Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models

University of Science and Technology Beijing
The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026

Corresponding authors
VisionToM overview

VisionToM steers MLLMs toward critical visual evidence and task-relevant ToM reasoning via lightweight, training-free interventions on attention heads.

Abstract

We study video-only Theory of Mind (ToM) reasoning in multimodal large language models (MLLMs), where the model must infer agents' goals, beliefs, and actions from egocentric videos without relying on extra text descriptions. We introduce VisionToM, a vision-oriented, training-free intervention framework that learns lightweight vectors in the model's latent space and applies targeted edits to attention heads during inference. By grounding decisions in visual evidence and guiding task-specific ToM reasoning, VisionToM reduces spurious linguistic priors and improves answer reliability on video-only ToM benchmarks.

Highlights

  • Video-only setting: no additional captions or prompts are required at test time.
  • Training-free intervention: identify sensitive attention heads and edit them during inference.
  • Interpretable: separates visual-attention enhancement from ToM-reasoning guidance in the model’s internal representations.

Method

VisionToM method diagram

We extract internal attention representations, locate heads that are sensitive to visual evidence and task-specific reasoning, and apply targeted interventions during inference.

Qualitative Example

Qualitative example of VisionToM

VisionToM reduces hallucinated answers by grounding predictions in the egocentric video context.

BibTeX

@article{liu2026videoonly,
  title={Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models},
  author={Liu, Siqi and Li, Xinyang and Zou, Bochao and Zhuo, Junbao and Ma, Huimin and Chen, Jiansheng},
  journal={arXiv preprint arXiv:2603.24484},
  year={2026}
}