We study video-only Theory of Mind (ToM) reasoning in multimodal large language models (MLLMs), where the model must infer agents' goals, beliefs, and actions from egocentric videos without relying on extra text descriptions. We introduce VisionToM, a vision-oriented, training-free intervention framework that learns lightweight vectors in the model's latent space and applies targeted edits to attention heads during inference. By grounding decisions in visual evidence and guiding task-specific ToM reasoning, VisionToM reduces spurious linguistic priors and improves answer reliability on video-only ToM benchmarks.
We extract internal attention representations, locate heads that are sensitive to visual evidence and task-specific reasoning, and apply targeted interventions during inference.
VisionToM reduces hallucinated answers by grounding predictions in the egocentric video context.
@article{liu2026videoonly,
title={Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models},
author={Liu, Siqi and Li, Xinyang and Zou, Bochao and Zhuo, Junbao and Ma, Huimin and Chen, Jiansheng},
journal={arXiv preprint arXiv:2603.24484},
year={2026}
}