Abstract Introduction Settings Analysis Results

How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game

1 Tsinghua University
2 Fudan University
Teaser Image
Illustration of our proposed room escape environment EscapeCraft, which allows us to generate customized room scenes (left) and define groundtruth reasoning path of tasks (right). Based on EscapeCraft, we create MM-Escape benchmark, targeting at evaluating both the task completion performance and the entire multimodal reasoning process of MLLMs.

Abstract

The rapid advancing of Multimodal Large Language Models (MLLMs) has spurred interest in complex multimodal reasoning tasks in the real-world and virtual environment, which require coordinating multiple abilities, including visual perception, visual reasoning, spatial awareness, and target deduction. However, existing evaluations primarily assess the final task completion, often degrading assessments to isolated abilities such as visual grounding and visual question answering. Less attention is given to comprehensively and quantitatively analyzing reasoning process in multimodal environments, which is crucial for understanding model behaviors and underlying reasoning mechanisms beyond merely task success. To address this, we introduce MM-Escape, an extensible benchmark for investigating multimodal reasoning, inspired by real-world escape games. MM-Escape emphasizes intermediate model behaviors alongside final task completion. To achieve this, we develop EscapeCraft, a customizable and open environment that enables models to engage in free-form exploration for assessing multimodal reasoning. Extensive experiments show that MLLMs, regardless of scale, can successfully complete the simplest room escape tasks, with some exhibiting human-like exploration strategies. Yet, performance dramatically drops as task difficulty increases. Moreover, we observe that performance bottlenecks vary across models, revealing distinct failure modes and limitations in their multimodal reasoning abilities, such as repetitive trajectories without adaptive exploration, getting stuck in corners due to poor visual spatial awareness, and ineffective use of acquired props, such as the key. We hope our work sheds light on new challenges in multimodal reasoning, and uncovers potential improvements in MLLMs capabilities.

The rapid development of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has driven advancements in various applications, requiring strong multimodal reasoning abilities such as visual perception, spatial awareness, and tool utilization. However, existing evaluation methods primarily focus on task completion rather than analyzing the entire reasoning process, limiting insights into model capabilities.
To address this, we introduce MM-Escape, a benchmark inspired by real-world escape games, designed to assess complex multimodal reasoning. At its core, we develop EscapeCraft, a customizable open environment that enables models to engage in free-form exploration through room escape tasks. This allows for a comprehensive evaluation of reasoning abilities beyond simple task completion, focusing on intermediate behaviors and decision-making.
Our findings reveal that while MLLMs show promising reasoning abilities, performance drops significantly as task complexity increases, exposing distinct failure patterns. These insights emphasize the need for more comprehensive analysis and improvements in multimodal reasoning.

Introduction

Game Settings

We introduce an automatic reasoning chain generation procedure by configuring the Prop Chain. In detail, this is a singly linked list representing the ordered sequence of items and interactions required to complete the game. Each node in the chain corresponds to an interactive element, such as a key, a locked box, or a note with a password, where the tail node represents the exit point of the game. To construct a complete escape game setting, we annotate the links between nodes in the prop chain to define the ways to obtain different props (such as unlimited acquisition or requiring a key to open, etc.) and their inclusion relationships (for example, a key can be placed in a box)

Difficulty-1: The simplest one-hop reasoning path where no props are needed to unlock the door. Models can exit by locating the door and interacting with it directly.

Difficulty-2: A two-hop reasoning path requiring an additional key or password compared to Difficulty-1. Models should search for the key or password and interact with it to unlock the door.

Difficulty-3: A three-hop reasoning path requiring both a password and a key, with one additional hop to Difficulty-2. This level challenges models with spatial reasoning, visual search, and prop utilization.

Since the prop chain can grow infinitely, our difficulty levels are inherently extendable. Moerover, the type of questions or tasks in each reasoning hop are customizable and interchangeable, further enhancing the difficulty and flexibility of MM-Escape. We also explore some extended settings that incorporate with other tasks, such as embodied QA and visual logical reasoning

Game Levels

Analysis

Takeaways
Distinct Human-Like Behavioral Patterns: Models exhibit unique behaviors in the escape task. Gemini tends to remain in a fixed location at the start, scanning its surroundings before acting, while GPT-4o first observes a wider range for a more global understanding.
Diverse Observation Strategies: Gemini frequently adopts a downward-facing view to inspect objects like tables and chairs, whereas GPT-4o predominantly relies on a front-facing view.
Common Failure Modes: Models struggle with movement and interaction failures. GPT-4o tends to repeat trajectories, while Gemini and Claude get stuck in confined areas. Phi-3 and Qwen-VL fail at action combinations, leading to imprecise object interactions.
Adherence to Long-Term Goals: Most models focus on locating the exit and key objects, but Phi-3 consistently fails to recognize doors, unlike other models that engage with doors upon detection.

Analysis of Entire Path
The analysis focuses on three key aspects: (1) the number of steps required to obtain core props, (2) steps needed to exit after acquiring the key or password, and (3) the relationship between grab success rate (GSR) and escape success. Efficiency in Obtaining Key Props: GPT-4o demonstrates the best performance in locating and acquiring essential props, followed by Gemini. Claude performs well in higher difficulty levels but with a lower escape rate. Efficiency in Exiting: Gemini performs well in moderate difficulty but struggles in complex environments, where GPT-4o benefits from prior memory and spatial understanding, leading to more efficient escapes. Grab Success Rate (GSR) and Escape Success: A higher GSR correlates with better escape performance. GPT-4o and Claude 3.5 maintain relatively stable GSRs, while Qwen and Llama struggle due to weaker environmental perception and interaction capabilities.
Extensibility of EscapeCraft
An extended case study evaluates model adaptability to different room configurations. In a scenario where the password is displayed as a numerical pattern instead of written text: GPT-4o quickly recognizes and uses it when the pattern is near the door but struggles when the pattern is farther away, revealing limitations in long-term and spatial reasoning. Gemini fails to recognize the pattern entirely, instead resorting to exhaustive room searches. This highlights the importance of enhancing models' ability to interpret environmental cues dynamically.
Post-game Debriefing
After escaping, models are tasked with recalling and reconstructing their escape process. Only GPT-4o and Gemini-1.5-Pro are evaluated due to their higher escape rates. Findings show that: Models prioritize key actions (e.g., obtaining the password) but neglect background context, suggesting a trade-off between efficiency and holistic understanding. Future improvements should focus on enhancing memory and reasoning abilities to better capture contextual details beyond task completion.

Experiment Results

Experiment Results

Models Difficulty-1 Difficulty-2 Difficulty-3 AVG ER (%)↑
ER (%)↑ Steps↓ Grab SR (%)↑ Grab Ratio ER (%)↑ Prop (%)↑ Steps↓ Grab SR (%)↑ Grab Ratio ER (%)↑ Prop (%)↑ Steps↓ Grab SR (%)↑ Grab Ratio
Human 100.00 5.73 95.45 0.19 100.00 100.00 13.64 81.81 0.19 100.00 100.00 21.45 75.45 0.19 100.00
GPT-4o 100.00 11.27 37.82 0.42 72.73 81.82 36.73 36.73 0.26 71.36 90.00 50.19 31.36 0.35 81.36
Gemini-1.5-pro 81.82 21.18 49.18 0.39 54.55 90.91 47.82 14.89 0.44 46.82 74.49 73.18 10.43 0.48 61.06
Claude 3.5 Sonnet 72.73 22.09 30.64 0.36 45.45 54.55 57.45 20.64 0.17 39.61 54.83 82.36 16.21 0.22 52.60
Doubao 1.5 Pro 91.91 16.27 44.68 0.27 45.45 54.55 63.18 13.63 0.25 9.52 33.33 93.19 6.76 0.26 48.96
Llama-3.2-11b-vision 63.64 23.55 31.36 0.35 0.00 27.27 75.00 3.16 0.44 0.00 27.27 100.00 3.55 0.32 21.21
Qwen-VL-Max 18.18 42.64 11.36 0.05 0.00 27.27 75.00 3.51 0.15 9.09 18.18 94.18 2.72 0.31 9.09
Phi-3-vision-128k 0.00 50.00 0.00 0.01 0.00 0.00 75.00 0.00 0.02 0.00 0.00 100.00 0.00 0.01 0.00
GLM-4v Flash 0.00 50.00 0.00 0.00 0.00 0.00 75.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00

BibTeX

@misc{wang2025multimodallargelanguagemodels,
      title={How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game}, 
      author={Ziyue Wang and Yurui Dong and Fuwen Luo and Minyuan Ruan and Zhili Cheng and Chi Chen and Peng Li and Yang Liu},
      year={2025},
      eprint={2503.10042},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.10042}, 
  }