MM-Escape

How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game

¹Tsinghua University
²Fudan University

Abstract

The rapid advancing of Multimodal Large Language Models (MLLMs) has spurred interest in complex multimodal reasoning tasks in the real-world and virtual environment, which require coordinating multiple abilities, including visual perception, visual reasoning, spatial awareness, and target deduction. However, existing evaluations primarily assess the final task completion, often degrading assessments to isolated abilities such as visual grounding and visual question answering. Less attention is given to comprehensively and quantitatively analyzing reasoning process in multimodal environments, which is crucial for understanding model behaviors and underlying reasoning mechanisms beyond merely task success. To address this, we introduce MM-Escape, an extensible benchmark for investigating multimodal reasoning, inspired by real-world escape games. MM-Escape emphasizes intermediate model behaviors alongside final task completion. To achieve this, we develop EscapeCraft, a customizable and open environment that enables models to engage in free-form exploration for assessing multimodal reasoning. Extensive experiments show that MLLMs, regardless of scale, can successfully complete the simplest room escape tasks, with some exhibiting human-like exploration strategies. Yet, performance dramatically drops as task difficulty increases. Moreover, we observe that performance bottlenecks vary across models, revealing distinct failure modes and limitations in their multimodal reasoning abilities, such as repetitive trajectories without adaptive exploration, getting stuck in corners due to poor visual spatial awareness, and ineffective use of acquired props, such as the key. We hope our work sheds light on new challenges in multimodal reasoning, and uncovers potential improvements in MLLMs capabilities.

Game Settings

We introduce an automatic reasoning chain generation procedure by configuring the Prop Chain. In detail, this is a singly linked list representing the ordered sequence of items and interactions required to complete the game. Each node in the chain corresponds to an interactive element, such as a key, a locked box, or a note with a password, where the tail node represents the exit point of the game. To construct a complete escape game setting, we annotate the links between nodes in the prop chain to define the ways to obtain different props (such as unlimited acquisition or requiring a key to open, etc.) and their inclusion relationships (for example, a key can be placed in a box)

Difficulty-1: The simplest one-hop reasoning path where no props are needed to unlock the door. Models can exit by locating the door and interacting with it directly.

Difficulty-2: A two-hop reasoning path requiring an additional key or password compared to Difficulty-1. Models should search for the key or password and interact with it to unlock the door.

Difficulty-3: A three-hop reasoning path requiring both a password and a key, with one additional hop to Difficulty-2. This level challenges models with spatial reasoning, visual search, and prop utilization.

Since the prop chain can grow infinitely, our difficulty levels are inherently extendable. Moerover, the type of questions or tasks in each reasoning hop are customizable and interchangeable, further enhancing the difficulty and flexibility of MM-Escape. We also explore some extended settings that incorporate with other tasks, such as embodied QA and visual logical reasoning

Analysis

Takeaways
• Distinct Human-Like Behavioral Patterns: Models exhibit unique behaviors in the escape task. Gemini tends to remain in a fixed location at the start, scanning its surroundings before acting, while GPT-4o first observes a wider range for a more global understanding.
• Diverse Observation Strategies: Gemini frequently adopts a downward-facing view to inspect objects like tables and chairs, whereas GPT-4o predominantly relies on a front-facing view.
• Common Failure Modes: Models struggle with movement and interaction failures. GPT-4o tends to repeat trajectories, while Gemini and Claude get stuck in confined areas. Phi-3 and Qwen-VL fail at action combinations, leading to imprecise object interactions.
• Adherence to Long-Term Goals: Most models focus on locating the exit and key objects, but Phi-3 consistently fails to recognize doors, unlike other models that engage with doors upon detection.

Analysis of Entire Path
The analysis focuses on three key aspects: (1) the number of steps required to obtain core props, (2) steps needed to exit after acquiring the key or password, and (3) the relationship between grab success rate (GSR) and escape success. Efficiency in Obtaining Key Props: GPT-4o demonstrates the best performance in locating and acquiring essential props, followed by Gemini. Claude performs well in higher difficulty levels but with a lower escape rate. Efficiency in Exiting: Gemini performs well in moderate difficulty but struggles in complex environments, where GPT-4o benefits from prior memory and spatial understanding, leading to more efficient escapes. Grab Success Rate (GSR) and Escape Success: A higher GSR correlates with better escape performance. GPT-4o and Claude 3.5 maintain relatively stable GSRs, while Qwen and Llama struggle due to weaker environmental perception and interaction capabilities.
Extensibility of EscapeCraft
An extended case study evaluates model adaptability to different room configurations. In a scenario where the password is displayed as a numerical pattern instead of written text: GPT-4o quickly recognizes and uses it when the pattern is near the door but struggles when the pattern is farther away, revealing limitations in long-term and spatial reasoning. Gemini fails to recognize the pattern entirely, instead resorting to exhaustive room searches. This highlights the importance of enhancing models' ability to interpret environmental cues dynamically.
Post-game Debriefing
After escaping, models are tasked with recalling and reconstructing their escape process. Only GPT-4o and Gemini-1.5-Pro are evaluated due to their higher escape rates. Findings show that: Models prioritize key actions (e.g., obtaining the password) but neglect background context, suggesting a trade-off between efficiency and holistic understanding. Future improvements should focus on enhancing memory and reasoning abilities to better capture contextual details beyond task completion.

An example of success escape path from Gemini-1.5-pro. For the first four steps, it rotates, without moving, to observe the surroundings, and then starts moving towards the direct on the observed tables in the following steps. For step 5 and 6, we zoom in to the key to make it clearer. The key is distorted because the view of model is tilting downwards to observe objects on the table. From step 10 to 14, the model is already close to the door, and is slowly rotating and tilting to locate the door.

Analysis on the grabbing behaviors with respect to the escape rate and the visibility of the exit at initial location.

The extended scenario where the required password is displayed via a numerical pattern on the wall, rather than explicitly written on notes. GPT-4o completes reading it at once and exits within five steps, while Gemini struggles to repetitively search the room. Moving the pattern away from the door further challenges GPT-4o, leading to a failure of escaping.

Experiment Results

Models	Difficulty-1	Difficulty-2	Difficulty-3	AVG ER (%)↑
ER (%)↑	Steps↓	Grab SR (%)↑	Grab Ratio	ER (%)↑	Prop (%)↑	Steps↓	Grab SR (%)↑	Grab Ratio	ER (%)↑	Prop (%)↑	Steps↓	Grab SR (%)↑	Grab Ratio
Human	100.00	5.73	95.45	0.19	100.00	100.00	13.64	81.81	0.19	100.00	100.00	21.45	75.45	0.19	100.00
GPT-4o	100.00	11.27	37.82	0.42	72.73	81.82	36.73	36.73	0.26	71.36	90.00	50.19	31.36	0.35	81.36
Gemini-1.5-pro	81.82	21.18	49.18	0.39	54.55	90.91	47.82	14.89	0.44	46.82	74.49	73.18	10.43	0.48	61.06
Claude 3.5 Sonnet	72.73	22.09	30.64	0.36	45.45	54.55	57.45	20.64	0.17	39.61	54.83	82.36	16.21	0.22	52.60
Doubao 1.5 Pro	91.91	16.27	44.68	0.27	45.45	54.55	63.18	13.63	0.25	9.52	33.33	93.19	6.76	0.26	48.96
Llama-3.2-11b-vision	63.64	23.55	31.36	0.35	0.00	27.27	75.00	3.16	0.44	0.00	27.27	100.00	3.55	0.32	21.21
Qwen-VL-Max	18.18	42.64	11.36	0.05	0.00	27.27	75.00	3.51	0.15	9.09	18.18	94.18	2.72	0.31	9.09
Phi-3-vision-128k	0.00	50.00	0.00	0.01	0.00	0.00	75.00	0.00	0.02	0.00	0.00	100.00	0.00	0.01	0.00
GLM-4v Flash	0.00	50.00	0.00	0.00	0.00	0.00	75.00	0.00	0.00	0.00	0.00	100.00	0.00	0.00	0.00

Models

Difficulty-1

Difficulty-2

Difficulty-3

AVG ER (%)↑

ER (%)↑

Steps↓

Grab SR (%)↑

Grab Ratio

ER (%)↑

Prop (%)↑

Steps↓

Grab SR (%)↑

Grab Ratio

ER (%)↑

Prop (%)↑

Steps↓

Grab SR (%)↑

Grab Ratio

Human

100.00

5.73

95.45

0.19

100.00

13.64

81.81

0.19

100.00

21.45

75.45

0.19

100.00

          Gemini-1.5-pro
          81.82
          21.18
          49.18
          0.39
          54.55
          90.91
          47.82
          14.89
          0.44
          46.82
          74.49
          73.18
          10.43
          0.48
          61.06
      

          Claude 3.5 Sonnet
          72.73
          22.09
          30.64
          0.36
          45.45
          54.55
          57.45
          20.64
          0.17
          39.61
          54.83
          82.36
          16.21
          0.22
          52.60
      

        Doubao 1.5 Pro
        91.91
        16.27
        44.68
        0.27
        45.45
        54.55
        63.18
        13.63
        0.25
        9.52
        33.33
        93.19
        6.76
        0.26
        48.96
    

          Llama-3.2-11b-vision
          63.64
          23.55
          31.36
          0.35
          0.00
          27.27
          75.00
          3.16
          0.44
          0.00
          27.27
          100.00
          3.55
          0.32
          21.21
      

          Qwen-VL-Max
          18.18
          42.64
          11.36
          0.05
          0.00
          27.27
          75.00
          3.51
          0.15
          9.09
          18.18
          94.18
          2.72
          0.31
          9.09
      

        Phi-3-vision-128k
        0.00
        50.00
        0.00
        0.01
        0.00
        0.00
        75.00
        0.00
        0.02
        0.00
        0.00
        100.00
        0.00
        0.01
        0.00
    

      GLM-4v Flash
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
  

Models	Difficulty-1 & Difficulty-1	Difficulty-1 & Difficulty-2	Difficulty-2 & Difficulty-2
GPT-4o	75.00	35.50	34.25	0.32	90.00	100.00	34.90	35.52	0.31	70.00	80.00	39.50	42.32	0.37
Gemini-1.5-pro	22.22	40.22	22.89	0.38	40.00	50.00	56.60	16.79	0.05	60.00	80.00	60.00	22.71	0.34
Claude 3.5 Sonnet	22.22	45.22	10.62	0.08	20.00	20.00	71.90	6.75	0.09	10.00	10.00	80.00	23.20	0.06
Llama-3.2-11b-vision	55.56	31.00	36.25	0.36	10.00	60.00	66.40	4.40	0.40	10.00	40.00	76.80	27.00	0.19
Qwen-VL-Max	22.22	40.33	12.96	0.16	30.00	50.00	57.70	42.30	0.28	0.00	10.00	80.00	23.66	0.32

Models

Difficulty-1 & Difficulty-1

Difficulty-1 & Difficulty-2

Difficulty-2 & Difficulty-2

ER (%)↑

Steps↓

Grab SR (%)↑

Grab Ratio

ER (%)↑

Prop (%)↑

Steps↓

Grab SR (%)↑

Grab Ratio

ER (%)↑

Prop (%)↑

Steps↓

Grab SR (%)↑

Grab Ratio

GPT-4o

75.00

35.50

34.25

0.32

90.00

100.00

34.90

35.52

0.31

70.00

80.00

39.50

42.32

0.37

Gemini-1.5-pro

22.22

40.22

22.89

0.38

40.00

50.00

56.60

16.79

0.05

60.00

80.00

60.00

22.71

0.34

Claude 3.5 Sonnet

22.22

45.22

10.62

0.08

20.00

71.90

6.75

0.09

10.00

80.00

23.20

0.06

Llama-3.2-11b-vision

55.56

31.00

36.25

0.36

10.00

60.00

66.40

4.40

0.40

10.00

40.00

76.80

27.00

0.19

Qwen-VL-Max

22.22

40.33

12.96

0.16

30.00

50.00

57.70

42.30

0.28

0.00

10.00

80.00

23.66

0.32

BibTeX

@misc{wang2025multimodallargelanguagemodels, title={How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game}, author={Ziyue Wang and Yurui Dong and Fuwen Luo and Minyuan Ruan and Zhili Cheng and Chi Chen and Peng Li and Yang Liu}, year={2025}, eprint={2503.10042}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.10042}, }

How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game

Abstract

Introduction

Game Settings

Analysis

Analysis on the grabbing behaviors with respect to the escape rate and the visibility of the exit at initial location.

Experiment Results

BibTeX