We introduce an automatic reasoning chain generation procedure by configuring the Prop Chain. In detail, this is a singly linked list representing the ordered sequence of items and interactions required to complete the game. Each node in the chain corresponds to an interactive element, such as a key, a locked box, or a note with a password, where the tail node represents the exit point of the game. To construct a complete escape game setting, we annotate the links between nodes in the prop chain to define the ways to obtain different props (such as unlimited acquisition or requiring a key to open, etc.) and their inclusion relationships (for example, a key can be placed in a box)
Difficulty-1: The simplest one-hop reasoning path where no props are needed to unlock the door. Models can exit by locating the door and interacting with it directly.
Difficulty-2: A two-hop reasoning path requiring an additional key or password compared to Difficulty-1. Models should search for the key or password and interact with it to unlock the door.
Difficulty-3: A three-hop reasoning path requiring both a password and a key, with one additional hop to Difficulty-2. This level challenges models with spatial reasoning, visual search, and prop utilization.
Since the prop chain can grow infinitely, our difficulty levels are inherently extendable. Moerover, the type of questions or tasks in each reasoning hop are customizable and interchangeable, further enhancing the difficulty and flexibility of MM-Escape. We also explore some extended settings that incorporate with other tasks, such as embodied QA and visual logical reasoning
Takeaways
• Distinct Human-Like Behavioral Patterns: Models exhibit unique behaviors in the escape task. Gemini tends to remain in a fixed location at the start, scanning its surroundings before acting, while GPT-4o first observes a wider range for a more global understanding.
• Diverse Observation Strategies: Gemini frequently adopts a downward-facing view to inspect objects like tables and chairs, whereas GPT-4o predominantly relies on a front-facing view.
• Common Failure Modes: Models struggle with movement and interaction failures. GPT-4o tends to repeat trajectories, while Gemini and Claude get stuck in confined areas. Phi-3 and Qwen-VL fail at action combinations, leading to imprecise object interactions.
• Adherence to Long-Term Goals: Most models focus on locating the exit and key objects, but Phi-3 consistently fails to recognize doors, unlike other models that engage with doors upon detection.
Models | Difficulty-1 | Difficulty-2 | Difficulty-3 | AVG ER (%)↑ | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ER (%)↑ | Steps↓ | Grab SR (%)↑ | Grab Ratio | ER (%)↑ | Prop (%)↑ | Steps↓ | Grab SR (%)↑ | Grab Ratio | ER (%)↑ | Prop (%)↑ | Steps↓ | Grab SR (%)↑ | Grab Ratio | ||
Human | 100.00 | 5.73 | 95.45 | 0.19 | 100.00 | 100.00 | 13.64 | 81.81 | 0.19 | 100.00 | 100.00 | 21.45 | 75.45 | 0.19 | 100.00 |
GPT-4o | 100.00 | 11.27 | 37.82 | 0.42 | 72.73 | 81.82 | 36.73 | 36.73 | 0.26 | 71.36 | 90.00 | 50.19 | 31.36 | 0.35 | 81.36 |
Gemini-1.5-pro | 81.82 | 21.18 | 49.18 | 0.39 | 54.55 | 90.91 | 47.82 | 14.89 | 0.44 | 46.82 | 74.49 | 73.18 | 10.43 | 0.48 | 61.06 |
Claude 3.5 Sonnet | 72.73 | 22.09 | 30.64 | 0.36 | 45.45 | 54.55 | 57.45 | 20.64 | 0.17 | 39.61 | 54.83 | 82.36 | 16.21 | 0.22 | 52.60 |
Doubao 1.5 Pro | 91.91 | 16.27 | 44.68 | 0.27 | 45.45 | 54.55 | 63.18 | 13.63 | 0.25 | 9.52 | 33.33 | 93.19 | 6.76 | 0.26 | 48.96 |
Llama-3.2-11b-vision | 63.64 | 23.55 | 31.36 | 0.35 | 0.00 | 27.27 | 75.00 | 3.16 | 0.44 | 0.00 | 27.27 | 100.00 | 3.55 | 0.32 | 21.21 |
Qwen-VL-Max | 18.18 | 42.64 | 11.36 | 0.05 | 0.00 | 27.27 | 75.00 | 3.51 | 0.15 | 9.09 | 18.18 | 94.18 | 2.72 | 0.31 | 9.09 |
Phi-3-vision-128k | 0.00 | 50.00 | 0.00 | 0.01 | 0.00 | 0.00 | 75.00 | 0.00 | 0.02 | 0.00 | 0.00 | 100.00 | 0.00 | 0.01 | 0.00 |
GLM-4v Flash | 0.00 | 50.00 | 0.00 | 0.00 | 0.00 | 0.00 | 75.00 | 0.00 | 0.00 | 0.00 | 0.00 | 100.00 | 0.00 | 0.00 | 0.00 |
Models | Difficulty-1 & Difficulty-1 | Difficulty-1 & Difficulty-2 | Difficulty-2 & Difficulty-2 | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ER (%)↑ | Steps↓ | Grab SR (%)↑ | Grab Ratio | ER (%)↑ | Prop (%)↑ | Steps↓ | Grab SR (%)↑ | Grab Ratio | ER (%)↑ | Prop (%)↑ | Steps↓ | Grab SR (%)↑ | Grab Ratio | |
GPT-4o | 75.00 | 35.50 | 34.25 | 0.32 | 90.00 | 100.00 | 34.90 | 35.52 | 0.31 | 70.00 | 80.00 | 39.50 | 42.32 | 0.37 |
Gemini-1.5-pro | 22.22 | 40.22 | 22.89 | 0.38 | 40.00 | 50.00 | 56.60 | 16.79 | 0.05 | 60.00 | 80.00 | 60.00 | 22.71 | 0.34 |
Claude 3.5 Sonnet | 22.22 | 45.22 | 10.62 | 0.08 | 20.00 | 20.00 | 71.90 | 6.75 | 0.09 | 10.00 | 10.00 | 80.00 | 23.20 | 0.06 |
Llama-3.2-11b-vision | 55.56 | 31.00 | 36.25 | 0.36 | 10.00 | 60.00 | 66.40 | 4.40 | 0.40 | 10.00 | 40.00 | 76.80 | 27.00 | 0.19 |
Qwen-VL-Max | 22.22 | 40.33 | 12.96 | 0.16 | 30.00 | 50.00 | 57.70 | 42.30 | 0.28 | 0.00 | 10.00 | 80.00 | 23.66 | 0.32 |