CoSpace

CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models

Yiqi Zhu^1*, Ziyue Wang^1*, Can Zhang³, Peng Li^2†, Yang Liu^1,2†

¹ Department of Computer Science and Technology, Tsinghua University
² Institute for AI Industry Research (AIR), Tsinghua University
³ School of Computer and Communication Engineering, University of Science and Technology Beijing

^* Equal Contribution
^† Corresponding Author

arXiv | Github

Abstract

Vision-Language Models (VLMs) have recently witnessed significant progress in visual comprehension. As the permitting length of image context grows, VLMs can now comprehend a broader range of views and spaces. Current benchmarks provide insightful analysis of VLMs in tasks involving complex visual instructions following, multi-image understanding and spatial reasoning. However, they usually focus on spatially irrelevant images or discrete images captured from varied viewpoints. The compositional characteristic of images captured from a static viewpoint remains underestimated. We term this characteristic as Continuous Space Perception. When observing a scene from a static viewpoint while shifting orientations, it produces a series of spatially continuous images, enabling the reconstruction of the entire space. In this paper, we present CoSpace, a multi-image visual understanding benchmark designed to assess the Continuous Space perception ability for VLMs. CoSpace contains 2,918 images and 1,626 question-answer pairs, covering seven types of tasks. We conduct evaluation across 19 proprietary and open-source VLMs. Results reveal that there exist pitfalls on the continuous space perception ability for most of the evaluated models, including proprietary ones. Interestingly, we find that the main discrepancy between open-source and proprietary models lies not in accuracy but in the consistency of responses. We believe that enhancing the ability of continuous space perception is essential for VLMs to perform effectively in real-world tasks and encourage further research to advance this capability.

Task Design

Direction Recognition (DIR-Rec)

In real world, identifying direction is inevitable when one is placed in a new environment. Similarly, the DIR-Rec task requires models to recognize the direction of targets by answering questions such as “where is the building located?” Meanwhile, we provide detailed task instructions, including direction references such as “the first image is facing north, the second image is facing east”. We standardize 8 directional options, including four cardinal directions, North, East, South and West, and four intercardinal directions, Northeast, Southeast, Southwest, and Northwest.

Directional Object Perception (DIR-Obj)

Some real-world tasks demand abilities far beyond merely identifying the direction of a specified target. It is also a critical ability to distinguish content regarding different directions and to align objects with correct directions. Given a specified direction, the DIR-Obj task requires models to be aware of the appearing objects, and also distinguish those absent from that direction. To achieve this, models need to first reconstruct the entire space from continuous views, and then correctly identify the objects dedicate to a given direction. Specifically, in this task, questions typically focus on the intercardinal directions such as “What is visible in the southeast direction?”, which necessitates a more comprehensive understanding of the entire space.

Rotation-Angle (ROT-Ang)

For an observant, when rotating the perspective for more spatial information, there could be overlaps between adjacent observations. Specifically, human can tell the approximate turning angle between consecutive images observed from a static viewpoint by noticing identical furnishings appearing across images. In this task, we investigate if models exhibit similar ability as human does, which requires fine-grained continuous spatial understanding. We apply unified question “What is the turning angle between the adjacent image?” and provide two candidate options for model. This task requires models to carefully exam the overlapping and different areas between adjacent images and the shared features in spatially continuous image sequence.

Rotation-Difference (ROT-Dif)

Following the ROT-Ang task, we also investigate the model ability to identify distinguished rotation angles. For the ROT-Dif task, models are given a sequence of five images, four of which share the same turning angle and the other is exceptional. Compared to ROT-Ang, this task emphasizes more on the global understanding of images regarding continuous visual space. The unified question of this task is “Which image does not belong to this image sequence?” It is an open-ended question without options, where models are required to answer with index of the exceptional image.

Counting (CNT)

Counting is a widely adopted task in existing visual benchmarks that asks models to recognize objects and deduce the time of occurrence of target objects. Generally, existing counting tasks only require models to deal with a single image or multiple spatially discrete images, while in our benchmark, images from continuous visual space where identical objects can occur in multiple images are focused on. This raise a challenge for models to not only recognize targets and count for their occurrences, but also be aware of the existence of the same object across different images. To achieve this, models should locate the overlapping area of adjacent images and align the same object appearing different images. The CNT task is an open-ended task, where models should response with the total count of the target.

Planning-Question Answering (PLA-QA)

Following the implementation of Embodied Question Answering (EQA), we develop the PLA-QA task, requiring models to identify the location of a certain object given a continuous embodied space. In this task, instructions like “Where is the television regarding your position?” are provided for the models, and we formulate four options, containing candidate directions relative to the agent, for each question.

Planning-Decision (PLA-Dec)

This task further investigate the understanding of continuous visual space by asking models to select the proper route to reach the target object. For disambiguation, we standardize the action space as turning (turning to other directions without displacement) and go ahead. The PLA-Dec task especially focus on the order of actions. For instance “Turn back and go ahead” and “Go ahead and turn back” represent two totally different actions and end up in different positions.

Results

Models	DIR-Rec		DIR-Obj		CNT		ROT-Ang	ROT-Dif	PLA-QA	PLA-Dec	Average
Models	ACC_q	ACC_p	ACC_q	ACC_p	ACC_q	ACC_p	ROT-Ang	ROT-Dif	PLA-QA	PLA-Dec	Average
Proprietary Models
Claude-3.7-sonnet	44.40	29.20	45.60	35.60	45.00	38.00	64.33	93.50	54.73	69.34	51.97
Gemini-1.5-pro	37.60	15.60	40.60	31.60	38.25	24.00	59.67	82.00	51.64	62.91	44.39
GPT-4o	40.40	22.80	46.00	32.00	40.00	23.50	58.33	50.50	53.05	54.46	42.10
>70B Open-source Models
InternVL2_5-78B	32.20	24.40	54.40	47.60	51.25	43.00	50.00	77.00	32.39	42.72	45.50
Qwen2-VL-72B	31.00	23.20	53.60	44.80	44.75	37.00	50.00	62.00	42.72	59.15	44.82
InternVL2-76B	33.60	13.20	46.40	39.20	50.00	43.50	50.00	23.00	25.35	30.05	35.43
LLaVA-OneVision-72B	19.20	10.80	44.20	34.40	24.25	19.00	50.00	26.00	32.86	30.99	29.17
<13B Open-source Models
MiniCPM-V 2.6	32.80	21.20	40.40	31.60	38.50	31.50	50.00	56.00	41.31	27.70	37.10
Qwen2-VL-7B	26.40	16.40	39.20	31.60	45.00	36.00	50.00	51.50	34.27	26.76	35.71
Mantis-8B	30.60	24.40	31.20	28.00	41.75	36.50	50.00	39.50	33.33	27.70	34.30
InternVL2-8B	29.00	12.80	37.20	30.40	38.50	31.00	50.33	47.50	32.39	27.70	33.68
VILA1.5-8B	34.00	30.40	28.20	24.00	42.25	38.00	50.00	18.50	22.07	42.72	33.01
Idefics3-8B	34.00	21.20	38.60	28.00	32.75	25.00	48.00	28.00	26.76	25.35	30.77
LLaVA-OneVision-7B	18.80	12.00	37.60	33.60	37.75	33.00	50.00	21.50	29.58	26.29	30.01
Phi-3.5-vision	20.40	12.00	33.20	28.40	36.75	34.00	50.00	22.50	34.74	26.76	29.88
Brote-IM-XXL	33.00	13.20	31.00	24.80	30.75	29.50	50.00	18.00	26.76	20.19	27.72
LongVA-7B	23.00	16.00	31.60	28.00	28.00	22.00	48.67	21.50	19.25	30.99	26.90
Mono-InternVL-2B	30.00	27.60	32.00	27.60	14.00	8.50	50.00	18.00	25.35	24.88	25.79
mPLUG-Owl3-7B	22.80	6.80	29.20	15.60	22.25	9.00	46.33	16.50	27.70	21.13	21.73

Case Study