CODIS

Interpretation of images can be significantly influenced by contextual information. In this instance, the determination of whether the photographer was ascending or descending a staircase remains ambiguous without supplementary context (a). However, when additional information is provided, indicating the position of the greenery relative to the observer, the direction of movement of the observer becomes clear (b).

For more cases, please go to our playground.

Overview

In certain situations, images need to be interpreted within a broader context. We introduce a new benchmark, named as CODIS (COntext-Dependent Image diSambiguation), designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension. It stands out from existing benchmarks in three main aspects:

1. Each image in CODIS contains inherent ambiguity that can only be resolved with additional context.

2. The questions are deliberately designed to highlight these ambiguities, requiring external context for accurate interpretation.

3. For every image-question pair, we provide two contexts in a free-form text format.

We have identified five representative types of context, including three types of global context that pertain to the overall scene—namely, the global background, which encompasses location and orientation, temporal information, and cultural background, and two types of local context related to objects within the scene, specifically the attributes of objects and the relationships between people.

To prevent models from guessing the correct answers without fully understanding context, we organize our dataset in pairs. Each pair contains two queries $ (I,Q,C_1) $ and $ (I,Q,C_2) $. The queries have identical image $ I $ and question $ Q $, but have two pieces of different context $ C_1 $ and $ C_2 $. We give MLLMs two queries separately, and get model outputs $ O_1 $ and $ O_2 $.

We leverage two metrics, pair-wise accuracy $ \mathrm{Acc}_p $ and query-wise accuracy $ \mathrm{Acc}_q $ for our evaluation metrics. For $ \mathrm{Acc}_p $, models score only if their answers to a pair of queries are both correct. For $ \mathrm{Acc}_q $, models score for each single query they answer correctly.

We manually collect images that contain ambiguities which can only be resolved with external contexts. The majority of these images are real-scene images from the publicly available dataset ShareGPT4V and the Internet, while the remainder are created manually. For each collected image, we manually write questions, context and answers for it.

Comparisons with Existing Benchmarks

We summarize recent benchmarks for MLLMs in the following table. Most of these benchmarks do not pair images with additional context. Only two benchmarks, namely VisDial and MMDialog, include extra context to help in conversation with humans rather than to clarify the meaning of images. This limitation means these benchmarks are not fully capable of testing the ability of MLLMs to understand images in a context-dependent manner.

Statistics

We collect 216 images and successfully collected a total of 706 queries, spanning five categories and encompassing a wide range of scenarios. The distribution of categories and scenarios is illustrated in the following figure.

Leaderboard

We perform evaluation on 14 popular MLLMs, which are divided into three groups: (1) API-based models; (2) Open-source ~7B models; (3) Open-source ~13B models. We evaluate various models including LLMs and LMMs. "Human" refers to the average performance of five independent people. We show $ \mathrm{Acc}_p $ and $ \mathrm{Acc}_q $ based on human and GPT-4 evaluation in the leaderboard.

Human API-based Models Open-source ~13B Models Open-source ~7B Models

Model	Loc & Ori		Temporal		Cultural		Attributes		Relationships		Overall
Model	$ \mathrm{Acc}_p $	$ \mathrm{Acc}_q $	$ \mathrm{Acc}_p $	$ \mathrm{Acc}_q $	$ \mathrm{Acc}_p $	$ \mathrm{Acc}_q $	$ \mathrm{Acc}_p $	$ \mathrm{Acc}_q $	$ \mathrm{Acc}_p $	$ \mathrm{Acc}_q $	$ \mathrm{Acc}_p $	$ \mathrm{Acc}_q $
Human	85.2	86.1	90.9	92.8	72.8	76.4	87.2	88.4	89.6	90.0	86.2	87.7
GPT-4V	33.3	54.2	28.4	52.1	25.5	60.6	26.7	54.7	51.9	70.2	32.3	56.9
Gemini	21.4	49.4	29.5	51.1	21.3	56.4	24.0	52.0	34.6	58.7	26.1	52.7
LLaVA-1.5-13B	06.0	41.1	04.2	44.7	10.6	50.0	14.7	51.3	13.5	54.8	09.1	47.5
BLIP-2-11B	06.0	32.7	08.4	45.8	04.3	35.1	06.7	42.0	11.5	51.9	07.4	41.4
InstructBLIP-13B	06.0	39.3	02.1	41.6	04.3	50.0	04.0	44.7	07.7	51.0	04.5	44.2
mPLUG-Owl-2-7B	13.1	42.3	09.5	41.6	06.4	42.6	12.0	44.7	19.2	51.9	11.9	44.1
MiniGPT4-7B	10.7	36.3	03.2	34.2	00.0	27.7	12.0	35.3	13.5	47.1	07.9	36.0
LLaVA-1.5-7B	11.9	42.9	05.3	44.7	04.3	43.6	09.3	39.3	07.7	47.1	07.9	43.3
InstructBLIP-7B	01.2	33.3	07.4	45.8	00.0	46.8	04.0	43.3	11.5	48.1	04.8	42.8
Otter-7B	02.4	32.7	05.3	41.1	04.3	28.7	00.0	26.0	05.8	40.4	03.4	34.1
LLaVA-7B	02.4	30.4	06.3	34.2	00.0	25.5	01.3	34.0	05.8	41.3	03.4	33.1
Qwen-VL-Chat	03.6	23.8	03.2	24.7	00.0	24.5	01.3	32.0	09.6	34.6	03.4	27.5
OpenFlamingo-7B	02.4	40.5	02.1	38.9	00.0	27.7	05.3	36.0	05.8	47.1	03.1	38.4
BLIP-2-6.7B	00.0	41.1	01.1	44.7	02.1	48.9	02.7	46.0	07.7	53.8	02.3	46.0

API-based Models Open-source ~13B Models Open-source ~7B Models

Model	Loc & Ori		Temporal		Cultural		Attributes		Relationships		Overall
Model	$ \mathrm{Acc}_p $	$ \mathrm{Acc}_q $	$ \mathrm{Acc}_p $	$ \mathrm{Acc}_q $	$ \mathrm{Acc}_p $	$ \mathrm{Acc}_q $	$ \mathrm{Acc}_p $	$ \mathrm{Acc}_q $	$ \mathrm{Acc}_p $	$ \mathrm{Acc}_q $	$ \mathrm{Acc}_p $	$ \mathrm{Acc}_q $
GPT-4V	33.3	53.6	28.4	50.5	21.3	53.2	25.3	54.0	50.0	69.2	31.2	55.1
Gemini	20.2	48.8	27.4	50.0	21.3	54.3	22.7	51.3	30.8	54.8	24.4	51.3
LLaVA-1.5-13B	06.0	41.1	03.2	43.2	12.8	46.8	13.3	50.0	11.5	53.8	08.5	46.2
BLIP-2-11B	06.0	34.5	10.5	44.2	04.3	30.9	06.7	40.7	11.5	47.1	08.0	39.8
InstructBLIP-13B	06.0	39.9	02.1	41.1	06.4	46.8	04.0	44.7	05.8	48.1	04.5	43.3
mPLUG-Owl-2-7B	13.1	39.9	09.5	40.0	04.3	41.5	09.3	42.7	11.5	48.1	09.9	41.9
MiniGPT4-7B	10.7	34.5	04.2	32.1	00.0	27.7	08.0	32.7	09.6	43.3	06.8	33.9
LLaVA-1.5-7B	08.3	37.5	01.1	36.3	02.1	40.4	09.3	37.3	07.7	48.1	05.7	39.1
InstructBLIP-7B	01.2	34.5	05.3	43.7	00.0	45.7	04.0	44.0	11.5	47.1	04.2	42.4
Otter-7B	02.4	31.5	03.2	35.3	00.0	23.4	01.3	27.3	05.8	34.6	02.5	31.0
LLaVA-7B	02.4	29.8	04.2	33.7	00.0	17.0	02.7	33.3	01.9	37.5	02.5	31.0
Qwen-VL-Chat	04.8	23.8	03.2	23.7	00.0	23.4	01.3	32.0	07.7	33.7	03.4	26.9
OpenFlamingo-7B	02.4	40.5	02.1	38.9	00.0	27.7	05.3	36.0	05.8	47.1	03.1	38.4
BLIP-2-6.7B	00.0	42.3	01.1	43.2	04.3	48.9	04.0	46.7	05.8	51.0	02.5	45.6

Overall results of different models on CODIS. The best-performing model in each category is in-bold, and the second best is underlined.

Examples

Citation


      @article{luo2024codis,
        title={CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models},
        author={Fuwen Luo and Chi Chen and Zihao Wan and Zhaolu Kang and Qidong Yan and Yingjie Li and Xiaolong Wang and Siyu Wang and Ziyue Wang and Xiaoyue Mi and Peng Li and Ning Ma and Maosong Sun and Yang Liu},
        journal={arXiv preprint arXiv:2402.13607},
        year={2024}
      }