logo CODIS

Benchmarking Context-Dependent Visual Comprehension
for Multimodal Large Language Models

Fuwen Luo1*, Chi Chen1*, Zihao Wan1, Zhaolu Kang4, Qidong Yan3, Yingjie Li3,
Xiaolong Wang1, Siyu Wang2, Ziyue Wang1, Xiaoyue Mi5,
Peng Li2†, Ning Ma3, Maosong Sun1†, Yang Liu1,2

1Department of Computer Science and Technology, Tsinghua University, Beijing, China
2Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China
3Key Laboratory of Linguistic and Cultural Computing Ministry of Education,
Northwest Minzu University, China

4College of Software, Jilin University, China
5Institute of Computing Technology, Chinese Academy of Sciences

introduction

Interpretation of images can be significantly influenced by contextual information. In this instance, the determination of whether the photographer was ascending or descending a staircase remains ambiguous without supplementary context (a). However, when additional information is provided, indicating the position of the greenery relative to the observer, the direction of movement of the observer becomes clear (b).

Introduction

In certain situations, images need to be interpreted within a broader context. We introduce a new benchmark, named as CODIS (COntext-Dependent Image diSambiguation), designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension. It stands out from existing benchmarks in three main aspects:

1. Each image in CODIS contains inherent ambiguity that can only be resolved with additional context.
2. The questions are deliberately designed to highlight these ambiguities, requiring external context for accurate interpretation.
3. For every image-question pair, we provide two contexts in a free-form text format.

CODIS Benchmark

Overview

We introduce a new benchmark, named as CODIS (COntext-Dependent Image diSambiguation), designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension. We have identified five representative types of context, including three types of global context that pertain to the overall scene—namely, the global background, which encompasses location and orientation, temporal information, and cultural background, and two types of local context related to objects within the scene, specifically the attributes of objects and the relationships between people.

taxonomy

To prevent models from guessing the correct answers without fully understanding context, we organize our dataset in pairs. Each pair contains two queries $ (I,Q,C_1) $ and $ (I,Q,C_2) $. The queries have identical image $ I $ and question $ Q $, but have two pieces of different context $ C_1 $ and $ C_2 $. We give MLLMs two queries separately, and get model outputs $ O_1 $ and $ O_2 $.

We leverage two metrics, pair-wise accuracy $ \mathrm{Acc}_p $ and query-wise accuracy $ \mathrm{Acc}_q $ for our evaluation metrics. For $ \mathrm{Acc}_p $, models score only if their answers to a pair of queries are both correct. For $ \mathrm{Acc}_q $, models score for each single query they answer correctly.

We manually collect images that contain ambiguities which can only be resolved with external contexts. The majority of these images are real-scene images from the publicly available dataset ShareGPT4V and the Internet, while the remainder are created manually. For each collected image, we manually write questions, context and answers for it.

Comparisons with Existing Benchmarks

We summarize recent benchmarks for MLLMs in the following table. Most of these benchmarks do not pair images with additional context. Only two benchmarks, namely VisDial and MMDialog, include extra context to help in conversation with humans rather than to clarify the meaning of images. This limitation means these benchmarks are not fully capable of testing the ability of MLLMs to understand images in a context-dependent manner.

other_benchmark

Statistics

We collect 216 images and successfully collected a total of 706 queries, spanning five categories and encompassing a wide range of scenarios. The distribution of categories and scenarios is illustrated in the following figure.

categories

Evaluation

Leaderboard

We perform evaluation on 14 popular MLLMs, which are divided into three groups: (1) API-based models; (2) Open-source ~7B models; (3) Open-source ~13B models. We evaluate various models including LLMs and LMMs. "Human" refers to the average performance of five independent people. We show $ \mathrm{Acc}_p $ and $ \mathrm{Acc}_q $ based on human and GPT-4 evaluation in the leaderboard.

Human API-based Models Open-source ~13B Models Open-source ~7B Models
Model Loc & Ori Temporal Cultural Attributes Relationships Overall
$ \mathrm{Acc}_p $ $ \mathrm{Acc}_q $ $ \mathrm{Acc}_p $ $ \mathrm{Acc}_q $ $ \mathrm{Acc}_p $ $ \mathrm{Acc}_q $ $ \mathrm{Acc}_p $ $ \mathrm{Acc}_q $ $ \mathrm{Acc}_p $ $ \mathrm{Acc}_q $ $ \mathrm{Acc}_p $ $ \mathrm{Acc}_q $
Human 85.2 86.1 90.9 92.8 72.8 76.4 87.2 88.4 89.6 90.0 86.2 87.7
GPT-4V 33.3 54.2 28.4 52.1 25.5 60.6 26.7 54.7 51.9 70.2 32.3 56.9
Gemini 21.4 49.4 29.5 51.1 21.3 56.4 24.0 52.0 34.6 58.7 26.1 52.7
LLaVA-1.5-13B 06.0 41.1 04.2 44.7 10.6 50.0 14.7 51.3 13.5 54.8 09.1 47.5
BLIP-2-11B 06.0 32.7 08.4 45.8 04.3 35.1 06.7 42.0 11.5 51.9 07.4 41.4
InstructBLIP-13B 06.0 39.3 02.1 41.6 04.3 50.0 04.0 44.7 07.7 51.0 04.5 44.2
mPLUG-Owl-2-7B 13.1 42.3 09.5 41.6 06.4 42.6 12.0 44.7 19.2 51.9 11.9 44.1
MiniGPT4-7B 10.7 36.3 03.2 34.2 00.0 27.7 12.0 35.3 13.5 47.1 07.9 36.0
LLaVA-1.5-7B 11.9 42.9 05.3 44.7 04.3 43.6 09.3 39.3 07.7 47.1 07.9 43.3
InstructBLIP-7B 01.2 33.3 07.4 45.8 00.0 46.8 04.0 43.3 11.5 48.1 04.8 42.8
Otter-7B 02.4 32.7 05.3 41.1 04.3 28.7 00.0 26.0 05.8 40.4 03.4 34.1
LLaVA-7B 02.4 30.4 06.3 34.2 00.0 25.5 01.3 34.0 05.8 41.3 03.4 33.1
Qwen-VL-Chat 03.6 23.8 03.2 24.7 00.0 24.5 01.3 32.0 09.6 34.6 03.4 27.5
OpenFlamingo-7B 02.4 40.5 02.1 38.9 00.0 27.7 05.3 36.0 05.8 47.1 03.1 38.4
BLIP-2-6.7B 00.0 41.1 01.1 44.7 02.1 48.9 02.7 46.0 07.7 53.8 02.3 46.0
API-based Models Open-source ~13B Models Open-source ~7B Models
Model Loc & Ori Temporal Cultural Attributes Relationships Overall
$ \mathrm{Acc}_p $ $ \mathrm{Acc}_q $ $ \mathrm{Acc}_p $ $ \mathrm{Acc}_q $ $ \mathrm{Acc}_p $ $ \mathrm{Acc}_q $ $ \mathrm{Acc}_p $ $ \mathrm{Acc}_q $ $ \mathrm{Acc}_p $ $ \mathrm{Acc}_q $ $ \mathrm{Acc}_p $ $ \mathrm{Acc}_q $
GPT-4V 33.3 53.6 28.4 50.5 21.3 53.2 25.3 54.0 50.0 69.2 31.2 55.1
Gemini 20.2 48.8 27.4 50.0 21.3 54.3 22.7 51.3 30.8 54.8 24.4 51.3
LLaVA-1.5-13B 06.0 41.1 03.2 43.2 12.8 46.8 13.3 50.0 11.5 53.8 08.5 46.2
BLIP-2-11B 06.0 34.5 10.5 44.2 04.3 30.9 06.7 40.7 11.5 47.1 08.0 39.8
InstructBLIP-13B 06.0 39.9 02.1 41.1 06.4 46.8 04.0 44.7 05.8 48.1 04.5 43.3
mPLUG-Owl-2-7B 13.1 39.9 09.5 40.0 04.3 41.5 09.3 42.7 11.5 48.1 09.9 41.9
MiniGPT4-7B 10.7 34.5 04.2 32.1 00.0 27.7 08.0 32.7 09.6 43.3 06.8 33.9
LLaVA-1.5-7B 08.3 37.5 01.1 36.3 02.1 40.4 09.3 37.3 07.7 48.1 05.7 39.1
InstructBLIP-7B 01.2 34.5 05.3 43.7 00.0 45.7 04.0 44.0 11.5 47.1 04.2 42.4
Otter-7B 02.4 31.5 03.2 35.3 00.0 23.4 01.3 27.3 05.8 34.6 02.5 31.0
LLaVA-7B 02.4 29.8 04.2 33.7 00.0 17.0 02.7 33.3 01.9 37.5 02.5 31.0
Qwen-VL-Chat 04.8 23.8 03.2 23.7 00.0 23.4 01.3 32.0 07.7 33.7 03.4 26.9
OpenFlamingo-7B 02.4 40.5 02.1 38.9 00.0 27.7 05.3 36.0 05.8 47.1 03.1 38.4
BLIP-2-6.7B 00.0 42.3 01.1 43.2 04.3 48.9 04.0 46.7 05.8 51.0 02.5 45.6

Overall results of different models on CODIS. The best-performing model in each category is in-bold, and the second best is underlined.

Examples

Citation


      @article{luo2024codis,
        title={CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models},
        author={Fuwen Luo and Chi Chen and Zihao Wan and Zhaolu Kang and Qidong Yan and Yingjie Li and Xiaolong Wang and Siyu Wang and Ziyue Wang and Xiaoyue Mi and Peng Li and Ning Ma and Maosong Sun and Yang Liu},
        journal={arXiv preprint arXiv:2402.13607},
        year={2024}
      }