Brote

Browse and Concentrate:
Comprehending Multimodal Content via prior-LLM Context Fusion

Ziyue Wang^1*, Chi Chen^1*, Yiqi Zhu¹, Fuwen Luo¹,
Peng Li^2†, Ming Yan³, Fei Huang^3†, Maosong Sun¹, Yang Liu^1,2

¹ Department of Computer Science and Technology, Tsinghua University, Beijing, China
² Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China
³ Institute of Intelligent Computing, Alibaba Group

^* Equal contribution
^† Corresponding authors

📖 arXiv | Github | Models 🤗

Introduction

With the bloom of Multimodal Large Language Models, the paradigm that extending Large Language Models with pre-trained vision encoders has shown remarkable abilities in visual reasoning and visual instruction-following tasks. However, this paradigm neglects essential crossmodality and inter-image interactions, leading to the LLM being presented with isolate visual and textual features without recognition of interleaved multimodal inputs, to which we refer as prior-LLM modality isolation and it obscures a deeper understanding of multi-image and interleaved inputs.

To mitigate the issue, we propose a novel paradigm named Browse-and-Concentrate (Brote). This paradigm begins with a browsing phase to generate a condition context vector, serving as a collection of browsing insights, encapsulating the main intent and visual information derived from images. Subsequently, a concentrating phase is employed to comprehend multimodal inputs, guided by the condition context vector. Our paradigm exhibits notable advancements, improving the average accuracy on 7 multi-image benchmarks by 2.13% and 7.60% against strong baselines with 3B and 11B LLMs, respectively.

Framework

Our paradigm progressively comprehends images via two phases, browsing and concentrating. In the browsing phase, the MLLM browses the entire input and generates a condition context as the browsing result, denoted as C. Then, in the concentrating phase, the model comprehends multimodal inputs under the guidance of C. We refer to the model of browsing phase as M_B and the model of concentrating phase as M_C.

Moreover, our proposed Brote can be further divided into two modes, explicit and implicit, regarding the distinct approaches of incorporating browsing result C. The explicit version, denoted as Brote-EX, operates with separated parameters (M_B ≠ M_C). This explicit mode first generates C using M_B, followed by M_C to infer the final outcomes. In contrast, for the implicit version, denoted as BroteIM, employs shared parameters for both phases (M_B = M_C), permitting M_C to directly predict the answer without the need to explicitly produce intermediate vectors from the other model.

Trainging Strategies

To encourage further exploration of information from C for VL tasks, we propose a new training strategy named context-dropping training. The strategy intentionally omits particular inputs yet requiring the model to infer for answers solely with the assistant of C. It motivates the model to compensate for the missing information from the provided condition context C. We propose three different dropping strategies:

Drop images: This involves two approaches, removing certain images (Context Dropping (IMG-N)), and replacing original images by blank placeholders (Context Dropping (IMG-B)).
Drop text: We remove the text before the last image (Context Dropping (TXT)).
Drop ALL: A combination of the above settings denoted as ALL, applied with the same probabilities.

Results

We report our results in the following tables:

Model	#Param LLM	In-context Learning		Multi-image / Video Tasks					AVG
Model	#Param LLM	VQA2	A-OKVQA	NLVR2	DEMON	SEED	MSVD QA	MSRVTT QA	AVG
KOSMOS-1	1.3B	51.8	-	-	-	-	-	-	-
InstructBLIP-XL	3B	31.76^*	39.13^*	52.59^*	32.59^*	52.7	43.40^*	12.12^*	37.77
MMICL-XL	3B	69.16	53.43^*	71.48^*	38.14^*	54.69^*	53.68	42.36^*	54.71
Otter	7B	45.39^*	38.42^*	49.54^*	24.51	39.7	25.87^*	9.78^*	-
VPG-C-LLaMA2	7B	-	-	-	37.22	-	-	-	-
Flamingo-9B	7B	56.3	-	-	-	-	30.2	13.7	-
Brote-EX-XL	3B	69.97	56.00	71.41	37.33	57.51	53.02	43.14	55.48
Brote-IM-XL	3B	68.94	56.43	76.02	37.34	57.86	56.06	45.08	56.84
InstructBlip-XXL	11B	48.21^*	45.92^*	64.54^*	33.00^*	50.81^*	44.30^*	15.49^*	43.18
MMICL-XXL	11B	70.56	54.85^*	56.16^*	36.30^*	56.66^*	52.19	39.46^*	52.18
EMU-2	33B	67.0	-	-	-	62.8	49.0	31.4	-
Flamingo-80B	70B	63.1	-	-	-	-	35.6	17.4	-
Brote-EX-XXL	11B	70.86	59.94	70.42	38.70	59.31	54.42	45.24	57.00
Brote-IM-XXL	11B	71.71	60.31	80.71	38.94	61.64	57.29	45.94	59.78

The best results for models larger/smaller than 10B are separately bolded and the seconds are underlined.
VQAv2 and A-OKVQA are conducted under four-shot setting.
SEED refers to SEED-Bench that contains both images and videos.
For video benchmarks, we uniformly extract eight frames from the given video clips for answering the questions.
For "AVG", we first average the MME scores over its subtasks, then calculate the average scores of all benchmarks in this table.

Model	#Param LLM	VQAv2	A-OKVQA	ScienceQA-IMG	MME Perception	MME Cognition	MMBench	AVG
InstructBLIP-XL	3B	36.77	54.57	70.40	1093.70^*	281.43^*	69.68^*	68.52
MMICL-XL	3B	69.13	52.12^*	72.58^*	1184.54^*	277.86^*	73.11^*	75.81
LLaVA	7B	-	-	-	457.82	214.64	36.2	-
Otter	7B	57.89^*	41.92^*	63.10	1292.26	306.43	48.3	69.51
Brote-EX-XL	3B	69.90	52.93	71.15	1203.87	301.79	73.27	77.18
Brote-IM-XL	3B	70.24	53.40	72.58	1181.95	266.79	74.29	75.90
InstructBlip-XXL	11B	63.69	57.10	70.60	1212.82^*	291.79^*	70.34^*	75.99
MMICL-XXL	11B	70.30	51.35^*	74.92^*	1313.88^*	311.79^*	76.58^*	80.41
MMICL-XXL (BLIP2)	11B	69.99	-	-	1381.74	428.93	65.24	-
Brote-EX-XXL	11B	71.58	56.47	77.69	1279.73	310.01	76.67	81.31
Brote-IM-XXL	11B	73.02	57.83	78.38	1284.13	300.00	77.34	81.66

The best results for models larger/smaller than 10B are separately bolded and the seconds are underlined.
VQAv2 and A-OKVQA are conducted under zero-shot setting.
ScienceQA is conducted under zero-shot CoT (ZS-CoT) setting.
For "AVG", we first average the MME scores over its subtasks, then calculate the average scores of all benchmarks in this table.

Citation

📑 If you find our project helpful to your research, please consider citing:

        @article{wang2024browse,
            title={Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion},
            author={Wang, Ziyue and Chen, Chi and Zhu, Yiqi and Luo, Fuwen and Li, Peng and Yan, Ming and Zhang, Ji and Huang, Fei and Sun, Maosong and Liu, Yang},
            journal={arXiv preprint arXiv:2402.12195},
            year={2024}
        }