Brote

Browse and Concentrate:
Comprehending Multimodal Content via prior-LLM Context Fusion


Ziyue Wang1*, Chi Chen1*, Yiqi Zhu1, Fuwen Luo1,
Peng Li2†, Ming Yan3, Fei Huang3†, Maosong Sun1, Yang Liu1,2


1 Department of Computer Science and Technology, Tsinghua University, Beijing, China
2 Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China
3 Institute of Intelligent Computing, Alibaba Group


* Equal contribution
Corresponding authors


📖 arXiv | Github | Models 🤗



Introduction

With the bloom of Multimodal Large Language Models, the paradigm that extending Large Language Models with pre-trained vision encoders has shown remarkable abilities in visual reasoning and visual instruction-following tasks. However, this paradigm neglects essential crossmodality and inter-image interactions, leading to the LLM being presented with isolate visual and textual features without recognition of interleaved multimodal inputs, to which we refer as prior-LLM modality isolation and it obscures a deeper understanding of multi-image and interleaved inputs.

To mitigate the issue, we propose a novel paradigm named Browse-and-Concentrate (Brote). This paradigm begins with a browsing phase to generate a condition context vector, serving as a collection of browsing insights, encapsulating the main intent and visual information derived from images. Subsequently, a concentrating phase is employed to comprehend multimodal inputs, guided by the condition context vector. Our paradigm exhibits notable advancements, improving the average accuracy on 7 multi-image benchmarks by 2.13% and 7.60% against strong baselines with 3B and 11B LLMs, respectively.



Framework

Our paradigm progressively comprehends images via two phases, browsing and concentrating. In the browsing phase, the MLLM browses the entire input and generates a condition context as the browsing result, denoted as C. Then, in the concentrating phase, the model comprehends multimodal inputs under the guidance of C. We refer to the model of browsing phase as MB and the model of concentrating phase as MC.

Moreover, our proposed Brote can be further divided into two modes, explicit and implicit, regarding the distinct approaches of incorporating browsing result C. The explicit version, denoted as Brote-EX, operates with separated parameters (MBMC). This explicit mode first generates C using MB, followed by MC to infer the final outcomes. In contrast, for the implicit version, denoted as BroteIM, employs shared parameters for both phases (MB = MC), permitting MC to directly predict the answer without the need to explicitly produce intermediate vectors from the other model.



Trainging Strategies

To encourage further exploration of information from C for VL tasks, we propose a new training strategy named context-dropping training. The strategy intentionally omits particular inputs yet requiring the model to infer for answers solely with the assistant of C. It motivates the model to compensate for the missing information from the provided condition context C. We propose three different dropping strategies:

  1. Drop images: This involves two approaches, removing certain images (Context Dropping (IMG-N)), and replacing original images by blank placeholders (Context Dropping (IMG-B)).
  2. Drop text: We remove the text before the last image (Context Dropping (TXT)).
  3. Drop ALL: A combination of the above settings denoted as ALL, applied with the same probabilities.



Results

We report our results in the following tables:

Model #Param LLM In-context Learning Multi-image / Video Tasks AVG
VQA2 A-OKVQA NLVR2 DEMON SEED MSVD QA MSRVTT QA
KOSMOS-1 1.3B 51.8 - - - - - - -
InstructBLIP-XL 3B 31.76* 39.13* 52.59* 32.59* 52.7 43.40* 12.12* 37.77
MMICL-XL 3B 69.16 53.43* 71.48* 38.14* 54.69* 53.68 42.36* 54.71
Otter 7B 45.39* 38.42* 49.54* 24.51 39.7 25.87* 9.78* -
VPG-C-LLaMA2 7B - - - 37.22 - - - -
Flamingo-9B 7B 56.3 - - - - 30.2 13.7 -
Brote-EX-XL 3B 69.97 56.00 71.41 37.33 57.51 53.02 43.14 55.48
Brote-IM-XL 3B 68.94 56.43 76.02 37.34 57.86 56.06 45.08 56.84
InstructBlip-XXL 11B 48.21* 45.92* 64.54* 33.00* 50.81* 44.30* 15.49* 43.18
MMICL-XXL 11B 70.56 54.85* 56.16* 36.30* 56.66* 52.19 39.46* 52.18
EMU-2 33B 67.0 - - - 62.8 49.0 31.4 -
Flamingo-80B 70B 63.1 - - - - 35.6 17.4 -
Brote-EX-XXL 11B 70.86 59.94 70.42 38.70 59.31 54.42 45.24 57.00
Brote-IM-XXL 11B 71.71 60.31 80.71 38.94 61.64 57.29 45.94 59.78


Model #Param LLM VQAv2 A-OKVQA ScienceQA-IMG MME Perception MME Cognition MMBench AVG
InstructBLIP-XL 3B 36.77 54.57 70.40 1093.70* 281.43* 69.68* 68.52
MMICL-XL 3B 69.13 52.12* 72.58* 1184.54* 277.86* 73.11* 75.81
LLaVA 7B - - - 457.82 214.64 36.2 -
Otter 7B 57.89* 41.92* 63.10 1292.26 306.43 48.3 69.51
Brote-EX-XL 3B 69.90 52.93 71.15 1203.87 301.79 73.27 77.18
Brote-IM-XL 3B 70.24 53.40 72.58 1181.95 266.79 74.29 75.90
InstructBlip-XXL 11B 63.69 57.10 70.60 1212.82* 291.79* 70.34* 75.99
MMICL-XXL 11B 70.30 51.35* 74.92* 1313.88* 311.79* 76.58* 80.41
MMICL-XXL (BLIP2) 11B 69.99 - - 1381.74 428.93 65.24 -
Brote-EX-XXL 11B 71.58 56.47 77.69 1279.73 310.01 76.67 81.31
Brote-IM-XXL 11B 73.02 57.83 78.38 1284.13 300.00 77.34 81.66



Citation

📑 If you find our project helpful to your research, please consider citing:

        @article{wang2024browse,
            title={Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion},
            author={Wang, Ziyue and Chen, Chi and Zhu, Yiqi and Luo, Fuwen and Li, Peng and Yan, Ming and Zhang, Ji and Huang, Fei and Sun, Maosong and Liu, Yang},
            journal={arXiv preprint arXiv:2402.12195},
            year={2024}
        }