With the bloom of Multimodal Large Language Models, the paradigm that extending Large Language Models with pre-trained vision encoders has shown remarkable abilities in visual reasoning and visual instruction-following tasks. However, this paradigm neglects essential crossmodality and inter-image interactions, leading to the LLM being presented with isolate visual and textual features without recognition of interleaved multimodal inputs, to which we refer as prior-LLM modality isolation and it obscures a deeper understanding of multi-image and interleaved inputs.
To mitigate the issue, we propose a novel paradigm named Browse-and-Concentrate (Brote). This paradigm begins with a browsing phase to generate a condition context vector, serving as a collection of browsing insights, encapsulating the main intent and visual information derived from images. Subsequently, a concentrating phase is employed to comprehend multimodal inputs, guided by the condition context vector. Our paradigm exhibits notable advancements, improving the average accuracy on 7 multi-image benchmarks by 2.13% and 7.60% against strong baselines with 3B and 11B LLMs, respectively.
Our paradigm progressively comprehends images via two phases, browsing and concentrating. In the browsing phase, the MLLM browses the entire input and generates a condition context as the browsing result, denoted as C. Then, in the concentrating phase, the model comprehends multimodal inputs under the guidance of C. We refer to the model of browsing phase as MB and the model of concentrating phase as MC.
Moreover, our proposed Brote can be further divided into two modes, explicit and implicit, regarding the distinct approaches of incorporating browsing result C. The explicit version, denoted as Brote-EX, operates with separated parameters (MB ≠ MC). This explicit mode first generates C using MB, followed by MC to infer the final outcomes. In contrast, for the implicit version, denoted as BroteIM, employs shared parameters for both phases (MB = MC), permitting MC to directly predict the answer without the need to explicitly produce intermediate vectors from the other model.
To encourage further exploration of information from C for VL tasks, we propose a new training strategy named context-dropping training. The strategy intentionally omits particular inputs yet requiring the model to infer for answers solely with the assistant of C. It motivates the model to compensate for the missing information from the provided condition context C. We propose three different dropping strategies:
We report our results in the following tables:
Model | #Param LLM | In-context Learning | Multi-image / Video Tasks | AVG | |||||
---|---|---|---|---|---|---|---|---|---|
VQA2 | A-OKVQA | NLVR2 | DEMON | SEED | MSVD QA | MSRVTT QA | |||
KOSMOS-1 | 1.3B | 51.8 | - | - | - | - | - | - | - |
InstructBLIP-XL | 3B | 31.76* | 39.13* | 52.59* | 32.59* | 52.7 | 43.40* | 12.12* | 37.77 |
MMICL-XL | 3B | 69.16 | 53.43* | 71.48* | 38.14* | 54.69* | 53.68 | 42.36* | 54.71 |
Otter | 7B | 45.39* | 38.42* | 49.54* | 24.51 | 39.7 | 25.87* | 9.78* | - |
VPG-C-LLaMA2 | 7B | - | - | - | 37.22 | - | - | - | - |
Flamingo-9B | 7B | 56.3 | - | - | - | - | 30.2 | 13.7 | - |
Brote-EX-XL | 3B | 69.97 | 56.00 | 71.41 | 37.33 | 57.51 | 53.02 | 43.14 | 55.48 |
Brote-IM-XL | 3B | 68.94 | 56.43 | 76.02 | 37.34 | 57.86 | 56.06 | 45.08 | 56.84 |
InstructBlip-XXL | 11B | 48.21* | 45.92* | 64.54* | 33.00* | 50.81* | 44.30* | 15.49* | 43.18 |
MMICL-XXL | 11B | 70.56 | 54.85* | 56.16* | 36.30* | 56.66* | 52.19 | 39.46* | 52.18 |
EMU-2 | 33B | 67.0 | - | - | - | 62.8 | 49.0 | 31.4 | - |
Flamingo-80B | 70B | 63.1 | - | - | - | - | 35.6 | 17.4 | - |
Brote-EX-XXL | 11B | 70.86 | 59.94 | 70.42 | 38.70 | 59.31 | 54.42 | 45.24 | 57.00 |
Brote-IM-XXL | 11B | 71.71 | 60.31 | 80.71 | 38.94 | 61.64 | 57.29 | 45.94 | 59.78 |
Model | #Param LLM | VQAv2 | A-OKVQA | ScienceQA-IMG | MME Perception | MME Cognition | MMBench | AVG |
---|---|---|---|---|---|---|---|---|
InstructBLIP-XL | 3B | 36.77 | 54.57 | 70.40 | 1093.70* | 281.43* | 69.68* | 68.52 |
MMICL-XL | 3B | 69.13 | 52.12* | 72.58* | 1184.54* | 277.86* | 73.11* | 75.81 |
LLaVA | 7B | - | - | - | 457.82 | 214.64 | 36.2 | - |
Otter | 7B | 57.89* | 41.92* | 63.10 | 1292.26 | 306.43 | 48.3 | 69.51 |
Brote-EX-XL | 3B | 69.90 | 52.93 | 71.15 | 1203.87 | 301.79 | 73.27 | 77.18 |
Brote-IM-XL | 3B | 70.24 | 53.40 | 72.58 | 1181.95 | 266.79 | 74.29 | 75.90 |
InstructBlip-XXL | 11B | 63.69 | 57.10 | 70.60 | 1212.82* | 291.79* | 70.34* | 75.99 |
MMICL-XXL | 11B | 70.30 | 51.35* | 74.92* | 1313.88* | 311.79* | 76.58* | 80.41 |
MMICL-XXL (BLIP2) | 11B | 69.99 | - | - | 1381.74 | 428.93 | 65.24 | - |
Brote-EX-XXL | 11B | 71.58 | 56.47 | 77.69 | 1279.73 | 310.01 | 76.67 | 81.31 |
Brote-IM-XXL | 11B | 73.02 | 57.83 | 78.38 | 1284.13 | 300.00 | 77.34 | 81.66 |
📑 If you find our project helpful to your research, please consider citing:
@article{wang2024browse, title={Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion}, author={Wang, Ziyue and Chen, Chi and Zhu, Yiqi and Luo, Fuwen and Li, Peng and Yan, Ming and Zhang, Ji and Huang, Fei and Sun, Maosong and Liu, Yang}, journal={arXiv preprint arXiv:2402.12195}, year={2024} }