We perform evaluation on 14 popular MLLMs, which are divided into three groups: (1) API-based models; (2) Open-source ~7B models; (3) Open-source ~13B models. We evaluate various models including LLMs and LMMs. "Human" refers to the average performance of five independent people. We show $ \mathrm{Acc}_p $ and $ \mathrm{Acc}_q $ based on human and GPT-4 evaluation in the leaderboard.
Model | Loc & Ori | Temporal | Cultural | Attributes | Relationships | Overall | ||||||
$ \mathrm{Acc}_p $ | $ \mathrm{Acc}_q $ | $ \mathrm{Acc}_p $ | $ \mathrm{Acc}_q $ | $ \mathrm{Acc}_p $ | $ \mathrm{Acc}_q $ | $ \mathrm{Acc}_p $ | $ \mathrm{Acc}_q $ | $ \mathrm{Acc}_p $ | $ \mathrm{Acc}_q $ | $ \mathrm{Acc}_p $ | $ \mathrm{Acc}_q $ | |
Human | 85.2 | 86.1 | 90.9 | 92.8 | 72.8 | 76.4 | 87.2 | 88.4 | 89.6 | 90.0 | 86.2 | 87.7 |
GPT-4V | 33.3 | 54.2 | 28.4 | 52.1 | 25.5 | 60.6 | 26.7 | 54.7 | 51.9 | 70.2 | 32.3 | 56.9 |
Gemini | 21.4 | 49.4 | 29.5 | 51.1 | 21.3 | 56.4 | 24.0 | 52.0 | 34.6 | 58.7 | 26.1 | 52.7 |
LLaVA-1.5-13B | 6.0 | 41.1 | 4.2 | 44.7 | 10.6 | 50.0 | 14.7 | 51.3 | 13.5 | 54.8 | 9.1 | 47.5 |
BLIP-2-11B | 6.0 | 32.7 | 8.4 | 45.8 | 4.3 | 35.1 | 6.7 | 42.0 | 11.5 | 51.9 | 7.4 | 41.4 |
InstructBLIP-13B | 6.0 | 39.3 | 2.1 | 41.6 | 4.3 | 50.0 | 4.0 | 44.7 | 7.7 | 51.0 | 4.5 | 44.2 |
mPLUG-Owl-2-7B | 13.1 | 42.3 | 9.5 | 41.6 | 6.4 | 42.6 | 12.0 | 44.7 | 19.2 | 51.9 | 11.9 | 44.1 |
MiniGPT4-7B | 10.7 | 36.3 | 3.2 | 34.2 | 0.0 | 27.7 | 12.0 | 35.3 | 13.5 | 47.1 | 7.9 | 36.0 |
LLaVA-1.5-7B | 11.9 | 42.9 | 5.3 | 44.7 | 4.3 | 43.6 | 9.3 | 39.3 | 7.7 | 47.1 | 7.9 | 43.3 |
InstructBLIP-7B | 1.2 | 33.3 | 7.4 | 45.8 | 0.0 | 46.8 | 4.0 | 43.3 | 11.5 | 48.1 | 4.8 | 42.8 |
Otter-7B | 2.4 | 32.7 | 5.3 | 41.1 | 4.3 | 28.7 | 0.0 | 26.0 | 5.8 | 40.4 | 3.4 | 34.1 |
LLaVA-7B | 2.4 | 30.4 | 6.3 | 34.2 | 0.0 | 25.5 | 1.3 | 34.0 | 5.8 | 41.3 | 3.4 | 33.1 |
Qwen-VL-Chat | 3.6 | 23.8 | 3.2 | 24.7 | 0.0 | 24.5 | 1.3 | 32.0 | 9.6 | 34.6 | 3.4 | 27.5 |
OpenFlamingo-7B | 2.4 | 40.5 | 2.1 | 38.9 | 0.0 | 27.7 | 5.3 | 36.0 | 5.8 | 47.1 | 3.1 | 38.4 |
BLIP-2-6.7B | 0.0 | 41.1 | 1.1 | 44.7 | 2.1 | 48.9 | 2.7 | 46.0 | 7.7 | 53.8 | 2.3 | 46.0 |
Model | Loc & Ori | Temporal | Cultural | Attributes | Relationships | Overall | ||||||
$ \mathrm{Acc}_p $ | $ \mathrm{Acc}_q $ | $ \mathrm{Acc}_p $ | $ \mathrm{Acc}_q $ | $ \mathrm{Acc}_p $ | $ \mathrm{Acc}_q $ | $ \mathrm{Acc}_p $ | $ \mathrm{Acc}_q $ | $ \mathrm{Acc}_p $ | $ \mathrm{Acc}_q $ | $ \mathrm{Acc}_p $ | $ \mathrm{Acc}_q $ | |
GPT-4V | 33.3 | 53.6 | 28.4 | 50.5 | 21.3 | 53.2 | 25.3 | 54.0 | 50.0 | 69.2 | 31.2 | 55.1 |
Gemini | 20.2 | 48.8 | 27.4 | 50.0 | 21.3 | 54.3 | 22.7 | 51.3 | 30.8 | 54.8 | 24.4 | 51.3 |
LLaVA-1.5-13B | 6.0 | 41.1 | 3.2 | 43.2 | 12.8 | 46.8 | 13.3 | 50.0 | 11.5 | 53.8 | 8.5 | 46.2 |
BLIP-2-11B | 6.0 | 34.5 | 10.5 | 44.2 | 4.3 | 30.9 | 6.7 | 40.7 | 11.5 | 47.1 | 8.0 | 39.8 |
InstructBLIP-13B | 6.0 | 39.9 | 2.1 | 41.1 | 6.4 | 46.8 | 4.0 | 44.7 | 5.8 | 48.1 | 4.5 | 43.3 |
mPLUG-Owl-2-7B | 13.1 | 39.9 | 9.5 | 40.0 | 4.3 | 41.5 | 9.3 | 42.7 | 11.5 | 48.1 | 9.9 | 41.9 |
MiniGPT4-7B | 10.7 | 34.5 | 4.2 | 32.1 | 0.0 | 27.7 | 8.0 | 32.7 | 9.6 | 43.3 | 6.8 | 33.9 |
LLaVA-1.5-7B | 8.3 | 37.5 | 1.1 | 36.3 | 2.1 | 40.4 | 9.3 | 37.3 | 7.7 | 48.1 | 5.7 | 39.1 |
InstructBLIP-7B | 1.2 | 34.5 | 5.3 | 43.7 | 0.0 | 45.7 | 4.0 | 44.0 | 11.5 | 47.1 | 4.2 | 42.4 |
Otter-7B | 2.4 | 31.5 | 3.2 | 35.3 | 0.0 | 23.4 | 1.3 | 27.3 | 5.8 | 34.6 | 2.5 | 31.0 |
LLaVA-7B | 2.4 | 29.8 | 4.2 | 33.7 | 0.0 | 17.0 | 2.7 | 33.3 | 1.9 | 37.5 | 2.5 | 31.0 |
Qwen-VL-Chat | 4.8 | 23.8 | 3.2 | 23.7 | 0.0 | 23.4 | 1.3 | 32.0 | 7.7 | 33.7 | 3.4 | 26.9 |
OpenFlamingo-7B | 2.4 | 40.5 | 2.1 | 38.9 | 0.0 | 27.7 | 5.3 | 36.0 | 5.8 | 47.1 | 3.1 | 38.4 |
BLIP-2-6.7B | 0.0 | 42.3 | 1.1 | 43.2 | 4.3 | 48.9 | 4.0 | 46.7 | 5.8 | 51.0 | 2.5 | 45.6 |
Overall results of different models on CODIS. The best-performing model in each category is in-bold, and the second best is underlined.