蛇缠腰是什么病怎么治| 芒果是什么季节的水果| 什么是春梦| 威海有什么特产| 肋骨下面是什么部位| 血脂稠吃什么| 毒奶粉是什么游戏| 三伏贴能治什么病| 鱼是什么结构| 平舌音是什么| 夏季风寒感冒吃什么药| 迪士尼狗狗叫什么名字| 夏天喝什么茶| 斋醮是什么意思| 亚甲蓝注射起什么作用| 最长的河流是什么河| 什么东西补精子最快| 什么防辐射最好| 小孩子消化不好吃什么调理| 昕字五行属什么| 炖鸡放什么材料| 关帝庙求什么最灵| 最好的烟是什么牌子| 小狗发烧吃什么药| 美容行业五行属什么| 以下是什么意思| 舒坦是什么意思| 万亿后面是什么单位| 油菜花是什么季节开的| 喝太多水对身体有什么影响| 取保候审需要什么条件| 狗狗犬窝咳吃什么药| 梦见怀孕是什么预兆| 马踏飞燕什么意思| 什么是精神出轨| 正常高压是什么意思| 耳朵聋是什么原因| 腹主动脉壁钙化是什么意思| 一月2日是什么星座| 月经提前是什么原因| 梦见掉了一颗牙齿是什么征兆| noisy是什么意思| 钾偏低是什么原因| 附件炎吃什么药效果好| 减脂吃什么主食| 病毒性扁桃体炎吃什么药| 类风湿因子高是什么原因| 一年半载是什么意思| 月经头疼是什么原因| 两肺纹理增粗是什么意思| 天目湖白茶属于什么茶| 南京有什么美食| 怎么看微信好友什么时候加的| 覆水难收什么意思| 吃什么补血快效果好| 袖珍人是什么| 肛瘘是什么原因造成的| 欠钱不还被起诉会有什么后果| 什么的茄子| 子宫颈肥大有什么危害| 什么笔记本电脑好| 两极分化是什么意思| 今年属于什么年| 看睾丸去医院挂什么科| 苦瓜泡酒有什么功效和作用| 米乳是什么| 六味地黄丸什么人不能吃| 尿蛋白阳性是什么意思| 78年属马的是什么命| 急性扁桃体炎什么原因导致的| 屁股抽筋疼是什么原因| 农历六月初六是什么星座| 什么是卤水| 血液凝固快是什么原因| 雨露均沾是什么意思| 减肥晚上适合吃什么水果| ct和b超有什么区别| 纪年是什么意思| 血脂稠吃什么食物好| 今年的属相是什么生肖| 志司是什么意思| 吃豆腐是什么意思| 喝醋对身体有什么好处| 615是什么星座| 四大美女是什么生肖| 什么的智慧| 善存片适合什么人吃| 眼下长斑是什么原因| 什么是速率| 10月28日是什么日子| 补气血吃什么中成药最好| 雷什么风什么| 狐臭手术挂什么科| 什么是白肉| 真数是什么| 半夜胃反酸水是什么原因| tbc是什么意思| 永无止境是什么意思| 肛门潮湿用什么药| 中性粒细胞绝对值高是什么原因| 什么时候闰十月| 甲状腺球蛋白抗体低说明什么| 二氧化碳结合力是什么| 小暑节气吃什么| 江西庐山产什么茶| 红海是什么意思| 天台种什么植物好| 试纸一深一浅说明什么| 食指长痣代表什么| 风热感冒吃什么药好| 宝宝病毒性感冒吃什么药效果好| 2b什么意思| 呆滞是什么意思| 前置胎盘是什么意思| 啤酒加生鸡蛋一起喝有什么效果| 什么时候打耳洞最好| 七七事变是什么生肖| 五味杂粮什么意思| 检查贫血做什么检查| 双肺间质性改变是什么意思| 摆渡是什么意思| 五彩的什么| 驻唱是什么意思| 扁桃体2度是什么意思| 鸡屁股叫什么| 易建联为什么不打nba| 孩子为什么厌学| 韩红是什么军衔| 女人吃人参有什么好处| 那是什么| 喉咙有痰是什么原因引起的| 为什么犹太人聪明| 腰间盘突出用什么药| 背疼什么原因| 魁拔4什么时候上映| zoe什么意思| 负责任是什么意思| 北豆腐是什么| 咳嗽有痰吃什么药好得最快最有效| 自知力是什么意思| 火龙果什么时候开花| 全麦面包是什么做的| 2021属什么生肖| 0和1什么意思| 夕阳红是什么意思| 邓绥和阴丽华什么关系| 容貌是什么意思| 嬴稷和嬴政什么关系| 瘦的快是什么原因| 心窝窝疼是什么原因| 脑卒中是什么意思| 博士点是什么意思| 什么是集体户口| 甲亢吃什么好的更快| 肛裂吃什么药| 阴湿是什么病| 右肾肾盂分离什么意思| 腱鞘炎在什么位置| 星链是什么| 缠腰蛇是什么原因引起的| 艾灸后皮肤痒出红疙瘩是什么原因| 吃百家饭是什么意思| 婶婶是什么意思| 浒苔是什么| 苏打水配什么好喝| 汗毛长的女人代表什么| 桃子不能和什么一起吃| 鱼吃什么| 免疫力是什么意思| 脚磨破了涂什么药| 女装大佬什么意思| 菊花泡水喝有什么功效| 西瓜什么时候种| 滋阴润燥是什么意思| 95年属什么| 孤帆远影碧空尽的尽是什么意思| 普贤菩萨保佑什么生肖| 核磁共振是查什么的| 下巴起痘痘是什么原因| 小孩牙疼有什么办法| 肱骨外上髁炎用什么药| 孕妇什么时候有奶水| 孕妇喝什么汤| 飞蚊症是什么引起的| 铮铮是什么意思| 今夕何夕是什么意思| 经常的近义词是什么| 嘴角周围长痘痘是什么原因| 舌苔发黄是什么病| 利多卡因是什么| 什么是违反禁令标志指示| 手发抖是什么原因| 支气管炎是什么原因引起的| 县公安局长什么级别| redline是什么牌子| 脚环肿是什么原因引起的| 胃酸分泌过多吃什么药| 一花一世界一叶一菩提什么意思| 长智齿一般什么年龄| 滑石是什么| 情何以堪是什么意思| 西米是什么米| 团宠是什么意思| 什么是思维| 伸张正义是什么意思| 菜鸟什么意思| 并是什么意思| 为什么经常长口腔溃疡| 儿童低烧吃什么药| jojo是什么意思| 父母都是o型血孩子是什么血型| 长时间手淫有什么危害| 金鸡独立什么意思| 秦始皇的真名叫什么| 坐骨神经痛用什么药| 赛脸什么意思| 婴儿什么时候可以睡枕头| 焦距是什么意思| 弱肉强食什么意思| 肾结晶是什么意思| 为什么外阴老是长疖子| 吆西是什么意思| 什么是双重人格| 日柱金舆是什么意思| 阴阳代表什么数字| 月经快来了有什么征兆| 多巴胺是什么| zoom 是什么意思| 1989年属什么的| 连长相当于地方什么官| 大堤是什么意思| 煞星是什么意思| 小儿疝气挂什么科| 医院打耳洞挂什么科| 流产什么样的症状表现| 出品人是干什么的| 牛肚是什么| 背靠背什么意思| 什么树木| 精满自溢是什么意思| 巧克力有什么功效与作用| 软化血管吃什么药| 不超过是什么意思| 手指肚发红是什么原因| 痛风喝什么水| 东坡肉是什么菜系| 唐三藏的真名叫什么| 金黄的什么| 气管痉挛是什么症状| 腹泻能吃什么食物| 谢霆锋什么学历| 百田森的鞋什么档次| 鹅口疮有什么症状| 抽烟对女生有什么危害| 蒲公英泡水喝有什么用| 3月18号是什么星座| 尿糖阳性是什么意思| 太阳绕着什么转| ipl是什么意思| 三八送什么花| 什么东西能吸水| 骨膜炎吃什么药| 宫颈口大是什么原因| 草鱼喜欢吃什么| 百度
Skip to content

BradyFU/Awesome-Multimodal-Large-Language-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

?
?
?
?
?

Repository files navigation

Awesome-Multimodal-Large-Language-Models

Our MLLM works

?????? A Survey on Multimodal Large Language Models
Project Page [This Page] | Paper | ?? Citation | ?? WeChat (MLLM微信交流群,欢迎加入)

The first comprehensive survey for Multimodal Large Language Models (MLLMs). ?


?????? VITA: Towards Open-Source Interactive Omni Multimodal LLM

We are excited to introduce the VITA-1.5, a more powerful and more real-time version. ?

All codes of VITA-1.5 have been released! ??

You can experience our Basic Demo on ModelScope directly. The Real-Time Interactive Demo needs to be configured according to the instructions.


?????? Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy

Process more than 4K frames or over 1M visual tokens. State-of-the-art on Video-MME under 20B models! ?


?????? MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

Align MLLMs with human preference, including a high-quality dataset, a strong reward model, a new alignmen algorithm, and two new benchmarks.?


?????? MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

Jointly introduced by MME, MMBench, and LLaVA teams. ?


?????? Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Project Page | Paper | GitHub | Dataset | Leaderboard

We are very proud to launch Video-MME, the first-ever comprehensive evaluation benchmark of MLLMs in Video Analysis! ??

It includes short- (< 2min), medium- (4min~15min), and long-term (30min~60min) videos, ranging from 11 seconds to 1 hour. All data are newly collected and annotated by humans, not from any existing video dataset. ?


?????? MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Paper | Download | Eval Tool | ?? Citation

A representative evaluation benchmark for MLLMs. ?


?????? Woodpecker: Hallucination Correction for Multimodal Large Language Models
Paper | GitHub

This is the first work to correct hallucination in multimodal large language models. ?


Table of Contents


Awesome Papers

Multimodal Instruction Tuning

Title Venue Date Code Demo
Star
Step3: Cost-Effective Multimodal Intelligence
StepFun 2025-08-06 Github Demo
Star
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
arXiv 2025-08-06 Github Demo
Star
DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World
arXiv 2025-08-06 Github -
Qwen VLo: From "Understanding" the World to "Depicting" It Qwen 2025-08-06 - Demo
Star
MMSearch-R1: Incentivizing LMMs to Search
arXiv 2025-08-06 Github -
Star
Show-o2: Improved Native Unified Multimodal Models
arXiv 2025-08-06 Github -
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities Google 2025-08-06 - -
Star
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning
arXiv 2025-08-06 Github -
Star
MiMo-VL Technical Report
arXiv 2025-08-06 Github -
Star
OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation
arXiv 2025-08-06 Github -
Star
Emerging Properties in Unified Multimodal Pretraining
arXiv 2025-08-06 Github Demo
Star
MMaDA: Multimodal Large Diffusion Language Models
arXiv 2025-08-06 Github Demo
UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation arXiv 2025-08-06 - -
Star
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
arXiv 2025-08-06 Github Local Demo
Seed1.5-VL Technical Report arXiv 2025-08-06 - -
Star
Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
arXiv 2025-08-06 Github -
Star
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
arXiv 2025-08-06 Github Local Demo
Star
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
arXiv 2025-08-06 Github -
Star
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models
arXiv 2025-08-06 Github -
Star
An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes
arXiv 2025-08-06 Github -
Star
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
arXiv 2025-08-06 Github Demo
Introducing GPT-4.1 in the API OpenAI 2025-08-06 - -
Star
Kimi-VL Technical Report
arXiv 2025-08-06 Github Demo
The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation Meta 2025-08-06 Hugging Face -
Star
Qwen2.5-Omni Technical Report
Qwen 2025-08-06 Github Demo
Addendum to GPT-4o System Card: Native image generation OpenAI 2025-08-06 - -
Star
Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation
arXiv 2025-08-06 Github -
Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision arXiv 2025-08-06 - -
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs arXiv 2025-08-06 Hugging Face Demo
Star
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray
arXiv 2025-08-06 Github -
Star
Qwen2.5-VL Technical Report
arXiv 2025-08-06 Github Demo
Star
Baichuan-Omni-1.5 Technical Report
Tech Report 2025-08-06 Github Local Demo
Star
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
arXiv 2025-08-06 Github -
Star
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
arXiv 2025-08-06 Github -
Star
QVQ: To See the World with Wisdom
Qwen 2025-08-06 Github Demo
Star
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
arXiv 2025-08-06 Github -
Apollo: An Exploration of Video Understanding in Large Multimodal Models arXiv 2025-08-06 - -
Star
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
arXiv 2025-08-06 Github Local Demo
StreamChat: Chatting with Streaming Video arXiv 2025-08-06 Coming soon -
CompCap: Improving Multimodal Large Language Models with Composite Captions arXiv 2025-08-06 - -
Star
LinVT: Empower Your Image-level Large Language Model to Understand Videos
arXiv 2025-08-06 Github -
Star
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
arXiv 2025-08-06 Github Demo
Star
NVILA: Efficient Frontier Visual Language Models
arXiv 2025-08-06 Github Demo
Star
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
arXiv 2025-08-06 Github -
Star
TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
arXiv 2025-08-06 Github -
Star
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
arXiv 2025-08-06 Github Local Demo
Star
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
arXiv 2025-08-06 Github Demo
Star
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate
arXiv 2025-08-06 Github -
Star
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
arXiv 2025-08-06 Github Local Demo
Star
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
CVPR 2025-08-06 Github Demo
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models arXiv 2025-08-06 Huggingface Demo
Star
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
arXiv 2025-08-06 Github Demo
Star
ChartMoE: Mixture of Expert Connector for Advanced Chart Understanding
ICLR 2025-08-06 Github Local Demo
Star
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
arXiv 2025-08-06 Github -
Star
EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
arXiv 2025-08-06 Github Demo
Star
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
arXiv 2025-08-06 Github -
Star
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
arXiv 2025-08-06 Github -
Star
VITA: Towards Open-Source Interactive Omni Multimodal LLM
arXiv 2025-08-06 Github -
Star
LLaVA-OneVision: Easy Visual Task Transfer
arXiv 2025-08-06 Github Demo
Star
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
arXiv 2025-08-06 Github Demo
VILA^2: VILA Augmented VILA arXiv 2025-08-06 - -
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models arXiv 2025-08-06 - -
EVLM: An Efficient Vision-Language Model for Visual Understanding arXiv 2025-08-06 - -
Star
IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model
arXiv 2025-08-06 Github -
Star
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
arXiv 2025-08-06 Github Demo
Star
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
arXiv 2025-08-06 Github Local Demo
Star
DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming
AAAI 2025-08-06 Github -
Star
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
arXiv 2025-08-06 Github Local Demo
Star
Long Context Transfer from Language to Vision
arXiv 2025-08-06 Github Local Demo
Star
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
ICML 2025-08-06 Github -
Star
TroL: Traversal of Layers for Large Language and Vision Models
EMNLP 2025-08-06 Github Local Demo
Star
Unveiling Encoder-Free Vision-Language Models
arXiv 2025-08-06 Github Local Demo
Star
VideoLLM-online: Online Video Large Language Model for Streaming Video
CVPR 2025-08-06 Github Local Demo
Star
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics
CoRL 2025-08-06 Github Demo
Star
Comparison Visual Instruction Tuning
arXiv 2025-08-06 Github Local Demo
Star
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
arXiv 2025-08-06 Github -
Star
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
arXiv 2025-08-06 Github Local Demo
Star
Parrot: Multilingual Visual Instruction Tuning
arXiv 2025-08-06 Github -
Star
Ovis: Structural Embedding Alignment for Multimodal Large Language Model
arXiv 2025-08-06 Github -
Star
Matryoshka Query Transformer for Large Vision-Language Models
arXiv 2025-08-06 Github Demo
Star
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models
arXiv 2025-08-06 Github -
Star
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
arXiv 2025-08-06 Github Demo
Star
Libra: Building Decoupled Vision System on Large Language Models
ICML 2025-08-06 Github Local Demo
Star
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
arXiv 2025-08-06 Github Local Demo
Star
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
arXiv 2025-08-06 Github Demo
Star
Graphic Design with Large Multimodal Model
arXiv 2025-08-06 Github -
BRAVE: Broadening the visual encoding of vision-language models ECCV 2025-08-06 - -
Star
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
arXiv 2025-08-06 Github Demo
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs arXiv 2025-08-06 - -
Star
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
CVPR 2025-08-06 Github -
Star
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
NeurIPS 2025-08-06 Github Local Demo
TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model ACM TKDD 2025-08-06 - -
Star
LITA: Language Instructed Temporal-Localization Assistant
arXiv 2025-08-06 Github Local Demo
Star
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
arXiv 2025-08-06 Github Demo
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training arXiv 2025-08-06 - -
Star
MoAI: Mixture of All Intelligence for Large Language and Vision Models
arXiv 2025-08-06 Github Local Demo
Star
DeepSeek-VL: Towards Real-World Vision-Language Understanding
arXiv 2025-08-06 Github Demo
Star
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document
arXiv 2025-08-06 Github Demo
Star
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
arXiv 2025-08-06 Github -
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation CVPR 2025-08-06 Coming soon Coming soon
Star
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
arXiv 2025-08-06 Github -
Star
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
arXiv 2025-08-06 Github -
Star
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model
arXiv 2025-08-06 Github Demo
Star
CoLLaVO: Crayon Large Language and Vision mOdel
arXiv 2025-08-06 Github -
Star
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models
ICML 2025-08-06 Github -
Star
CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations
arXiv 2025-08-06 Github -
Star
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
arXiv 2025-08-06 Github -
Star
GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning
NeurIPS 2025-08-06 Github -
Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study arXiv 2025-08-06 Coming soon -
Star
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge
Blog 2025-08-06 Github Demo
Star
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
arXiv 2025-08-06 Github Demo
Star
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
arXiv 2025-08-06 Github Demo
Star
Yi-VL
- 2025-08-06 Github Local Demo
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities arXiv 2025-08-06 - -
Star
ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning
ACL 2025-08-06 Github Local Demo
Star
MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices
arXiv 2025-08-06 Github -
Star
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
CVPR 2025-08-06 Github Demo
Star
Osprey: Pixel Understanding with Visual Instruction Tuning
CVPR 2025-08-06 Github Demo
Star
CogAgent: A Visual Language Model for GUI Agents
arXiv 2025-08-06 Github Coming soon
Pixel Aligned Language Models arXiv 2025-08-06 Coming soon -
Star
VILA: On Pre-training for Visual Language Models
CVPR 2025-08-06 Github Local Demo
See, Say, and Segment: Teaching LMMs to Overcome False Premises arXiv 2025-08-06 Coming soon -
Star
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models
ECCV 2025-08-06 Github Demo
Star
Honeybee: Locality-enhanced Projector for Multimodal LLM
CVPR 2025-08-06 Github -
Gemini: A Family of Highly Capable Multimodal Models Google 2025-08-06 - -
Star
OneLLM: One Framework to Align All Modalities with Language
arXiv 2025-08-06 Github Demo
Star
Lenna: Language Enhanced Reasoning Detection Assistant
arXiv 2025-08-06 Github -
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding arXiv 2025-08-06 - -
Star
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
arXiv 2025-08-06 Github Local Demo
Star
Making Large Multimodal Models Understand Arbitrary Visual Prompts
CVPR 2025-08-06 Github Demo
Star
Dolphins: Multimodal Language Model for Driving
arXiv 2025-08-06 Github -
Star
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
arXiv 2025-08-06 Github Coming soon
Star
VTimeLLM: Empower LLM to Grasp Video Moments
arXiv 2025-08-06 Github Local Demo
Star
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model
arXiv 2025-08-06 Github -
Star
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
arXiv 2025-08-06 Github Coming soon
Star
LLMGA: Multimodal Large Language Model based Generation Assistant
arXiv 2025-08-06 Github Demo
Star
ChartLlama: A Multimodal LLM for Chart Understanding and Generation
arXiv 2025-08-06 Github -
Star
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
arXiv 2025-08-06 Github Demo
Star
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge
arXiv 2025-08-06 Github -
Star
An Embodied Generalist Agent in 3D World
arXiv 2025-08-06 Github Demo
Star
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
arXiv 2025-08-06 Github Demo
Star
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
CVPR 2025-08-06 Github -
Star
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
arXiv 2025-08-06 Github -
Star
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
arXiv 2025-08-06 Github Demo
Star
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
CVPR 2025-08-06 Github Demo
Star
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
arXiv 2025-08-06 Github Demo
Star
NExT-Chat: An LMM for Chat, Detection and Segmentation
arXiv 2025-08-06 Github Local Demo
Star
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
arXiv 2025-08-06 Github Demo
Star
OtterHD: A High-Resolution Multi-modality Model
arXiv 2025-08-06 Github -
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding arXiv 2025-08-06 Coming soon -
Star
GLaMM: Pixel Grounding Large Multimodal Model
CVPR 2025-08-06 Github Demo
Star
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
arXiv 2025-08-06 Github -
Star
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
arXiv 2025-08-06 Github Local Demo
Star
SALMONN: Towards Generic Hearing Abilities for Large Language Models
ICLR 2025-08-06 Github -
Star
Ferret: Refer and Ground Anything Anywhere at Any Granularity
arXiv 2025-08-06 Github -
Star
CogVLM: Visual Expert For Large Language Models
arXiv 2025-08-06 Github Demo
Star
Improved Baselines with Visual Instruction Tuning
arXiv 2025-08-06 Github Demo
Star
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
ICLR 2025-08-06 Github Demo
Star
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs
arXiv 2025-08-06 Github -
Star
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants
arXiv 2025-08-06 Github Local Demo
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model arXiv 2025-08-06 - -
Star
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
arXiv 2025-08-06 Github Local Demo
Star
DreamLLM: Synergistic Multimodal Comprehension and Creation
ICLR 2025-08-06 Github Coming soon
An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models arXiv 2025-08-06 Coming soon -
Star
TextBind: Multi-turn Interleaved Multimodal Instruction-following
arXiv 2025-08-06 Github Demo
Star
NExT-GPT: Any-to-Any Multimodal LLM
arXiv 2025-08-06 Github Demo
Star
Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics
arXiv 2025-08-06 Github -
Star
ImageBind-LLM: Multi-modality Instruction Tuning
arXiv 2025-08-06 Github Demo
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning arXiv 2025-08-06 - -
Star
PointLLM: Empowering Large Language Models to Understand Point Clouds
arXiv 2025-08-06 Github Demo
Star
?Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
arXiv 2025-08-06 Github Local Demo
Star
MLLM-DataEngine: An Iterative Refinement Approach for MLLM
arXiv 2025-08-06 Github -
Star
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models
arXiv 2025-08-06 Github Demo
Star
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
arXiv 2025-08-06 Github Demo
Star
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
ICLR 2025-08-06 Github Demo
Star
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data
arXiv 2025-08-06 Github -
Star
BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions
arXiv 2025-08-06 Github Demo
Star
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions
arXiv 2025-08-06 Github -
Star
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
ICLR 2025-08-06 Github Demo
Star
LISA: Reasoning Segmentation via Large Language Model
arXiv 2025-08-06 Github Demo
Star
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
arXiv 2025-08-06 Github Local Demo
Star
3D-LLM: Injecting the 3D World into Large Language Models
arXiv 2025-08-06 Github -
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning
arXiv 2025-08-06 - Demo
Star
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
arXiv 2025-08-06 Github Demo
Star
SVIT: Scaling up Visual Instruction Tuning
arXiv 2025-08-06 Github -
Star
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
arXiv 2025-08-06 Github Demo
Star
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
arXiv 2025-08-06 Github -
Star
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
arXiv 2025-08-06 Github Demo
Star
Visual Instruction Tuning with Polite Flamingo
arXiv 2025-08-06 Github Demo
Star
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
arXiv 2025-08-06 Github Demo
Star
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
arXiv 2025-08-06 Github Demo
Star
MotionGPT: Human Motion as a Foreign Language
arXiv 2025-08-06 Github -
Star
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
arXiv 2025-08-06 Github Coming soon
Star
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
arXiv 2025-08-06 Github Demo
Star
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
arXiv 2025-08-06 Github Demo
Star
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
arXiv 2025-08-06 Github Demo
M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning arXiv 2025-08-06 - -
Star
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
arXiv 2025-08-06 Github Demo
Star
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
arXiv 2025-08-06 Github -
Star
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
arXiv 2025-08-06 Github Demo
Star
PandaGPT: One Model To Instruction-Follow Them All
arXiv 2025-08-06 Github Demo
Star
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
arXiv 2025-08-06 Github -
Star
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
arXiv 2025-08-06 Github Local Demo
Star
DetGPT: Detect What You Need via Reasoning
arXiv 2025-08-06 Github Demo
Star
Pengi: An Audio Language Model for Audio Tasks
NeurIPS 2025-08-06 Github -
Star
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
arXiv 2025-08-06 Github -
Star
Listen, Think, and Understand
arXiv 2025-08-06 Github Demo
Star
VisualGLM-6B
- 2025-08-06 Github Local Demo
Star
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
arXiv 2025-08-06 Github -
Star
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
arXiv 2025-08-06 Github Local Demo
Star
VideoChat: Chat-Centric Video Understanding
arXiv 2025-08-06 Github Demo
Star
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans
arXiv 2025-08-06 Github Demo
Star
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
arXiv 2025-08-06 Github -
Star
LMEye: An Interactive Perception Network for Large Language Models
arXiv 2025-08-06 Github Local Demo
Star
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
arXiv 2025-08-06 Github Demo
Star
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
arXiv 2025-08-06 Github Demo
Star
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
arXiv 2025-08-06 Github -
Star
Visual Instruction Tuning
NeurIPS 2025-08-06 GitHub Demo
Star
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
ICLR 2025-08-06 Github Demo
Star
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning
ACL 2025-08-06 Github -

Multimodal Hallucination

Title Venue Date Code Demo
Star
Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models
arXiv 2025-08-06 Github -
Star
Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations
arXiv 2025-08-06 Github -
FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs arXiv 2025-08-06 Link -
Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation arXiv 2025-08-06 - -
Star
Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs
ECCV 2025-08-06 Github -
Star
Evaluating and Analyzing Relationship Hallucinations in LVLMs
ICML 2025-08-06 Github -
Star
AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention
arXiv 2025-08-06 Github -
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models arXiv 2025-08-06 Coming soon -
Mitigating Object Hallucination via Data Augmented Contrastive Tuning arXiv 2025-08-06 Coming soon -
VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap arXiv 2025-08-06 Coming soon -
Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback arXiv 2025-08-06 - -
Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding arXiv 2025-08-06 - -
Star
What if...?: Counterfactual Inception to Mitigate Hallucination Effects in Large Multimodal Models
arXiv 2025-08-06 Github -
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization arXiv 2025-08-06 - -
Star
Debiasing Multimodal Large Language Models
arXiv 2025-08-06 Github -
Star
HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding
arXiv 2025-08-06 Github -
IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding arXiv 2025-08-06 - -
Star
Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective
arXiv 2025-08-06 Github -
Star
Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models
arXiv 2025-08-06 Github -
Star
The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs
arXiv 2025-08-06 Github -
Star
Unified Hallucination Detection for Multimodal Large Language Models
arXiv 2025-08-06 Github -
A Survey on Hallucination in Large Vision-Language Models arXiv 2025-08-06 - -
Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models arXiv 2025-08-06 - -
Star
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model
arXiv 2025-08-06 Github -
Star
MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations
arXiv 2025-08-06 Github -
Star
Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites
arXiv 2025-08-06 Github -
Star
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
arXiv 2025-08-06 Github Demo
Star
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
CVPR 2025-08-06 Github -
Star
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
CVPR 2025-08-06 Github -
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization arXiv 2025-08-06 Github Comins Soon
Mitigating Hallucination in Visual Language Models with Visual Supervision arXiv 2025-08-06 - -
Star
HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data
arXiv 2025-08-06 Github -
Star
An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
arXiv 2025-08-06 Github -
Star
FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models
arXiv 2025-08-06 Github -
Star
Woodpecker: Hallucination Correction for Multimodal Large Language Models
arXiv 2025-08-06 Github Demo
Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models arXiv 2025-08-06 - -
Star
HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption
arXiv 2025-08-06 Github -
Star
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models
ICLR 2025-08-06 Github -
Star
Aligning Large Multimodal Models with Factually Augmented RLHF
arXiv 2025-08-06 Github Demo
Evaluation and Mitigation of Agnosia in Multimodal Large Language Models arXiv 2025-08-06 - -
CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning arXiv 2025-08-06 - -
Star
Evaluation and Analysis of Hallucination in Large Vision-Language Models
arXiv 2025-08-06 Github -
Star
VIGC: Visual Instruction Generation and Correction
arXiv 2025-08-06 Github Demo
Detecting and Preventing Hallucinations in Large Vision Language Models arXiv 2025-08-06 - -
Star
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
ICLR 2025-08-06 Github Demo
Star
Evaluating Object Hallucination in Large Vision-Language Models
EMNLP 2025-08-06 Github -

Multimodal In-Context Learning

Title Venue Date Code Demo
Visual In-Context Learning for Large Vision-Language Models arXiv 2025-08-06 - -
Star
RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model
RSS 2025-08-06 Github -
Star
Can MLLMs Perform Text-to-Image In-Context Learning?
arXiv 2025-08-06 Github -
Star
Generative Multimodal Models are In-Context Learners
CVPR 2025-08-06 Github Demo
Hijacking Context in Large Multi-modal Models arXiv 2025-08-06 - -
Towards More Unified In-context Visual Understanding arXiv 2025-08-06 - -
Star
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
arXiv 2025-08-06 Github Demo
Star
Link-Context Learning for Multimodal LLMs
arXiv 2025-08-06 Github Demo
Star
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
arXiv 2025-08-06 Github Demo
Star
Med-Flamingo: a Multimodal Medical Few-shot Learner
arXiv 2025-08-06 Github Local Demo
Star
Generative Pretraining in Multimodality
ICLR 2025-08-06 Github Demo
AVIS: Autonomous Visual Information Seeking with Large Language Models arXiv 2025-08-06 - -
Star
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
arXiv 2025-08-06 Github Demo
Star
Exploring Diverse In-Context Configurations for Image Captioning
NeurIPS 2025-08-06 Github -
Star
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
arXiv 2025-08-06 Github Demo
Star
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
arXiv 2025-08-06 Github Demo
Star
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
arXiv 2025-08-06 Github Demo
Star
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction
ICCV 2025-08-06 Github -
Star
Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering
CVPR 2025-08-06 Github -
Star
Visual Programming: Compositional visual reasoning without training
CVPR 2025-08-06 Github Local Demo
Star
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
AAAI 2025-08-06 Github -
Star
Flamingo: a Visual Language Model for Few-Shot Learning
NeurIPS 2025-08-06 Github Demo
Multimodal Few-Shot Learning with Frozen Language Models NeurIPS 2025-08-06 - -

Multimodal Chain-of-Thought

Title Venue Date Code Demo
Star
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
arXiv 2025-08-06 Github -
Star
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM
arXiv 2025-08-06 Github Local Demo
Star
Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models
arXiv 2025-08-06 Github Local Demo
Star
Compositional Chain-of-Thought Prompting for Large Multimodal Models
CVPR 2025-08-06 Github -
Star
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models
NeurIPS 2025-08-06 Github -
Star
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
arXiv 2025-08-06 Github Demo
Star
Explainable Multimodal Emotion Reasoning
arXiv 2025-08-06 Github -
Star
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
arXiv 2025-08-06 Github -
Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction arXiv 2025-08-06 - -
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering arXiv 2025-08-06 - -
Star
Caption Anything: Interactive Image Description with Diverse Multimodal Controls
arXiv 2025-08-06 Github Demo
Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings arXiv 2025-08-06 Coming soon -
Star
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
arXiv 2025-08-06 Github Demo
Chain of Thought Prompt Tuning in Vision Language Models arXiv 2025-08-06 Coming soon -
Star
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
arXiv 2025-08-06 Github Demo
Star
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
arXiv 2025-08-06 Github Demo
Star
Multimodal Chain-of-Thought Reasoning in Language Models
arXiv 2025-08-06 Github -
Star
Visual Programming: Compositional visual reasoning without training
CVPR 2025-08-06 Github Local Demo
Star
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
NeurIPS 2025-08-06 Github -

LLM-Aided Visual Reasoning

Title Venue Date Code Demo
Star
VideoDeepResearch: Long Video Understanding With Agentic Tool Using
arXiv 2025-08-06 Github Local Demo
Star
Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models
arXiv 2025-08-06 Github -
Star
V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs
arXiv 2025-08-06 Github Local Demo
Star
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
arXiv 2025-08-06 Github Demo
MM-VID: Advancing Video Understanding with GPT-4V(vision) arXiv 2025-08-06 - -
Star
ControlLLM: Augment Language Models with Tools by Searching on Graphs
arXiv 2025-08-06 Github -
Star
Woodpecker: Hallucination Correction for Multimodal Large Language Models
arXiv 2025-08-06 Github Demo
Star
MindAgent: Emergent Gaming Interaction
arXiv 2025-08-06 Github -
Star
Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language
arXiv 2025-08-06 Github Demo
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models arXiv 2025-08-06 - -
Star
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn
arXiv 2025-08-06 Github -
AVIS: Autonomous Visual Information Seeking with Large Language Models arXiv 2025-08-06 - -
Star
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
arXiv 2025-08-06 Github Demo
Mindstorms in Natural Language-Based Societies of Mind arXiv 2025-08-06 - -
Star
LayoutGPT: Compositional Visual Planning and Generation with Large Language Models
arXiv 2025-08-06 Github -
Star
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models
arXiv 2025-08-06 Github Local Demo
Star
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation
arXiv 2025-08-06 Github -
Star
Caption Anything: Interactive Image Description with Diverse Multimodal Controls
arXiv 2025-08-06 Github Demo
Star
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
arXiv 2025-08-06 Github Demo
Star
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
arXiv 2025-08-06 Github Demo
Star
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
arXiv 2025-08-06 Github Demo
Star
ViperGPT: Visual Inference via Python Execution for Reasoning
arXiv 2025-08-06 Github Local Demo
Star
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions
arXiv 2025-08-06 Github Local Demo
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction ICCV 2025-08-06 - -
Star
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
arXiv 2025-08-06 Github Demo
Star
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners
CVPR 2025-08-06 Github -
Star
From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models
CVPR 2025-08-06 Github Demo
Star
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models
arXiv 2025-08-06 Github -
Star
PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning
CVPR 2025-08-06 Github -
Star
Visual Programming: Compositional visual reasoning without training
CVPR 2025-08-06 Github Local Demo
Star
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
arXiv 2025-08-06 Github -

Foundation Models

Title Venue Date Code Demo
Star
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
arXiv 2025-08-06 Github Demo
Star
Emu3: Next-Token Prediction is All You Need
arXiv 2025-08-06 Github Local Demo
Llama 3.2: Revolutionizing edge AI and vision with open, customizable models Meta 2025-08-06 - Demo
Pixtral-12B Mistral 2025-08-06 - -
Star
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
arXiv 2025-08-06 Github -
The Llama 3 Herd of Models arXiv 2025-08-06 - -
Chameleon: Mixed-Modal Early-Fusion Foundation Models arXiv 2025-08-06 - -
Hello GPT-4o OpenAI 2025-08-06 - -
The Claude 3 Model Family: Opus, Sonnet, Haiku Anthropic 2025-08-06 - -
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context Google 2025-08-06 - -
Gemini: A Family of Highly Capable Multimodal Models Google 2025-08-06 - -
Fuyu-8B: A Multimodal Architecture for AI Agents blog 2025-08-06 Huggingface Demo
Star
Unified Model for Image, Video, Audio and Language Tasks
arXiv 2025-08-06 Github Demo
PaLI-3 Vision Language Models: Smaller, Faster, Stronger arXiv 2025-08-06 - -
GPT-4V(ision) System Card OpenAI 2025-08-06 - -
Star
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
arXiv 2025-08-06 Github -
Multimodal Foundation Models: From Specialists to General-Purpose Assistants arXiv 2025-08-06 - -
Star
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
NeurIPS 2025-08-06 Github -
Star
Generative Pretraining in Multimodality
arXiv 2025-08-06 Github Demo
Star
Kosmos-2: Grounding Multimodal Large Language Models to the World
arXiv 2025-08-06 Github Demo
Star
Transfer Visual Prompt Generator across LLMs
arXiv 2025-08-06 Github Demo
GPT-4 Technical Report arXiv 2025-08-06 - -
PaLM-E: An Embodied Multimodal Language Model arXiv 2025-08-06 - Demo
Star
Prismer: A Vision-Language Model with An Ensemble of Experts
arXiv 2025-08-06 Github Demo
Star
Language Is Not All You Need: Aligning Perception with Language Models
arXiv 2025-08-06 Github -
Star
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
arXiv 2025-08-06 Github Demo
Star
VIMA: General Robot Manipulation with Multimodal Prompts
ICML 2025-08-06 Github Local Demo
Star
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge
NeurIPS 2025-08-06 Github -
Star
Write and Paint: Generative Vision-Language Models are Unified Modal Learners
ICLR 2025-08-06 Github -
Star
Language Models are General-Purpose Interfaces
arXiv 2025-08-06 Github -

Evaluation

Title Venue Date Page
Stars
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
arXiv 2025-08-06 Github
Stars
MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective
arXiv 2025-08-06 Github
Stars
OmniBench: Towards The Future of Universal Omni-Language Models
arXiv 2025-08-06 Github
Stars
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
arXiv 2025-08-06 Github
Stars
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models
TPAMI 2025-08-06 Github
Stars
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation
arXiv 2025-08-06 Github
Stars
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs
arXiv 2025-08-06 Github
Stars
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
arXiv 2025-08-06 Github
Stars
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation
arXiv 2025-08-06 Github
Stars
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
arXiv 2025-08-06 Github
Stars
Benchmarking Large Multimodal Models against Common Corruptions
NAACL 2025-08-06 Github
Stars
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
arXiv 2025-08-06 Github
Stars
A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise
arXiv 2025-08-06 Github
Stars
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
arXiv 2025-08-06 Github
Star
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs
arXiv 2025-08-06 Github
Star
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs
arXiv 2025-08-06 Github
Star
MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V
arXiv 2025-08-06 Github
VLM-Eval: A General Evaluation on Video Large Language Models arXiv 2025-08-06 Coming soon
Star
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges
arXiv 2025-08-06 Github
Star
On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving
arXiv 2025-08-06 Github
Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the Lead arXiv 2025-08-06 -
A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging arXiv 2025-08-06 -
Star
An Early Evaluation of GPT-4V(ision)
arXiv 2025-08-06 Github
Star
Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation
arXiv 2025-08-06 Github
Star
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
CVPR 2025-08-06 Github
Star
MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models
ICLR 2025-08-06 Github
Star
Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations
arXiv 2025-08-06 Github
Star
Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning
arXiv 2025-08-06 Github
Star
Can We Edit Multimodal Large Language Models?
arXiv 2025-08-06 Github
Star
REVO-LION: Evaluating and Refining Vision-Language Instruction Tuning Datasets
arXiv 2025-08-06 Github
The Dawn of LMMs: Preliminary Explorations with GPT-4V(vision) arXiv 2025-08-06 -
Star
TouchStone: Evaluating Vision-Language Models by Language Models
arXiv 2025-08-06 Github
Star
?Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
arXiv 2025-08-06 Github
Star
SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs
arXiv 2025-08-06 Github
Star
Tiny LVLM-eHub: Early Multimodal Experiments with Bard
arXiv 2025-08-06 Github
Star
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
arXiv 2025-08-06 Github
Star
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
CVPR 2025-08-06 Github
Star
MMBench: Is Your Multi-modal Model an All-around Player?
arXiv 2025-08-06 Github
Star
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
arXiv 2025-08-06 Github
Star
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
arXiv 2025-08-06 Github
Star
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
arXiv 2025-08-06 Github
Star
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models
arXiv 2025-08-06 Github
Star
On The Hidden Mystery of OCR in Large Multimodal Models
arXiv 2025-08-06 Github

Multimodal RLHF

Title Venue Date Code Demo
Star
R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning
arXiv 2025-08-06 Github -
Star
Aligning Multimodal LLM with Human Preference: A Survey
arXiv 2025-08-06 Github -
Star
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
arXiv 2025-08-06 Github -
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization arXiv 2025-08-06 - -
Star
Silkie: Preference Distillation for Large Visual Language Models
arXiv 2025-08-06 Github -
Star
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
arXiv 2025-08-06 Github Demo
Star
Aligning Large Multimodal Models with Factually Augmented RLHF
arXiv 2025-08-06 Github Demo
Star
RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data
arXiv 2025-08-06 Github -

Others

Title Venue Date Code Demo
Star
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
arXiv 2025-08-06 Github -
Star
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
arXiv 2025-08-06 Github -
Star
VCoder: Versatile Vision Encoders for Multimodal Large Language Models
arXiv 2025-08-06 Github Local Demo
Star
Prompt Highlighter: Interactive Control for Multi-Modal LLMs
arXiv 2025-08-06 Github -
Star
Planting a SEED of Vision in Large Language Model
arXiv 2025-08-06 Github
Star
Can Large Pre-trained Models Help Vision Models on Perception Tasks?
arXiv 2025-08-06 Github -
Star
Contextual Object Detection with Multimodal Large Language Models
arXiv 2025-08-06 Github Demo
Star
Generating Images with Multimodal Language Models
arXiv 2025-08-06 Github -
Star
On Evaluating Adversarial Robustness of Large Vision-Language Models
arXiv 2025-08-06 Github -
Star
Grounding Language Models to Images for Multimodal Inputs and Outputs
ICML 2025-08-06 Github Demo

Awesome Datasets

Datasets of Pre-Training for Alignment

Name Paper Type Modalities
ShareGPT4Video ShareGPT4Video: Improving Video Understanding and Generation with Better Captions Caption Video-Text
COYO-700M COYO-700M: Image-Text Pair Dataset Caption Image-Text
ShareGPT4V ShareGPT4V: Improving Large Multi-Modal Models with Better Captions Caption Image-Text
AS-1B The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World Hybrid Image-Text
InternVid InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation Caption Video-Text
MS-COCO Microsoft COCO: Common Objects in Context Caption Image-Text
SBU Captions Im2Text: Describing Images Using 1 Million Captioned Photographs Caption Image-Text
Conceptual Captions Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning Caption Image-Text
LAION-400M LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs Caption Image-Text
VG Captions Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations Caption Image-Text
Flickr30k Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models Caption Image-Text
AI-Caps AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding Caption Image-Text
Wukong Captions Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark Caption Image-Text
GRIT Kosmos-2: Grounding Multimodal Large Language Models to the World Caption Image-Text-Bounding-Box
Youku-mPLUG Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks Caption Video-Text
MSR-VTT MSR-VTT: A Large Video Description Dataset for Bridging Video and Language Caption Video-Text
Webvid10M Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval Caption Video-Text
WavCaps WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research Caption Audio-Text
AISHELL-1 AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline ASR Audio-Text
AISHELL-2 AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale ASR Audio-Text
VSDial-CN X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages ASR Image-Audio-Text

Datasets of Multimodal Instruction Tuning

Name Paper Link Notes
Inst-IT Dataset Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning Link An instruction-tuning dataset which contains fine-grained multi-level annotations for 21k videos and 51k images
E.T. Instruct 164K E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding Link An instruction-tuning dataset for time-sensitive video understanding
MSQA Multi-modal Situated Reasoning in 3D Scenes Link A large scale dataset for multi-modal situated reasoning in 3D scenes
MM-Evol MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct Link An instruction dataset with rich diversity
UNK-VQA UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models Link A dataset designed to teach models to refrain from answering unanswerable questions
VEGA VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models Link A dataset for enhancing model capabilities in comprehension of interleaved information
ALLaVA-4V ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model Link Vision and language caption and instruction dataset generated by GPT4V
IDK Visually Dehallucinative Instruction Generation: Know What You Don't Know Link Dehallucinative visual instruction for "I Know" hallucination
CAP2QA Visually Dehallucinative Instruction Generation Link Image-aligned visual instruction dataset
M3DBench M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts Link A large-scale 3D instruction tuning dataset
ViP-LLaVA-Instruct Making Large Multimodal Models Understand Arbitrary Visual Prompts Link A mixture of LLaVA-1.5 instruction data and the region-level visual prompting data
LVIS-Instruct4V To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning Link A visual instruction dataset via self-instruction from GPT-4V
ComVint What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning Link A synthetic instruction dataset for complex visual reasoning
SparklesDialogue ?Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models Link A machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions to augment the conversational competence of instruction-following LLMs across multiple images and dialogue turns.
StableLLaVA StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data Link A cheap and effective approach to collect visual instruction tuning data
M-HalDetect Detecting and Preventing Hallucinations in Large Vision Language Models Coming soon A dataset used to train and benchmark models for hallucination detection and prevention
MGVLID ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning - A high-quality instruction-tuning dataset including image-text and region-text pairs
BuboGPT BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs Link A high-quality instruction-tuning dataset including audio-text audio caption data and audio-image-text localization data
SVIT SVIT: Scaling up Visual Instruction Tuning Link A large-scale dataset with 4.2M informative visual instruction tuning data, including conversations, detailed descriptions, complex reasoning and referring QAs
mPLUG-DocOwl mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding Link An instruction tuning dataset featuring a wide range of visual-text understanding tasks including OCR-free document understanding
PF-1M Visual Instruction Tuning with Polite Flamingo Link A collection of 37 vision-language datasets with responses rewritten by Polite Flamingo.
ChartLlama ChartLlama: A Multimodal LLM for Chart Understanding and Generation Link A multi-modal instruction-tuning dataset for chart understanding and generation
LLaVAR LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding Link A visual instruction-tuning dataset for Text-rich Image Understanding
MotionGPT MotionGPT: Human Motion as a Foreign Language Link A instruction-tuning dataset including multiple human motion-related tasks
LRV-Instruction Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning Link Visual instruction tuning dataset for addressing hallucination issue
Macaw-LLM Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration Link A large-scale multi-modal instruction dataset in terms of multi-turn dialogue
LAMM-Dataset LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark Link A comprehensive multi-modal instruction tuning dataset
Video-ChatGPT Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models Link 100K high-quality video instruction dataset
MIMIC-IT MIMIC-IT: Multi-Modal In-Context Instruction Tuning Link Multimodal in-context instruction tuning
M3IT M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning Link Large-scale, broad-coverage multimodal instruction tuning dataset
LLaVA-Med LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day Coming soon A large-scale, broad-coverage biomedical instruction-following dataset
GPT4Tools GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction Link Tool-related instruction datasets
MULTIS ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst Coming soon Multimodal instruction tuning dataset covering 16 multimodal tasks
DetGPT DetGPT: Detect What You Need via Reasoning Link Instruction-tuning dataset with 5000 images and around 30000 query-answer pairs
PMC-VQA PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering Coming soon Large-scale medical visual question-answering dataset
VideoChat VideoChat: Chat-Centric Video Understanding Link Video-centric multimodal instruction dataset
X-LLM X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages Link Chinese multimodal instruction dataset
LMEye LMEye: An Interactive Perception Network for Large Language Models Link A multi-modal instruction-tuning dataset
cc-sbu-align MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models Link Multimodal aligned dataset for improving model's usability and generation's fluency
LLaVA-Instruct-150K Visual Instruction Tuning Link Multimodal instruction-following data generated by GPT
MultiInstruct MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning Link The first multimodal instruction tuning benchmark dataset

Datasets of In-Context Learning

Name Paper Link Notes
MIC MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning Link A manually constructed instruction tuning dataset including interleaved text-image inputs, inter-related multiple image inputs, and multimodal in-context learning inputs.
MIMIC-IT MIMIC-IT: Multi-Modal In-Context Instruction Tuning Link Multimodal in-context instruction dataset

Datasets of Multimodal Chain-of-Thought

Name Paper Link Notes
EMER Explainable Multimodal Emotion Reasoning Coming soon A benchmark dataset for explainable emotion reasoning task
EgoCOT EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought Coming soon Large-scale embodied planning dataset
VIP Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction Coming soon An inference-time dataset that can be used to evaluate VideoCOT
ScienceQA Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering Link Large-scale multi-choice dataset, featuring multimodal science questions and diverse domains

Datasets of Multimodal RLHF

Name Paper Link Notes
VLFeedback Silkie: Preference Distillation for Large Visual Language Models Link A vision-language feedback dataset annotated by AI

Benchmarks for Evaluation

Name Paper Link Notes
Inst-IT Bench Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning Link A benchmark to evaluate fine-grained instance-level understanding in images and videos
M3CoT M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought Link A multi-domain, multi-step benchmark for multimodal CoT
MMGenBench MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective Link A benchmark that gauges the performance of constructing image-generation prompt given an image
MiCEval MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps Link A multimodal CoT benchmark to evaluate MLLMs' reasoning capabilities
LiveXiv LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content Link A live benchmark based on arXiv papers
TemporalBench TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models Link A benchmark for evaluation of fine-grained temporal understanding
OmniBench OmniBench: Towards The Future of Universal Omni-Language Models Link A benchmark that evaluates models' capabilities of processing visual, acoustic, and textual inputs simultaneously
MME-RealWorld MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? Link A challenging benchmark that involves real-life scenarios
VELOCITI VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time? Link A video benhcmark that evaluates on perception and binding capabilities
MMR Seeing Clearly, Answering Incorrectly: A Multimodal Robustness Benchmark for Evaluating MLLMs on Leading Questions Link A benchmark for measuring MLLMs' understanding capability and robustness to leading questions
CharXiv CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs Link Chart understanding benchmark curated by human experts
Video-MME Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Link A comprehensive evaluation benchmark of Multi-modal LLMs in video analysis
VL-ICL Bench VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning Link A benchmark for M-ICL evaluation, covering a wide spectrum of tasks
TempCompass TempCompass: Do Video LLMs Really Understand Videos? Link A benchmark to evaluate the temporal perception ability of Video LLMs
GVLQA GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning Link A benchmark for evaluation of graph reasoning capabilities
CoBSAT Can MLLMs Perform Text-to-Image In-Context Learning? Link A benchmark for text-to-image ICL
VQAv2-IDK Visually Dehallucinative Instruction Generation: Know What You Don't Know Link A benchmark for assessing "I Know" visual hallucination
Math-Vision Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset Link A diverse mathematical reasoning benchmark
SciMMIR SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval Link
CMMMU CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark Link A Chinese benchmark involving reasoning and knowledge across multiple disciplines
MMCBench Benchmarking Large Multimodal Models against Common Corruptions Link A benchmark for examining self-consistency under common corruptions
MMVP Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs Link A benchmark for assessing visual capabilities
TimeIT TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding Link A video instruction-tuning dataset with timestamp annotations, covering diverse time-sensitive video-understanding tasks.
ViP-Bench Making Large Multimodal Models Understand Arbitrary Visual Prompts Link A benchmark for visual prompts
M3DBench M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts Link A 3D-centric benchmark
Video-Bench Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models Link A benchmark for video-MLLM evaluation
Charting-New-Territories Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs Link A benchmark for evaluating geographic and geospatial capabilities
MLLM-Bench MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V Link GPT-4V evaluation with per-sample criteria
BenchLMM BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models Link A benchmark for assessment of the robustness against different image styles
MMC-Benchmark MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning Link A comprehensive human-annotated benchmark with distinct tasks evaluating reasoning capabilities over charts
MVBench MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Link A comprehensive multimodal benchmark for video understanding
Bingo Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges Link A benchmark for hallucination evaluation that focuses on two common types
MagnifierBench OtterHD: A High-Resolution Multi-modality Model Link A benchmark designed to probe models' ability of fine-grained perception
HallusionBench HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models Link An image-context reasoning benchmark for evaluation of hallucination
PCA-EVAL Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond Link A benchmark for evaluating multi-domain embodied decision-making.
MMHal-Bench Aligning Large Multimodal Models with Factually Augmented RLHF Link A benchmark for hallucination evaluation
MathVista MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models Link A benchmark that challenges both visual and math reasoning capabilities
SparklesEval ?Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models Link A GPT-assisted benchmark for quantitatively assessing a model's conversational competence across multiple images and dialogue turns based on three distinct criteria.
ISEKAI Link-Context Learning for Multimodal LLMs Link A benchmark comprising exclusively of unseen generated image-label pairs designed for link-context learning
M-HalDetect Detecting and Preventing Hallucinations in Large Vision Language Models Coming soon A dataset used to train and benchmark models for hallucination detection and prevention
I4 Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions Link A benchmark to comprehensively evaluate the instruction following ability on complicated interleaved vision-language instructions
SciGraphQA SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs Link A large-scale chart-visual question-answering dataset
MM-Vet MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities Link An evaluation benchmark that examines large multimodal models on complicated multimodal tasks
SEED-Bench SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension Link A benchmark for evaluation of generative comprehension in MLLMs
MMBench MMBench: Is Your Multi-modal Model an All-around Player? Link A systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models
Lynx What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? Link A comprehensive evaluation benchmark including both image and video tasks
GAVIE Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning Link A benchmark to evaluate the hallucination and instruction following ability
MME MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models Link A comprehensive MLLM Evaluation benchmark
LVLM-eHub LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models Link An evaluation platform for MLLMs
LAMM-Benchmark LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark Link A benchmark for evaluating the quantitative performance of MLLMs on various2D/3D vision tasks
M3Exam M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models Link A multilingual, multimodal, multilevel benchmark for evaluating MLLM
OwlEval mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality Link Dataset for evaluation on multiple capabilities

Others

Name Paper Link Notes
IMAD IMAD: IMage-Augmented multi-modal Dialogue Link Multimodal dialogue dataset
Video-ChatGPT Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models Link A quantitative evaluation framework for video-based dialogue models
CLEVR-ATVC Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation Link A synthetic multimodal fine-tuning dataset for learning to reject instructions
Fruit-ATVC Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation Link A manually pictured multimodal fine-tuning dataset for learning to reject instructions
InfoSeek Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? Link A VQA dataset that focuses on asking information-seeking questions
OVEN Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities Link A dataset that focuses on recognizing the Visual Entity on the Wikipedia, from images in the wild
女人小便出血是什么原因 酱油是什么做的 2024年是属什么生肖 指甲弯曲是什么原因 看是什么意思
单纯疱疹病毒是什么病 左耳耳鸣是什么原因 素心是什么意思 三个土字念什么字 吹牛皮是什么意思
手柄是什么意思 为什么空调外机会滴水 夜夜笙歌什么意思 mmhg是什么意思 93年属鸡是什么命
这是什么英语 鬼火是什么意思 miffy是什么意思 消炎药有什么 00年属龙的是什么命
香兰素是什么东西hcv7jop9ns5r.cn 96年是什么年tiangongnft.com 平时血压高突然变低什么原因adwl56.com 指甲盖凹陷是什么原因hcv8jop8ns9r.cn 吃什么会变瘦hcv8jop8ns3r.cn
西藏有什么特产wmyky.com 除权是什么意思hcv8jop2ns6r.cn 尿隐血是什么意思hcv8jop4ns5r.cn 海棠依旧什么意思hcv8jop0ns8r.cn 复方氨酚烷胺胶囊是什么药hcv8jop3ns0r.cn
生化八项是检查什么hcv7jop9ns3r.cn 男性睾丸一边大一边小是什么原因hcv9jop7ns0r.cn 急性肠胃炎是什么原因引起的hcv9jop2ns0r.cn 1979年属什么hcv9jop7ns9r.cn 69年出生属什么hcv9jop0ns5r.cn
15号来月经排卵期是什么时候hcv8jop7ns9r.cn 亭亭净植的亭亭是什么意思hcv7jop7ns2r.cn 野是什么意思hcv8jop1ns9r.cn 一什么睡莲zhongyiyatai.com 理疗是什么意思hcv8jop7ns4r.cn
百度