JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation

1Renmin University of China 2Baichuan Inc.

Abstract

Compared to vision or audio large language models (LLMs), the key advantage of omni large language model lies in their joint audio-visual reasoning capability. To train such models, datasets with questions requiring both visual and auditory information to answer are needed. Moreover, videos contain complex audio signal types and scenes, interleaved with each other, demanding models with various cognitive capabilities. However, current datasets lack challenging multi-scene tasks, various types of audio information and cognition abilities.

This paper introduces JointAVBench, a dataset designed to answer questions that necessitate AV integration, spanning 5 cognitive dimensions, 4 audio information types, and 3 scene spans. Our benchmark reveals that the top omni-LLM achieves only 56.2% average accuracy, highlighting significant room for improvement, particularly in cross-scene reasoning.

Key Features

  • Large-scale Benchmark: 2,853 questions across 15 diverse task types
  • Automated Generation Pipeline: State-of-the-art vision-LLMs, audio-LLMs, and LLMs automatically synthesize questions requiring joint audio-visual reasoning
  • Multi-dimensional Coverage:
    • 5 cognitive dimensions: Temporal, Spatial, Long-form, Emotional, and Plot understanding
    • 4 audio information types: Speech, Sound events, Music, and Speech emotion
    • 3 scene spans: Single-scene, Multi-scene, and Full-scene reasoning
  • Comprehensive Evaluation: Evaluation suite for majority mainstream omni-modal models
  • Challenging Tasks: Multi-scene tasks requiring complex cross-modal reasoning

Data Generation Pipeline

Our automated benchmark generation pipeline leverages state-of-the-art vision-LLMs, audio-LLMs, and general-purpose LLMs to synthesize questions and answers that strictly require joint audio-visual understanding. The pipeline consists of several stages that ensure high-quality benchmark questions with strict audio-video correlation.

JointAVBench data generation pipeline
Complete pipeline for automated benchmark generation from raw videos

Benchmark Statistics

JointAVBench consists of 2,853 questions across 15 distinct tasks spanning multiple dimensions. The benchmark covers diverse cognitive dimensions, audio information types, and scene spans to provide comprehensive evaluation of omni-modal models.

JointAVBench statistics breakdown
Distribution of questions across cognitive dimensions, audio types, and scene spans

Dataset

The JointAVBench dataset is available on Hugging Face. The benchmark file jointavbench.json contains all 2,853 questions with metadata. Please note that due to content restrictions, we cannot share the raw videos. However, we provide a URL to the original YouTube video for each question.

Download the Benchmark

# Download benchmark questions and videos
pip install huggingface_hub
huggingface-cli download JointAVBench/JointAVBench --local-dir ./data

Data Format

Each question in the benchmark follows this format:

{
  "qid": "-CEDoGn0w1s_task1_0",
  "video_name": "-CEDoGn0w1s",
  "task": "STL",
  "question": "Which objects are mentioned only in the dialogue...",
  "correct_answer": "The broom, mentioned at around 6.34s",
  "explanation": "The object \"broom\" is mentioned...",
  "options": [...],
  "video_url": "https://www.youtube.com/watch?v=-CEDoGn0w1s",
  "segment_timestamp": [653.444, 699.657]
}

Performance

Overall Performance

Overall model performance comparison
Performance comparison of different omni-modal models on JointAVBench

Breakdown Performance

Performance breakdown by dimensions
Model performance across different cognitive dimensions and audio types
Performance breakdown by scene span
Performance comparison across different scene spans

Key Findings

  • Multi-modal Reasoning is Hard: Even top models struggle to integrate audio-visual information effectively
  • Scene Complexity Matters: Performance degrades significantly for multi-scene and full-video tasks
  • Audio Type Dependency: Models perform differently on speech vs. music vs. sound events
  • Cognitive Dimension Gaps: Temporal and spatial reasoning show better results than plot understanding

Our benchmark reveals significant challenges for current omni-modal models:

  • Top Performance: 56.2% average accuracy
  • Cross-scene Reasoning: Particularly challenging (42-50% accuracy)
  • Single-scene Tasks: Relatively better (68% accuracy)
  • Performance Gaps: Significant variations across cognitive dimensions and audio types

Citation

If you find JointAVBench useful for your research, please cite our paper:

@article{chao2025jointavbench,
  title={JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation},
  author={Chao, Jianghan and Gao, Jianzhang and Tan, Wenhui and Sun, Yuchong and Song, Ruihua and Ru, Liyun},
  journal={arXiv preprint arXiv:2512.12772},
  year={2025}
}

Contact

For questions and feedback: