Skip to main contentSkip to user menuSkip to navigation

MMLU & Composite Benchmarks

Comprehensive evaluation with MMLU, MMMU, AGIEval, and other composite benchmarks for measuring broad AI capabilities

40 min readAdvanced
Not Started
Loading...

What are Composite Benchmarks?

Composite benchmarks evaluate AI models across multiple domains and skill types in a single comprehensive assessment. Unlike specialized benchmarks that focus on specific capabilities, composite benchmarks like MMLU test broad knowledge and reasoning across subjects ranging from elementary mathematics to professional law, providing a holistic view of model capabilities and limitations.

MMLU: Massive Multitask Language Understanding

Dataset Characteristics

  • 15,908 questions across 57 subjects
  • 4-choice multiple choice format
  • 4 difficulty levels: Elementary, High School, College, Professional
  • Covers 4 major areas: STEM, Humanities, Social Sciences, Other
  • Few-shot evaluation with 5 examples per subject
  • Human expert baseline for comparison

Subject Categories

STEM (18 subjects): Mathematics, Physics, Chemistry, Biology, Computer Science
Humanities (13 subjects): History, Philosophy, Literature, Arts
Social Sciences (12 subjects): Psychology, Sociology, Economics, Law
Other (14 subjects): Business, Medicine, Miscellaneous topics

Sample MMLU Questions

High School Chemistry

Question: Which of the following best explains why water has a higher boiling point than hydrogen sulfide?

A) Water has a lower molecular weight
B) Water molecules form hydrogen bonds
C) Water has a linear molecular structure
D) Water is more polar than hydrogen sulfide
Answer: B
College Mathematics

Question: Let f(x) = x³ - 3x + 1. What is the number of real roots of f(x) = 0?

A) 0
B) 1
C) 2
D) 3
Answer: D
ModelMMLU ScoreSTEMHumanitiesSocial Sci
GPT-486.4%83.3%85.2%91.1%
Claude-3 Opus86.8%85.2%86.4%89.7%
Gemini Ultra83.7%80.9%84.2%88.1%
Human Expert89.8%87.5%91.2%92.8%

MMMU: Massive Multi-discipline Multimodal Understanding

Multimodal Extension

  • 11,500 multimodal questions across 30 subjects
  • Images required for answering questions
  • 6 question types: Multiple choice, true/false, fill-in-blank, etc.
  • College-level difficulty across all subjects
  • Vision-language integration essential for success
  • Expert validation of question quality

Subject Coverage

  • Art & Design: Architecture, Fashion, Graphic Design
  • Business: Accounting, Economics, Marketing
  • Science: Biology, Chemistry, Physics, Geography
  • Health & Medicine: Pharmacy, Public Health
  • Humanities: History, Literature, Philosophy
  • Social Science: Psychology, Sociology

MMMU vs MMLU: Key Differences

MMLU (Text-only)

  • • Pure language understanding
  • • 57 subjects across all difficulty levels
  • • Primarily multiple choice (4 options)
  • • Tests factual knowledge & reasoning
  • • Human expert baseline: ~90%

MMMU (Multimodal)

  • • Vision-language understanding required
  • • 30 subjects, college-level focus
  • • Multiple question formats
  • • Tests visual reasoning & interpretation
  • • Human expert baseline: ~89%

Other Major Composite Benchmarks

AGIEval

Tests foundation models on human-centric standardized exams including SAT, LSAT, GRE, and national college entrance exams from China.

  • • 20 official exams
  • • Real-world assessment standards
  • • English and Chinese versions
  • • Human performance baselines

HellaSwag

Tests commonsense natural language inference through machine-generated challenging situations that are trivial for humans.

  • • 70,000 multiple choice questions
  • • Adversarially filtered examples
  • • Human accuracy: ~95%
  • • Tests physical commonsense

ARC (AI2 Reasoning Challenge)

Science exam questions that require reasoning and world knowledge, designed to be challenging for statistical AI systems.

  • • Easy: 2,590 questions
  • • Challenge: 1,172 questions
  • • Grade 3-9 science level
  • • Multiple choice format

BIG-bench

Collaborative benchmark with 200+ tasks designed to probe capabilities and limitations of large language models.

  • • 200+ diverse tasks
  • • Beyond current capabilities
  • • Community contributions
  • • Emergent abilities focus

Meta Llama Human Evaluation Framework

Evaluation Methodology

  • Human preference evaluation on real-world prompts
  • Side-by-side comparisons with other models
  • Multiple evaluation criteria: Helpfulness, harmlessness, honesty
  • Diverse prompt categories from actual usage
  • Professional human evaluators with domain expertise
  • Inter-annotator agreement validation

Evaluation Categories

  • Creative writing: Stories, poems, creative tasks
  • Reasoning: Logic puzzles, mathematical problems
  • Code generation: Programming tasks and debugging
  • Information seeking: Q&A and research tasks
  • Summarization: Text condensation tasks
  • Instruction following: Complex multi-step tasks

Key Insights from Human Evaluation

Benchmark vs Human Preferences

  • • Automated metrics don't always correlate with human preferences
  • • Context-dependent performance varies significantly
  • • Safety considerations crucial in real-world usage
  • • Task complexity affects evaluation reliability

Model Capabilities

  • • Creative tasks show high variance in quality
  • • Factual accuracy remains challenging
  • • Instruction following improves with model size
  • • Domain expertise varies across subjects

Implementation Best Practices

MMLU & Composite Benchmark Evaluation Framework

✅ Best Practices

  • • Use few-shot prompting with subject-specific examples
  • • Track per-subject and per-difficulty performance
  • • Report confidence intervals for statistical significance
  • • Compare against human expert baselines
  • • Analyze error patterns across domains
  • • Consider domain transfer capabilities

❌ Common Pitfalls

  • • Focusing only on aggregate scores
  • • Ignoring subject-specific performance patterns
  • • Not accounting for random baseline (25% for 4-choice)
  • • Overlooking prompt sensitivity
  • • Missing statistical significance testing
  • • Assuming correlation between different benchmarks

Future Directions in Composite Evaluation

Dynamic Evaluation

  • • Adaptive testing based on performance
  • • Real-time question generation
  • • Preventing test set contamination
  • • Personalized difficulty adjustment

Multimodal Integration

  • • Video understanding benchmarks
  • • Audio-visual reasoning tasks
  • • 3D spatial reasoning evaluation
  • • Embodied AI assessment

Interactive Evaluation

  • • Multi-turn conversation assessment
  • • Tool use and external knowledge
  • • Collaborative problem solving
  • • Real-world task simulation

Specialized Domains

  • • Professional certification exams
  • • Cultural and linguistic diversity
  • • Domain-specific reasoning patterns
  • • Cross-cultural knowledge evaluation
No quiz questions available
Quiz ID "composite-benchmarks-mmlu" not found