MMLU & Composite Benchmarks

Comprehensive evaluation with MMLU, MMMU, AGIEval, and other composite benchmarks for measuring broad AI capabilities

40 min read•Advanced

Not Started

What are Composite Benchmarks?

Composite benchmarks evaluate AI models across multiple domains and skill types in a single comprehensive assessment. Unlike specialized benchmarks that focus on specific capabilities, composite benchmarks like MMLU test broad knowledge and reasoning across subjects ranging from elementary mathematics to professional law, providing a holistic view of model capabilities and limitations.

MMLU: Massive Multitask Language Understanding

Dataset Characteristics

15,908 questions across 57 subjects
4-choice multiple choice format
4 difficulty levels: Elementary, High School, College, Professional
Covers 4 major areas: STEM, Humanities, Social Sciences, Other
Few-shot evaluation with 5 examples per subject
Human expert baseline for comparison

Subject Categories

STEM (18 subjects): Mathematics, Physics, Chemistry, Biology, Computer Science

Humanities (13 subjects): History, Philosophy, Literature, Arts

Social Sciences (12 subjects): Psychology, Sociology, Economics, Law

Other (14 subjects): Business, Medicine, Miscellaneous topics

Sample MMLU Questions

High School Chemistry

Question: Which of the following best explains why water has a higher boiling point than hydrogen sulfide?

A) Water has a lower molecular weight

B) Water molecules form hydrogen bonds

C) Water has a linear molecular structure

D) Water is more polar than hydrogen sulfide

Answer: B

College Mathematics

Question: Let f(x) = x³ - 3x + 1. What is the number of real roots of f(x) = 0?

A) 0

B) 1

C) 2

D) 3

Answer: D

Model	MMLU Score	STEM	Humanities	Social Sci
GPT-4	86.4%	83.3%	85.2%	91.1%
Claude-3 Opus	86.8%	85.2%	86.4%	89.7%
Gemini Ultra	83.7%	80.9%	84.2%	88.1%
Human Expert	89.8%	87.5%	91.2%	92.8%

MMMU: Massive Multi-discipline Multimodal Understanding

Multimodal Extension

11,500 multimodal questions across 30 subjects
Images required for answering questions
6 question types: Multiple choice, true/false, fill-in-blank, etc.
College-level difficulty across all subjects
Vision-language integration essential for success
Expert validation of question quality

Subject Coverage

Art & Design: Architecture, Fashion, Graphic Design
Business: Accounting, Economics, Marketing
Science: Biology, Chemistry, Physics, Geography
Health & Medicine: Pharmacy, Public Health
Humanities: History, Literature, Philosophy
Social Science: Psychology, Sociology

MMMU vs MMLU: Key Differences

MMLU (Text-only)

• Pure language understanding
• 57 subjects across all difficulty levels
• Primarily multiple choice (4 options)
• Tests factual knowledge & reasoning
• Human expert baseline: ~90%

MMMU (Multimodal)

• Vision-language understanding required
• 30 subjects, college-level focus
• Multiple question formats
• Tests visual reasoning & interpretation
• Human expert baseline: ~89%

Other Major Composite Benchmarks

AGIEval

Tests foundation models on human-centric standardized exams including SAT, LSAT, GRE, and national college entrance exams from China.

• 20 official exams
• Real-world assessment standards
• English and Chinese versions
• Human performance baselines

HellaSwag

Tests commonsense natural language inference through machine-generated challenging situations that are trivial for humans.

• 70,000 multiple choice questions
• Adversarially filtered examples
• Human accuracy: ~95%
• Tests physical commonsense

ARC (AI2 Reasoning Challenge)

Science exam questions that require reasoning and world knowledge, designed to be challenging for statistical AI systems.

• Easy: 2,590 questions
• Challenge: 1,172 questions
• Grade 3-9 science level
• Multiple choice format

BIG-bench

Collaborative benchmark with 200+ tasks designed to probe capabilities and limitations of large language models.

• 200+ diverse tasks
• Beyond current capabilities
• Community contributions
• Emergent abilities focus

Meta Llama Human Evaluation Framework

Evaluation Methodology

Human preference evaluation on real-world prompts
Side-by-side comparisons with other models
Multiple evaluation criteria: Helpfulness, harmlessness, honesty
Diverse prompt categories from actual usage
Professional human evaluators with domain expertise
Inter-annotator agreement validation

Evaluation Categories

Creative writing: Stories, poems, creative tasks
Reasoning: Logic puzzles, mathematical problems
Code generation: Programming tasks and debugging
Information seeking: Q&A and research tasks
Summarization: Text condensation tasks
Instruction following: Complex multi-step tasks

Key Insights from Human Evaluation

Benchmark vs Human Preferences

• Automated metrics don't always correlate with human preferences
• Context-dependent performance varies significantly
• Safety considerations crucial in real-world usage
• Task complexity affects evaluation reliability

Model Capabilities

• Creative tasks show high variance in quality
• Factual accuracy remains challenging
• Instruction following improves with model size
• Domain expertise varies across subjects

Implementation Best Practices

MMLU & Composite Benchmark Evaluation Framework

✅ Best Practices

• Use few-shot prompting with subject-specific examples
• Track per-subject and per-difficulty performance
• Report confidence intervals for statistical significance
• Compare against human expert baselines
• Analyze error patterns across domains
• Consider domain transfer capabilities

❌ Common Pitfalls

• Focusing only on aggregate scores
• Ignoring subject-specific performance patterns
• Not accounting for random baseline (25% for 4-choice)
• Overlooking prompt sensitivity
• Missing statistical significance testing
• Assuming correlation between different benchmarks

Future Directions in Composite Evaluation

Dynamic Evaluation

• Adaptive testing based on performance
• Real-time question generation
• Preventing test set contamination
• Personalized difficulty adjustment

Multimodal Integration

• Video understanding benchmarks
• Audio-visual reasoning tasks
• 3D spatial reasoning evaluation
• Embodied AI assessment

Interactive Evaluation

• Multi-turn conversation assessment
• Tool use and external knowledge
• Collaborative problem solving
• Real-world task simulation

Specialized Domains

• Professional certification exams
• Cultural and linguistic diversity
• Domain-specific reasoning patterns
• Cross-cultural knowledge evaluation

No quiz questions available

Quiz ID "composite-benchmarks-mmlu" not found