MMLU & Composite Benchmarks
Comprehensive evaluation with MMLU, MMMU, AGIEval, and other composite benchmarks for measuring broad AI capabilities
What are Composite Benchmarks?
Composite benchmarks evaluate AI models across multiple domains and skill types in a single comprehensive assessment. Unlike specialized benchmarks that focus on specific capabilities, composite benchmarks like MMLU test broad knowledge and reasoning across subjects ranging from elementary mathematics to professional law, providing a holistic view of model capabilities and limitations.
MMLU: Massive Multitask Language Understanding
Dataset Characteristics
- 15,908 questions across 57 subjects
- 4-choice multiple choice format
- 4 difficulty levels: Elementary, High School, College, Professional
- Covers 4 major areas: STEM, Humanities, Social Sciences, Other
- Few-shot evaluation with 5 examples per subject
- Human expert baseline for comparison
Subject Categories
Sample MMLU Questions
Question: Which of the following best explains why water has a higher boiling point than hydrogen sulfide?
Question: Let f(x) = x³ - 3x + 1. What is the number of real roots of f(x) = 0?
| Model | MMLU Score | STEM | Humanities | Social Sci |
|---|---|---|---|---|
| GPT-4 | 86.4% | 83.3% | 85.2% | 91.1% |
| Claude-3 Opus | 86.8% | 85.2% | 86.4% | 89.7% |
| Gemini Ultra | 83.7% | 80.9% | 84.2% | 88.1% |
| Human Expert | 89.8% | 87.5% | 91.2% | 92.8% |
MMMU: Massive Multi-discipline Multimodal Understanding
Multimodal Extension
- 11,500 multimodal questions across 30 subjects
- Images required for answering questions
- 6 question types: Multiple choice, true/false, fill-in-blank, etc.
- College-level difficulty across all subjects
- Vision-language integration essential for success
- Expert validation of question quality
Subject Coverage
- Art & Design: Architecture, Fashion, Graphic Design
- Business: Accounting, Economics, Marketing
- Science: Biology, Chemistry, Physics, Geography
- Health & Medicine: Pharmacy, Public Health
- Humanities: History, Literature, Philosophy
- Social Science: Psychology, Sociology
MMMU vs MMLU: Key Differences
MMLU (Text-only)
- • Pure language understanding
- • 57 subjects across all difficulty levels
- • Primarily multiple choice (4 options)
- • Tests factual knowledge & reasoning
- • Human expert baseline: ~90%
MMMU (Multimodal)
- • Vision-language understanding required
- • 30 subjects, college-level focus
- • Multiple question formats
- • Tests visual reasoning & interpretation
- • Human expert baseline: ~89%
Other Major Composite Benchmarks
AGIEval
Tests foundation models on human-centric standardized exams including SAT, LSAT, GRE, and national college entrance exams from China.
- • 20 official exams
- • Real-world assessment standards
- • English and Chinese versions
- • Human performance baselines
HellaSwag
Tests commonsense natural language inference through machine-generated challenging situations that are trivial for humans.
- • 70,000 multiple choice questions
- • Adversarially filtered examples
- • Human accuracy: ~95%
- • Tests physical commonsense
ARC (AI2 Reasoning Challenge)
Science exam questions that require reasoning and world knowledge, designed to be challenging for statistical AI systems.
- • Easy: 2,590 questions
- • Challenge: 1,172 questions
- • Grade 3-9 science level
- • Multiple choice format
BIG-bench
Collaborative benchmark with 200+ tasks designed to probe capabilities and limitations of large language models.
- • 200+ diverse tasks
- • Beyond current capabilities
- • Community contributions
- • Emergent abilities focus
Meta Llama Human Evaluation Framework
Evaluation Methodology
- Human preference evaluation on real-world prompts
- Side-by-side comparisons with other models
- Multiple evaluation criteria: Helpfulness, harmlessness, honesty
- Diverse prompt categories from actual usage
- Professional human evaluators with domain expertise
- Inter-annotator agreement validation
Evaluation Categories
- Creative writing: Stories, poems, creative tasks
- Reasoning: Logic puzzles, mathematical problems
- Code generation: Programming tasks and debugging
- Information seeking: Q&A and research tasks
- Summarization: Text condensation tasks
- Instruction following: Complex multi-step tasks
Key Insights from Human Evaluation
Benchmark vs Human Preferences
- • Automated metrics don't always correlate with human preferences
- • Context-dependent performance varies significantly
- • Safety considerations crucial in real-world usage
- • Task complexity affects evaluation reliability
Model Capabilities
- • Creative tasks show high variance in quality
- • Factual accuracy remains challenging
- • Instruction following improves with model size
- • Domain expertise varies across subjects
Implementation Best Practices
✅ Best Practices
- • Use few-shot prompting with subject-specific examples
- • Track per-subject and per-difficulty performance
- • Report confidence intervals for statistical significance
- • Compare against human expert baselines
- • Analyze error patterns across domains
- • Consider domain transfer capabilities
❌ Common Pitfalls
- • Focusing only on aggregate scores
- • Ignoring subject-specific performance patterns
- • Not accounting for random baseline (25% for 4-choice)
- • Overlooking prompt sensitivity
- • Missing statistical significance testing
- • Assuming correlation between different benchmarks
Future Directions in Composite Evaluation
Dynamic Evaluation
- • Adaptive testing based on performance
- • Real-time question generation
- • Preventing test set contamination
- • Personalized difficulty adjustment
Multimodal Integration
- • Video understanding benchmarks
- • Audio-visual reasoning tasks
- • 3D spatial reasoning evaluation
- • Embodied AI assessment
Interactive Evaluation
- • Multi-turn conversation assessment
- • Tool use and external knowledge
- • Collaborative problem solving
- • Real-world task simulation
Specialized Domains
- • Professional certification exams
- • Cultural and linguistic diversity
- • Domain-specific reasoning patterns
- • Cross-cultural knowledge evaluation