Common Sense Reasoning: PIQA, HellaSwag & More
Master common sense reasoning benchmarks and evaluation methodologies
What is Common Sense Reasoning?
Common sense reasoning evaluates an AI model's ability to understand and apply everyday knowledge that humans take for granted. These benchmarks test whether models can make logical inferences about physical properties, cause-and-effect relationships, and practical scenarios without explicit training on specific facts.
Modern benchmarks like PIQA, HellaSwag, and WinoGrande challenge models to demonstrate understanding of physical interactions, situation modeling, and contextual reasoning that requires both linguistic and world knowledge integration.
PIQA: Physical Interaction QA
Benchmark Overview
Dataset Statistics
- • Questions: 16,113 training, 1,838 validation
- • Format: Goal + two solution methods
- • Task: Choose physically correct solution
- • Human Performance: 94.9%
- • GPT-3 Performance: 82.8%
Example Question
HellaSwag: Commonsense NLI
Adversarial Filtering Approach
HellaSwag uses Adversarial Filtering (AF) to generate challenging wrong answers that fool language models but not humans. This creates more difficult evaluation scenarios than simple random distractor selection.
Dataset Size
Human Accuracy
Task Format
Example HellaSwag Question
WinoGrande & Related Benchmarks
WinoGrande
Scaled-up version of Winograd Schema Challenge with 44K problems testing pronoun resolution requiring commonsense reasoning.
- • Pronoun resolution tasks
- • Requires world knowledge
- • Binary choice format
- • Human: 94.0% accuracy
OpenBookQA
Elementary-level science questions requiring multi-step reasoning with an open book of scientific facts.
- • Science reasoning
- • Multi-hop inference
- • 4-way multiple choice
- • Human: 92.0% accuracy
CommonsenseQA
Multiple-choice questions that require different types of commonsense knowledge to solve. Based on ConceptNet knowledge graph.
Comprehensive Evaluation Framework
Model Performance Analysis
| Model | PIQA | HellaSwag | WinoGrande | OpenBookQA | CommonsenseQA | Average |
|---|---|---|---|---|---|---|
| Human Performance | 94.9% | 95.6% | 94.0% | 92.0% | 88.9% | 93.1% |
| GPT-4 | 86.4% | 87.8% | 83.1% | 86.0% | 83.8% | 85.4% |
| Claude-3 | 85.1% | 85.9% | 81.6% | 84.4% | 82.3% | 83.9% |
| GPT-3.5-Turbo | 78.2% | 79.4% | 71.3% | 72.8% | 76.5% | 75.6% |
| Llama-2-70B | 82.8% | 84.2% | 80.2% | 79.1% | 78.7% | 81.0% |
| Random Baseline | 50.0% | 25.0% | 50.0% | 25.0% | 20.0% | 34.0% |
Key Insights
- • 8-10% gap from human performance
- • PIQA most challenging for models
- • Pronoun resolution still difficult
Improvement Areas
- • Physical world understanding
- • Temporal reasoning
- • Causal relationships
Best Practices
- • Multi-shot prompting
- • Chain-of-thought reasoning
- • Knowledge integration
Implementation Best Practices
✅ DO
- • Use stratified sampling for balanced evaluation
- • Implement confidence-based answer selection
- • Apply consistent preprocessing across benchmarks
- • Report both accuracy and detailed error analysis
- • Use appropriate prompting strategies for each task
❌ DON'T
- • Cherry-pick favorable evaluation subsets
- • Ignore human performance baselines
- • Mix different benchmark versions
- • Over-optimize prompts on test data
- • Report single metrics without error bars