Common Sense Reasoning: PIQA, HellaSwag & More

Master common sense reasoning benchmarks and evaluation methodologies

45 min read•Advanced

Not Started

What is Common Sense Reasoning?

Common sense reasoning evaluates an AI model's ability to understand and apply everyday knowledge that humans take for granted. These benchmarks test whether models can make logical inferences about physical properties, cause-and-effect relationships, and practical scenarios without explicit training on specific facts.

Modern benchmarks like PIQA, HellaSwag, and WinoGrande challenge models to demonstrate understanding of physical interactions, situation modeling, and contextual reasoning that requires both linguistic and world knowledge integration.

PIQA: Physical Interaction QA

Benchmark Overview

Dataset Statistics

• Questions: 16,113 training, 1,838 validation
• Format: Goal + two solution methods
• Task: Choose physically correct solution
• Human Performance: 94.9%
• GPT-3 Performance: 82.8%

Example Question

Goal: Make a temporary ice pack

✓ Solution A: Fill a ziploc bag with water and freeze it

✗ Solution B: Fill a ziploc bag with milk and freeze it

PIQA Evaluation Implementation

HellaSwag: Commonsense NLI

Adversarial Filtering Approach

HellaSwag uses Adversarial Filtering (AF) to generate challenging wrong answers that fool language models but not humans. This creates more difficult evaluation scenarios than simple random distractor selection.

Dataset Size

70K

Total questions

Human Accuracy

95.6%

Human performance

Task Format

4-way

Multiple choice

Example HellaSwag Question

Context: "A woman is outside with a bucket and a dog. The dog is running around trying to avoid a bath. She..."

✓ A: rinses the dog off with a hose, catches the dog, and continues to wash the dog.

✗ B: gets in the bucket and holds the dog while washing the dog.

✗ C: uses a net to catch the dog, throws the dog in a swimming pool.

✗ D: gets into the bathtub with the dog, washing the dog's head.

WinoGrande & Related Benchmarks

WinoGrande

Scaled-up version of Winograd Schema Challenge with 44K problems testing pronoun resolution requiring commonsense reasoning.

• Pronoun resolution tasks
• Requires world knowledge
• Binary choice format
• Human: 94.0% accuracy

OpenBookQA

Elementary-level science questions requiring multi-step reasoning with an open book of scientific facts.

• Science reasoning
• Multi-hop inference
• 4-way multiple choice
• Human: 92.0% accuracy

CommonsenseQA

Multiple-choice questions that require different types of commonsense knowledge to solve. Based on ConceptNet knowledge graph.

12K

Questions

5-way

Choices

88.9%

Human

ConceptNet

Knowledge

Comprehensive Evaluation Framework

Multi-Benchmark Evaluation System

Model Performance Analysis

Model	PIQA	HellaSwag	WinoGrande	OpenBookQA	CommonsenseQA	Average
Human Performance	94.9%	95.6%	94.0%	92.0%	88.9%	93.1%
GPT-4	86.4%	87.8%	83.1%	86.0%	83.8%	85.4%
Claude-3	85.1%	85.9%	81.6%	84.4%	82.3%	83.9%
GPT-3.5-Turbo	78.2%	79.4%	71.3%	72.8%	76.5%	75.6%
Llama-2-70B	82.8%	84.2%	80.2%	79.1%	78.7%	81.0%
Random Baseline	50.0%	25.0%	50.0%	25.0%	20.0%	34.0%

Key Insights

• 8-10% gap from human performance
• PIQA most challenging for models
• Pronoun resolution still difficult

Improvement Areas

• Physical world understanding
• Temporal reasoning
• Causal relationships

Best Practices

• Multi-shot prompting
• Chain-of-thought reasoning
• Knowledge integration

Implementation Best Practices

✅ DO

• Use stratified sampling for balanced evaluation
• Implement confidence-based answer selection
• Apply consistent preprocessing across benchmarks
• Report both accuracy and detailed error analysis
• Use appropriate prompting strategies for each task

❌ DON'T

• Cherry-pick favorable evaluation subsets
• Ignore human performance baselines
• Mix different benchmark versions
• Over-optimize prompts on test data
• Report single metrics without error bars

Advanced Evaluation Metrics and Analysis

No quiz questions available

Quiz ID "common-sense-reasoning-benchmarks" not found