Skip to main contentSkip to user menuSkip to navigation

Apache Parquet

Master Apache Parquet: columnar storage format, compression, schema evolution, and analytics optimization.

35 min readIntermediate
Not Started
Loading...

What is Apache Parquet?

Apache Parquet is a columnar storage file format optimized for analytics workloads in big data processing. Developed as part of the Apache Hadoop ecosystem, Parquet provides efficient data compression and encoding schemes that significantly improve query performance and reduce storage costs for analytical applications.

Unlike row-based formats that store data record by record, Parquet organizes data by column, enabling better compression ratios, column pruning, and predicate pushdown optimizations. It's widely adopted by modern data processing engines like Apache Spark, Presto, and Amazon Athena.

Parquet Performance Calculator

40GB
Compressed Size
60%
Space Saved
3s
Query Time
$1
Monthly Storage Cost

I/O Reduction: 50%

Predicate Pushdown: 85% effective

Columns Scanned: 25/50

Parquet File Structure

File Header

Contains magic number and file metadata.

• Magic number (PAR1)
• Version information
• Schema definition
• Column statistics

Row Groups

Horizontal partitions containing column chunks.

• Default 128MB size
• Column chunks per column
• Statistics and metadata
• Parallel processing unit

Column Chunks

Data for a specific column within a row group.

• Data pages (compressed)
• Dictionary pages
• Column metadata
• Encoding information

Footer

Contains file metadata and schema information.

• Row group metadata
• Schema definition
• Column statistics
• Key-value metadata

Real-World Parquet Implementations

Netflix

Uses Parquet for storing and analyzing petabytes of viewing data and content metadata.

  • • 100+ TB of new Parquet data daily
  • • 70% storage cost reduction vs JSON
  • • 10x faster analytical queries
  • • Integration with Spark and Presto

Uber

Leverages Parquet in their data lake for trip data, driver analytics, and ML features.

  • • Exabytes of Parquet data
  • • Real-time and batch analytics
  • • ML feature store integration
  • • Cost-optimized storage on S3

Airbnb

Powers their data warehouse and analytics platform with Parquet for all structured data.

  • • Complete migration from Avro
  • • 60% reduction in storage costs
  • • Faster query performance on Presto
  • • Schema evolution support

Twitter

Uses Parquet for tweet analytics, user behavior analysis, and ad targeting data.

  • • Billions of tweets in Parquet format
  • • Real-time streaming to Parquet
  • • Integration with Hadoop ecosystem
  • • Efficient analytical workloads

Compression & Encoding Strategies

Dictionary Encoding

Replaces repeated values with integer references to a dictionary, ideal for categorical data.

Dictionary Encoding Example
Original: ["US", "CA", "US", "UK", "US", "CA"]
Dictionary: {0: "US", 1: "CA", 2: "UK"}
Encoded: [0, 1, 0, 2, 0, 1]
# Compression: 50%+ for categorical data

Run Length Encoding

Compresses sequences of identical values by storing the value and count.

RLE Encoding Example
Original: [1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3]
RLE: [(1, 4), (2, 2), (3, 5)]
# Ideal for sorted or grouped data

Delta Encoding

Stores differences between consecutive values, effective for timestamps and numeric sequences.

Delta Encoding Example
Original: [1000, 1001, 1005, 1008, 1010]
Delta: [1000, 1, 4, 3, 2]
# Reduces value ranges for better compression

Parquet Best Practices

✅ Do

  • • Use Snappy compression for balanced performance
  • • Partition data by frequently filtered columns
  • • Optimize row group size (128MB-1GB)
  • • Sort data by commonly filtered columns
  • • Use appropriate data types (avoid strings for IDs)
  • • Enable predicate pushdown in query engines

❌ Don't

  • • Create too many small files (< 1MB)
  • • Use Parquet for frequent updates/deletes
  • • Ignore schema evolution considerations
  • • Over-partition your data
  • • Use overly complex nested schemas
  • • Forget to tune row group size for your workload
No quiz questions available
Questions prop is empty