Skip to main contentSkip to user menuSkip to navigation

Apache ORC

Master Apache ORC: Optimized Row Columnar format, ACID transactions, compression, and Hive integration.

35 min readIntermediate
Not Started
Loading...

What is Apache ORC?

Apache ORC (Optimized Row Columnar) is a columnar storage format designed specifically for Hadoop workloads. Originally developed by Hortonworks as an evolution of RCFile, ORC provides superior compression, performance, and ACID transaction capabilities for big data analytics and data warehousing applications.

ORC combines the benefits of columnar storage with advanced features like built-in indexes, bloom filters, and ACID compliance. It's tightly integrated with Apache Hive and supports vectorized query execution, making it particularly effective for analytical workloads requiring both high performance and data integrity.

ORC Performance Calculator

30GB
Compressed Size
70%
Space Saved
3.13s
Query Time
64
Parallel Tasks

ACID Support: 85%

Bloom Filter Hit Rate: 85%

Index Effectiveness: 90%

ORC File Structure

File Header

Contains file signature and version information.

• Magic bytes (ORC)
• Format version
• Compression type
• Writer identification

Stripes

Self-contained units with data, indexes, and footer.

• Default 250MB size
• Row data section
• Row index entries
• Stripe footer

Indexes

Built-in row group indexes and bloom filters.

• Row group indexes
• Column statistics
• Bloom filters
• Min/max values

File Footer

Metadata about stripes, schema, and statistics.

• Stripe information
• Type information
• File statistics
• User metadata

ACID Transaction Features

Atomicity

All operations in a transaction complete successfully or none do. Failed transactions are automatically rolled back.

Atomic Transaction Example
BEGIN;
INSERT INTO sales VALUES (1, 'Product A', 100);
UPDATE inventory SET quantity = quantity - 1 WHERE product_id = 1;
COMMIT; -- Both operations succeed or both are rolled back

Consistency

Data integrity constraints are maintained across all transactions and concurrent operations.

Consistency Guarantees
-- Primary key constraints enforced
-- Foreign key relationships maintained
-- Check constraints validated
-- Data type constraints preserved

Isolation

Concurrent transactions don't interfere with each other using snapshot isolation.

Isolation Levels
-- Read Committed (default)
-- Snapshot Isolation for consistency
-- Multi-version concurrency control
-- Lock-free read operations

Durability

Committed transactions persist through system failures with write-ahead logging.

Durability Features
-- Write-ahead logging (WAL)
-- Transaction commit confirmation
-- Crash recovery mechanisms
-- Data replication for availability

Real-World ORC Implementations

Yahoo

Originally developed ORC and uses it for their massive data warehouse operations.

  • • Petabytes of data in ORC format
  • • 3x compression improvement over RCFile
  • • Vectorized query execution benefits
  • • ACID transaction support for updates

Facebook

Uses ORC for data warehouse analytics and machine learning feature storage.

  • • Exabyte-scale data warehouse
  • • Integration with Presto queries
  • • ML feature store backend
  • • Real-time analytics on historical data

LinkedIn

Leverages ORC for member data analytics and recommendation system features.

  • • Member activity and profile data
  • • A/B testing analytics
  • • Recommendation engine features
  • • Real-time dashboard updates

Spotify

Uses ORC for music streaming analytics and user behavior analysis.

  • • Listening behavior analytics
  • • Music recommendation data
  • • Artist and content metadata
  • • Advertising targeting features

ORC Best Practices

✅ Do

  • • Use ORC for Hive-based data warehouses
  • • Enable ACID transactions when needed
  • • Leverage vectorized query execution
  • • Configure appropriate stripe sizes
  • • Use bloom filters for selective queries
  • • Sort data by commonly filtered columns

❌ Don't

  • • Use ORC for write-heavy OLTP workloads
  • • Create excessively small stripe sizes
  • • Ignore compaction for updated tables
  • • Mix ORC with other formats unnecessarily
  • • Disable indexes and statistics
  • • Forget to tune bloom filter parameters
No quiz questions available
Questions prop is empty