What is Apache ORC?
Apache ORC (Optimized Row Columnar) is a columnar storage format designed specifically for Hadoop workloads. Originally developed by Hortonworks as an evolution of RCFile, ORC provides superior compression, performance, and ACID transaction capabilities for big data analytics and data warehousing applications.
ORC combines the benefits of columnar storage with advanced features like built-in indexes, bloom filters, and ACID compliance. It's tightly integrated with Apache Hive and supports vectorized query execution, making it particularly effective for analytical workloads requiring both high performance and data integrity.
ORC Performance Calculator
ACID Support: 85%
Bloom Filter Hit Rate: 85%
Index Effectiveness: 90%
ORC File Structure
File Header
Contains file signature and version information.
• Format version
• Compression type
• Writer identification
Stripes
Self-contained units with data, indexes, and footer.
• Row data section
• Row index entries
• Stripe footer
Indexes
Built-in row group indexes and bloom filters.
• Column statistics
• Bloom filters
• Min/max values
File Footer
Metadata about stripes, schema, and statistics.
• Type information
• File statistics
• User metadata
ACID Transaction Features
Atomicity
All operations in a transaction complete successfully or none do. Failed transactions are automatically rolled back.
BEGIN;
INSERT INTO sales VALUES (1, 'Product A', 100);
UPDATE inventory SET quantity = quantity - 1 WHERE product_id = 1;
COMMIT; -- Both operations succeed or both are rolled back
Consistency
Data integrity constraints are maintained across all transactions and concurrent operations.
-- Primary key constraints enforced
-- Foreign key relationships maintained
-- Check constraints validated
-- Data type constraints preserved
Isolation
Concurrent transactions don't interfere with each other using snapshot isolation.
-- Read Committed (default)
-- Snapshot Isolation for consistency
-- Multi-version concurrency control
-- Lock-free read operations
Durability
Committed transactions persist through system failures with write-ahead logging.
-- Write-ahead logging (WAL)
-- Transaction commit confirmation
-- Crash recovery mechanisms
-- Data replication for availability
Real-World ORC Implementations
Yahoo
Originally developed ORC and uses it for their massive data warehouse operations.
- • Petabytes of data in ORC format
- • 3x compression improvement over RCFile
- • Vectorized query execution benefits
- • ACID transaction support for updates
Uses ORC for data warehouse analytics and machine learning feature storage.
- • Exabyte-scale data warehouse
- • Integration with Presto queries
- • ML feature store backend
- • Real-time analytics on historical data
Leverages ORC for member data analytics and recommendation system features.
- • Member activity and profile data
- • A/B testing analytics
- • Recommendation engine features
- • Real-time dashboard updates
Spotify
Uses ORC for music streaming analytics and user behavior analysis.
- • Listening behavior analytics
- • Music recommendation data
- • Artist and content metadata
- • Advertising targeting features
ORC Best Practices
✅ Do
- • Use ORC for Hive-based data warehouses
- • Enable ACID transactions when needed
- • Leverage vectorized query execution
- • Configure appropriate stripe sizes
- • Use bloom filters for selective queries
- • Sort data by commonly filtered columns
❌ Don't
- • Use ORC for write-heavy OLTP workloads
- • Create excessively small stripe sizes
- • Ignore compaction for updated tables
- • Mix ORC with other formats unnecessarily
- • Disable indexes and statistics
- • Forget to tune bloom filter parameters