Skip to main contentSkip to user menuSkip to navigation

Apache Hive

Master Apache Hive: SQL on Hadoop, data warehousing, partitioning strategies, and query optimization.

40 min readIntermediate
Not Started
Loading...

What is Apache Hive?

Apache Hive is a data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Built on top of Hadoop, Hive provides a SQL-like interface (HiveQL) to query data stored in various databases and file systems that integrate with Hadoop.

Originally developed by Facebook and now an Apache top-level project, Hive abstracts the complexity of MapReduce programming by translating SQL queries into MapReduce, Tez, or Spark jobs. It's widely used for batch processing, ETL operations, and large-scale data analytics in enterprises handling petabyte-scale datasets.

Hive Query Performance Calculator

188s
Query Latency
90%
Partition Pruning
50 GB
Data Scanned
0.3 GB/s
Throughput

Storage (Compressed): 125 GB

Memory Required: 20,480 MB

Concurrent Queries: 30

Hive Storage Formats

ORC (Optimized Row Columnar)

Default format optimized for Hive with excellent compression and performance.

• Columnar storage format
• Built-in indexes & statistics
• ACID transaction support
• 75% compression ratio
• Predicate pushdown

Parquet

Cross-platform columnar format ideal for analytical workloads.

• Ecosystem compatibility
• Efficient compression
• Schema evolution support
• Column pruning
• Spark integration

Avro

Row-based format with schema evolution for data serialization.

• Schema in JSON format
• Splittable & compressible
• Rich data structures
• Schema evolution
• ETL friendly

TextFile/CSV

Human-readable format for simple data exchange and debugging.

• Human readable
• No compression by default
• Simple to generate
• Large storage footprint
• Slow query performance

Hive Execution Engines

MapReduce (Legacy)

Original execution engine with high latency but proven stability for batch processing.

MapReduce Configuration
SET hive.execution.engine=mr;
SET mapreduce.job.reduces=10;
SET mapreduce.map.memory.mb=4096;
# High latency: 30s+ startup overhead

Apache Tez (Recommended)

DAG-based execution engine optimized for Hive with reduced latency and better resource utilization.

Tez Configuration
SET hive.execution.engine=tez;
SET hive.tez.container.size=4096;
SET hive.tez.java.opts=-Xmx3276m;
# 10x faster than MapReduce for complex queries

Apache Spark

In-memory processing engine with excellent performance for iterative algorithms and machine learning.

Spark Configuration
SET hive.execution.engine=spark;
SET spark.executor.memory=8g;
SET spark.executor.cores=4;
# Best for ML workloads and iterative processing

Real-World Hive Implementations

Facebook

Original creator processing 4PB+ daily with thousands of users running analytics.

  • • 300PB+ data warehouse
  • • 600TB daily data ingestion
  • • 30K+ daily Hive queries
  • • Custom Presto integration for speed

Netflix

Analyzes viewing patterns and content performance across 200M+ subscribers.

  • • 100PB+ S3 data lake
  • • Hourly ETL pipelines
  • • A/B testing analytics
  • • Content recommendation data processing

Airbnb

Powers data infrastructure for search ranking and pricing optimization.

  • • 10PB+ data warehouse
  • • Real-time pricing models
  • • Search quality metrics
  • • Host/guest behavior analysis

Uber

Processes ride data for surge pricing and driver allocation optimization.

  • • 100PB+ analytical data
  • • Geospatial analytics
  • • Driver earnings calculations
  • • City operations dashboards

Hive Optimization Techniques

Query Optimization

  • • Use partitioning to reduce data scans
  • • Enable cost-based optimizer (CBO)
  • • Leverage bucketing for efficient joins
  • • Use ORC/Parquet for columnar storage
  • • Apply predicate pushdown filters
  • • Collect table and column statistics

Configuration Tuning

  • • Set appropriate execution engine (Tez/Spark)
  • • Configure vectorized query execution
  • • Tune container and heap sizes
  • • Enable dynamic partition pruning
  • • Use map-side joins for small tables
  • • Configure parallel execution settings

Hive Best Practices

✅ Do

  • • Partition tables by date/region columns
  • • Use ORC format for best performance
  • • Collect statistics with ANALYZE TABLE
  • • Implement proper data compaction strategies
  • • Use Tez or Spark execution engines
  • • Monitor and optimize slow queries

❌ Don't

  • • Create too many small partitions
  • • Use SELECT * on large tables
  • • Ignore data skew in joins
  • • Skip table statistics collection
  • • Use MapReduce for interactive queries
  • • Store data in text format for analytics
No quiz questions available
Questions prop is empty