Skip to main contentSkip to user menuSkip to navigation

Apache Pig

Master Apache Pig: data flow scripting, Pig Latin language, and ETL processing on Hadoop.

30 min readIntermediate
Not Started
Loading...

What is Apache Pig?

Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin, which abstracts the programming from the Java MapReduce idiom into a notation that makes MapReduce programming high-level, similar to SQL for RDBMSs. Pig transforms the Pig Latin scripts into MapReduce jobs that can be executed on Hadoop clusters.

Originally developed at Yahoo for researchers and data analysts, Pig provides a simple language for data analysis tasks. It's particularly well-suited for ETL operations, data preparation, and ad-hoc analysis of large datasets. Pig handles complex data structures and provides extensive built-in functions for data transformation, making it easier to work with semi-structured and unstructured data than traditional SQL-based tools.

Pig Performance Calculator

15
Minutes
15,872MB
Memory Needed
6,667
MB/min Speed
2
MapReduce Jobs

Storage: 150GB (includes intermediate data)

Reducer Efficiency: 80%

Pig Core Features

Pig Latin Language

High-level data flow language with SQL-like syntax for big data processing.

• LOAD, STORE operations
• FILTER, GROUP, JOIN
• FOREACH transformations
• ORDER, LIMIT, DISTINCT
• User-defined functions (UDFs)

Data Types

Rich data type system supporting complex nested structures.

• Atomic: int, long, float, double
• Complex: chararray, bytearray
• Collections: bag, tuple, map
• Datetime types
• Automatic type conversion

Execution Modes

Multiple execution environments for development and production.

• Local mode (testing)
• MapReduce mode (Hadoop)
• Tez mode (faster execution)
• Spark mode (in-memory)
• Interactive Grunt shell

Built-in Functions

Extensive library of functions for data manipulation and analysis.

• String functions (UPPER, LOWER)
• Math functions (SUM, AVG, COUNT)
• Date/time functions
• Collection functions
• Statistical functions

Real-World Pig Implementations

Yahoo

Original creator using Pig for large-scale log processing and web analytics.

  • • 40% of Hadoop jobs run on Pig
  • • Web crawl data processing
  • • Search ranking algorithm data prep
  • • User behavior analytics

Netflix

Uses Pig for recommendation engine data preparation and content analytics.

  • • Content recommendation data pipelines
  • • User viewing pattern analysis
  • • A/B testing data processing
  • • Content metadata enrichment

Twitter

Processes billions of tweets and user interactions for analytics and ML.

  • • Tweet sentiment analysis
  • • Trending topics calculation
  • • User engagement metrics
  • • Ad targeting data preparation

E-commerce

Retailers use Pig for customer analytics, inventory management, and pricing.

  • • Customer behavior analysis
  • • Product recommendation engines
  • • Inventory optimization
  • • Dynamic pricing algorithms

Pig Latin Script Examples

Basic ETL Pipeline

etl_pipeline.pig
-- Load raw sales data
sales_data = LOAD '/data/sales.csv' USING PigStorage(',')
             AS (transaction_id:int, product:chararray, category:chararray, 
                 amount:double, customer_id:int, date:chararray);

-- Filter valid transactions
valid_sales = FILTER sales_data BY amount > 0 AND customer_id IS NOT NULL;

-- Group by category and calculate aggregates
category_group = GROUP valid_sales BY category;
category_stats = FOREACH category_group GENERATE 
                   group AS category,
                   COUNT(valid_sales) AS transaction_count,
                   SUM(valid_sales.amount) AS total_revenue,
                   AVG(valid_sales.amount) AS avg_amount;

-- Sort by revenue and save top categories
sorted_categories = ORDER category_stats BY total_revenue DESC;
top_categories = LIMIT sorted_categories 10;

STORE top_categories INTO '/output/top_categories' USING PigStorage(',');

Log Analysis with UDFs

log_analysis.pig
REGISTER 'myudfs.jar';
DEFINE ExtractIP com.mycompany.pig.ExtractIP();
DEFINE ParseUserAgent com.mycompany.pig.ParseUserAgent();

-- Load web server logs
logs = LOAD '/logs/access.log' AS (line:chararray);

-- Parse log entries
parsed_logs = FOREACH logs GENERATE 
    ExtractIP(line) AS client_ip,
    REGEX_EXTRACT(line, '\[(.*?)\]', 1) AS timestamp,
    REGEX_EXTRACT(line, '"(\w+)\s+(\S+)\s+HTTP', 1) AS method,
    REGEX_EXTRACT(line, '"(\w+)\s+(\S+)\s+HTTP', 2) AS url,
    (int)REGEX_EXTRACT(line, '\s(\d{3})\s', 1) AS status_code,
    ParseUserAgent(line) AS user_agent;

-- Analyze traffic patterns
hourly_traffic = GROUP parsed_logs BY 
    SUBSTRING(timestamp, 0, 13); -- Group by hour

traffic_stats = FOREACH hourly_traffic GENERATE 
    group AS hour,
    COUNT(parsed_logs) AS requests,
    COUNT(FILTER(parsed_logs, status_code >= 400)) AS errors,
    (double)COUNT(FILTER(parsed_logs, status_code >= 400)) / 
    COUNT(parsed_logs) * 100 AS error_rate;

STORE traffic_stats INTO '/output/hourly_traffic' USING PigStorage('\t');

Pig Best Practices

✅ Do

  • • Use early filtering to reduce data volume
  • • Leverage replicated joins for small datasets
  • • Set appropriate parallelism with SET default_parallel
  • • Use ILLUSTRATE and DESCRIBE for debugging
  • • Implement proper error handling in UDFs
  • • Use compression for intermediate and output data
  • • Take advantage of Pig's automatic optimization
  • • Use meaningful aliases for relations

❌ Don't

  • • Load unnecessary columns from large datasets
  • • Use CROSS operation on large datasets
  • • Ignore data skew in join operations
  • • Forget to handle null values appropriately
  • • Use inefficient regular expressions in tight loops
  • • Mix different execution modes in production
  • • Skip schema validation in data loading
  • • Create overly complex nested operations
No quiz questions available
Questions prop is empty