What is Apache Pig?
Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin, which abstracts the programming from the Java MapReduce idiom into a notation that makes MapReduce programming high-level, similar to SQL for RDBMSs. Pig transforms the Pig Latin scripts into MapReduce jobs that can be executed on Hadoop clusters.
Originally developed at Yahoo for researchers and data analysts, Pig provides a simple language for data analysis tasks. It's particularly well-suited for ETL operations, data preparation, and ad-hoc analysis of large datasets. Pig handles complex data structures and provides extensive built-in functions for data transformation, making it easier to work with semi-structured and unstructured data than traditional SQL-based tools.
Pig Performance Calculator
Storage: 150GB (includes intermediate data)
Reducer Efficiency: 80%
Pig Core Features
Pig Latin Language
High-level data flow language with SQL-like syntax for big data processing.
• FILTER, GROUP, JOIN
• FOREACH transformations
• ORDER, LIMIT, DISTINCT
• User-defined functions (UDFs)
Data Types
Rich data type system supporting complex nested structures.
• Complex: chararray, bytearray
• Collections: bag, tuple, map
• Datetime types
• Automatic type conversion
Execution Modes
Multiple execution environments for development and production.
• MapReduce mode (Hadoop)
• Tez mode (faster execution)
• Spark mode (in-memory)
• Interactive Grunt shell
Built-in Functions
Extensive library of functions for data manipulation and analysis.
• Math functions (SUM, AVG, COUNT)
• Date/time functions
• Collection functions
• Statistical functions
Real-World Pig Implementations
Yahoo
Original creator using Pig for large-scale log processing and web analytics.
- • 40% of Hadoop jobs run on Pig
- • Web crawl data processing
- • Search ranking algorithm data prep
- • User behavior analytics
Netflix
Uses Pig for recommendation engine data preparation and content analytics.
- • Content recommendation data pipelines
- • User viewing pattern analysis
- • A/B testing data processing
- • Content metadata enrichment
Processes billions of tweets and user interactions for analytics and ML.
- • Tweet sentiment analysis
- • Trending topics calculation
- • User engagement metrics
- • Ad targeting data preparation
E-commerce
Retailers use Pig for customer analytics, inventory management, and pricing.
- • Customer behavior analysis
- • Product recommendation engines
- • Inventory optimization
- • Dynamic pricing algorithms
Pig Latin Script Examples
Basic ETL Pipeline
-- Load raw sales data
sales_data = LOAD '/data/sales.csv' USING PigStorage(',')
AS (transaction_id:int, product:chararray, category:chararray,
amount:double, customer_id:int, date:chararray);
-- Filter valid transactions
valid_sales = FILTER sales_data BY amount > 0 AND customer_id IS NOT NULL;
-- Group by category and calculate aggregates
category_group = GROUP valid_sales BY category;
category_stats = FOREACH category_group GENERATE
group AS category,
COUNT(valid_sales) AS transaction_count,
SUM(valid_sales.amount) AS total_revenue,
AVG(valid_sales.amount) AS avg_amount;
-- Sort by revenue and save top categories
sorted_categories = ORDER category_stats BY total_revenue DESC;
top_categories = LIMIT sorted_categories 10;
STORE top_categories INTO '/output/top_categories' USING PigStorage(',');
Log Analysis with UDFs
REGISTER 'myudfs.jar';
DEFINE ExtractIP com.mycompany.pig.ExtractIP();
DEFINE ParseUserAgent com.mycompany.pig.ParseUserAgent();
-- Load web server logs
logs = LOAD '/logs/access.log' AS (line:chararray);
-- Parse log entries
parsed_logs = FOREACH logs GENERATE
ExtractIP(line) AS client_ip,
REGEX_EXTRACT(line, '\[(.*?)\]', 1) AS timestamp,
REGEX_EXTRACT(line, '"(\w+)\s+(\S+)\s+HTTP', 1) AS method,
REGEX_EXTRACT(line, '"(\w+)\s+(\S+)\s+HTTP', 2) AS url,
(int)REGEX_EXTRACT(line, '\s(\d{3})\s', 1) AS status_code,
ParseUserAgent(line) AS user_agent;
-- Analyze traffic patterns
hourly_traffic = GROUP parsed_logs BY
SUBSTRING(timestamp, 0, 13); -- Group by hour
traffic_stats = FOREACH hourly_traffic GENERATE
group AS hour,
COUNT(parsed_logs) AS requests,
COUNT(FILTER(parsed_logs, status_code >= 400)) AS errors,
(double)COUNT(FILTER(parsed_logs, status_code >= 400)) /
COUNT(parsed_logs) * 100 AS error_rate;
STORE traffic_stats INTO '/output/hourly_traffic' USING PigStorage('\t');
Pig Best Practices
✅ Do
- • Use early filtering to reduce data volume
- • Leverage replicated joins for small datasets
- • Set appropriate parallelism with SET default_parallel
- • Use ILLUSTRATE and DESCRIBE for debugging
- • Implement proper error handling in UDFs
- • Use compression for intermediate and output data
- • Take advantage of Pig's automatic optimization
- • Use meaningful aliases for relations
❌ Don't
- • Load unnecessary columns from large datasets
- • Use CROSS operation on large datasets
- • Ignore data skew in join operations
- • Forget to handle null values appropriately
- • Use inefficient regular expressions in tight loops
- • Mix different execution modes in production
- • Skip schema validation in data loading
- • Create overly complex nested operations