Skip to main contentSkip to user menuSkip to navigation

Hadoop & HDFS

Master Hadoop and HDFS: distributed storage, MapReduce processing, and big data architecture patterns.

45 min readAdvanced
Not Started
Loading...

What is Hadoop HDFS?

Hadoop Distributed File System (HDFS) is a distributed file system providing reliable, scalable storage for big data applications with automatic fault tolerance and high-throughput access patterns. It's designed to run on commodity hardware and handle very large files across many machines.

HDFS stores data across multiple nodes with configurable replication factors, providing fault tolerance through data redundancy. It's optimized for large sequential reads and writes, making it ideal for batch processing workloads and analytics applications.

HDFS Capacity & Performance Calculator

Cluster Metrics

Raw Storage:120.0 TB
Usable Storage:40.0 TB
Effective Storage:28.0 TB
Read Throughput:15 GB/s
Write Throughput:3.3 GB/s
Monthly Cost:$3,826

Real-World Examples

Yahoo!

One of the largest HDFS deployments for web search and advertising

  • 40,000+ nodes in production
  • 600 PB of storage capacity
  • 100+ TB processed daily
  • 5+ years of web crawl data

Facebook

Social graph analytics and data warehouse

  • 30,000+ nodes in largest cluster
  • 300 PB total data storage
  • 1+ billion users' data processed
  • 600 TB ingested daily

Netflix

Content recommendation and viewing analytics

  • 15 PB of viewing data
  • 2.5 TB processed per hour
  • 1000+ recommendation models
  • 150+ million viewing sessions analyzed

eBay

Search optimization and business intelligence

  • 10 PB of structured data
  • 5000+ nodes across clusters
  • 500+ TB log data daily
  • 50+ countries data processing

Best Practices

Do's

  • Use large block sizes (128-256MB) for big data
  • Implement rack awareness for optimal placement
  • Monitor DataNode health with regular fsck
  • Use compression (Snappy, LZO) for space savings
  • Configure proper replication factors by data criticality
  • Enable short-circuit reads for local data access
  • Implement regular backup strategies for metadata
  • Use heterogeneous storage for tiered data

Don'ts

  • Don't store many small files (< 128MB)
  • Don't ignore NameNode memory limitations
  • Don't run without SecondaryNameNode backups
  • Don't mix critical and non-critical data in same cluster
  • Don't over-replicate non-critical temporary data
  • Don't ignore disk balancing across DataNodes
  • Don't disable security in production environments
  • Don't neglect regular cluster maintenance tasks

Core Concepts Deep Dive

Master-slave distributed file system designed for storing massive datasets across clusters of commodity hardware with automatic fault tolerance and high throughput.

Key Points:

  • Master-slave architecture with single NameNode and multiple DataNodes
  • Large block size (128MB-256MB) optimized for sequential access patterns
  • Write-once-read-many model designed for streaming data access
  • Automatic data replication (default factor of 3) for fault tolerance
  • Rack-aware placement strategy optimizing network bandwidth usage
  • Strong consistency guarantees for metadata and data integrity

Implementation Example:

# Core HDFS Configuration (hdfs-site.xml)
<configuration>
  <!-- Block size configuration -->
  <property>
    <name>dfs.blocksize</name>
    <value>268435456</value> <!-- 256MB blocks -->
    <description>Default block size for HDFS files</description>
  </property>
  
  <!-- Replication factor -->
  <property>
    <name>dfs.replication</name>
    <value>3</value>
    <description>Default replication factor</description>
  </property>
  
  <!-- NameNode directories -->
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///hadoop/hdfs/namenode</value>
    <description>NameNode metadata storage</description>
  </property>
  
  <!-- DataNode directories -->
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///hadoop/hdfs/datanode</value>
    <description>DataNode block storage</description>
  </property>
  
  <!-- Security and permissions -->
  <property>
    <name>dfs.permissions.enabled</name>
    <value>true</value>
  </property>
  
  <!-- Heartbeat interval -->
  <property>
    <name>dfs.heartbeat.interval</name>
    <value>3</value>
    <description>DataNode heartbeat interval in seconds</description>
  </property>
</configuration>

# Basic HDFS Operations
# Format NameNode (initial setup only)
hdfs namenode -format

# Start HDFS services
start-dfs.sh

# Basic file operations
hdfs dfs -mkdir /user/data
hdfs dfs -put local_file.txt /user/data/
hdfs dfs -ls /user/data/
hdfs dfs -cat /user/data/local_file.txt
hdfs dfs -get /user/data/local_file.txt downloaded_file.txt

# Cluster health check
hdfs fsck / -files -blocks -locations
hdfs dfsadmin -report

# Python HDFS Client Example
from hdfs import InsecureClient
import os

class HDFSManager:
    def __init__(self, namenode_url='http://localhost:9870'):
        self.client = InsecureClient(namenode_url)
    
    def upload_file(self, local_path, hdfs_path, overwrite=False):
        """Upload local file to HDFS"""
        try:
            with open(local_path, 'rb') as local_file:
                self.client.write(hdfs_path, local_file, overwrite=overwrite)
            print(f"Uploaded {local_path} to {hdfs_path}")
        except Exception as e:
            print(f"Upload failed: {e}")
    
    def download_file(self, hdfs_path, local_path):
        """Download file from HDFS to local filesystem"""
        try:
            with self.client.read(hdfs_path) as hdfs_file:
                with open(local_path, 'wb') as local_file:
                    for chunk in hdfs_file:
                        local_file.write(chunk)
            print(f"Downloaded {hdfs_path} to {local_path}")
        except Exception as e:
            print(f"Download failed: {e}")
    
    def list_directory(self, hdfs_path='/'):
        """List directory contents"""
        try:
            contents = self.client.list(hdfs_path, status=True)
            for item, status in contents:
                file_type = 'DIR' if status['type'] == 'DIRECTORY' else 'FILE'
                size = status.get('length', 0)
                print(f"{file_type}	{size}	{item}")
        except Exception as e:
            print(f"List failed: {e}")

# Usage example
hdfs = HDFSManager()
hdfs.upload_file('large_dataset.csv', '/user/data/large_dataset.csv')
hdfs.list_directory('/user/data/')

Additional Use Cases

Data Lake Storage & Analytics

Centralized repository storing petabytes of raw data in native format for machine learning, analytics, and business intelligence across the enterprise.

Store structured and unstructured data cost-effectively with schema-on-read capabilities, integrating with Spark, Hive, and Presto for flexible data exploration.

Log Aggregation & Monitoring

Scalable ingestion and storage of high-volume application logs, system metrics, and event streams for real-time monitoring and historical analysis.

Collect logs from distributed applications using Flume or Kafka, store in HDFS for long-term retention, and process with MapReduce/Spark for insights.

Scientific Data Processing

High-performance platform for storing and processing large scientific datasets, simulation results, and research data with parallel computing capabilities.

Support collaborative research with shared datasets, reproducible workflows, and integration with scientific computing frameworks and visualization tools.

Media & Content Management

Petabyte-scale storage for digital media assets including high-resolution video, audio, and images with parallel processing for transcoding workflows.

Optimize for large file storage, integrate with CDNs for content distribution, and provide metadata management for media asset search and discovery.

Enterprise Backup & Archival

Cost-effective long-term data retention solution with 99.999% durability through triple replication and automated lifecycle management.

Replace expensive proprietary storage systems with commodity hardware, ensure regulatory compliance, and integrate with existing enterprise backup tools.

No quiz questions available
Quiz ID "hadoop-hdfs" not found