System Designer

What is Apache YARN?

Apache YARN (Yet Another Resource Negotiator) is the resource management and job scheduling technology in Hadoop 2.0+. YARN enables multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform, unlocking an entirely new approach to analytics.

YARN separates the resource management functionality from the programming model by splitting the JobTracker responsibilities into separate daemons: ResourceManager for resource management, NodeManager for node monitoring, and ApplicationMaster for application lifecycle management. This architecture allows multiple applications to run simultaneously while efficiently sharing cluster resources.

YARN Cluster Performance Calculator

Cluster Nodes: 10

Memory per Node: 16 GB

CPU Cores per Node: 8

Scheduler Type

Max Containers

Concurrent Apps

Jobs/Hour

85%

Resource Util

Available Resources: 136GB RAM, 70 cores

Fault Tolerance: 30% node failure tolerance

YARN Core Components

ResourceManager

Central authority managing cluster resources and scheduling applications globally.

• Global resource allocation
• Application scheduling
• Queue management
• High availability support
• Web UI and REST APIs

NodeManager

Per-node agent managing containers and monitoring local resources.

• Container lifecycle management
• Local resource monitoring
• Log aggregation
• Health checking
• Security enforcement

ApplicationMaster

Per-application coordinator managing task execution and resource negotiation.

• Resource negotiation
• Task coordination
• Progress monitoring
• Failure handling
• Application-specific logic

Containers

Resource allocation units providing isolated execution environments for tasks.

• CPU, memory, disk allocation
• Process isolation
• Environment setup
• Resource monitoring
• Cleanup on completion

Real-World YARN Implementations

Yahoo

Original creator running massive clusters for web search and advertising analytics.

• 40,000+ node clusters
• Mixed MapReduce and Spark workloads
• Multi-tenant resource sharing
• Real-time and batch processing

eBay

Manages complex e-commerce analytics and recommendation systems.

• User behavior analysis
• Fraud detection systems
• Recommendation engines
• Financial reporting pipelines

Powers professional networking analytics and machine learning workflows.

• Member connection analysis
• Job recommendation algorithms
• Content feed optimization
• A/B testing frameworks

Financial Services

Banks use YARN for risk analytics, regulatory reporting, and fraud detection.

• Risk calculation workflows
• Regulatory compliance reports
• Real-time fraud detection
• Customer analytics pipelines

YARN Configuration Examples

Resource Configuration

yarn-site.xml

<configuration>
  <!-- ResourceManager Configuration -->
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>rm.example.com</value>
  </property>
  
  <property>
    <name>yarn.resourcemanager.webapp.address</name>
    <value>rm.example.com:8088</value>
  </property>
  
  <!-- NodeManager Resource Configuration -->
  <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>12288</value>
    <description>Total memory available for containers</description>
  </property>
  
  <property>
    <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>8</value>
    <description>Total CPU cores available for containers</description>
  </property>
  
  <!-- Container Memory Limits -->
  <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>512</value>
  </property>
  
  <property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>8192</value>
  </property>
  
  <!-- Virtual Memory Settings -->
  <property>
    <name>yarn.nodemanager.vmem-pmem-ratio</name>
    <value>2.1</value>
  </property>
  
  <property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
    <description>Disable virtual memory checking</description>
  </property>
</configuration>

Fair Scheduler Configuration

fair-scheduler.xml

<allocations>
  <!-- Production Queue -->
  <queue name="production">
    <weight>3.0</weight>
    <minResources>4096 mb,4 vcores</minResources>
    <maxResources>16384 mb,16 vcores</maxResources>
    <maxRunningApps>10</maxRunningApps>
    <schedulingPolicy>fair</schedulingPolicy>
    <aclSubmitApps>prod_users</aclSubmitApps>
  </queue>
  
  <!-- Development Queue -->
  <queue name="development">
    <weight>1.0</weight>
    <minResources>2048 mb,2 vcores</minResources>
    <maxResources>8192 mb,8 vcores</maxResources>
    <maxRunningApps>5</maxRunningApps>
    <schedulingPolicy>drf</schedulingPolicy>
    <aclSubmitApps>dev_users</aclSubmitApps>
  </queue>
  
  <!-- Queue Placement Policy -->
  <queuePlacementPolicy>
    <rule name="specified" create="false" />
    <rule name="primaryGroup" create="false" />
    <rule name="user" create="true" />
    <rule name="default" queue="development"/>
  </queuePlacementPolicy>
  
  <!-- Global Settings -->
  <defaultQueueSchedulingPolicy>fair</defaultQueueSchedulingPolicy>
  <queueMaxAppsDefault>15</queueMaxAppsDefault>
  <defaultMinSharePreemptionTimeout>600</defaultMinSharePreemptionTimeout>
  <defaultFairSharePreemptionTimeout>600</defaultFairSharePreemptionTimeout>
</allocations>

YARN Best Practices

✅ Do

• Configure appropriate memory and CPU allocations per node
• Use Fair or Capacity Scheduler for multi-tenant environments
• Enable ResourceManager high availability for production
• Monitor resource utilization and queue performance
• Set up log aggregation for centralized log management
• Configure preemption policies for resource guarantees
• Use node labels for heterogeneous hardware
• Implement health checks for NodeManager nodes

❌ Don't

• Allocate all node memory to YARN (leave 15-20% for OS)
• Enable virtual memory checking (often causes issues)
• Use FIFO scheduler in multi-tenant environments
• Ignore container memory leaks and resource violations
• Run ResourceManager without high availability
• Set container limits too small (causes thrashing)
• Forget to configure queue ACLs and user limits
• Neglect monitoring and alerting for cluster health

No quiz questions available

Questions prop is empty

Apache YARN