Download Report

Resource Management with YARN:
YARN Past, Present and Future
Anubhav Dhoot
Software Engineer
Cloudera
1
Resource Management
Map Reduce
Impala
YARN (DYNAMIC RESOURCE MANAGEMENT)
Spark
YARN (Yet Another Resource Negotiator)
Traditional Operating System
Storage:
File System
Execution/
Scheduling:
Processes/
Kernel
Scheduler
Hadoop
Storage:
Hadoop
Distributed
File System
(HDFS)
Execution/
Scheduling:
Yet Another
Resource
Negotiator
(YARN)
Overview of Talk
History of YARN
• Recent features
• On going features
• Future
•
WHY YARN
Traditional Distributed Execution Engines
Master
Client
Worker
Task
Task
Worker
Task
Task
Worker
Task
Task
Client
MapReduce v1 (MR1)
Job Tracker
Client
Task Tracker
Map
Map
Task Tracker
Reduce
Map
Task Tracker
Map
Reduce
Client
JobTracker tracks every task in the cluster!
MR1 Utilization
4 GB
Map
1024 MB
Map
1024 MB
Reduce
1024 MB
Reduce
1024 MB
Fixed-size slot model forces slots large enough for the biggest task!
Running multiple frameworks…
Master
Master
Master
Client
Client
Client
Client
Client
Client
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Task
TaskTask
TaskTask
Task
Task
Task
Task
TaskTask
Task
Task
TaskTask
TaskTask
Task
YARN to the rescue!
Scalability: Track only applications, not all tasks.
• Utilization: Allocate only as many resources as needed.
• Multi-tenancy: Share resources between frameworks and users
•
•
Physical resources – memory, CPU, disk, network
YARN Architecture
Node Manager
Client
Resource
Manager
Client
App
Node Manager
Container
Master
Applications
State
Cluster State
Node Manager
MR1 to YARN/MR2 functionality mapping
•
JobTracker is split into
•
•
•
•
ResourceManager – cluster-management, scheduling and
application state handling
ApplicationMaster – Handle tasks (containers) per
application (e.g. MR job)
JobHistoryServer – Serve MR history
TaskTracker maps to NodeManager
EARLY FEATURES
Handing faults on Workers
App
Node Manager
Master
Client
Resource
Manager
Client
App
Node Manager
Container
Master
Applications
State
Cluster State
Container
Node Manager
Master Fault-tolerance - RM Recovery
App
Node Manager
Container
Master
Client
Client
Resource
Manager
Applications
State
Cluster State
RM Store
App
Node Manager
Container
Master
Master Node Fault tolerance
High Availability (Active / Standby)
Client
Active
Resource
Manager
Elector
RM Store
Client
Elector
Standby
Resource
Manager
App
Node Manager
Master
ZK
Node Manager
Master Node Fault tolerance
High Availability (Active / Standby)
Client
Standby
Resource
Manager
Elector
RM Store
Client
Elector
Active
Resource
Manager
Node Manager
ZK
App
Node Manager
Master
Scheduler
•
•
•
•
•
Inside ResourceManager
Decides who gets to run when and where
Uses “Queues” to describe organization needs
Applications are submitted to a queue
Two schedulers out of the box
Fair Scheduler
• Capacity Scheduler
•
Fair Scheduler Hierarchical Queues
Root
Mem Capacity: 12 GB
CPU Capacity: 24 cores
Marketing
Fair Share Mem: 4 GB
Fair Share CPU: 8 cores
Jim’s Team
Fair Share Mem: 2 GB
Fair Share CPU: 4 cores
R&D
Fair Share Mem: 4 GB
Fair Share CPU: 8 cores
Bob’s Team
Fair Share Mem: 2 GB
Fair Share CPU: 4 cores
Sales
Fair Share Mem: 4 GB
Fair Share CPU: 8 cores
Fair Scheduler Queue Placement Policies
<queuePlacementPolicy>
<rule name="specified" />
<rule name="primaryGroup"
create="false" />
<rule name="default" />
</queuePlacementPolicy>
Multi-Resource Scheduling
● Node capacities expressed in both memory and CPU
● Memory in MB and CPU in terms of vcores
● Scheduler uses dominant resource for making
decisions
Multi-Resource Scheduling
12 GB
33% cap.
3 cores
25% cap.
Queue 1 Usage
10 GB
28% cap.
6 cores
50% cap.
Queue 2 Usage
Multi-Resource Enforcement
● YARN kills containers that use too much memory
● CGroups for limiting CPU
RECENTLY ADDED FEATURES
RM recovery without losing work
Preserving running containers on RM restart
• NM no longer kills containers on resync
• AM made to register on resync with RM
•
RM recovery without losing work
Node Manager
Client
Client
Resource
Manager
Applications
State
Cluster State
RM Store
App
Node Manager
Container
Master
NM Recovery without losing work
NM stores container and its associated state in a
local store
• On restart reconstruct state from store
• Default implementation using LevelDB
• Supports rolling restarts with no user impact
•
NM Recovery without losing work
Node Manager
Client
Resource
Manager
Client
Applications
State
Cluster State
App
Master
Container
State
Store
Fair Scheduler Dynamic User Queues
Root
Mem Capacity: 12 GB
CPU Capacity: 24 cores
Marketing
Fair Share Mem: 4 GB
Fair Share CPU: 8 cores
Moe
4 GB
Fair Share Mem: 2
8 cores
Fair Share CPU: 4
R&D
Fair Share Mem: 4 GB
Fair Share CPU: 8 cores
Larry
Fair Share Mem: 2 GB
Fair Share CPU: 4 cores
Sales
Fair Share Mem: 4 GB
Fair Share CPU: 8 cores
ON GOING FEATURES
Long Running Apps on Secure Clusters (YARN896)
● Update tokens of running applications
● Reset AM failure count to allow mulitple failures
over a long time
● Need to access logs while application is running
● Need a way to show progress
Application Timeline Server (YARN-321, YARN1530)
● Currently we have a JobHistoryServer for
MapReduce history
● Generic history server
● Gives information even while job is running
Application Timeline Server
● Store and serve generic data like when containers
ran, container logs
● Apps post app-specific events
o
e.g. MapReduce Attempt Succeeded/Failed
● Pluggable framework-specific UIs
● Pluggable storage backend
● Default LevelDB
Disk scheduling (YARN-2139 )
● Disk as a resource in addition to CPU and Memory
● Expressed as virtual disk similar to vcore for cpu
● Dominant resource fairness can handle this on the
scheduling side
● Use CGroups blkio controller for enforcement
Reservation-based Scheduling (YARN-1051)
Reservation-based Scheduling
FUTURE FEATURES
Container Resizing (YARN-1197)
● Change container’s resource allocation
● Very useful for frameworks like Spark that schedule
multiple tasks within a container
● Follow same paths as for acquiring and releasing
containers
Admin labels (YARN-796)
● Admin tags nodes with labels (e.g. GPU)
● Applications can include labels in container requests
I want a GPU
NodeManager
[Windows]
Application Master
NodeManager
[GPU, beefy]
Container Delegation (YARN-1488)
● Problem: single process wants to run work on behalf
of multiple users.
● Want to count resources used against users that use
them.
● E.g. Impala or HDFS caching
Container Delegation (YARN-1488)
● Solution: let apps “delegate” their containers to
other containers on the same node.
● Delegated container never runs
● Framework container gets its resources
● Framework container responsible for fairness within
itself
Questions?
Thank You!
Anubhav Dhoot, Software Engineer, Cloudera
[email protected]
43