Understanding Hadoop Performance on Lustre

Understanding Hadoop
Performance on Lustre
Stephen Skory, PhD | Seagate Technology
Collaborators Kelsie Betsch, Daniel Kaslovsky, Daniel Lingenfelter, Dimitar Vlassarev, and Zhenzhen Yan
LUG Conference | 15 April 2015
Our live demo of
Hadoop on Lustre at
Supercomputing
November 17-20,
2014
Stephen Skory | Seagate Technology | LUG 15 Apr 2015
Seagate Confidential
2
•
Why Hadoop on Lustre?
•
LustreFS Hadoop Plugin
•
Our Test System
•
HDFS vs Lustre Benchmarks
•
Performance experiments on Lustre
and Hadoop Configuration
•
Diskless Node Optimization
•
Final Thoughts
Why Hadoop on Lustre?
•
Lustre is POSIX compliant and convenient for a wide range of workflows and
applications, while HDFS is a distributed object store designed narrowly for the
write once, read many Hadoop paradigm
•
Disaggregate computation and storage – HDFS requires storage to accompany
compute, with Lustre each part can be scaled or resized according to needs
•
Lower Storage Overhead – Lustre is typically backed by RAID with ~40%
overhead, while HDFS has a default 200% overhead
•
Speed – In our testing we see nearly 30% TeraSort v2 Performance boost over
HDFS. When transfer time is considered the improvement is over 50%
Stephen Skory | Seagate Technology | LUG 15 Apr 2015
Seagate Confidential
4
Hadoop on HDFS and on Lustre Data Flow
Typical Workflow:
Ingestion of Data before Analysis
1)
2)
Transfer (and usually duplicate) data on HDFS
Analyze data
1)
Analyze data on Lustre
Compute
Compute
Lustre
Storage
HDFS
Storage
Stephen Skory | Seagate Technology | LUG 15 Apr 2015
Lustre
Storage
HDFS
Storage
Seagate Confidential
5
LustreFS Plugin
•
It is a single Java jar file that needs to be placed in the CLASSPATH of whatever
service that needs to access Lustre file
•
Requires minimal changes to Hadoop configuration XML files
•
Instead of file:/// or hdfs://hostname:8020/, files are accessed using lustrefs:///
•
HDFS can co-exist with this plugin, and jobs can run between the two file
systems seamlessly
•
Hadoop service accounts (e.g. “yarn” or “mapred”) need to have permissions to
a directory hierarchy on Lustre for temporary and other accounting data
•
Freely available at http://github.com/seagate/lustrefs today, with documentation
•
A PR has been issued to include the plugin with Apache Hadoop
Stephen Skory | Seagate Technology | LUG 15 Apr 2015
Seagate Confidential
6
Our Hadoop on Lustre Development Cluster
All benchmarks shown were performed on this system
•
Hadoop Rack:
• 1 Namenode, 18 Datanodes
• 12 cores & 12 HDDs per Datanode,
total of 216 for both
• 19x10 GbE connection within rack
for Hadoop traffic
• Each Hadoop node has a second
10GbE connection to the
ClusterStor network for a total of
190Gb full duplex bandwidth
•
ClusterStor Rack:
• 2 ClusterStor 6000 SSUs
• 160 data HDDs
Stephen Skory | Seagate Technology | LUG 15 Apr 2015
Seagate Confidential
7
How Much Faster is Hadoop on Lustre?
Terasort timings
Transfer Time (s)
Completion Time (s)
Analysis Time (s)
63% Faster Overall
Data Transfer
not Needed
53% Faster Overall
Data Transfer
not Needed
1500
28% Faster
Analysis
38% Faster
Analysis
1000
500
Faster
0
Hadoop 1 on HDFS
Stephen Skory | Seagate Technology | LUG 15 Apr 2015
Hadoop 1 on
Lustre
Hadoop 2 on HDFS
Hadoop 2 on
Lustre
Seagate Confidential
8
How Much Faster is Hadoop on Lustre?
HoL Improvement over HDFS
Ecosystem Timings Compared to HDFS (transfer times not included)
40%
30%
38%
28%
20%
17%
15%
10%
6%
5%
8%
5%
1%
-7%
0%
TeraSort
Sort
Wordcount
Mahout
Pig
0%
-1%
Hive
-10%
Hadoop 1
Hadoop 2
All remaining benchmarks presented were performed on Hadoop v2
Stephen Skory | Seagate Technology | LUG 15 Apr 2015
Seagate Confidential
9
Hadoop on Lustre Performance
Stripe Size Performance
Stripe size is not a
strong factor in Hadoop
performance, except
perhaps for 1MB stripes
in some cases
Recommend 4MB or
greater
Stephen Skory | Seagate Technology | LUG 15 Apr 2015
Seagate Confidential
10
Hadoop on Lustre Performance
max_rpcs_in_flight and max_dirty_mb
Terasort performance on 373GB with 4MB stripe size, 216 Mappers and 108 Reducers
RPCs
8
16
128
Dirty MB
32
128
512
Time
5m04s
5m03s
5m07s
Changing these values does not strongly affect performance.
Stephen Skory | Seagate Technology | LUG 15 Apr 2015
Seagate Confidential
11
Hadoop on Lustre Performance
Number of Mappers and Reducers
Terasort 373GB
The number of mappers
and reducers has a much
stronger effect on
performance than any
Lustre client configuration
Stephen Skory | Seagate Technology | LUG 15 Apr 2015
Seagate Confidential
12
Hadoop on Lustre Performance
Some additional observations
•
When it comes to Terasort, and likely other I/O intensive jobs, most performance
gains come from tuning Hadoop rather than Lustre
•
In contrast to HDFS, this means that Hadoop performance is less tied to how (or
where, in the case of HDFS) the data is stored on disk, which gives users more
options in how to tune their applications
•
It’s best to focus on tuning Hadoop parameters for best performance:
• We have seen that Lustre does better with fewer writers than the same job going to
HDFS (e.g. number of Mappers for Teragen, or Reducers in Terasort)
• A typical Hadoop rule of thumb is to build clusters, and run jobs, following the 1:1
core:HDD rule. With Hadoop on Lustre, this isn’t necessarily true
• Particularly for CPU-intensive jobs, performance tuning for Lustre is very similar to
typical Hadoop on HDFS
Stephen Skory | Seagate Technology | LUG 15 Apr 2015
Seagate Confidential
13
Hadoop on Lustre Diskless Modification
•
Standard Hadoop has local disks used for both
HDFS and temporary data
•
However, when using Hadoop on Lustre on
diskless compute nodes, all data needs to go to
Lustre, including temporary data.
•
Beginning in 2011, we investigated an idea of
how to optimize Map/Reduce performance when
using Lustre as the temporary data store point
•
(2011)
Stephen Skory | Seagate Technology | LUG 15 Apr 2015
This required a modification to core Hadoop
functionality, and we released these changes at
SC14:
• https://github.com/seagate/hadoop-on-lustre
• https://github.com/seagate/hadoop-on-lustre2
Seagate Confidential
14
Map/Reduce Data Flow Basics
Hadoop
on HDFS
Lustre Disk
Note: Most cases
HDFS data is local
to mapper but can
be remote
HDFS Disk
Mapper
Local Disk
RAM
Local Xfer
Network Xfer
Sufficient RAM
Xfer Task
(Reducer)
Sort
Reduce
Note: 1 Local HDFS
+ 2 Network HDFS
copies for
Reduce Output
Insufficient RAM
•
•
•
•
•
HDFS data is split into blocks which is analogous to Lustre’s stripe size. Default size is 128 MB
for Hadoop 2
The ResourceManager does best effort to assign computation to nodes that already have the
data (Mapper)
When done, the Mapper writes intermediate data to local scratch disk (no replication)
Reducers collect data from Mappers (reading from local disk) over HTTP (called the “shuffle”
phase) and sort incoming data either in RAM or on disk (again, local scratch disk)
When data is sorted, the Reducer applies the reduction algorithm and writes output to HDFS
that is 3x replicated
Stephen Skory | Seagate Technology | LUG 15 Apr 2015
Seagate Confidential
15
Map/Reduce Data Flow Basics
Hadoop
on HDFS
Lustre Disk
HDFS Disk
Note: Most cases
HDFS data is local
to mapper but can
be remote
Mapper
Local Disk
RAM
Local Xfer
Network Xfer
Sufficient RAM
Xfer Task
(Reducer)
Sort
Reduce
Note: 1 Local HDFS
+ 2 Network HDFS
copies for
Reduce Output
Hadoop
on Lustre
with local
Disks
Insufficient RAM
Sufficient RAM
Mapper
Xfer Task
(Reducer)
Sort
Note: 1 Lustre
network copy for
Reduce output
Reduce
Insufficient RAM
Using Lustre instead of HDFS is very similar, except the Mapper reads data from Lustre over
the network, and the Reducer output is not triple replicated.
The lower diagram describes the setup for how we performed all the benchmarks we’ve
presented so far, by using both Lustre and local disks to each Hadoop compute node.
Stephen Skory | Seagate Technology | LUG 15 Apr 2015
Seagate Confidential
16
Hadoop on Lustre Map/Reduce Diskless Modification
Hadoop
on HDFS
Lustre Disk
HDFS Disk
Note: Most cases
HDFS data is local
to mapper but can
be remote
Mapper
Local Disk
RAM
Local Xfer
Network Xfer
Sufficient RAM
Xfer Task
(Reducer)
Sort
Reduce
Note: 1 Local HDFS
+ 2 Network HDFS
copies for
Reduce Output
Hadoop
on Lustre
Diskless
Unmodified
Insufficient RAM
Sufficient RAM
Mapper
Xfer Task
(Reducer)
Sort
Note: 1 Lustre
network copy for
Reduce output
Reduce
Hadoop
on Lustre
Diskless
Modified
Insufficient RAM
Sufficient RAM
Mapper
Stephen Skory | Seagate Technology | LUG 15 Apr 2015
Sort
Reduce
Note: 1 Lustre
network copy for
Reduce output
Insufficient RAM
Seagate Confidential
17
Hadoop on Lustre Diskless Modification
Hadoop
Node
Unmodified
Terasort Timings:
Size
Unmodified
Mapper
IO Cache
Modified
100 GB 131s
155s
500 GB 538s
558s
The modified method is slower!
Reducer
IO Cache
Temporary files
Saved to Lustre
Modified
Mapper
IO Cache
Reducer
IO Cache
Takeaway: IO Caching speeds up the
unmodified method.
Stephen Skory | Seagate Technology | LUG 15 Apr 2015
Seagate Confidential
18
Final Thoughts
•
We have explored three methods of Hadoop on Lustre:
•
Using local disk(s) for temporary storage
•
Using Lustre for temporary storage with no modifications to Map/Reduce
•
Using Lustre for temporary storage, but with modifications to Map/Reduce
•
We find that each method above is slower than the ones above it
•
The local disks do not need to be close to “big” (where “big” is the largest
dataset analyzed), but it is most optimal to have some kind of local disk for
temporary data
•
Thank you to my collaborators Kelsie Betsch, Daniel Kaslovsky, Daniel
Lingenfelter, Dimitar Vlassarev, and Zhenzhen Yan
Seagate Confidential
19