Slides - Splunk.com

Copyright © 2014 Splunk Inc. Exploratory AnalyAcs for Shared-­‐service Hadoop Clusters Sagi Zelnick Principal Architect, Yahoo Disclaimer During the course of this presentaAon, we may make forward-­‐looking statements regarding future events or the expected performance of the company. We cauAon you that such statements reflect our current expectaAons and esAmates based on factors currently known to us and that actual events or results could differ materially. For important factors that may cause actual results to differ from those contained in our forward-­‐looking statements, please review our filings with the SEC. The forward-­‐looking statements made in the this presentaAon are being made as of the Ame and date of its live presentaAon. If reviewed aRer its live presentaAon, this presentaAon may not contain current or accurate informaAon. We do not assume any obligaAon to update any forward-­‐looking statements we may make. In addiAon, any informaAon about our roadmap outlines our general product direcAon and is subject to change at any Ame without noAce. It is for informaAonal purposes only, and shall not be incorporated into any contract or other commitment. Splunk undertakes no obligaAon either to develop the features or funcAonality described or to include any such feature or funcAonality in a future release. 2 Overview ! 
! 
! 
Hadoop @ Yahoo: 8+ years of innovaAon Hunk @ Yahoo: organizaAon-­‐wide investment for next 3+ years Yahoo providing Hunk as a self-­‐service to explore, analyze & visualize data in HDFS –  Hunk allows for visually browsing very complex tables (250+ fields) –  Rapid prototyping for new jobs with almost instant results for searches, without having to wait for the enAre job/query to finish –  Cuts down on the development cycles by faster interacAon with results –  Built-­‐in graphs/charts makes for a powerful soluAon for many situaAons History of Hadoop I nnovaAon @ Yahoo Over 600PB of Hadoop Storage (Over Half an Exabyte) ! 
! 
! 
! 
! 
! 
! 
Very large clusters used by many groups across the enterprise More than 35,000 individual datanodes Hadoop is provided as a service MulAple cluster types such as research, dev, sandbox and producAon Services such as HBase, Hive, Oozie, etc… Users are free to run jobs, but have resource constraints Maintained by the Grid OperaAons Group Integrated AnalyAcs Plajorm for Diverse Data Stores Full-­‐featured, Integrated Product Explore Analyze Visualize Dashboard
s Share Fast Insights for Everyone Works with What You Have Today Hadoop Client Libraries Hadoop Clusters Streaming Resource Libraries NoSQL and Other Data Stores Improving OperaAonal Visibility with Hunk ! 
! 
! 
! 
! 
! 
We pointed Hunk at many operaAonal logs and event data we already had on the grid This includes system metrics, HDFS ops, JVM stats and YARN metrics Created instrumentaAon to measure usage per user and job Analyzed terabytes of NameNode audit logs Job history leveraged for visualizing usage/growth and historical views Custom events for HBase staAsAcs Tracking Hadoop Performance & Metrics in Hunk Use Case Customer Benefits System metrics from 35k nodes Grid Ops / Grid Customers IdenAfy slow tasks/nodes when debugging Historical insights of resources All Grid Customers Track organic growth Job performance All Grid Customers Improved job SLAs HBase metrics All Grid Customers Track region/RS/table metrics… Job logs in near real-­‐Ame All Grid Customers / Ops Search for errors directly from the YARN logs Measuring NameNode Performance Pre & Post Upgrades ! 
! 
! 
! 
Historical visualizaAons of all operaAons Search data in Hunk from billions of NameNode events Measure JVM and memory usage Insights into operaAonal performance ✓ 10,086 events (5/15/14 1:00:00.000 AM to 5/22/14 1:36:34.000 AM)
VisualizaAon Using Hunk Visualization
num_BlockReports
num_CopyBl...perations
num_HeartBeats
num_ReadBl...perations
num_ReadMe...perations
num_Replac...Operations
num_WriteB...Operations
num_blockChecksumOp
600,000,000
400,000,000
200,000,000
Fri May 16
2014
Sun May 18
Tue May 20
_time
_time ↕
2014-05-15 01:00
num_Bl
ockRep
orts ↕
num_Copy
BlockOpera
tions ↕
num_
HeartB
eats ↕
num_Read
BlockOpera
tions ↕
num_ReadMe
tadataOperati
ons ↕
num_Replac
eBlockOperat
ions ↕
num_Write
BlockOpera
tions ↕
num_blo
ckChecks
umOp ↕
112443
7.7359
02
46721126.
819672
51495
7.3840
98
12930433.0
77869
0.000000
94210832.78
6885
63512425.9
67213
13975.30
6557
Sample TroubleshooAng in Hunk of 750 Million Events Visualization
num_BlockReports
num_CopyBl...perations
num_ReadMe...perations
num_Replac...Operations
num_HeartBeats
num_ReadBl...perations
num_WriteB...Operations
num_blockChecksumOp
1,000,000,000
750,000,000
500,000,000
250,000,000
12:00 PM
Tue May 20
2014
12:00 AM
Wed May 21
12:00 PM
_time
_time ↕
num_Bl
ockRep
orts ↕
num_Copy
BlockOpera
tions ↕
num_
HeartB
eats ↕
num_Read
BlockOpera
tions ↕
num_ReadMe
tadataOperati
ons ↕
num_Replac
eBlockOperat
ions ↕
num_Write
BlockOpera
tions ↕
num_blo
ckChecks
umOp ↕
2014-05-20 01:15:00
105604
34677652.
12412
26242490.8
0.000000
88112292.80
126478486.
51405.34
Big Picture Plus Granular Details Analyzing NameNode RPC Calls (TroubleshooAng) ! 
! 
! 
! 
! 
Who is making what RPC call (open, listStatus, create, etc.) How oRen are they making these RPC calls From which IP/host are they coming from Search and visualize historical data from billions of events Prevent NameNode abuse/misuse Visualizing 834 Million Discrete Events … ConAnued Queue Insights (Capacity & Provisioning) ! 
! 
! 
! 
! 
Each Hadoop job runs in a specific queue We track every aspect of the YARN framework Immediate queue performance and configuraAon profiling via job history server Historical views and trends that enable beper capacity management Improved queue uAlizaAon and allocaAon management Search | Splunk 6.1.0
 New Search
Visualizing Q
ueues http://spbl103n01.blue.ygrid.yahoo.com:9999/en-US/app/search...
index="jobsummary_logs_all_red"
onds)
cluster="dilithium*"
|
eval total_slot_seconds=(mapSlotSeconds + reduceSlotSec
| eval gb_hours=((total_slot_seconds * 0.5) / 3600) | eval gb_hours=round(gb_hours) | timechart span=6h sum
(gb_hours) as gb_hours by queue
Last 7 days
✓ 1,175,726 events (5/20/14 8:00:00.000 PM to 5/27/14 8:26:26.000 PM)
Visualization
600,000
400,000
200,000
Wed May 21
2014
Thu May 22
Fri May 23
Sat May 24
Sun May 25
Mon May 26
_time
_time ↕
2014-05-20 18:00
OTH
ER
↕
apg_dai
lyhigh_
p3 ↕
apg_dail
ymedium
_p5 ↕
apg_hou
rlyhigh_
p1 ↕
apg_ho
urlylow_
p4 ↕
apg_hourl
ymedium
_p2 ↕
apg
_p7
↕
curveb
all_larg
e ↕
curveb
all_me
d ↕
sling
shot
↕
sling
stone
↕
415
4
45512
7071
25643
12111
29664
347
3
26547
14192
6087
5
4537
6
Self-­‐Service Job Reports ! 
! 
! 
! 
Each job is unique and so are the map and reduce elements How to start analyzing jobs? Historical job performance and profiling enables in-­‐depth performance tuning Long terms historical views and trending of growth index="jobsummary_logs_all_blue"
cluster="*"
user="gmon" |
eval total_slot_seconds=(mapSlotSeconds + reduceSlotSeconds)
eval gb_hours=((total_slot_seconds * 0.5) / 3600) |
eval gb_hours=round(gb_hours,2) |
eval runtime=(finishTime-submitTime)/1000
|
| stats sum(gb_hours) as gb-hours
avg(runtime) as run_mins
by cluster user queue jobName jobId status| eval run_mins=round(run_mins/60,2)
|
sort -gb-hours
Yesterday
✓ 4,871 events (5/26/14 12:00:00.000 AM to 5/27/14 12:00:00.000 AM)
Statistics (4,871)
clu
ster
↕
us
er
↕
que
ue
↕
cob
alt
g
m
on
cob
alt
gb-ho
urs ↕
run_
mins
↕
jobName ↕
jobId ↕
status
↕
grid
eng
PigLatin:findRemoteHDFSFromAudits.pig
job_1398982765
383_315271
SUCCE
EDED
108.0
0
33.07
g
m
on
grid
eng
PigLatin:findRemoteHDFSFromAudits.pig
job_1398982765
383_312700
SUCCE
EDED
104.0
0
37.37
cob
alt
g
m
on
grid
eng
PigLatin:findRemoteHDFSFromAudits.pig
job_1398982765
383_309715
SUCCE
EDED
88.00
29.83
cob
alt
g
m
on
grid
ops
distcp:
job_1398982765
383_309921
SUCCE
EDED
36.00
68.49
...It’s Not Just Logs We’re Looking At More data to tap into with the metastore / Hive sources ! 
Using the metastore we can setup virtual indexes to any table(s) in Hive, without the need to define the schema up-­‐front ! 
Visualize very complex tables (250+ fields) ! 
Rapid prototyping for new jobs with almost instant results for searches, without having to wait for the enAre job/query to finish ! 
Built-­‐in aggregates and graphs/charts ! 
Accelerates development workflow by providing faster interacAon with data THANK YOU