Copyright © 2014 Splunk Inc. Exploratory AnalyAcs for Shared-‐service Hadoop Clusters Sagi Zelnick Principal Architect, Yahoo Disclaimer During the course of this presentaAon, we may make forward-‐looking statements regarding future events or the expected performance of the company. We cauAon you that such statements reflect our current expectaAons and esAmates based on factors currently known to us and that actual events or results could differ materially. For important factors that may cause actual results to differ from those contained in our forward-‐looking statements, please review our filings with the SEC. The forward-‐looking statements made in the this presentaAon are being made as of the Ame and date of its live presentaAon. If reviewed aRer its live presentaAon, this presentaAon may not contain current or accurate informaAon. We do not assume any obligaAon to update any forward-‐looking statements we may make. In addiAon, any informaAon about our roadmap outlines our general product direcAon and is subject to change at any Ame without noAce. It is for informaAonal purposes only, and shall not be incorporated into any contract or other commitment. Splunk undertakes no obligaAon either to develop the features or funcAonality described or to include any such feature or funcAonality in a future release. 2 Overview ! ! ! Hadoop @ Yahoo: 8+ years of innovaAon Hunk @ Yahoo: organizaAon-‐wide investment for next 3+ years Yahoo providing Hunk as a self-‐service to explore, analyze & visualize data in HDFS – Hunk allows for visually browsing very complex tables (250+ fields) – Rapid prototyping for new jobs with almost instant results for searches, without having to wait for the enAre job/query to finish – Cuts down on the development cycles by faster interacAon with results – Built-‐in graphs/charts makes for a powerful soluAon for many situaAons History of Hadoop I nnovaAon @ Yahoo Over 600PB of Hadoop Storage (Over Half an Exabyte) ! ! ! ! ! ! ! Very large clusters used by many groups across the enterprise More than 35,000 individual datanodes Hadoop is provided as a service MulAple cluster types such as research, dev, sandbox and producAon Services such as HBase, Hive, Oozie, etc… Users are free to run jobs, but have resource constraints Maintained by the Grid OperaAons Group Integrated AnalyAcs Plajorm for Diverse Data Stores Full-‐featured, Integrated Product Explore Analyze Visualize Dashboard s Share Fast Insights for Everyone Works with What You Have Today Hadoop Client Libraries Hadoop Clusters Streaming Resource Libraries NoSQL and Other Data Stores Improving OperaAonal Visibility with Hunk ! ! ! ! ! ! We pointed Hunk at many operaAonal logs and event data we already had on the grid This includes system metrics, HDFS ops, JVM stats and YARN metrics Created instrumentaAon to measure usage per user and job Analyzed terabytes of NameNode audit logs Job history leveraged for visualizing usage/growth and historical views Custom events for HBase staAsAcs Tracking Hadoop Performance & Metrics in Hunk Use Case Customer Benefits System metrics from 35k nodes Grid Ops / Grid Customers IdenAfy slow tasks/nodes when debugging Historical insights of resources All Grid Customers Track organic growth Job performance All Grid Customers Improved job SLAs HBase metrics All Grid Customers Track region/RS/table metrics… Job logs in near real-‐Ame All Grid Customers / Ops Search for errors directly from the YARN logs Measuring NameNode Performance Pre & Post Upgrades ! ! ! ! Historical visualizaAons of all operaAons Search data in Hunk from billions of NameNode events Measure JVM and memory usage Insights into operaAonal performance ✓ 10,086 events (5/15/14 1:00:00.000 AM to 5/22/14 1:36:34.000 AM) VisualizaAon Using Hunk Visualization num_BlockReports num_CopyBl...perations num_HeartBeats num_ReadBl...perations num_ReadMe...perations num_Replac...Operations num_WriteB...Operations num_blockChecksumOp 600,000,000 400,000,000 200,000,000 Fri May 16 2014 Sun May 18 Tue May 20 _time _time ↕ 2014-05-15 01:00 num_Bl ockRep orts ↕ num_Copy BlockOpera tions ↕ num_ HeartB eats ↕ num_Read BlockOpera tions ↕ num_ReadMe tadataOperati ons ↕ num_Replac eBlockOperat ions ↕ num_Write BlockOpera tions ↕ num_blo ckChecks umOp ↕ 112443 7.7359 02 46721126. 819672 51495 7.3840 98 12930433.0 77869 0.000000 94210832.78 6885 63512425.9 67213 13975.30 6557 Sample TroubleshooAng in Hunk of 750 Million Events Visualization num_BlockReports num_CopyBl...perations num_ReadMe...perations num_Replac...Operations num_HeartBeats num_ReadBl...perations num_WriteB...Operations num_blockChecksumOp 1,000,000,000 750,000,000 500,000,000 250,000,000 12:00 PM Tue May 20 2014 12:00 AM Wed May 21 12:00 PM _time _time ↕ num_Bl ockRep orts ↕ num_Copy BlockOpera tions ↕ num_ HeartB eats ↕ num_Read BlockOpera tions ↕ num_ReadMe tadataOperati ons ↕ num_Replac eBlockOperat ions ↕ num_Write BlockOpera tions ↕ num_blo ckChecks umOp ↕ 2014-05-20 01:15:00 105604 34677652. 12412 26242490.8 0.000000 88112292.80 126478486. 51405.34 Big Picture Plus Granular Details Analyzing NameNode RPC Calls (TroubleshooAng) ! ! ! ! ! Who is making what RPC call (open, listStatus, create, etc.) How oRen are they making these RPC calls From which IP/host are they coming from Search and visualize historical data from billions of events Prevent NameNode abuse/misuse Visualizing 834 Million Discrete Events … ConAnued Queue Insights (Capacity & Provisioning) ! ! ! ! ! Each Hadoop job runs in a specific queue We track every aspect of the YARN framework Immediate queue performance and configuraAon profiling via job history server Historical views and trends that enable beper capacity management Improved queue uAlizaAon and allocaAon management Search | Splunk 6.1.0 New Search Visualizing Q ueues http://spbl103n01.blue.ygrid.yahoo.com:9999/en-US/app/search... index="jobsummary_logs_all_red" onds) cluster="dilithium*" | eval total_slot_seconds=(mapSlotSeconds + reduceSlotSec | eval gb_hours=((total_slot_seconds * 0.5) / 3600) | eval gb_hours=round(gb_hours) | timechart span=6h sum (gb_hours) as gb_hours by queue Last 7 days ✓ 1,175,726 events (5/20/14 8:00:00.000 PM to 5/27/14 8:26:26.000 PM) Visualization 600,000 400,000 200,000 Wed May 21 2014 Thu May 22 Fri May 23 Sat May 24 Sun May 25 Mon May 26 _time _time ↕ 2014-05-20 18:00 OTH ER ↕ apg_dai lyhigh_ p3 ↕ apg_dail ymedium _p5 ↕ apg_hou rlyhigh_ p1 ↕ apg_ho urlylow_ p4 ↕ apg_hourl ymedium _p2 ↕ apg _p7 ↕ curveb all_larg e ↕ curveb all_me d ↕ sling shot ↕ sling stone ↕ 415 4 45512 7071 25643 12111 29664 347 3 26547 14192 6087 5 4537 6 Self-‐Service Job Reports ! ! ! ! Each job is unique and so are the map and reduce elements How to start analyzing jobs? Historical job performance and profiling enables in-‐depth performance tuning Long terms historical views and trending of growth index="jobsummary_logs_all_blue" cluster="*" user="gmon" | eval total_slot_seconds=(mapSlotSeconds + reduceSlotSeconds) eval gb_hours=((total_slot_seconds * 0.5) / 3600) | eval gb_hours=round(gb_hours,2) | eval runtime=(finishTime-submitTime)/1000 | | stats sum(gb_hours) as gb-hours avg(runtime) as run_mins by cluster user queue jobName jobId status| eval run_mins=round(run_mins/60,2) | sort -gb-hours Yesterday ✓ 4,871 events (5/26/14 12:00:00.000 AM to 5/27/14 12:00:00.000 AM) Statistics (4,871) clu ster ↕ us er ↕ que ue ↕ cob alt g m on cob alt gb-ho urs ↕ run_ mins ↕ jobName ↕ jobId ↕ status ↕ grid eng PigLatin:findRemoteHDFSFromAudits.pig job_1398982765 383_315271 SUCCE EDED 108.0 0 33.07 g m on grid eng PigLatin:findRemoteHDFSFromAudits.pig job_1398982765 383_312700 SUCCE EDED 104.0 0 37.37 cob alt g m on grid eng PigLatin:findRemoteHDFSFromAudits.pig job_1398982765 383_309715 SUCCE EDED 88.00 29.83 cob alt g m on grid ops distcp: job_1398982765 383_309921 SUCCE EDED 36.00 68.49 ...It’s Not Just Logs We’re Looking At More data to tap into with the metastore / Hive sources ! Using the metastore we can setup virtual indexes to any table(s) in Hive, without the need to define the schema up-‐front ! Visualize very complex tables (250+ fields) ! Rapid prototyping for new jobs with almost instant results for searches, without having to wait for the enAre job/query to finish ! Built-‐in aggregates and graphs/charts ! Accelerates development workflow by providing faster interacAon with data THANK YOU
© Copyright 2024