HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM APACHE HADOOP WORKSHOP AGENDA • • • • • • • Introduction What is Hadoop and the rationale behind it Hadoop Distributed File System (HDFS) and MapReduce Common Hadoop use cases How Hadoop integrates with other systems like Relational Databases and Data Warehouses The other components in a typical Hadoop “stack” such as: Hive, Pig, HBase, Sqoop, Flume and Oozie Conclusion ABOUT TRIFORCE Triforce provides critical, reliable IT infrastructure solutions and services to Australian and New Zealand listed corporations and government agencies. Triforce has qualified and experienced technical and sales consultants and demonstrated experience in designing and delivering enterprise Apache Hadoop solutions. TRIFORCE BIG DATA PARTNERSHIP NetApp Cloudera • • The NetApp Open Solution for Hadoop provides customers with flexible choices for delivering enterprise-class Hadoop. Cloudera is the market leader in Hadoop enterprise solutions. Cloudera’s 100% open-source distribution including Apache Hadoop (CDH), combined with Cloudera Enterprise, comprises the most reliable and complete Hadoop solution available. WHAT IS HADOOP? • “a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.” (http://hadoop.apache.org/) • “Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data.” (http://en.wikipedia.org/wiki/Hadoop/) THE RATIONALE FOR HADOOP • “Hadoop enables distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers that both store and process the data, and can scale without limits. With Hadoop, no data is too big.” (http://www.cloudera.com) • Hadoop processes petabytes of unstructured data in parallel across potentially thousands of commodity boxes using an open source filesystem and related tools • Hadoop has been all about innovative ways to process, store, and eventually analyse huge volumes of multi-structured data. EXAMPLES • 2.7 Zettabytes of data exist in the digital universe today. (Gigabyte, Terabyte, Petabyte, Exabyte, Zettabyte) • Facebook stores, accesses, and analyses 30+ Petabytes of user generated data. • Decoding the human genome originally took 10 years to process; now it can be achieved in one week. • YouTube users upload 48 hours of new video every minute of the day. • 100 terabytes of data uploaded daily to Facebook HADOOP • Handles all types of data – structured, unstructured, log files, pictures, audio files, communications records, email • No prior need for a schema – you don’t need to know how you intend to query your data before you store it • Makes all of your data useable – By making all of your data useable, not just what’s in your databases, Hadoop lets you see relationships that were hidden before and reveal answers that have always been just out of reach. You can start making more decisions based on hard data instead of hunches and look at complete data sets, not just samples. • Two parts to Hadoop – MapReduce – Hadoop Distributed File System (HDFS) What is this Big Elephant ? HADOOP Geever Paul Pulikkottil BigData Solutions Architect (CCAH,CCDH) CASE FOR BIGDATA Databases – here for more than 20yrs – continue to store structured transactional data • • • • Large server (s) Multi CPUs Huge Memory Buffer SAN disks – Relatively low latency queries, indexed data CASE FOR BIGDATA TYPICAL WORKLOADS – DATABASE  • • • • OLTP (online transaction processing) Typical Use: e-commerce, banking Nature: User facing, real-time, low latency, highly-concurrent Job: relatively small set of “standard” transactional queries  • • • • OLAP (online analytical processing) Typical Use: BI, Data Mining Nature: Back-end processing, Batch workloads Job: complex analytical queries, often ad hoc Data access: Table scans, Large query Data access pattern: random reads, updates, writes (relatively small data) CASE FOR BIGDATA Data warehouse: – – – – Consolidated database loaded from CRM, ERP, OLTP Process: Staging, Cleansing, Loading Purpose: BI Reporting, Forecasts, Quarterly reporting Size: larger server, multiple CPUs, SAN disks- many TBs • Challenge: • As the data grows overtime, things getting slower • Batch should fit in within daily, weekly loading cycle • Relatively expensive to license, store, manage CASE FOR BIGDATA New Objective: Businesses wants to “connect” with the customer • • • • We are generating lots of data – most discarded them Likes and Dislikes – Facebook, Twitter, Linked-in Predictable outcomes - you can when you know the customer React quickly – time missed = opportunity lost ! Question: Can DW provide that ? • Where can you store TB or PB’s unstructured data more economically • How can you scale out easily, rather than forklift upgrades • How can I finish batch jobs when the data grows beyond TBs • Need a scalable, distributed system that can store and process large amounts of data CASE FOR BIGDATA • Distributed systems are not NEW: – – – – Common frameworks include MPI, PVM Focuses on distributing the processing workload Powerful compute nodes with Separate systems for data storage Fast network connections – Infiniband • Typical processing pattern: – – – – Step 1: Copy input data from storage to compute node Step 2: Perform necessary processing Step 3: Copy output data back to storage Often hundreds to thousands of nodes with GPUs CASE FOR BIGDATA Distributed HPC – relatively small amounts of data – doesn’t scale with large amounts of data – more time spent copying data than actually processing – getting data to the processors is the bottleneck – getting worse as more compute nodes are added – each node competing for the same bandwidth – compute nodes become starved for data “Distributed systems pay for compute scalability by adding complexity CudaFortran , PGI programing ? ” BIGDATA SOLUTION: HADOOP What is Hadoop – – – – – – – open source distributed computing platform based on Google’s GFS File system commodity hardware, no SAN, no infiniband scale up from single servers to thousands of machines each offering local computation and storage designed to detect and handle failures at the application layer adding more nodes, increase “performance” and “capacity” with no penalty – commodity hardware is prone to failures, Hadoop knows that ! HADOOP CLUSTER STACK Master Nodes (1st rack) - Name Node - Standby Name Node - Job Tracker Slave Nodes (all racks) - Data Nodes with direct attached large capacity disks (SATA) Plus: - Management or Admin Node - Hadoop Client Node(s) - Typical setup MAPREDUCE PROGRAMING Hadoop is great for large-data processing ! - MapReduce code requires you to write Java class, driver code - Its complicated to write MapReduce jobs so we need a simpler method. - Develop a higher-level language to facilitate large data processing - Hive: SQL language for Hadoop , called HQL - Pig: Pig Latin is scripting language, a bit like Perl - Both translate and run a series of Map only or MapReduce Jobs ECOSYSTEM TOOLS: HIVE AND PIG Hive: Pig: - Data warehousing application in Hadoop - Query language is HQL, variant of SQL - Tables stored on HDFS as flat files - Developed by Facebook, now open source - large-scale data processing system - Scripts are written in Pig Latin - Dataflow language Developed by Yahoo!, now open source Objective: - Higher-level language to facilitate large-data processing - Higher-level language “compiles down” to Hadoop jobs HIVE AND PIG EXAMPLE CODE Hive example: Pig example: ECOSYSTEM TOOLS: SQOOP - Import data from RDBMS to Hadoop - - Individual tables, Portions (where clause) or entire Databases Stored to HDFS as delimited text files or Sequence Files Provides the ability to import from SQL databases straight into your Hive Datawarehouse JDBC to connect to RDBMS, additional connectors available to BI/DW Sqoop automatically generates a Java class to import data into Hadoop Sqoop provides an incremental import mode Export tables to RDBMS from Hadoop SQOOP IMPORT EXAMPLES > Importing Data into HDFS as Hive table using SQOOP user@dbserver$> sqoop --connect jdbc:mysql://db.example.com/website --table USERS --local \ --hive-import > Importing Data to HDFS as compressed sequence files (No Hive) using SQOOP user@dbserver$>sqoop --connect jdbc:mysql://db.example.com/website --table USERS \ --as-sequencefile > Importing Data into HBase using SQOOP: $ sqoop import --connect jdbc:mysql://localhost/acmedb \ --table ORDERS --username test --password **** \ --hbase-create-table --hbase-table ORDERS --column-family mysql >Exporting Data to RDBMS using SQOOP: $ sqoop export --connect jdbc:mysql://localhost/acmedb \ --table ORDERS --username test --password **** \ --export-dir /user/arvind/ORDERS • • • This would connect to the MySQL database on this server and import the USERS table into HDFS. The –-local option instructs Sqoop to take advantage of a local MySQL connection. The –-hive-import option after reading the data into HDFS, Sqoop will connect to the Hive metastore, create a table named USERS with the same columns and types (translated into their closest analogues in Hive), and load the data into the Hive warehouse directory on HDFS (instead of a subdir of your HDFS home dir) SQOOP CUSTOM CONNECTORS Sqoop Works with standard JDBC connection with common Databases, custom faster tuned connectors available for – Cloudera Connector for Teradata – Cloudera Connector for Netezza – Cloudera Connector for MicroStrategy – Cloudera Connector for Tableau – Quest Data Connector for Oracle and Hadoop ECOSYSTEM TOOLS: FLUME Flume: Gather data/logs from Multiple systems, inserting them into HDFS as they are generated. Typically used to ingest log files from real-time systems such as Web servers, firewalls and mail servers into HDFS. Each Flume agent has a source and a sink  Source – Tells the node where to receive data from  Sink – Tells the node where to send data to  Channel – A queue between the Source and Sink – Can be in memory only or ‘Durable’ – Durable channels will not lose data if power is lost ECOSYSTEM TOOLS: FUSE FUSE :“ Filesystem in Userspace “ – Allows HDFS to be mounted as a UNIX file system – User can operate 'ls', 'cd', 'cp', 'mkdir', 'find', 'grep', or use standard Posix libraries like open, write, read, close. – You can export a fuse mount using NFS, ECOSYSTEM TOOLS: OOZIE Oozie: – – – – – Oozie is a ‘workflow engine’ Runs workflows of Hadoop jobs Pig, Hive, Sqoop jobs Jobs can be run at specific times, One-off or recurring Jobs can also be run when data is present in a directory ECOSYSTEM TOOLS: MAHOUT Mahout: - Mahout is a Machine Learning library - Contains many pre written ML algorithms - R is another set of open source library used by Data Scientists ECOSYSTEM TOOLS: IMPALA <CDH4.1> IMPALA: – Brings real-time, ad hoc query – Query data stored in HDFS or HBase – SELECT, JOIN, and aggregate functions in real time. – Uses the same Hive Metadata – SQL syntax (Hive SQL), ODBC driver – User interface (Hue Beeswax) as Hive and Impala shell – Released 26th Oct 2012 CDH4.1 HBASE – REAL TIME DATA WITH UPDATE HBase is a distributed, sparse, column-oriented data store – Real-time read/write access to data on HDFS – Modeled after Google’s Bitable data store – Designed to use multiple machines to store and serve data – Leverages HDFS to store data – Each row may or may not have values for all columns – Data is stored grouped by column, rather than by row – Columns are grouped into ‘column families’, which define what columns are physically stored together – Scales to provide very high write throughput – Hundreds of thousands of inserts per second – Has a constrained access model: NO SQL • • – Insert a row, retrieve a row, do a full or partial table scan Only one column (the ‘row key’) is indexed Based on Key/value Store: [rowkey, column family, column qualifier, timestamp] -> Cell Value • • [TheRealMT, info, password, 1329088818321] -> abc123 [TheRealMT, info, password, 13290888321289] -> newpass123 HBASE Hbase: – Indexed by [rowkey+column qualifier +timestamp] • HBase is Not a Relational Database – – – – – – – – – No SQL Query language (GET/PUT/SCAN) No Joins, No Secondary Indexing, No Transactions Table is split into Regions Regions are served by Region Servers Region Servers are Java processes, on DataNodes two special tables: ROOT and .META MemStore, Hfiles Every Memstore flush creates one HFile per Col.Fam Compactions Major/Minor – reduce consolidated hfiles DATA HAS CHANGED HADOOP USE CASES: • What do we know today? • • • • • • • We love to be connected and collaborated We love to share emotions likes and dislikes Digital marketing has focus towards social media Get more insights across collection of data Need all sorts of data to store and analyse Real-time recommendation engines Predictive modelling with data science COMMON HADOOP USE CASES Financial Services – Consumer & market risk modelling Personalization & recommendations Fraud detection & anti-money laundering Portfolio valuations COMMON HADOOP USE CASES • Government – – Cyber security & fraud detection, – Geospatial image & video processing COMMON HADOOP USE CASES • Media & Entertainment – – Search & recommendation optimization, – User engagement & digital content analysis, – Ad/offer targeting, – Sentiment & social media analysis HADOOP USE CASES: DATA STORES  OLTP database (OLTP) • for user-facing transaction, Retain records  Extract-Transform-Load (ETL) • Periodic ETL (e.g., nightly), Extract records from source • Transform: clean data, check integrity, aggregate, etc. • Load into OLAP database  OLAP database for Data Warehousing (DW) • Business Intelligence: reporting, ad hoc queries, data mining HADOOP USE CASES: REPLACE DW ? Reporting is often a nightly task • • ETL is often slow, runs after the day What happens if processing 24 hours of data takes longer than 24hr Hadoop is perfect likely, you already have some DW • Most • Ingest is limited by speed of HDFS • Scales out with more nodes parallel • Massively • Ability to use any processing tool • Much cheaper than parallel databases • ETL is a batch process anyway! CLOUDERA DISTRIBUTION HADOOP 4.1 Cloudera Enterprise Subscription Options: • Cloudera Enterprise Core • Cloudera Enterprise RTD (Real-Time Delivery) • Cloudera Enterprise RTQ (Real-Time Query) WHERE TO FROM HERE? Understand Use Cases Deploy Hadoop Infrastructure Confirm Data sources Build a business Case Use Hadoop to answer questions Design a solution CONTACT TRIFORCE • Call 1300 664 667 • Email: [email protected] • View our Big Data Resources page at www.triforce.com.au • Follow us on LinkedIN http://www.linkedin.com/company/triforceaustralia