Petabytes BIG DATA Transactions + Interactions + Observations = BIG DATA Mobile Web Sentiment SMS/MMS Speech to Text User Click Stream Social Interactions & Feeds Terabytes WEB Web logs Spatial & GPS Coordinates A/B testing Sensors / RFID / Devices Behavioral Targeting Gigabytes CRM Business Data Feeds Dynamic Pricing Segmentation External Demographics Search Marketing Megabytes ERP Purchase detail Customer Touches Support Contacts Purchase record Payment record User Generated Content Affiliate Networks Offer details Dynamic Funnels Offer history HD Video, Audio, Images Product/Service Logs Increasing Data Variety and Complexity APPLICATIONS OLTP, ERP, CRM Systems Custom Applications Business Analytics Packaged Applications Unstructured documents, emails Server logs DATA SYSTEM 2.8 ZB in 2012 85% from New Data Types RDBMS EDW Sentiment, Web Data MPP REPOSITORIES 15x Machine Data by 2020 40 ZB by 2020 Sensor. Machine Data Source: IDC SOURCES Geolocation Existing Sources (CRM, ERP, Clickstream, Logs) Clickstream APPLICATIONS OLTP, ERP, CRM Systems Custom Applications Business Analytics Packaged Applications DEV & DATA TOOLS Server logs EDW MPP REPOSITORIES Data Management Operations RDBMS Data Access Security OPERATIONS TOOLS Governance & Integration DATA SYSTEM Build & Test Unstructured documents, emails Sentiment, Web Data Provision, Manage & Monitor SOURCES Sensor. Machine Data Geolocation OLTP, ERP, Documents, Web Logs, Social CRM Systems Emails Click Streams Networks Machine Generated Sensor Data Geolocation Data Clickstream SCALE New Analytic Apps New types of data LOB-driven SCOPE Data Access Data Management Operations MPP Security RDBMS Governance & Integration SCALE Data Lake A Modern Data Architecture/Data Lake EDW New Analytic Apps New types of data LOB-driven SCOPE An architectural shift in the data center that uses Hadoop to deliver deeper insight across a large, broad, diverse set of data at efficient scale Hortonworks Data Platform (HDP) HDP 2.1 Hortonworks Data Platform GOVERNANCE & INTEGRATION Data Workflow, Lifecycle & Governance DATA ACCESS Batch Script SQL NoSQL Stream Search Others Map Reduce Pig Hive/Tez, HCatalog HBase Storm Solr In-Memory Analytics, ISV engines Falcon Sqoop Flume WebHDFS ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS ° ° ° ° ° ° ° ° ° ° ° ° ° N (Hadoop Distributed File System) DATA MANAGEMENT Linux Windows Deployment Choice On-Premise OPERATIONS Authentication Authorization Accounting Data Protection Provision, Manage & Monitor Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox YARN : Data Operating System 1 SECURITY Cloud Ambari (SCOM) Zookeeper Scheduling Oozie The Only Completely Open Distribution for Apache Hadoop Fundamentally Versatile and Comprehensive enterprise capabilities Wholly Integrated for deep ecosystem interoperability HDP certifies most recent & stable community innovation 1.5.1 HDP 2.1 Data Management 3.4.5 3.3.2 0.94.6 Data Access Governance & Integration Hortonworks Data Platform Operations Knox Zookeeper Ambari 1.2.5 Flume Storm Sqoop 1.4.3 0.7.0 Phoenix 2013 1.1.2 Tez May 1.3.1 0.11.0 Hadoop &YARN HDP 1.3 1.4.4 0.8.0 0.4.0 4.0.0 1.4.0 1.4.4 0.11.0 2013 0.5.0 0.96.1 0.12.0 2.2.0 4.7.2 0.9.1 HBase October 4.0.0 0.12.0 Pig HDP 2.0 0.9.0 Falcon 2014 0.98.0 Oozie 0.12.1 Solr 0.4.0 Mahout 2.4.0 Hive & HCatalog April 0.13.0 Security DEV & DATA TOOLS OPERATIONAL TOOLS a HDInsight Azure x Ω SOURCES DATA SYSTEM APPLICATIONS New! Power BI INFRASTRUCTURE SCALE (storage & processing) Traditional Database EDW Required on write Reads are fast MPP Analytics schema speed NoSQL Hadoop Platform Required on read Writes are fast Standards and structured governance Loosely structured Limited, no data processing processing Processing coupled with data Structured data types Multi and unstructured best fit use Data Discovery Processing unstructured data Massive Storage/Processing Interactive OLAP Analytics Complex ACID Transactions Operational Data Store Hortonworks Data Platform (HDP) for Windows Microsoft Azure HDInsight Microsoft Analytics Platform System (APS) All offerings co-engineered by Hortonworks and Microsoft Enjoy seamless interoperability across on-premises and cloud Data Operating System of Hadoop DATA ACCESS Batch Script Map Reduce Pig SQL NoSQL Stream Search Others Storm Solr In-Memory Analytics, ISV engines Hive/Tez, HBase HCatalog Accumulo YARN : Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS ° ° ° ° ° ° ° ° ° ° ° ° ° N (Hadoop Distributed File System) DATA MANAGEMENT Single Use System Multi Use Data Platform Batch Apps Batch, Interactive, Online, Streaming, … 2nd Gen of Hadoop 1st Gen of Hadoop Classic Hadoop Apps Batch MapReduce MapReduce (cluster resource management & data processing) Flexible Data Processing Online Data Processing Stream Processing Hive, Pig, others… HBase, Accumulo Storm Batch & Interactive Tez Efficient Cluster Resource Management & Shared Services (YARN) HDFS Redundant, Reliable Storage (redundant, reliable storage) (HDFS) others … ResourceManager Scheduler NodeManager NodeManager NodeManager NodeManager map 1.1 nimbus0 vertex1.1.1 vertex1.2.2 NodeManager NodeManager NodeManager NodeManager map1.2 Batch Interactive SQL vertex1.1.2 nimbus2 NodeManager NodeManager NodeManager NodeManager nimbus1 Real-Time reduce1.1 vertex1.2.1 Stinger Initiative Custom Apps Business Analytics SQL Apache Hive Apache Tez Apache MapReduce Apache YARN 1 ° ° ° ° ° ° ° ° ° HDFS ° ° ° ° ° ° ° ° ° N ° (Hadoop Distributed File System) Apache Hive Contribution… an Open Community at its finest 1,672 Jira Tickets Closed 145 Developers 44 Companies ~390,000 Lines Of Code Added… (2x) 13 Months Replaces MapReduce as primitive for Hive, Pig, etc Task with pluggable Input, Processor and Output Input Processor Output Task Tez Task - <Input, Processor, Output> SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state Tez avoids unneeded writes to HDFS Hive – MR M M Hive – Tez M SELECT a.state SELECT b.id R R M SELECT a.state, c.itemId M M R M SELECT b.id R M M HDFS JOIN (a, c) SELECT c.price M R M R R HDFS JOIN (a, c) R HDFS JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) M M M R JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) R Hive SQL Datatypes Hive SQL Semantics SQL Compliance INT SELECT, INSERT TINYINT/SMALLINT/BIGINT GROUP BY, ORDER BY, SORT BY BOOLEAN JOIN on explicit join key FLOAT Inner, outer, cross and semi joins DOUBLE Sub-queries in FROM clause Hive provides a wide array of SQL datatypes and semantics so your existing tools integrate more seamlessly with Hadoop STRING ROLLUP and CUBE TIMESTAMP UNION BINARY Windowing Functions (OVER, RANK, etc) DECIMAL Custom Java UDFs ARRAY, MAP, STRUCT, UNION Standard Aggregation (SUM, AVG, etc.) DATE Advanced UDFs (ngram, Xpath, URL) VARCHAR Sub-queries for IN/NOT IN, HAVING CHAR Expanded JOIN Syntax Hive 0.12 (HDP 2.0) INTERSECT / EXCEPT Hive 0.13 (HDP 2.1) Hive 0.11 Apache Falcon Provides key governance framework for: Disaster Recovery and Backup between environments Site to Site Publishing data between environments for Discovery Site to Cloud Define sophisticated retention policies Simplify data retention for audit, compliance, or for data re-processing Staged Data Cleansed Data Conformed Data Presented Data Retain 5 Years Retain 3 Years Retain 3 Years Retain Last Copy Only Apache Solr MapReduce Indexing Job ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) Apache Storm Apache Knox Enterprise Identity Provider LDAP/AD Browser Firewall Firewall Identity Providers HDP Cluster 1 Masters NN Web HCat JT DN DMZ REST Client TT YARN HBase Hive Knox Gateway GW HDP Hadoop Cluster 2 JDBC Client Masters NN JT DN A stateless reverse proxy instance deployed in DMZ Oozie -Requests streamed through GW to Hadoop services after auth. -URLs rewritten to refer to gateway Hive Web HCat Oozie TT HBase YARN Ambari: Deploy, Manage, Monitor AMBARI WEB REST APIs AMBARI SERVER PROVISION compute & storage . . . MANAGE . . . . MONITOR . . . compute & storage PROVISION | MANAGE | MONITOR Ambari SCOM Server aggregates + exposes Hadoop metrics Ambari SCOM Mgmt Pack Ambari SCOM Server Ambari SCOM monitors health + alerts in case of problems HADOOP Storage & Process at Scale http://www.trySQLSever.com http://www.powerbi.com http://microsoft.com/bigdata http://channel9.msdn.com/Events/TechEd www.microsoft.com/learning http://microsoft.com/technet http://microsoft.com/msdn
© Copyright 2024