Under the Covers of Hadoop on Windows

Petabytes
BIG DATA
Transactions +
Interactions +
Observations
= BIG DATA
Mobile Web
Sentiment
SMS/MMS
Speech to Text
User Click Stream
Social Interactions & Feeds
Terabytes
WEB
Web logs
Spatial & GPS Coordinates
A/B testing
Sensors / RFID / Devices
Behavioral Targeting
Gigabytes
CRM
Business Data Feeds
Dynamic Pricing
Segmentation
External Demographics
Search Marketing
Megabytes
ERP
Purchase detail
Customer Touches
Support Contacts
Purchase record
Payment record
User Generated Content
Affiliate Networks
Offer details
Dynamic Funnels
Offer history
HD Video, Audio, Images
Product/Service Logs
Increasing Data Variety and Complexity
APPLICATIONS
OLTP, ERP, CRM Systems
Custom
Applications
Business
Analytics
Packaged
Applications
Unstructured documents, emails
Server logs
DATA SYSTEM
2.8 ZB in 2012
85% from New Data Types
RDBMS
EDW
Sentiment, Web Data
MPP
REPOSITORIES
15x Machine Data by 2020
40 ZB by 2020
Sensor. Machine Data
Source: IDC
SOURCES
Geolocation
Existing Sources
(CRM, ERP, Clickstream, Logs)
Clickstream
APPLICATIONS
OLTP, ERP, CRM Systems
Custom
Applications
Business
Analytics
Packaged
Applications
DEV & DATA TOOLS
Server logs
EDW
MPP
REPOSITORIES
Data Management
Operations
RDBMS
Data Access
Security
OPERATIONS TOOLS
Governance
& Integration
DATA SYSTEM
Build &
Test
Unstructured documents, emails
Sentiment, Web Data
Provision,
Manage &
Monitor
SOURCES
Sensor. Machine Data
Geolocation
OLTP, ERP, Documents, Web Logs,
Social
CRM Systems
Emails
Click Streams Networks
Machine
Generated
Sensor
Data
Geolocation
Data
Clickstream
SCALE
New Analytic Apps
New types of data
LOB-driven
SCOPE
Data Access
Data Management
Operations
MPP
Security
RDBMS
Governance
& Integration
SCALE
Data Lake
A Modern Data Architecture/Data Lake
EDW
New Analytic Apps
New types of data
LOB-driven
SCOPE
An architectural shift in the data
center that uses Hadoop to
deliver deeper insight across a
large, broad, diverse set of data
at efficient scale
Hortonworks
Data Platform
(HDP)
HDP 2.1
Hortonworks Data Platform
GOVERNANCE &
INTEGRATION
Data Workflow,
Lifecycle &
Governance
DATA ACCESS
Batch
Script
SQL
NoSQL
Stream
Search
Others
Map
Reduce
Pig
Hive/Tez,
HCatalog
HBase
Storm
Solr
In-Memory
Analytics,
ISV engines
Falcon
Sqoop
Flume
WebHDFS
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
HDFS
°
°
°
°
°
°
°
°
°
°
°
°
°
N
(Hadoop Distributed File System)
DATA MANAGEMENT
Linux
Windows
Deployment Choice
On-Premise
OPERATIONS
Authentication
Authorization
Accounting
Data Protection
Provision,
Manage &
Monitor
Storage: HDFS
Resources: YARN
Access: Hive, …
Pipeline: Falcon
Cluster: Knox
YARN : Data Operating System
1
SECURITY
Cloud
Ambari (SCOM)
Zookeeper
Scheduling
Oozie
The Only Completely Open
Distribution for Apache
Hadoop
Fundamentally Versatile and
Comprehensive enterprise
capabilities
Wholly Integrated for deep
ecosystem interoperability
HDP certifies most recent & stable community innovation
1.5.1
HDP 2.1
Data
Management
3.4.5
3.3.2
0.94.6
Data Access
Governance
& Integration
Hortonworks Data Platform
Operations
Knox
Zookeeper
Ambari
1.2.5
Flume
Storm
Sqoop
1.4.3
0.7.0
Phoenix
2013
1.1.2
Tez
May
1.3.1
0.11.0
Hadoop
&YARN
HDP 1.3
1.4.4
0.8.0
0.4.0
4.0.0
1.4.0
1.4.4
0.11.0
2013
0.5.0
0.96.1
0.12.0
2.2.0
4.7.2
0.9.1
HBase
October
4.0.0
0.12.0
Pig
HDP 2.0
0.9.0
Falcon
2014
0.98.0
Oozie
0.12.1
Solr
0.4.0
Mahout
2.4.0
Hive & HCatalog
April
0.13.0
Security
DEV & DATA TOOLS
OPERATIONAL TOOLS
a
HDInsight
Azure
x
Ω
SOURCES
DATA SYSTEM
APPLICATIONS
New!
Power BI
INFRASTRUCTURE
SCALE (storage & processing)
Traditional
Database
EDW
Required on write
Reads are fast
MPP
Analytics
schema
speed
NoSQL
Hadoop
Platform
Required on read
Writes are fast
Standards and structured
governance
Loosely structured
Limited, no data processing
processing
Processing coupled with data
Structured
data types
Multi and unstructured
best fit use
Data Discovery
Processing unstructured data
Massive Storage/Processing
Interactive OLAP Analytics
Complex ACID Transactions
Operational Data Store
Hortonworks Data Platform (HDP) for Windows
Microsoft Azure HDInsight
Microsoft Analytics Platform System (APS)
All offerings co-engineered by Hortonworks and Microsoft
Enjoy seamless interoperability across on-premises and cloud
Data Operating System of
Hadoop
DATA ACCESS
Batch
Script
Map
Reduce
Pig
SQL
NoSQL
Stream
Search
Others
Storm
Solr
In-Memory
Analytics,
ISV engines
Hive/Tez,
HBase
HCatalog Accumulo
YARN : Data Operating System
1
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
HDFS
°
°
°
°
°
°
°
°
°
°
°
°
°
N
(Hadoop Distributed File System)
DATA MANAGEMENT
Single Use System
Multi Use Data Platform
Batch Apps
Batch, Interactive, Online, Streaming, …
2nd Gen of Hadoop
1st Gen
of Hadoop
Classic
Hadoop
Apps
Batch
MapReduce
MapReduce
(cluster resource management
& data processing)
Flexible Data
Processing
Online Data
Processing
Stream
Processing
Hive, Pig, others…
HBase, Accumulo
Storm
Batch & Interactive
Tez
Efficient Cluster Resource
Management & Shared Services
(YARN)
HDFS
Redundant, Reliable Storage
(redundant, reliable storage)
(HDFS)
others
…
ResourceManager
Scheduler
NodeManager
NodeManager
NodeManager
NodeManager
map 1.1
nimbus0
vertex1.1.1
vertex1.2.2
NodeManager
NodeManager
NodeManager
NodeManager
map1.2
Batch
Interactive SQL
vertex1.1.2
nimbus2
NodeManager
NodeManager
NodeManager
NodeManager
nimbus1
Real-Time
reduce1.1
vertex1.2.1
Stinger Initiative
Custom
Apps
Business Analytics
SQL
Apache Hive
Apache
Tez
Apache
MapReduce
Apache YARN
1
°
°
°
°
°
°
°
°
°
HDFS
°
°
°
°
°
°
°
°
°
N
°
(Hadoop Distributed File System)
Apache Hive Contribution… an Open Community at its finest
1,672
Jira Tickets Closed
145
Developers
44
Companies
~390,000
Lines Of Code Added… (2x)
13
Months
Replaces MapReduce as
primitive for Hive, Pig, etc
Task with pluggable Input, Processor and Output
Input
Processor
Output
Task
Tez Task - <Input, Processor, Output>
SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
Tez avoids
unneeded writes to
HDFS
Hive – MR
M
M
Hive – Tez
M
SELECT a.state
SELECT b.id
R
R
M
SELECT a.state,
c.itemId
M
M
R
M
SELECT b.id
R
M
M
HDFS
JOIN (a, c)
SELECT c.price
M
R
M
R
R
HDFS
JOIN (a, c)
R
HDFS
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
M
M
M
R
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
R
Hive SQL Datatypes
Hive SQL Semantics
SQL Compliance
INT
SELECT, INSERT
TINYINT/SMALLINT/BIGINT
GROUP BY, ORDER BY, SORT BY
BOOLEAN
JOIN on explicit join key
FLOAT
Inner, outer, cross and semi joins
DOUBLE
Sub-queries in FROM clause
Hive provides a wide array
of SQL datatypes and
semantics so your existing
tools integrate more
seamlessly with Hadoop
STRING
ROLLUP and CUBE
TIMESTAMP
UNION
BINARY
Windowing Functions (OVER, RANK, etc)
DECIMAL
Custom Java UDFs
ARRAY, MAP, STRUCT, UNION
Standard Aggregation (SUM, AVG, etc.)
DATE
Advanced UDFs (ngram, Xpath, URL)
VARCHAR
Sub-queries for IN/NOT IN, HAVING
CHAR
Expanded JOIN Syntax
Hive 0.12 (HDP 2.0)
INTERSECT / EXCEPT
Hive 0.13 (HDP 2.1)
Hive 0.11
Apache Falcon
Provides key governance
framework for:
Disaster Recovery and
Backup between
environments
Site to Site
Publishing data between
environments for
Discovery
Site to Cloud
Define sophisticated retention policies
Simplify data retention for audit, compliance, or for data re-processing
Staged Data
Cleansed Data
Conformed
Data
Presented
Data
Retain 5 Years
Retain 3 Years
Retain 3 Years
Retain Last
Copy Only
Apache Solr
MapReduce Indexing Job
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
HDFS
(Hadoop Distributed File System)
Apache Storm
Apache Knox
Enterprise
Identity
Provider
LDAP/AD
Browser
Firewall
Firewall
Identity Providers
HDP Cluster 1
Masters
NN
Web
HCat
JT
DN
DMZ
REST
Client
TT
YARN
HBase
Hive
Knox Gateway
GW
HDP Hadoop Cluster 2
JDBC
Client
Masters
NN
JT
DN
A stateless reverse proxy
instance deployed in
DMZ
Oozie
-Requests streamed through GW to
Hadoop services after auth.
-URLs rewritten to refer to
gateway
Hive
Web
HCat
Oozie
TT
HBase
YARN
Ambari: Deploy, Manage, Monitor
AMBARI WEB
REST APIs
AMBARI SERVER
PROVISION
compute
&
storage
.
.
.
MANAGE
.
.
.
.
MONITOR
.
.
.
compute
&
storage
PROVISION | MANAGE | MONITOR
Ambari SCOM Server aggregates + exposes Hadoop metrics
Ambari
SCOM
Mgmt
Pack
Ambari
SCOM
Server
Ambari SCOM monitors health + alerts in case of problems
HADOOP
Storage & Process
at Scale
http://www.trySQLSever.com
http://www.powerbi.com
http://microsoft.com/bigdata
http://channel9.msdn.com/Events/TechEd
www.microsoft.com/learning
http://microsoft.com/technet
http://microsoft.com/msdn