“Cloudera Hadoop”

“Cloudera Hadoop”
โดย คุณกิตติรักษ์ ม่วงมิ่งสุข
กรรมการผู้จัดการบริษัทคลัสเตอร์ คิท (Cluster Kit) และ นายกสมาคมศึกษาและพัฒนาโอเพ่นซอร์ส
สัมมนา Big Data & Analytics โดย ดาต้า คิวบ์ (facebook.com/datacube.th)
Cloudera Hadoop
กตตรกษ ม
วงมงสข
Kittirak Moungmingsuk
[email protected]
Arp 4, 2015 Data Cube Seminar @ KU HOME
รรจกกนกอน
กตตรกษ ม
วงมงสข ชชอเล
น กก
ปจจบนททหนททหลยอย
งในกจกรเล$ก ๆ ชชอ “คลสเตอรคท”
และไดรบมอบหมยจกคนหลยคนใหเป-นนยกสมาคมศกษา
และพฒนาโอเพ
นซอรส หรชอ OSEDA
วฒกรศ0กษ
นกธรรมช2นตรท สทนกเรทยนจงหวดอบลรชธนท วดป3
วเวก(ธรรมชน)
ชอบเล
นอนเทอรเน$ต ท
องเททยว และททกจกรรมต
ง ๆ
2
Cluster Kit: Achievement
ThaiGrid (Tera Cluster)
800 Cores, Linux Cluster
133 Cores, Win Cluster
Sila Cluster @Ramkhamhaeng U. 286 Cores
BIOTEC (Eclipse Cluster) 704 Cores
Virgin Radio Thailand
7 nodes, Web Cluster
Geo-Informatics and Space Technology Development
Agency (GISTDA)
10 nodes, Web Cluster
HAII (HAII Cluster I, II) 480 Cores
3
Top500.org (update Nov 2014)
4
Top500 Architecture Share (June 2014)
5
Top500 OS Share
6
Why Big Data?
7
Source: https://practicalanalytics.;les.wordpress.com/2012/10/newstyleo;t.jpg
8
Source:
http://smartdatacollective.com/yellow;n/75616/why-big-data-and-business-intelligence-one-direction
9
10
Facebook Usage Statistics (June 2014)
829 million daily active users
654 million mobile daily active users
1.32 billion monthly active users
1.07 billion mobile monthly active users
Approximately 81.7% of our daily active users
are outside the US and Canada
Source: http://newsroom.fb.com/company-info/
11
Google Usage Statistic
Data from
http://expandedramblings.com/index.php/by-the-num
bers-a-gigantic-list-of-google-stats-and-facts/#.
VDavqq2mhNA
Amount of monthly Google searches
11.944 billion (3/20/14)
Number of monthly unique visitor
187 million (3/25/14)
12
จจนวนเครรอ งเซรฟเวอร
Google
> ลานเครรอง
Facebook
180,900 Servers
https://www.facebook.com/ArcadianLearning
s/posts/549836811713533
13
Low Cost
High Performance
14
http://www.opencompute.org/
15
Software
Linux
Python C++, Java,
Javascript, Go, Sawzal
(a custom logging
language)
Hadoop
Linux
PHP, C++, Java, Python,
and Ruby.
Apache Web Server
MySQL
Hadoop
Memcached, Flashcache
HipHop to transform PHP
source code into C++ and
gain performance bene;ts.
16
What is Hadoop?
HDFS
MapReduce
How to build Hadoop cluster
How to execute MapReduce
Hive SQL
17
Hadoop – How was it Born?
To Process Huge Volume of data, as the amount of generated data continued to
rapidly increase. (Big Data).
Also the Web generated more and more information, which was becoming
quite challenging to index the content.
18
What Is Apache Hadoop?
The Apache Hadoop software library is a
framework that allows for the distributed
processing of large data sets across clusters of
computers using simple programming models.
Image Source: http://blogs.ejb.cc/archives/4290/hadoop-technical-manuals-athe-hadoop-ecosystem/tumblr_lbbwggcer71qappj8
19
HDFS Architecture
Source: http://hadoop.apache.org/docs/r1.0.4/hdfs_design.html
20
Data Replication
Source: http://hadoop.apache.org/docs/r1.0.4/hdfs_design.html
21
Hadoop - Basic Architecture
Source: http://www.mplsvpn.info/2012/11/hadoop-architecture-types-of-hadoop.html
22
Hadoop - Basic Architecture (contd.)
Source: http://www.mplsvpn.info/2012/11/hadoop-architecture-types-of-hadoop.html
23
MapReduce
MapReduce is a programming model for processing
large data sets, and the name of an implementation
of the model by Google. – Wikipedia
map: (K1, V1) -> list(K2, V2)
Reduce: (K2, list(V2)) -> list(K3, V3)
24
MapReduce
Output in a list of (Key, Value)
Image source: http://www.rabidgremlin.com/data20/#(3)/
Output in a list of (Key, List of Values)
25
WordCount - MapReduce
Map function
Reduce function
26
WordCount by Pig
Apache Pig is a platform for analyzing large data
sets that consists of a high-level language for
expressing data analysis programs
https://pig.apache.org/
A=load 'input
/*';
B=foreach Agenerateflatten(TOKENIZE((chararray)$0))asword;
C=group Bbyword;
D=foreach CgenerateCOUNT(B),group;
store Dinto'output/wordcount-pig';
Source: http://salsahpc.indiana.edu/ScienceCloud/pig_word_count_tutorial.htm
27
Hive
Hive provides a mechanism to project structure onto
this data and query the data using a SQL-like
language called HiveQL.
hive>show tables;
hive>create table country (country_idint,c
ountrystr
ing)
rowfor
matdelimitedfieldsterminatedby','s
toredast
extfile;
hive>desc country;
hive>load data l
ocalinpat
h'/t
mp/co
untry.csv'intota
blecountr
y;
hive>select count(country_id)fromcountrywhere countrylike 'T%';
28
Apache™ Mahout is a library of scalable machinelearning algorithms, implemented on top of Apache
Hadoop® and using the MapReduce paradigm.
Mahout supports four main data science use cases:
Collaborative ;ltering
Clustering
Classi;cation
Frequent itemset mining
29
List of algorithms
(for distributed mode)
Distributed Item-based
Collaborative Filtering
Canopy Clustering
Dirichlet Process
Clustering
Hierarchical Clustering
Collaborative Filtering
Using a Parallel Matrix
Factorization
Latent Dirichlet Allocation
Bayesian
Fuzzy K-Means
K-Means Clustering
Mean Shift Clustering
Minhash Clustering
Spectral Clustering
Random Forests
Parallel FP Growth
Algorithm
Source: http://hortonworks.com/hadoop/mahout/
30
Source: http://imgbuddy.com/hadoop-ecosystem-components.asp
31
HADOOP 1.0 vs 2.0
Source: http://hortonworks.com/blog/apache-hadoop-2-is-ga/
32
Hadoop 2.0 : YARN
(Yet Another Resource Negotiator)
Source: http://hortonworks.com/get-started/yarn/
33
Cloudera Hadoop (CDH)
CDH is Cloudera's open source software distribution
http://www.cloudera.com/
Source: http://www.cloudera.com/content/cloudera/en/products-and-services/cdh.html
34
Cloudera GUI (Hue)
35
Another Hadoop Platform
Hartonworks
http://hortonworks.com/
MapR
https://www.mapr.com/
36
References
https://www.facebook.com/Engineering
https://www.facebook.com/data
https://www.facebook.com/publication
http://research.google.com
http://googleblog.blogspot.com
37
References
Stratapps, “An Introduction to Hadoop”,
http://stratapps.net/intro-hadoop.php
edureka!, “Introduction to Hadoop 2.0 and
advantages of Hadoop 2.0 over 1.0”,
http://www.edureka.co/blog/introduction-to-hadoop
-2-0-and-advantages-of-hadoop-2-0/
, May 2014,
38
โครงกรคอมพวเตอรมรอสองเพรอ นองในชนบท
39
The End.
Download this slide at
http://goo.gl/DoibT2
Tweet to me at @kittirak