How to program MapReduce jobs in Hadoop with R Group 8

How to program
MapReduce jobs in
Hadoop with R
Group 8
João Rosa, Mario Almeida, Alex Pérez
Index
●
●
●
●
●
●
●
●
●
Introduction
Hadoop
MapReduce
R
Why and how?
Possible uses? Business opportunities?
Conclusion
Questions
References
Nowadays, we have lots of data. BIG DATA!
If we need to analyse this we have a
problem...
...but, if we need to analyse this we have a BIG DATA
problem!
How can we analyse this BIG DATA?
+
A possible solution!
The Apache™ Hadoop™ project develops open-source
software for reliable, scalable, distributed computing.
The project includes these subprojects:
Hadoop Common is a set of utilities that support the
Hadoop
subprojects.
Hadoop
Common
includes
FileSystem, RPC, and serialization libraries.
The Hadoop Distributed File System (HDFS) is a
distributed file system designed to run on commodity
hardware. It has many similarities with existing distributed
file systems. However, the differences from other
distributed file systems are significant.
●
●
●
●
●
Highly fault-tolerant with hardware Failure
Designed to be deployed on low-cost hardware
Streaming Data Access
Large Data Sets
Portability Across Heterogeneous Hardware and
Software Platforms
Supports distributed computing on large data
sets on clusters of computers
Process large amounts of raw
data
Map + Reduce
R is the language of Pirates!!!
Rrrrr
What is R?
It's a language and environment for
statistical computing and graphics!
What is R?
2 million analysts!
Quantitative finance!
Google, Facebook and LinkedIn!
Why R?
● Current analytic solutions are costly!
● New methods for analyzing complex datasets!
Why Hadoop with R?
"Easiest, most productive, most elegant way to
write map reduce jobs."
Why Hadoop with R?
● One-two orders of magnitude less code than
Java
Why Hadoop with R?
Readable, reusable and extensible language.
Why Hadoop with R?
To give R analysts a way to access the mapreduce programming paradigm using big data
sets.
How to use Hadoop with R?
RHadoop = rmr + rHDFS + rHBase
● rhdfs - functions providing file management
of the HDFS from within R (RJava).
● rhbase - functions providing database
management for the HBase distributed
database from within R (Thrift).
● rmr - functions providing Hadoop
MapReduce functionality in R.
Business opportunities?
xkcd.com
Conclusions
Productivity vs Efficiency
Wide variety of statistical and graphical
techniques
Business orientation
Questions?
References
http://hadoop.apache.org/ - Apache Hadoop's project
http://www.r-bloggers.com/how-to-program-mapreduce-jobs-in-hadoop-with-r/ - teachers page
http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf - MapReduce
https://github.com/RevolutionAnalytics/RHadoop/wiki/Tutorial - MapReduce in R tutorial
http://www.inside-r.org/r-doc/base/lapply - R lapply
http://www.inside-r.org/r-doc/base/tapply - R tapply
http://www.revolutionanalytics.com/what-is-open-source-r/ - What is R?
http://www.r-project.org/ - What is R? Official page
http://en.wikipedia.org/wiki/R_(programming_language) - R wiki
http://www.johndcook.com/R_language_for_programmers.html - R programming for those coming
from other languages
http://www.revolutionanalytics.com/why-revolution-r/whitepapers/r-is-hot.php- why are R is hot
Pictures
We tried to use CC pictures, bellow are their respective links:
http://www.flickr.com/photos/nanagyei/4880468290 - pig pirates
http://www.flickr.com/photos/timypenburg/5328226108 - maths and pen
http://www.flickr.com/photos/48481327 - graduate
http://s0.geograph.org.uk/geophotos/01/53/43/1534341_7dc47500.jpg - store
http://www.flickr.com/photos/dizfunk/3066153143/ - nerd
http://geekithawaii.com/wp-content/uploads/2011/01/7562581_l.jpg - sky
http://www.robweir.com/blog/wp-content/uploads/2011/01/numbers.jpg - numbers
http://delightfulchildrensbooks.files.wordpress.com/2011/02/read-around-the-world.jpg - children
Others:
http://www.xkcd.com