How to program MapReduce jobs in Hadoop with R Group 8 João Rosa, Mario Almeida, Alex Pérez Index ● ● ● ● ● ● ● ● ● Introduction Hadoop MapReduce R Why and how? Possible uses? Business opportunities? Conclusion Questions References Nowadays, we have lots of data. BIG DATA! If we need to analyse this we have a problem... ...but, if we need to analyse this we have a BIG DATA problem! How can we analyse this BIG DATA? + A possible solution! The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing. The project includes these subprojects: Hadoop Common is a set of utilities that support the Hadoop subprojects. Hadoop Common includes FileSystem, RPC, and serialization libraries. The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. ● ● ● ● ● Highly fault-tolerant with hardware Failure Designed to be deployed on low-cost hardware Streaming Data Access Large Data Sets Portability Across Heterogeneous Hardware and Software Platforms Supports distributed computing on large data sets on clusters of computers Process large amounts of raw data Map + Reduce R is the language of Pirates!!! Rrrrr What is R? It's a language and environment for statistical computing and graphics! What is R? 2 million analysts! Quantitative finance! Google, Facebook and LinkedIn! Why R? ● Current analytic solutions are costly! ● New methods for analyzing complex datasets! Why Hadoop with R? "Easiest, most productive, most elegant way to write map reduce jobs." Why Hadoop with R? ● One-two orders of magnitude less code than Java Why Hadoop with R? Readable, reusable and extensible language. Why Hadoop with R? To give R analysts a way to access the mapreduce programming paradigm using big data sets. How to use Hadoop with R? RHadoop = rmr + rHDFS + rHBase ● rhdfs - functions providing file management of the HDFS from within R (RJava). ● rhbase - functions providing database management for the HBase distributed database from within R (Thrift). ● rmr - functions providing Hadoop MapReduce functionality in R. Business opportunities? xkcd.com Conclusions Productivity vs Efficiency Wide variety of statistical and graphical techniques Business orientation Questions? References http://hadoop.apache.org/ - Apache Hadoop's project http://www.r-bloggers.com/how-to-program-mapreduce-jobs-in-hadoop-with-r/ - teachers page http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf - MapReduce https://github.com/RevolutionAnalytics/RHadoop/wiki/Tutorial - MapReduce in R tutorial http://www.inside-r.org/r-doc/base/lapply - R lapply http://www.inside-r.org/r-doc/base/tapply - R tapply http://www.revolutionanalytics.com/what-is-open-source-r/ - What is R? http://www.r-project.org/ - What is R? Official page http://en.wikipedia.org/wiki/R_(programming_language) - R wiki http://www.johndcook.com/R_language_for_programmers.html - R programming for those coming from other languages http://www.revolutionanalytics.com/why-revolution-r/whitepapers/r-is-hot.php- why are R is hot Pictures We tried to use CC pictures, bellow are their respective links: http://www.flickr.com/photos/nanagyei/4880468290 - pig pirates http://www.flickr.com/photos/timypenburg/5328226108 - maths and pen http://www.flickr.com/photos/48481327 - graduate http://s0.geograph.org.uk/geophotos/01/53/43/1534341_7dc47500.jpg - store http://www.flickr.com/photos/dizfunk/3066153143/ - nerd http://geekithawaii.com/wp-content/uploads/2011/01/7562581_l.jpg - sky http://www.robweir.com/blog/wp-content/uploads/2011/01/numbers.jpg - numbers http://delightfulchildrensbooks.files.wordpress.com/2011/02/read-around-the-world.jpg - children Others: http://www.xkcd.com
© Copyright 2024