Pig and Pig Latin • [email protected] pig.apache.org Pig Pig is a dataflow language for massive data sets Currently, it runs on top of Map Reduce, but other frameworks are coming… Used by many companies (twitter, Facebook etc..) and often combined with Hive. What is Pig? • Pig is an implementation of the relational operators • Think of relational algebra: union, intersect, join, group… • Pig is all about relations (think tables) Where does Pig live? • Pig is installed on the User Machine • No need to install anything on a Hadoop cluster • Pig controls job submission to the cluster • Extendable by User Defined Functions (UDF) MapReduce API • Low level but flexible • Not human fault tolerant - people make mistakes • The burden of complexity rests on the programmer 8 9 lines = load 'mary' as (line); words = foreach lines generate flatten(TOKENIZE(line)) as word; grpd = group words by word; cntd = foreach grpd generate group, COUNT(words); dump cntd; Word Count in Pig Latin V Slide by YAHOO 13 Pig code Slide by YAHOO 14 Translates into four Map Reduce Jobs Slide by YAHOO Using Pig There are several ways to use pig. local or cluster mode file based or interactive command line Pigs local mode • Local mode is good for testing and learning. • No HDFS is used. You can read and write to local files. • You only have to change your input/output paths when you want to run the same script on a cluster Using pig in local mode • Connect to your VM: $ ssh [email protected] –p 2222 Start pig in local mode: $ pig -x local When using a script: $ pig -x local filename.pig Pig cluster mode • Connect to your VM: $ ssh [email protected] –p 2222 Start pig in cluster mode: $ pig When using a script: $ pig filename.pig Remember • Cluster mode on your VM is slow… it tries to mimic an entire cluster… • You can speed things up by using Tez mode: $ pig -x tez [filename] (Tez is a new, optimised Map Reduce framework) Hue • You can use HUE on your VM - it runs only in (slow) cluster mode data model alias f2:bag d = (1, {(2,3),(4,6),(5,7)},[‘apache’:’search’]) f1:atom tuple f3:map Load Loads in a datafile (from HDFS or local FS) Assumes that every dataset is a sequence of tuples A schema can be specified by AS A = LOAD 'myfile.txt' USING PigStorage(’,’) AS (f1,f2,f3); Songs = LOAD ‘lastfm/songs.tsv’ AS (user:chararray, timestamp, artist, track:chararray); LOAD with a schema USING specifies a loader which parses the input A = LOAD 'myfile.txt' USING PigStorage(’,’); B = FILTER A BY $1 <= 21; Here $1 refers to the second field. ($0 is the first…) Filter (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3) A = LOAD 'data' AS (a1:int,a2:int,a3:int); X = FILTER A BY a3 == 3; DUMP X; (1,2,3) (4,3,3) (8,4,3) Don’t filter late! Filter early Group BY GROUPing creates a relation with unique keys and all associated rows Example: Group all student data (rows) by department students = LOAD ‘students.txt’ as (first:chararray, last:chararray, age:int, dept:chararray); students_grouped = GROUP students BY dept; (CS,{(Willia,Cracknell,18,CS),(Del,Graefe,20,CS), (Douglass,Adelizzi,23,CS)}) (EE,{(Wes,Knill,23,EE)}) (Psych,{(Warner,Caminita,24,Psych)}) (Biology,{(Lesley,Kellywood,20,Biology)}) (English,{(Francesco,Corraro,21,English), (Ellyn,Meyerhoefer,18,English)}) (History,{(Lino,Feddes,22,History),(Lucius,Orlosky, 20,History)}) One row per unique key and in the bag are all entries containing that key. Group by X= GROUP A BY F1 A= (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3) X= (1, {(1,2,3}) (4,{(4,2,1)(4,3,3)}) (7,{(7,2,5)} (8,{(8,3,4),(8,4,3)}) First field is named ‘group’ Second field is called ‘A’ now… For each …generate FOREACH takes as input a record and generates a new one by applying a set of expressions to it. It is essentially a projection operator. It selects fields from a record, applies some tranformations on them and outputs a new record. Example foreach A=LOAD ‘input’ AS (user, name, income_a, income_b); B=FOREACH A GENERATE user, income_a + income_b AS total_income. What % does each age group make up of total? students = LOAD 'students.txt’ as (first:chararray, last:chararray, age:int, dept:chararray); students_grp = GROUP students BY age; students_ct = FOREACH students_grp COUNT_STAR(students) as ct; GENERATE group as age, students_total = FOREACH (GROUP students_ct ALL) generate SUM(students_ct.ct) as total; students_proj = FOREACH students_ct GENERATE age, (double)ct / (long)students_total.total as pct; JOIN performs an inner join on two or more relations based on common field values Word Count in Pig Advanced stuff… • Nested foreach… • Suppose you group stocks by exchange and want to find the average number of unique stocks per exchange • In general you want to iterate over the items in a group Nested foreach Finally… USER DEFINED FUNCTIONS (UDFs) Write your own functions in java or python… Libraries like piggybank and datafu contain many useful UDFs register lib/stanford-ner.jar; register pigner.jar; -- Define the function DEFINE pigner nl.surfsara.pig.Pigner(); -- Load some text line by line lines = LOAD 'sample.txt' USING TextLoader(); -- Apply the function to each line taggedlines = FOREACH lines GENERATE pigner(); -- Dump output tuples to standard out dump taggedlines; Named entity recognition Exercises: http://bit.ly/1IWZN4f
© Copyright 2024