Pig - BeeHub

Pig and Pig Latin
•
[email protected]
pig.apache.org
Pig
Pig is a dataflow language for massive data sets
Currently, it runs on top of Map Reduce, but
other frameworks are coming…
Used by many companies (twitter, Facebook
etc..) and often combined with Hive.
What is Pig?
•
Pig is an implementation of the relational
operators
•
Think of relational algebra: union, intersect, join,
group…
•
Pig is all about relations (think tables)
Where does Pig live?
•
Pig is installed on the User Machine
•
No need to install anything on a Hadoop
cluster
•
Pig controls job submission to the cluster
•
Extendable by User Defined Functions (UDF)
MapReduce API
• Low level but flexible
• Not human fault tolerant - people make
mistakes
• The burden of complexity rests on the
programmer
8
9
lines = load 'mary' as (line);
words = foreach lines generate
flatten(TOKENIZE(line)) as word;
grpd = group words by word;
cntd = foreach grpd generate group,
COUNT(words); dump cntd;
Word Count in Pig Latin
V
Slide by YAHOO
13
Pig code
Slide by YAHOO
14
Translates into four Map Reduce Jobs
Slide by YAHOO
Using Pig
There are several ways to use pig.
local or cluster mode
file based or interactive command line
Pigs local mode
•
Local mode is good for testing and learning.
•
No HDFS is used. You can read and write to
local files.
•
You only have to change your input/output
paths when you want to run the same script on
a cluster
Using pig in local mode
• Connect to your VM:
$ ssh [email protected] –p 2222
Start pig in local mode:
$ pig -x local
When using a script:
$ pig -x local filename.pig
Pig cluster mode
• Connect to your VM:
$ ssh [email protected] –p 2222
Start pig in cluster mode:
$ pig
When using a script:
$ pig filename.pig
Remember
•
Cluster mode on your VM is slow… it tries to
mimic an entire cluster…
•
You can speed things up by using Tez mode:
$ pig -x tez [filename]
(Tez is a new, optimised Map Reduce
framework)
Hue
•
You can use HUE on your VM - it runs only in
(slow) cluster mode
data model
alias
f2:bag
d = (1, {(2,3),(4,6),(5,7)},[‘apache’:’search’])
f1:atom
tuple
f3:map
Load
Loads in a datafile (from HDFS or local FS)
Assumes that every dataset is a sequence of tuples
A schema can be specified by AS
A = LOAD 'myfile.txt' USING PigStorage(’,’) AS
(f1,f2,f3);
Songs = LOAD ‘lastfm/songs.tsv’ AS (user:chararray,
timestamp, artist, track:chararray);
LOAD with a schema
USING specifies a loader which parses the input
A = LOAD 'myfile.txt' USING PigStorage(’,’);
B = FILTER A BY $1 <= 21;
Here $1 refers to the second field.
($0 is the first…)
Filter
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
X = FILTER A BY a3 == 3;
DUMP X;
(1,2,3)
(4,3,3)
(8,4,3)
Don’t filter late!
Filter early
Group BY
GROUPing creates a relation with unique keys and all
associated rows
Example: Group all student data (rows) by department
students = LOAD ‘students.txt’ as (first:chararray, last:chararray, age:int, dept:chararray); students_grouped = GROUP students BY dept; (CS,{(Willia,Cracknell,18,CS),(Del,Graefe,20,CS),
(Douglass,Adelizzi,23,CS)})
(EE,{(Wes,Knill,23,EE)})
(Psych,{(Warner,Caminita,24,Psych)})
(Biology,{(Lesley,Kellywood,20,Biology)})
(English,{(Francesco,Corraro,21,English),
(Ellyn,Meyerhoefer,18,English)})
(History,{(Lino,Feddes,22,History),(Lucius,Orlosky,
20,History)})
One row per unique key and in the bag are all
entries containing that key.
Group by
X= GROUP A BY F1
A=
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
X=
(1, {(1,2,3})
(4,{(4,2,1)(4,3,3)})
(7,{(7,2,5)}
(8,{(8,3,4),(8,4,3)})
First field is named ‘group’
Second field is called ‘A’ now…
For each …generate
FOREACH takes as input a record and generates
a new one by applying a set of expressions to it.
It is essentially a projection operator. It selects
fields from a record, applies some tranformations
on them and outputs a new record.
Example foreach
A=LOAD ‘input’ AS (user, name,
income_a, income_b);
B=FOREACH A GENERATE user, income_a +
income_b AS total_income.
What % does each age
group make up of total?
students = LOAD 'students.txt’ as (first:chararray,
last:chararray, age:int, dept:chararray);
students_grp = GROUP students BY age;
students_ct = FOREACH students_grp
COUNT_STAR(students) as ct;
GENERATE
group
as
age,
students_total = FOREACH (GROUP students_ct ALL) generate
SUM(students_ct.ct) as total;
students_proj = FOREACH students_ct GENERATE age,
(double)ct / (long)students_total.total as pct;
JOIN performs an inner join on two or more
relations based on common field values
Word Count in Pig
Advanced stuff…
•
Nested foreach…
•
Suppose you group stocks by exchange and
want to find the average number of unique
stocks per exchange
•
In general you want to iterate over the items in
a group
Nested foreach
Finally…
USER DEFINED FUNCTIONS (UDFs)
Write your own functions in java or python…
Libraries like piggybank and datafu contain
many useful UDFs
register lib/stanford-ner.jar;
register pigner.jar;
-- Define the function
DEFINE pigner nl.surfsara.pig.Pigner();
-- Load some text line by line
lines = LOAD 'sample.txt' USING
TextLoader();
-- Apply the function to each line
taggedlines = FOREACH lines GENERATE
pigner();
-- Dump output tuples to standard out
dump taggedlines;
Named entity recognition
Exercises:
http://bit.ly/1IWZN4f