Graphics I Outline Types of Plots Know your data types

4/25/15 Outline •  Review plot types Graphics I –  Univariate –  Bivariate –  Mul>variate •  Design –  Best prac>ce –  PiBalls –  Cri>que –  Percep>on & color •  Graphics Models in R –  Base –  Grammar of Graphics •  Informa>on visualiza>on vs Sta>s>cal graphics Know your data types Types of Plots The appropriate graphical techniques depend on the kind of data that you are working with •  Quan>ta>ve –  con>nuous – e.g., height, weight –  discrete – numeric data with few values, e.g., number of children in family •  Qualita>ve –  ordered – categories with an order but no meaningful distance between, e.g., number of stars for a movie ra>ng –  nominal -­‐ categories have no meaningful order, e.g., gender 1 4/25/15 One Variable Case: Infant Health load(url("hYp://www.stat.berkeley.edu/
users/nolan/data/KaiserBabies.rda")) Kaiser Study •  Oakland Kaiser mothers •  1960s •  Measure the babies weight (in ounces) at birth •  All babies: –  Male –  Single births (no twins, etc.) –  Survived 28 days Informa>on collected on mother’s and their babies • 
• 
• 
• 
• 
• 
• 
Birth weight (ounces) Gesta>on (weeks) Parity -­‐ total number of previous pregnancies Mother’s height and weight Mother’s smoking status Mother’s age, race, educa>on level, income And more… 2 4/25/15 Rug plot & Strip chart Histogram 80
100
120
140
160
180
Birth Weight
Density
60
0.000
Each baby’s weight is represented as a >ckmark. The thicker lines are from mul>ple babies with similar weights. I added a liYle random noise to the weights to keep them from falling on top of each other. 60
80
100
140
180
Birth Weight (oz)
Male babies born at Oakland Kaiser in the 1960s
0.020
0.005
0.010
0.015
Boxplot: 0.000
Density (proportion per oz)
Density plot Male babies born at Oakland Kaiser in the 1960s
150
140
180
Birth Weight (oz)
100
Violin Plot 60
80
100
120
Birth Weight (oz)
140
160
180
60
Density (proportion per oz)
100
0.000 0.005 0.010 0.015 0.020 0.025
50
1
3 4/25/15 150
0 50
•  This quan>ta>ve variable is different from birth weight – there are only a few possible values, i.e., it’s not possible to have 2.3 siblings, and it’s highly unlikely to have 17 > table(infants$parity)
0 1 2 3 4 5 6 7 8 9 10 11 13 315 310 238 168 83 52 32 16 8 7 4 2 1 Number of Siblings 250
Parity: Number of siblings 0
1
2
3
4
5
6
7
8
9
11
Number of siblings
0.20
Dot chart 0.10
13
11
10
9
8
7
6
5
4
3
2
1
0
0.00
Proportion
Alterna>ve – bar width has no meaning 0
0 1 2 3 4 5 6 7 8 9
11
13
50
100
150
200
250
300
Parity
Number of siblings
4 4/25/15 Smoking Status -­‐ Categorical 500
Pie chart pie Never
400
Qualita>ve Variables Bar chart 200
300
Unknown
Once
Until
0
100
Now
Never
Now
Until
Once
Unknown
ANGLES can be hard to compare Two Qualita>ve – Mosaic plot Trade
Some College
College
Now
Until
Once
Two Variables High School
Never
No HS Some HS
5 4/25/15 Line plots 150
100
Count
0
0
50
200
100
400
150
200
200
Barplots: side-­‐by-­‐side & stacked Now
Until
Once
Never
Now
Until
Once
200
0
50
Never
Some HS
HS
Trade
Some College
0
No HS
No HS
Some HS
Trade
College
San Francisco Chronicle lis>ngs Data •  Record: house sold in a par>cular >me period •  Over 200,000 houses •  Subset to a dozen ci>es in the East Bay – about 25,000 houses Variables: •  City •  County •  Price •  # bedrooms •  Lot square footage •  and 10 more load(url("hYp://www.stat.berkeley.edu/users/
nolan/data/Projects/SFHousing.rda")) 6 4/25/15 Side-­‐by-­‐Side Boxplots Rela>onship between city and sale price Data types: City -­‐ factor Sale price -­‐ numeric Side-­‐by-­‐side Violin plots Ci>es ordered by median price 2000
ppsf
1500
1000
500
0
Albany BerkeleyEmeryvilleOakland Piedmont
Walnut Creek
El CerritoKensingtonMoraga Orinda LafayetteRichmond
factor(city)
7 4/25/15 WHAT’s Wrong with this plot? Rela>onship between price per square foot and total square foot Both are quan>ta>ve 800 1200
400
0
Price/ft^2
Smooth scaYer plot 1000
2000
3000
4000
5000
6000
Area (ft^2)
8 4/25/15 Summary of graph rela>onships between two variables Rela>onships between more than 2 variables •  Two Qualita>ve variables – mosaicplot, side-­‐by-­‐side barplots, line plots •  Qualita>ve informa>on can be conveyed in plots through color, plomng symbol, juxtaposed panels •  The following plot uses informa>on from 4 variables: city, number of bedrooms, lot size (sq n), and price per square n •  One Quan>ta>ve and one Qualita>ve – Boxplots, dotcharts, mul>ple density plots, violin plots •  Two Quan>ta>ve variables – ScaYer plot, line plot Bsqn, PPSF, Bedrooms, City Berkeley
●
Piedmont
●
●
●
●
1 bedrooms
800
600
400
●
●
●
●
●
●
●
●
●● ●
●
● ● ●●
●
●
● ●
●
●
●
● ● ●
●
●
4 bedrooms
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
● ● ●●●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●●● ● ● ●
●
● ●●●
●●
●
●
●
● ●
●
●
● ●
●
●●
●
●
● ●●
●
●
● ●●● ● ●
● ●●
● ●●●
●
●
●● ● ● ● ●
●
●●
●● ● ● ●● ●
● ●●
●
●
● ● ●●
●
●
●
●
●
●
●
●
●
●
●● ●● ● ●
●
● ●
● ● ●● ●●
● ●●
●● ● ●●
●●●●●●●● ● ● ●
●
●
●●
●
●
● ●●
●
●
●
● ●
●● ●
●●
●●● ● ● ● ● ●
●
●
●
●
●
● ●● ●● ●●
●●●
●● ●●
●
●
●
●
●● ●●●
●
●
● ●
● ●●
● ● ● ●
●
●●● ●● ●
●
● ●● ● ●● ● ●●
●
●
● ●●
● ●●●
●
●● ●
●
●●
● ●●
●● ●●●
●● ● ●
●
●
●● ●●●
●● ● ● ●
●●
● ● ● ●● ●●●
●
●
● ● ●●●●●●
●
●
● ● ●● ● ●●
●
●● ●●●
●
●
●●
●●
● ● ●●
●
●
●
● ●
●
●
● ● ● ● ●●●●●●● ● ●● ● ●●● ●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●● ●
● ●●
●● ●
●
●
●●
●
●●
●
●
●● ● ●●
●●●
●●● ●●
●
●●
●
●
● ● ●●●● ●
●
● ●●● ●●●
● ●●
●● ●
●
● ●●
●
●
●
● ● ●● ●
●● ●
● ●●●●● ●●
● ●●●
●
●●
●● ●
●●●●
●●
●
● ●●● ●●●
●●●●●●
●
●
●
●●
●●●
●
●
●●●
● ● ●●● ●●
●
●●
●
●
●
●●
●●
●
●
●● ●
●●●
●
●
●
●● ● ●
● ●● ● ●●●
●● ● ●
●●
●
●●
●
●●
●
●●●●●
●●
●
●●●
●●
●
●● ●
●
● ●
●●●●●● ●●●● ●
● ●
●
●● ● ●
●
● ●● ●
●●
●
●●
●● ● ●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●● ● ● ●● ●●●
●●
●
●
●●● ●●●
●● ●● ● ●
●
●● ●● ● ●● ● ●
●
● ● ● ● ● ●●
● ●
●●
●
●
● ●
●
● ● ●● ●
●● ● ●● ●●●●● ●
● ● ●● ●●
● ●
●
●
●
●● ●●●●●●●● ●● ● ●● ● ● ●●● ● ● ●●
● ●●● ● ● ●●●
●●
●● ● ●
●●
●
●● ● ●
●●● ● ●●●
● ●●●
●●●
● ●● ●
● ● ●●●● ● ● ● ●●
●●●●
●
●
●● ●●●
● ●●●●
●
●●
●
●
● ●●
●●●
●
● ●● ●
●●● ●●
●
●
●
●
● ●
●
● ●●●●● ●
● ●●
●
●
● ●
●● ●●
●●●●●
● ●●
●
●●
●●●
● ●● ●●
●●●
●● ● ●
●
●●
● ●●
●
● ●
●●●
●●
●●●
●● ●●● ●● ●●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●● ● ● ●
●
● ● ●●
● ●
●● ●
●●
●
● ●●
●● ●●
● ●
●
●
●●● ●●
●
● ●●
●
● ●
● ●● ● ●● ●
●●
●
●
●
● ●●●●●
●● ● ●●
●
● ● ●●● ●●● ●● ●● ● ● ●
●●●● ●●
● ●●●
●● ● ● ●
●
●
● ● ● ● ● ●● ● ●●●
●
●●
●●●●●●● ●●●
●
● ● ●●
●
● ●● ●
●
●
●
●
●● ●● ●
●
● ● ● ●●●
● ●
●● ● ● ●
●● ●
●
● ●●●● ●●●
●
● ●●
● ●●● ●
●
● ● ● ● ●● ●● ●●
●●
●
●●● ●●
● ●
●
●
●
●
●
●
●● ●●●● ●●
● ●●
●●●
●
●● ● ●●
●● ● ●●●
● ●
●
●●●
● ● ● ●● ●●●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●● ● ●● ● ●●●●●● ● ●
● ● ●● ●
● ● ● ●● ●
● ●●
●
● ●
●●●●●● ● ●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●● ●
●
●●●
● ● ●●
●
● ● ●●●
●
● ●● ● ●●●● ●
●
●● ●●●
●● ● ●●
●
●
●
●
●● ● ●
●
●
●
●●● ●
●
●●
●●●● ●●●
●
● ● ●●●●
● ● ●●
● ●●
●●●
●
● ● ●●
●
●
●
●● ●
●●
●● ● ●● ●
●●●
●●
●●●●
●● ●● ●
●●● ● ●
●
●
●●
● ● ● ●
●
● ●●
● ● ● ● ●●
●
●●
●●●●●
●● ● ●●●● ●●
● ● ●●
● ●
●
●
● ●●●
● ●
●●
● ● ●●●
● ● ●● ●
●
●●● ●
● ● ● ● ● ●●
●●● ●● ●
● ●
● ●
●
●●●
●● ●● ●●
● ●●
●
●
●
● ●●●
●● ● ●
●
●
●●
●● ● ● ●●●●●
● ● ● ●●● ● ●
●●
●
●●● ● ● ●●●●● ● ● ● ●● ●
●
●●
● ●
● ● ●
●
●●
●●●● ● ●
●
● ●● ●
● ● ●
●
●● ●
●●● ● ●
● ●●
●●
● ●
●
● ● ●
●
●● ●
●●●● ●●
●
● ●
●●● ●
● ●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●●●
●
●
● ●
● ● ●● ●
●● ●
●●
●
●●●● ●● ●
● ● ● ●● ●
●● ●● ●●● ●
●●
●
●● ●
●● ●●
●●
●
● ●
● ●●
●●
●●
●●
● ●●●
● ●●
●●●
●
●
●●●●●
●● ●
● ● ● ● ● ●
●
● ●
●●●●
● ●
●●●
● ● ●●
●
● ●
● ● ●● ● ● ●
●
●
● ●●
●
●
● ● ● ●●● ● ● ● ●●● ●●
● ● ● ●
● ● ● ● ●●
● ●● ●● ● ●● ●
●
●
●●●
●
●●
●
● ●●
●
●
●
●
● ●● ●●● ●● ●●
●●
●
● ●●
●● ●●●●
●● ●●
●● ● ●
●
●●
● ●
● ●
●●●●● ●●
● ● ●●●
● ●●● ●●
● ● ● ● ●●
●●● ●
●
● ●
●● ● ●● ●
●
●
● ● ●●●●●●●
●●
●●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●● ●
●●● ● ●
● ●
●
●
●
● ●●
●● ● ●●●
● ● ●
● ● ●● ●●
●●
● ●●
●
●●●
●●● ● ● ●●●
● ●●● ●●
● ●
●●
●● ●● ●
●● ● ● ● ●
● ● ●● ● ●
●
● ● ● ● ● ● ●● ● ●●
● ● ●
● ●
●●●●
●●
●
●● ●
●●
●
●● ● ● ● ●
●
● ● ●● ● ●
●
●
● ●● ●● ● ●
●
●●
●
●
● ● ●● ●
●●
●
●● ●
●
●● ● ● ●
● ●
●
● ●
● ● ●
●● ●
●
●
●●
●
●● ●
●●
●● ● ● ● ●
● ●
●
●
●
●
●
●
●
●● ●
●
●
●● ●
● ●●
●● ● ●
●●
●
● ● ●●
●●
●●
●
● ●● ●
●
●
● ●● ●
●
●
●
●
●
●●
●● ● ●
●●
● ●
●
●
●●
●
● ● ●●
● ●●
● ● ●
● ●
●
●
●●
● ●
●
●
●
●
●
●
●● ●
● ●● ● ● ●
●
● ●●
● ●
●
● ●●
●● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●●
●● ●
●
●●
● ●
●
●●
●
●
●
●
●●
●
●
● ●●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
3 bedrooms
●
●
●
1000
●
●
●
2000
●
5 bedrooms
6 bedrooms
7 bedrooms
8 bedrooms
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
2 bedrooms
●
3 bedrooms
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
● ●●● ●● ●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●● ●
● ●
●●●●
●
●
● ●
● ●
●
●●
●
●●
● ●●
●
●●●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●●
●
●
●
●
●● ● ●
● ● ●
●●
● ●●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
● ●
●
●
●●
●
●
●
● ●
●
●●
●
●
●
●
● ●● ●
●
●
●●
●●
●
●
●
●●
●
●●
● ●●
●
● ●
●
●
● ●
● ●
● ●
●
●
● ●
● ● ●● ●
●
●
●
● ●●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
Square Feet
●
4000
5000
3
1000
●
●
4
500
5
0
Moraga
1000
Kensington
●
●
●
0
El Cerrito
●
●
●
3000
2
Walnut Creek
1500
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
1
Piedmont
●
●
factor(br)
0
8 bedrooms
●●
Oakland
500
7 bedrooms
●
Emeryville
1000
6 bedrooms
●
●●
●
● ●
●
●
●
Berkeley
1500
5 bedrooms
●
●
●
●
●
●
●●
●
● ●
●
●
● ●
●
●
●
● ●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
4 bedrooms
●
●
●●
Albany
1 bedrooms
●
●
800
●
600
2 bedrooms
●
400
●
Price per square feet
●
●
ppsf
●
●
●
200
Price per square feet
●
200
●
●
2000
3000
Square Feet
4000
5000
Orinda
Lafayette
Richmond
6
1500
7
1000
500
8
0
1096.633
8103.084
1096.633
8103.084
1096.633
8103.084
1096.633
8103.084
bsqft
9 4/25/15 ScaYerplots of all pairs of variables 1.0e+09
30
50
70
5
10
15
1.0
3.0
200
0.0e+00
Image Map with 3-­‐numeric variables 0
2.5
80 120
1.0e+09
100
ctry
2.0
0.8
0.0e+00
pop
0.6
50
70
0
Kappa
40
imRates
1.5
gini
90
30
0.4
70
1.0
50
lifeExpect
10
15
0.2
0.5
5
health
400 800
0.0
0.5
1.0
1.5
2.0
2.5
3.0
0
matM
0
100
200
0
40
80 120
50
70
90
0
400 800
Lambda
3D ScaYerplot 5
4
3
4
2
3
1
2
1
0
0
Kappa
[0, 0.5)
[0.5, 2)
[2, 5)
5
0.0
0.5
1.0
1.5
2.0
2.5
Upper Quartile Offspring log10
Best Prac>ces 3.0
Lambda
10 4/25/15 Outline •  Vocabulary •  3 Proper>es of good graph construc>on Vocabulary –  Data stand out –  Facilitate comparison –  Informa>on rich •  Percep>on •  Case studies Title: Temperature in August
●
●
●
●
Vertical Axis
Reference line: 75
plotting
symbol
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
1. Make the Data Stand Out ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Legend
●
●
data label ●
Min: Aug 17
●
●
●
●
●
●
●
over 70
below 70
●
●
tick
mark
10
15
20
Axis Label: Day
25
30
tick
mark
label
11 4/25/15 Avoid having other graph elements interfere with data 80
Use visually prominent symbols ●
●
75
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
70
●
Degrees
●
●
●
●
●
●
●
●
●
65
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
3
5
7
9
12
15
18
21
24
27
5
10
15
30
20
25
30
Day
Day
One way to avoid over plomng: JiYer the values Avoid over-­‐plomng 1200 Families
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
75
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
70
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Shrink the plomng symbol so they don’t plot on top of each other 65
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
60
●
65
70
●
60
Dad Height
75
●
●
Add a liYle bit of random noise so all of the values aren’t ploYed on top of each other jitter(dht, 2)
Why are there so few data points? 55
55
60
65
70
See a point cloud -­‐ 60
65
70
jitter(ht, 2)
Mom Height
12 4/25/15 0.20
0.10
Density
0
50
100
150
200
Choosing the Scale of the Axis • 
• 
• 
• 
0.00
0.10
0.00
Density
0.20
Different values of data may obscure each other 0
2
4
6
8
10
Include all or nearly all of the data Fill data region Origin need not be on the scale Choose a scale that improves resolu>on (to be con>nued) Most of the data are in the 0 to 10 range. The few large values obscure the bulk of the data. Consider men>oning these large values in a cap>on, instead of showing them in the plot. Eliminate superfluous material •  Chart junk – stuff that adds no meaning, e.g. buYerflies on top of barplots, background images •  Extra >ck marks and grid lines •  Unnecessary text and arrows •  Decimal places beyond the measurement error or the level of difference 2. Facilitate Comparisons 13 4/25/15 Put Juxtaposed plots on same scale 0.00
15
20
25
20
30
40
50
60
0.00 0.05 0.10 0.15 0.20
0.02
Density
Density
0.04
0.06
0.20
0.15
0.10
0.05
0.00
Density
Groups
Group 2
0.08
Group 1
Make it easy to dis>nguish elements of superposed plots (e.g. color) 20
Choosing the Scale •  Keep scales on x and y axes the same for both plots to facilitate the comparison •  Zoom in to focus on the region that contains the bulk of the data •  These two principles may go counter to one another •  Keep the scale the same throughout the plot (i.e., don’t change it mid-­‐axis) 30
40
50
60
Emphasizes the important difference A B Which of these side-­‐by-­‐side bar plots emphasizes the important difference? 14 4/25/15 Avoid Jiggling the baseline It is difficult to see how a country has changed over >me because the boYom/
base line moves Comparison: volume, area, height We naturally compare the volume of the barrels, but the change is really the height of the barrels Banking: Aspect Ra>o Percep>on Color, shape (including banking) can affect your ability to make good comparisons •  The height/width of the data region was selected to be about 1 so that the trend line is at about 45 degrees. •  The Aspect ra>o affects our visual decoding of the rate of change •  The banking to 45 degrees helps us see rate of change •  The ability to effec>vely judge rate of change allows us to see important paYerns in data 15 4/25/15 Banking at 45 degrees 2
4
6
8
10
200
50
20
10
5
2
2
5
50
10
20
100
50
150
100
200
log-log transformtation
100
200
log-transformation
0
•  Roughly: Examine the absolute value of the orienta>on of segments, they should be centered at 45 degrees. •  Transforma>ons to improve the aspect ra>o uncovers the structure of the rela>onship between variables •  Easier to see important features Bank to 45 degrees 2
4
6
8
10
1
2
5
10
Bar plot vs Pie chart Shapes •  Cleveland’s experiment had a group of subjects judge 40 pairs of values on bar chars and the same 40 pairs on pie charts: What percent the smaller was of the larger? •  Pie chart judgments are less accurate than bar chart judgments •  Bar chart errors are about the same size for all percents. •  Pie chart errors tend to be larger for percents greater than 35% 16 4/25/15 Color Guidelines Color Colorfulness •  Saturated/colorful colors are hard to look at for a long >me. •  They tend to produce an aner-­‐image effect which can be distrac>ng. •  Choosing a set of colors which work well together is a challenging task for anyone who does not have an intui>ve gin for color •  7-­‐10% of males are red-­‐green color blind. Luminance •  If the size of the areas presented in a graph is important, then the areas should be rendered with colors of similar luminance (brightness). •  Lighter colors tend to make areas look larger than darker colors 17 4/25/15 Data Type and Color •  Qualita>ve – Choose a qualita;ve scheme that makes it easy to dis>nguish between categories •  Quan>ta>ve – Choose a color scheme that implies magnitude. –  Does the data progress from low to high? Use a sequen;al scheme where light colors are for low values –  Do both low and high value deserve equal emphasis? Use a diverging scheme where light colors represent middle values Brewer’s Qualita>ve PaleYes Set3
Set2
Set1
Pastel2
Pastel1
Paired
Dark2
Accent
Brewer’s Diverging PaleYes Spectral
RdYlGn
RdYlBu
RdGy
RdBu
PuOr
PRGn
PiYG
BrBG
Brewer’s Sequen>al PaleYes YlOrRd
YlOrBr
YlGnBu
YlGn
Reds
RdPu
Purples
PuRd
PuBuGn
PuBu
OrRd
Oranges
Greys
Greens
GnBu
BuPu
BuGn
Blues
18 4/25/15 How to make a plot informa>on rich 3. Add Informa>on •  Describe what you see in the Cap;on •  Add context with Reference Markers (lines and points) including text •  Add Legends and Labels •  Use color and plomng symbols to add more informa>on •  Plot the same thing more than once in different ways/scales •  Reduce cluYer Cap>ons •  Cap>ons should be comprehensive •  Self-­‐contained •  Cap>ons should: –  Describe what has been graphed –  Draw aYen>on to important features –  Describe conclusions drawn from graph Good Plot Making Prac>ce Put major conclusions in graphical form Provide reference informa>on Proof read for clarity and consistency Graphing is an itera>ve process Mul>plicity is OK, i.e., two plots of the same variable may provide different messages •  Make plots data rich • 
• 
• 
• 
• 
19 4/25/15 The Plomng Process Cases • 
• 
• 
• 
Determine what’s the message Help the data speak Plomng is an itera>ve process – An ar>st makes many sketches before pain>ng the masterpiece How would you improve this plot? Case: Voter Registra>on Trends in California 20 4/25/15 Changes • 
• 
• 
• 
• 
• 
What’s the message? Loca>on of >ck marks under bars Color of bars – indicate party Title Y-­‐axis label confusing X-­‐axis label missing Check data for understanding of how plot is made •  How party registra>on has changed over the past presiden>al elec>ons •  More informa>ve if we have registra>on figures for people not coun>es •  County size may be a lurking variable -­‐ small coun>es tend to be rural and conserva>ve Can we make it more informa>on rich? hYp://www.sos.ca.gov/elec>ons/ror/60day presprim/hist reg stats.pdf 30
Case: CO2 emissions around the world 20
Democrat
Republican
Other
Decline to state
0
10
Percent
40
50
Party Affiliation of Registered Voters in California
1992
1996
Colors from Brewer’s Set1 Qualita>ve paleYe 2000
2004
2008
Year
21 4/25/15 ManyEyes and CO2 How might we improve this plot? Changes •  Superpose rather than stack the curves so the baseline doesn’t jiggle •  Use color on the lines rather than filling polygons What can you see now? Case: CO2 levels at Mauna Loa Time and the horizontal axis 22 4/25/15 Mauna Loa Observatory •  Far from any con>nent, the air sampled is a good average for the central pacific. •  Being high, it is above the inversion layer where local effects are present. •  Measurements of atmospheric CO2 since 1958 – longest con>nuous record Atmospheric Carbon Dioxide •  The increasing amount of CO2 in the atmosphere from the burning of fossil fuels has become a serious environmental concern. •  Upper safety limit for atmospheric CO2 is 350 parts per million •  Does a rise in CO2 lead to a rise in world temperatures? 380
360
co2
320
320
340
co2
360
380
Points are typically not the best way to plot >me series Connect the measurements with line segments 340
Time Series – Pairs: (>me, CO2) 1960
1970
1980
1990
date
2000
2010
1960
1970
1980
1990
2000
2010
date
23 4/25/15 Seasonality vs the long-­‐term Trend Monthly Average CO2
360
320
340
CO2 (ppm)
380
•  The height/width of the data region was selected to be about 1 so that the trend line is at about 45 degrees. •  The banking to 45 degrees let’s us see that the curve is convex •  This means that the rate of increase of CO2 is increasing through >me 1988
1960
1970
1980
1990
Aspect Ra>o 2000
2010
Date
Global Warming Chartology •  1981 US Senate convened scien>st for tes>mony on global warning •  Senator Al Gore said that the Mauna Loa data clearly demonstrated increases in CO2 •  PewiY (witness for the DOE) said that the graph was misleading because it doesn’t include 0 PewiY took issue with the graph, saying “It is a clever piece of chartology” because it can be read the wrong way. He con>nued, “It is intellectually just exactly correct. It displays 315 going to 336, but it appears to be going from 0 to very large amounts.” Steven Schneider (Global Warming) called PewiY’s objec>on “double talk” 24 Including 0 & The Aspect Ra>o 200
co2
100
0
2000
200
date
0
co2
400
1960
What do you think of this plot? When we include 0 and bank at 45 degrees, then the plot must be tall and narrow. With this plot it’s hard to see any other features. There is also a lot of empty space. To fill the space with data, we need to stretch the data region to be wide and short. Now, it’s hard to see the most important feature, the curvature, because the banking is nearly 0. 300
400
4/25/15 1960
1970
1980
1990
2000
date
Eliminate: •  background color, •  Grid lines, •  Polygon filling Take logs 2010
FIND 5 things that you would change •  Examine rela>ve growth so we can add more variables •  Add legend for different variables •  Add reference lines for important dates 25 4/25/15 What alterna>ve plot would beYer enable comparisons? A.  Stacked Barchart B.  Dot Chart C.  Pie Chart D.  Line plot E.  Mosaic plot Data Stand Out Graphics Best Prac>ces -­‐ Recap •  Avoid having other graph elements interfere with data •  Use visually prominent symbols •  Avoid over-­‐plomng •  Choose appropriate axis scales •  Eliminate superfluous material 26 4/25/15 Facilitate Comparisons • 
• 
• 
• 
• 
• 
• 
Juxtapose or Superpose plots (use same scale) Don’t change scale mid-­‐axis Use only one scale on one axis Use color Emphasize the important difference Avoid jiggling the baseline Care with the visual metaphor (linear, area, volume Informa>on Rich •  Describe what you see in the Cap;on •  Add context with Reference Markers (lines and points) including text •  Add Legends and Labels •  Use color and plomng symbols to add more informa>on •  Plot the same thing more than once in different ways/scales •  Reduce cluYer Good Plot Making Prac>ce Put major conclusions in graphical form Provide reference informa>on Proof read for clarity and consistency Graphing is an itera>ve process Mul>plicity is OK, i.e., two plots of the same variable may provide different messages •  Make plots data rich • 
• 
• 
• 
• 
Graphics Models in R 27 4/25/15 R’s graphics model •  There are two models in R – •  The painter’s model, which is the base graphics •  The object-­‐based model, which follows Wilkinson’s Grammar of Graphics R’s base graphics model •  High-­‐level func>on plot() –  Starts a new plot on a new page –  creates a complete plot •  Low-­‐level func>ons add more output to the current plot, if necessary –  lines and line segments –  points, text –  legend Review Plomng Func>ons • 
• 
• 
• 
• 
• 
hist()
boxplot()
dotchart()
plot() barchart()
mosaicplot()
•  abline() •  curve() •  points()
•  lines()
•  polygon() •  text() add text Review Plot Arguments ?plot.default
"p" for points, "l" for lines, "n" for nothing •  ylim = c(0, 1) the range for the scale of the axis •  type = "l"
•  xlab = "x axis
label"
•  main = "plot title"
•  col = vector of colors •  log = "y" use log scale on y axis, can be "x" or "xy" •  lwd = 2 thickness of line •  pch = 19 plomng character – check other numbers •  cex = 0.5 character magnifica>on •  lty = 2 type of line – check other numbers •  las = 1 0,1,2, or 3 style of >ck mark labels 28 4/25/15 plot(x = fbDF$gini, y = fbDF$im,
type = "n”, log = "y",
xlab = "Gini Index",
ylab = "Infant Mortality Rate (log scale)",
main = "World Trends" )
100
World Trends
●
50
●
●
●
●
●
●
20
●
●
●
●
●
●
●●
●
●
5
symbols(x = fbDF$gini, y = fbDF$imRates,
bg = myColors[fbDF$cutLE], add = TRUE,
circles = pmax(sqrt(fbDF$pop)/8000, 0.2),
inches = FALSE)
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
10
Infant Mortality Rate (log scale)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●●
●
●
Life Expectancy
●
●
●
< 55
55 − 65
65 − 70
70−75
75 − 80
>85
2
●
30
40
50
60
70
Gini Index
legend("bottomright", title = "Life Expectancy",
legend = lifeLabels),
bty = "n", fill = myColors)
ggplot2 model •  Func>ons produce plot components (aka layers) •  Combine components in different ways with the + operator to produce a variety of plots ggplot2 model -­‐ steps •  Define the data you want to plot and create an empty plot object with ggplot()
•  Specify what graphcis shapes, aka, geoms, to view the data, e.g., plomng symbols, lines. Add these to the plot with geom_point or geom_line
•  Specify features/aestheitcs used to represent the data, e.g., x, y loca>ons, color, symbol size 29 4/25/15 p = ggplot(fbDF)
p +
geom_point(aes(x = gini, y = imRates,
color = cutLE, size = popS)) +
With geographic informa>on we can make maps scale_y_continuous(trans = "log",
name = "Infant Mortality Rate (log scale)") +
●
●
●
●
●
● ●
●
●
scale_color_manual(values = myColors,
name = "Life Expectancy") +
Infant Mortality
●
●
●
(0,10]
(10,25]
(25,50]
(50,75]
(75,150]
● ●●
● ●
●● ● ●
●● ●● ● ● ●●
●
●
● ●● ●
● ● ●
●●
●
●
●
●
●
●
● ● ●● ●●
●●
●
●
●
●
● ●
●
●●
● ●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
● ●
● ● ● ●●
●
●
●
●
●
●
●
●
●
●
●●
●●
● ●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
● ●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●●
scale_size(name = "Population") +
ggtitle("World Trends")
●
●
●
●●
●
●
scale_x_continuous(name = "Gini Index") +
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
We can use apps like Google Earth to display our data Informa>on Visualiza>on vs Sta>s>cal Graphics 30 4/25/15 Info-­‐Viz vs Sta>s>cal Graphics Billion dollar o-­‐gram •  Informa>on visualiza>ons tend to be grabby and visually striking –  Drama>ze the problem with a unique and visually-­‐
appealing image that draws casual viewer in deeper •  Sta>s>cal graphics try to reveal paYerns and discrepancies and make comparisons –  Careful to make judicious use of all aspects, onen for viewers already interested in the problem From Gelman and Unwin’s ar>cle (2013) Word Clouds vs dotcharts elasticsearchpandas
json
cloudera
map
oracle
postgres
ruby
microsoft weka dashboard
solr nosql
java
spss hbase
mcmc
lisp
cloud scikit−learn
visualization
sql excel
splus
tableau scrape
hive
c++
hadoop r sas
rest
shell
aws
d3
git
python
javascript
cluster
matlab
perl
algorithm
github
mpi
mapreduce mysql bayes
mongodb
mahout
cassandra
windows
s−plus
pig parallel
reduce
stata zookeeper
linux
amazon
soap
hdfs
c
xml
unix scipy
scala
numpy
gpu
haskell
r
python
sql
hadoop
java
sas
matlab
visualization
excel
hive
c
c++
spss
perl
amazon
linux
nosql
mapreduce
cloud
pig
tableau
microsoft
oracle
mysql
stata
ruby
spark
javascript
unix
scala
hbase
reduce
algorithm
cassandra
mongodb
mahout
map
cluster
github
pandas
numpy
d3
postgres
scipy
scikit−learn
aws
parallel
weka
storm
solr
shell
rest
git
xml
hdfs
s−plus
mcmc
Rectangle Map vs Dot chart ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
100
200
300
400
500
31 4/25/15 Windrose vs density plots and bar plots General Graphics References •  Cleveland,The elements of graphing data •  Tune, The visual display of quan;ta;ve informa;on •  Wainer, Graphic discovery: A trout in the milk and other visual adventures •  Wilkinson, Grammar of Graphics •  Yau, Visualize This: The flowing data guide to design, visualiza;on, and sta;s;cs Parallel-­‐
coordinates plot vs scaYerplot R Graphics References •  Murrell, R Graphics, 2nd edi;on •  Sarkar, Mul;variate data visualiza;on with R •  Wickham, ggplot2 – Elegant graphics for data analysis 32