4/25/15 Outline • Review plot types Graphics I – Univariate – Bivariate – Mul>variate • Design – Best prac>ce – PiBalls – Cri>que – Percep>on & color • Graphics Models in R – Base – Grammar of Graphics • Informa>on visualiza>on vs Sta>s>cal graphics Know your data types Types of Plots The appropriate graphical techniques depend on the kind of data that you are working with • Quan>ta>ve – con>nuous – e.g., height, weight – discrete – numeric data with few values, e.g., number of children in family • Qualita>ve – ordered – categories with an order but no meaningful distance between, e.g., number of stars for a movie ra>ng – nominal -‐ categories have no meaningful order, e.g., gender 1 4/25/15 One Variable Case: Infant Health load(url("hYp://www.stat.berkeley.edu/ users/nolan/data/KaiserBabies.rda")) Kaiser Study • Oakland Kaiser mothers • 1960s • Measure the babies weight (in ounces) at birth • All babies: – Male – Single births (no twins, etc.) – Survived 28 days Informa>on collected on mother’s and their babies • • • • • • • Birth weight (ounces) Gesta>on (weeks) Parity -‐ total number of previous pregnancies Mother’s height and weight Mother’s smoking status Mother’s age, race, educa>on level, income And more… 2 4/25/15 Rug plot & Strip chart Histogram 80 100 120 140 160 180 Birth Weight Density 60 0.000 Each baby’s weight is represented as a >ckmark. The thicker lines are from mul>ple babies with similar weights. I added a liYle random noise to the weights to keep them from falling on top of each other. 60 80 100 140 180 Birth Weight (oz) Male babies born at Oakland Kaiser in the 1960s 0.020 0.005 0.010 0.015 Boxplot: 0.000 Density (proportion per oz) Density plot Male babies born at Oakland Kaiser in the 1960s 150 140 180 Birth Weight (oz) 100 Violin Plot 60 80 100 120 Birth Weight (oz) 140 160 180 60 Density (proportion per oz) 100 0.000 0.005 0.010 0.015 0.020 0.025 50 1 3 4/25/15 150 0 50 • This quan>ta>ve variable is different from birth weight – there are only a few possible values, i.e., it’s not possible to have 2.3 siblings, and it’s highly unlikely to have 17 > table(infants$parity) 0 1 2 3 4 5 6 7 8 9 10 11 13 315 310 238 168 83 52 32 16 8 7 4 2 1 Number of Siblings 250 Parity: Number of siblings 0 1 2 3 4 5 6 7 8 9 11 Number of siblings 0.20 Dot chart 0.10 13 11 10 9 8 7 6 5 4 3 2 1 0 0.00 Proportion Alterna>ve – bar width has no meaning 0 0 1 2 3 4 5 6 7 8 9 11 13 50 100 150 200 250 300 Parity Number of siblings 4 4/25/15 Smoking Status -‐ Categorical 500 Pie chart pie Never 400 Qualita>ve Variables Bar chart 200 300 Unknown Once Until 0 100 Now Never Now Until Once Unknown ANGLES can be hard to compare Two Qualita>ve – Mosaic plot Trade Some College College Now Until Once Two Variables High School Never No HS Some HS 5 4/25/15 Line plots 150 100 Count 0 0 50 200 100 400 150 200 200 Barplots: side-‐by-‐side & stacked Now Until Once Never Now Until Once 200 0 50 Never Some HS HS Trade Some College 0 No HS No HS Some HS Trade College San Francisco Chronicle lis>ngs Data • Record: house sold in a par>cular >me period • Over 200,000 houses • Subset to a dozen ci>es in the East Bay – about 25,000 houses Variables: • City • County • Price • # bedrooms • Lot square footage • and 10 more load(url("hYp://www.stat.berkeley.edu/users/ nolan/data/Projects/SFHousing.rda")) 6 4/25/15 Side-‐by-‐Side Boxplots Rela>onship between city and sale price Data types: City -‐ factor Sale price -‐ numeric Side-‐by-‐side Violin plots Ci>es ordered by median price 2000 ppsf 1500 1000 500 0 Albany BerkeleyEmeryvilleOakland Piedmont Walnut Creek El CerritoKensingtonMoraga Orinda LafayetteRichmond factor(city) 7 4/25/15 WHAT’s Wrong with this plot? Rela>onship between price per square foot and total square foot Both are quan>ta>ve 800 1200 400 0 Price/ft^2 Smooth scaYer plot 1000 2000 3000 4000 5000 6000 Area (ft^2) 8 4/25/15 Summary of graph rela>onships between two variables Rela>onships between more than 2 variables • Two Qualita>ve variables – mosaicplot, side-‐by-‐side barplots, line plots • Qualita>ve informa>on can be conveyed in plots through color, plomng symbol, juxtaposed panels • The following plot uses informa>on from 4 variables: city, number of bedrooms, lot size (sq n), and price per square n • One Quan>ta>ve and one Qualita>ve – Boxplots, dotcharts, mul>ple density plots, violin plots • Two Quan>ta>ve variables – ScaYer plot, line plot Bsqn, PPSF, Bedrooms, City Berkeley ● Piedmont ● ● ● ● 1 bedrooms 800 600 400 ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● 4 bedrooms ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ●●● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●●● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ● ● ●● ●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ●● ● ●● ●● ● ●● ●●●●●●●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ●●● ●● ●● ● ● ● ● ●● ●●● ● ● ● ● ● ●● ● ● ● ● ● ●●● ●● ● ● ● ●● ● ●● ● ●● ● ● ● ●● ● ●●● ● ●● ● ● ●● ● ●● ●● ●●● ●● ● ● ● ● ●● ●●● ●● ● ● ● ●● ● ● ● ●● ●●● ● ● ● ● ●●●●●● ● ● ● ● ●● ● ●● ● ●● ●●● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●● ● ●● ● ●●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●● ● ● ●● ●● ● ● ● ●● ● ●● ● ● ●● ● ●● ●●● ●●● ●● ● ●● ● ● ● ● ●●●● ● ● ● ●●● ●●● ● ●● ●● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ●●●●● ●● ● ●●● ● ●● ●● ● ●●●● ●● ● ● ●●● ●●● ●●●●●● ● ● ● ●● ●●● ● ● ●●● ● ● ●●● ●● ● ●● ● ● ● ●● ●● ● ● ●● ● ●●● ● ● ● ●● ● ● ● ●● ● ●●● ●● ● ● ●● ● ●● ● ●● ● ●●●●● ●● ● ●●● ●● ● ●● ● ● ● ● ●●●●●● ●●●● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●●● ●● ● ● ●●● ●●● ●● ●● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ●● ●●●●● ● ● ● ●● ●● ● ● ● ● ● ●● ●●●●●●●● ●● ● ●● ● ● ●●● ● ● ●● ● ●●● ● ● ●●● ●● ●● ● ● ●● ● ●● ● ● ●●● ● ●●● ● ●●● ●●● ● ●● ● ● ● ●●●● ● ● ● ●● ●●●● ● ● ●● ●●● ● ●●●● ● ●● ● ● ● ●● ●●● ● ● ●● ● ●●● ●● ● ● ● ● ● ● ● ● ●●●●● ● ● ●● ● ● ● ● ●● ●● ●●●●● ● ●● ● ●● ●●● ● ●● ●● ●●● ●● ● ● ● ●● ● ●● ● ● ● ●●● ●● ●●● ●● ●●● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ●● ●● ●● ● ● ● ● ●●● ●● ● ● ●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ●●●●● ●● ● ●● ● ● ● ●●● ●●● ●● ●● ● ● ● ●●●● ●● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ●● ●●●●●●● ●●● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ●●● ● ● ●● ● ● ● ●● ● ● ● ●●●● ●●● ● ● ●● ● ●●● ● ● ● ● ● ● ●● ●● ●● ●● ● ●●● ●● ● ● ● ● ● ● ● ● ●● ●●●● ●● ● ●● ●●● ● ●● ● ●● ●● ● ●●● ● ● ● ●●● ● ● ● ●● ●●● ● ● ● ● ● ● ●●● ● ● ●● ● ● ●● ● ●● ● ●●●●●● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ●●●●●● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●●● ● ● ●● ● ● ● ●●● ● ● ●● ● ●●●● ● ● ●● ●●● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ●● ●●●● ●●● ● ● ● ●●●● ● ● ●● ● ●● ●●● ● ● ● ●● ● ● ● ●● ● ●● ●● ● ●● ● ●●● ●● ●●●● ●● ●● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ●●●●● ●● ● ●●●● ●● ● ● ●● ● ● ● ● ● ●●● ● ● ●● ● ● ●●● ● ● ●● ● ● ●●● ● ● ● ● ● ● ●● ●●● ●● ● ● ● ● ● ● ●●● ●● ●● ●● ● ●● ● ● ● ● ●●● ●● ● ● ● ● ●● ●● ● ● ●●●●● ● ● ● ●●● ● ● ●● ● ●●● ● ● ●●●●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ●●●● ● ● ● ● ●● ● ● ● ● ● ●● ● ●●● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ●●●● ●● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●●● ● ● ● ● ● ● ●● ● ●● ● ●● ● ●●●● ●● ● ● ● ● ●● ● ●● ●● ●●● ● ●● ● ●● ● ●● ●● ●● ● ● ● ● ●● ●● ●● ●● ● ●●● ● ●● ●●● ● ● ●●●●● ●● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ●● ● ● ● ●●● ● ●● ● ● ●● ● ● ● ● ● ●● ●●● ●● ●● ●● ● ● ●● ●● ●●●● ●● ●● ●● ● ● ● ●● ● ● ● ● ●●●●● ●● ● ● ●●● ● ●●● ●● ● ● ● ● ●● ●●● ● ● ● ● ●● ● ●● ● ● ● ● ● ●●●●●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ●● ●● ● ●●● ● ● ● ● ● ●● ●● ●● ● ●● ● ●●● ●●● ● ● ●●● ● ●●● ●● ● ● ●● ●● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●●●● ●● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ●● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ●● ● ● ●● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 3 bedrooms ● ● ● 1000 ● ● ● 2000 ● 5 bedrooms 6 bedrooms 7 bedrooms 8 bedrooms ● ● ● ● ● ● ●● ● ● ● ● ● ● ● 2 bedrooms ● 3 bedrooms ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ●●●● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Square Feet ● 4000 5000 3 1000 ● ● 4 500 5 0 Moraga 1000 Kensington ● ● ● 0 El Cerrito ● ● ● 3000 2 Walnut Creek 1500 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 Piedmont ● ● factor(br) 0 8 bedrooms ●● Oakland 500 7 bedrooms ● Emeryville 1000 6 bedrooms ● ●● ● ● ● ● ● ● Berkeley 1500 5 bedrooms ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4 bedrooms ● ● ●● Albany 1 bedrooms ● ● 800 ● 600 2 bedrooms ● 400 ● Price per square feet ● ● ppsf ● ● ● 200 Price per square feet ● 200 ● ● 2000 3000 Square Feet 4000 5000 Orinda Lafayette Richmond 6 1500 7 1000 500 8 0 1096.633 8103.084 1096.633 8103.084 1096.633 8103.084 1096.633 8103.084 bsqft 9 4/25/15 ScaYerplots of all pairs of variables 1.0e+09 30 50 70 5 10 15 1.0 3.0 200 0.0e+00 Image Map with 3-‐numeric variables 0 2.5 80 120 1.0e+09 100 ctry 2.0 0.8 0.0e+00 pop 0.6 50 70 0 Kappa 40 imRates 1.5 gini 90 30 0.4 70 1.0 50 lifeExpect 10 15 0.2 0.5 5 health 400 800 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 matM 0 100 200 0 40 80 120 50 70 90 0 400 800 Lambda 3D ScaYerplot 5 4 3 4 2 3 1 2 1 0 0 Kappa [0, 0.5) [0.5, 2) [2, 5) 5 0.0 0.5 1.0 1.5 2.0 2.5 Upper Quartile Offspring log10 Best Prac>ces 3.0 Lambda 10 4/25/15 Outline • Vocabulary • 3 Proper>es of good graph construc>on Vocabulary – Data stand out – Facilitate comparison – Informa>on rich • Percep>on • Case studies Title: Temperature in August ● ● ● ● Vertical Axis Reference line: 75 plotting symbol ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1. Make the Data Stand Out ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Legend ● ● data label ● Min: Aug 17 ● ● ● ● ● ● ● over 70 below 70 ● ● tick mark 10 15 20 Axis Label: Day 25 30 tick mark label 11 4/25/15 Avoid having other graph elements interfere with data 80 Use visually prominent symbols ● ● 75 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 70 ● Degrees ● ● ● ● ● ● ● ● ● 65 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 3 5 7 9 12 15 18 21 24 27 5 10 15 30 20 25 30 Day Day One way to avoid over plomng: JiYer the values Avoid over-‐plomng 1200 Families ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 75 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 70 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Shrink the plomng symbol so they don’t plot on top of each other 65 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 60 ● 65 70 ● 60 Dad Height 75 ● ● Add a liYle bit of random noise so all of the values aren’t ploYed on top of each other jitter(dht, 2) Why are there so few data points? 55 55 60 65 70 See a point cloud -‐ 60 65 70 jitter(ht, 2) Mom Height 12 4/25/15 0.20 0.10 Density 0 50 100 150 200 Choosing the Scale of the Axis • • • • 0.00 0.10 0.00 Density 0.20 Different values of data may obscure each other 0 2 4 6 8 10 Include all or nearly all of the data Fill data region Origin need not be on the scale Choose a scale that improves resolu>on (to be con>nued) Most of the data are in the 0 to 10 range. The few large values obscure the bulk of the data. Consider men>oning these large values in a cap>on, instead of showing them in the plot. Eliminate superfluous material • Chart junk – stuff that adds no meaning, e.g. buYerflies on top of barplots, background images • Extra >ck marks and grid lines • Unnecessary text and arrows • Decimal places beyond the measurement error or the level of difference 2. Facilitate Comparisons 13 4/25/15 Put Juxtaposed plots on same scale 0.00 15 20 25 20 30 40 50 60 0.00 0.05 0.10 0.15 0.20 0.02 Density Density 0.04 0.06 0.20 0.15 0.10 0.05 0.00 Density Groups Group 2 0.08 Group 1 Make it easy to dis>nguish elements of superposed plots (e.g. color) 20 Choosing the Scale • Keep scales on x and y axes the same for both plots to facilitate the comparison • Zoom in to focus on the region that contains the bulk of the data • These two principles may go counter to one another • Keep the scale the same throughout the plot (i.e., don’t change it mid-‐axis) 30 40 50 60 Emphasizes the important difference A B Which of these side-‐by-‐side bar plots emphasizes the important difference? 14 4/25/15 Avoid Jiggling the baseline It is difficult to see how a country has changed over >me because the boYom/ base line moves Comparison: volume, area, height We naturally compare the volume of the barrels, but the change is really the height of the barrels Banking: Aspect Ra>o Percep>on Color, shape (including banking) can affect your ability to make good comparisons • The height/width of the data region was selected to be about 1 so that the trend line is at about 45 degrees. • The Aspect ra>o affects our visual decoding of the rate of change • The banking to 45 degrees helps us see rate of change • The ability to effec>vely judge rate of change allows us to see important paYerns in data 15 4/25/15 Banking at 45 degrees 2 4 6 8 10 200 50 20 10 5 2 2 5 50 10 20 100 50 150 100 200 log-log transformtation 100 200 log-transformation 0 • Roughly: Examine the absolute value of the orienta>on of segments, they should be centered at 45 degrees. • Transforma>ons to improve the aspect ra>o uncovers the structure of the rela>onship between variables • Easier to see important features Bank to 45 degrees 2 4 6 8 10 1 2 5 10 Bar plot vs Pie chart Shapes • Cleveland’s experiment had a group of subjects judge 40 pairs of values on bar chars and the same 40 pairs on pie charts: What percent the smaller was of the larger? • Pie chart judgments are less accurate than bar chart judgments • Bar chart errors are about the same size for all percents. • Pie chart errors tend to be larger for percents greater than 35% 16 4/25/15 Color Guidelines Color Colorfulness • Saturated/colorful colors are hard to look at for a long >me. • They tend to produce an aner-‐image effect which can be distrac>ng. • Choosing a set of colors which work well together is a challenging task for anyone who does not have an intui>ve gin for color • 7-‐10% of males are red-‐green color blind. Luminance • If the size of the areas presented in a graph is important, then the areas should be rendered with colors of similar luminance (brightness). • Lighter colors tend to make areas look larger than darker colors 17 4/25/15 Data Type and Color • Qualita>ve – Choose a qualita;ve scheme that makes it easy to dis>nguish between categories • Quan>ta>ve – Choose a color scheme that implies magnitude. – Does the data progress from low to high? Use a sequen;al scheme where light colors are for low values – Do both low and high value deserve equal emphasis? Use a diverging scheme where light colors represent middle values Brewer’s Qualita>ve PaleYes Set3 Set2 Set1 Pastel2 Pastel1 Paired Dark2 Accent Brewer’s Diverging PaleYes Spectral RdYlGn RdYlBu RdGy RdBu PuOr PRGn PiYG BrBG Brewer’s Sequen>al PaleYes YlOrRd YlOrBr YlGnBu YlGn Reds RdPu Purples PuRd PuBuGn PuBu OrRd Oranges Greys Greens GnBu BuPu BuGn Blues 18 4/25/15 How to make a plot informa>on rich 3. Add Informa>on • Describe what you see in the Cap;on • Add context with Reference Markers (lines and points) including text • Add Legends and Labels • Use color and plomng symbols to add more informa>on • Plot the same thing more than once in different ways/scales • Reduce cluYer Cap>ons • Cap>ons should be comprehensive • Self-‐contained • Cap>ons should: – Describe what has been graphed – Draw aYen>on to important features – Describe conclusions drawn from graph Good Plot Making Prac>ce Put major conclusions in graphical form Provide reference informa>on Proof read for clarity and consistency Graphing is an itera>ve process Mul>plicity is OK, i.e., two plots of the same variable may provide different messages • Make plots data rich • • • • • 19 4/25/15 The Plomng Process Cases • • • • Determine what’s the message Help the data speak Plomng is an itera>ve process – An ar>st makes many sketches before pain>ng the masterpiece How would you improve this plot? Case: Voter Registra>on Trends in California 20 4/25/15 Changes • • • • • • What’s the message? Loca>on of >ck marks under bars Color of bars – indicate party Title Y-‐axis label confusing X-‐axis label missing Check data for understanding of how plot is made • How party registra>on has changed over the past presiden>al elec>ons • More informa>ve if we have registra>on figures for people not coun>es • County size may be a lurking variable -‐ small coun>es tend to be rural and conserva>ve Can we make it more informa>on rich? hYp://www.sos.ca.gov/elec>ons/ror/60day presprim/hist reg stats.pdf 30 Case: CO2 emissions around the world 20 Democrat Republican Other Decline to state 0 10 Percent 40 50 Party Affiliation of Registered Voters in California 1992 1996 Colors from Brewer’s Set1 Qualita>ve paleYe 2000 2004 2008 Year 21 4/25/15 ManyEyes and CO2 How might we improve this plot? Changes • Superpose rather than stack the curves so the baseline doesn’t jiggle • Use color on the lines rather than filling polygons What can you see now? Case: CO2 levels at Mauna Loa Time and the horizontal axis 22 4/25/15 Mauna Loa Observatory • Far from any con>nent, the air sampled is a good average for the central pacific. • Being high, it is above the inversion layer where local effects are present. • Measurements of atmospheric CO2 since 1958 – longest con>nuous record Atmospheric Carbon Dioxide • The increasing amount of CO2 in the atmosphere from the burning of fossil fuels has become a serious environmental concern. • Upper safety limit for atmospheric CO2 is 350 parts per million • Does a rise in CO2 lead to a rise in world temperatures? 380 360 co2 320 320 340 co2 360 380 Points are typically not the best way to plot >me series Connect the measurements with line segments 340 Time Series – Pairs: (>me, CO2) 1960 1970 1980 1990 date 2000 2010 1960 1970 1980 1990 2000 2010 date 23 4/25/15 Seasonality vs the long-‐term Trend Monthly Average CO2 360 320 340 CO2 (ppm) 380 • The height/width of the data region was selected to be about 1 so that the trend line is at about 45 degrees. • The banking to 45 degrees let’s us see that the curve is convex • This means that the rate of increase of CO2 is increasing through >me 1988 1960 1970 1980 1990 Aspect Ra>o 2000 2010 Date Global Warming Chartology • 1981 US Senate convened scien>st for tes>mony on global warning • Senator Al Gore said that the Mauna Loa data clearly demonstrated increases in CO2 • PewiY (witness for the DOE) said that the graph was misleading because it doesn’t include 0 PewiY took issue with the graph, saying “It is a clever piece of chartology” because it can be read the wrong way. He con>nued, “It is intellectually just exactly correct. It displays 315 going to 336, but it appears to be going from 0 to very large amounts.” Steven Schneider (Global Warming) called PewiY’s objec>on “double talk” 24 Including 0 & The Aspect Ra>o 200 co2 100 0 2000 200 date 0 co2 400 1960 What do you think of this plot? When we include 0 and bank at 45 degrees, then the plot must be tall and narrow. With this plot it’s hard to see any other features. There is also a lot of empty space. To fill the space with data, we need to stretch the data region to be wide and short. Now, it’s hard to see the most important feature, the curvature, because the banking is nearly 0. 300 400 4/25/15 1960 1970 1980 1990 2000 date Eliminate: • background color, • Grid lines, • Polygon filling Take logs 2010 FIND 5 things that you would change • Examine rela>ve growth so we can add more variables • Add legend for different variables • Add reference lines for important dates 25 4/25/15 What alterna>ve plot would beYer enable comparisons? A. Stacked Barchart B. Dot Chart C. Pie Chart D. Line plot E. Mosaic plot Data Stand Out Graphics Best Prac>ces -‐ Recap • Avoid having other graph elements interfere with data • Use visually prominent symbols • Avoid over-‐plomng • Choose appropriate axis scales • Eliminate superfluous material 26 4/25/15 Facilitate Comparisons • • • • • • • Juxtapose or Superpose plots (use same scale) Don’t change scale mid-‐axis Use only one scale on one axis Use color Emphasize the important difference Avoid jiggling the baseline Care with the visual metaphor (linear, area, volume Informa>on Rich • Describe what you see in the Cap;on • Add context with Reference Markers (lines and points) including text • Add Legends and Labels • Use color and plomng symbols to add more informa>on • Plot the same thing more than once in different ways/scales • Reduce cluYer Good Plot Making Prac>ce Put major conclusions in graphical form Provide reference informa>on Proof read for clarity and consistency Graphing is an itera>ve process Mul>plicity is OK, i.e., two plots of the same variable may provide different messages • Make plots data rich • • • • • Graphics Models in R 27 4/25/15 R’s graphics model • There are two models in R – • The painter’s model, which is the base graphics • The object-‐based model, which follows Wilkinson’s Grammar of Graphics R’s base graphics model • High-‐level func>on plot() – Starts a new plot on a new page – creates a complete plot • Low-‐level func>ons add more output to the current plot, if necessary – lines and line segments – points, text – legend Review Plomng Func>ons • • • • • • hist() boxplot() dotchart() plot() barchart() mosaicplot() • abline() • curve() • points() • lines() • polygon() • text() add text Review Plot Arguments ?plot.default "p" for points, "l" for lines, "n" for nothing • ylim = c(0, 1) the range for the scale of the axis • type = "l" • xlab = "x axis label" • main = "plot title" • col = vector of colors • log = "y" use log scale on y axis, can be "x" or "xy" • lwd = 2 thickness of line • pch = 19 plomng character – check other numbers • cex = 0.5 character magnifica>on • lty = 2 type of line – check other numbers • las = 1 0,1,2, or 3 style of >ck mark labels 28 4/25/15 plot(x = fbDF$gini, y = fbDF$im, type = "n”, log = "y", xlab = "Gini Index", ylab = "Infant Mortality Rate (log scale)", main = "World Trends" ) 100 World Trends ● 50 ● ● ● ● ● ● 20 ● ● ● ● ● ● ●● ● ● 5 symbols(x = fbDF$gini, y = fbDF$imRates, bg = myColors[fbDF$cutLE], add = TRUE, circles = pmax(sqrt(fbDF$pop)/8000, 0.2), inches = FALSE) ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 Infant Mortality Rate (log scale) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● Life Expectancy ● ● ● < 55 55 − 65 65 − 70 70−75 75 − 80 >85 2 ● 30 40 50 60 70 Gini Index legend("bottomright", title = "Life Expectancy", legend = lifeLabels), bty = "n", fill = myColors) ggplot2 model • Func>ons produce plot components (aka layers) • Combine components in different ways with the + operator to produce a variety of plots ggplot2 model -‐ steps • Define the data you want to plot and create an empty plot object with ggplot() • Specify what graphcis shapes, aka, geoms, to view the data, e.g., plomng symbols, lines. Add these to the plot with geom_point or geom_line • Specify features/aestheitcs used to represent the data, e.g., x, y loca>ons, color, symbol size 29 4/25/15 p = ggplot(fbDF) p + geom_point(aes(x = gini, y = imRates, color = cutLE, size = popS)) + With geographic informa>on we can make maps scale_y_continuous(trans = "log", name = "Infant Mortality Rate (log scale)") + ● ● ● ● ● ● ● ● ● scale_color_manual(values = myColors, name = "Life Expectancy") + Infant Mortality ● ● ● (0,10] (10,25] (25,50] (50,75] (75,150] ● ●● ● ● ●● ● ● ●● ●● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● scale_size(name = "Population") + ggtitle("World Trends") ● ● ● ●● ● ● scale_x_continuous(name = "Gini Index") + ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● We can use apps like Google Earth to display our data Informa>on Visualiza>on vs Sta>s>cal Graphics 30 4/25/15 Info-‐Viz vs Sta>s>cal Graphics Billion dollar o-‐gram • Informa>on visualiza>ons tend to be grabby and visually striking – Drama>ze the problem with a unique and visually-‐ appealing image that draws casual viewer in deeper • Sta>s>cal graphics try to reveal paYerns and discrepancies and make comparisons – Careful to make judicious use of all aspects, onen for viewers already interested in the problem From Gelman and Unwin’s ar>cle (2013) Word Clouds vs dotcharts elasticsearchpandas json cloudera map oracle postgres ruby microsoft weka dashboard solr nosql java spss hbase mcmc lisp cloud scikit−learn visualization sql excel splus tableau scrape hive c++ hadoop r sas rest shell aws d3 git python javascript cluster matlab perl algorithm github mpi mapreduce mysql bayes mongodb mahout cassandra windows s−plus pig parallel reduce stata zookeeper linux amazon soap hdfs c xml unix scipy scala numpy gpu haskell r python sql hadoop java sas matlab visualization excel hive c c++ spss perl amazon linux nosql mapreduce cloud pig tableau microsoft oracle mysql stata ruby spark javascript unix scala hbase reduce algorithm cassandra mongodb mahout map cluster github pandas numpy d3 postgres scipy scikit−learn aws parallel weka storm solr shell rest git xml hdfs s−plus mcmc Rectangle Map vs Dot chart ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 100 200 300 400 500 31 4/25/15 Windrose vs density plots and bar plots General Graphics References • Cleveland,The elements of graphing data • Tune, The visual display of quan;ta;ve informa;on • Wainer, Graphic discovery: A trout in the milk and other visual adventures • Wilkinson, Grammar of Graphics • Yau, Visualize This: The flowing data guide to design, visualiza;on, and sta;s;cs Parallel-‐ coordinates plot vs scaYerplot R Graphics References • Murrell, R Graphics, 2nd edi;on • Sarkar, Mul;variate data visualiza;on with R • Wickham, ggplot2 – Elegant graphics for data analysis 32
© Copyright 2024