Hands-On Data Science with R Exploring Data with GGPlot2 [email protected] 22nd May 2015 Visit http://HandsOnDataScience.com/ for more Chapters. The required packages for this chapter are: T The ggplot2 (Wickham and Chang, 2015) package implements a grammar of graphics. A plot is built starting with the dataset and aesthetics (e.g., x-axis and y-axis) and adding geometric elements, statistical operations, scales, facets, coordinates, and options. DR AF library(ggplot2) # Visualise data. library(scales) # Include commas in numbers. library(RColorBrewer) # Choose different color. library(rattle) # Weather dataset. library(randomForest) # Use na.roughfix() to deal with missing data. library(gridExtra) # Layout multiple plots. library(wq) # Regular grid layout. library(xkcd) # Some xkcd fun. library(extrafont) # Fonts for xkcd. library(sysfonts) # Font support for xkcd. library(GGally) # Parallel coordinates. library(dplyr) # Data wrangling. library(lubridate) # Dates and time. As we work through this chapter, new R commands will be introduced. Be sure to review the command’s documentation and understand what the command does. You can ask for help using the ? command as in: ?read.csv We can obtain documentation on a particular package using the help= option of library(): library(help=rattle) This chapter is intended to be hands on. To learn effectively, you are encouraged to have R running (e.g., RStudio) and to run all the commands as they appear here. Check that you get the same output, and you understand the output. Try some variations. Explore. Copyright © 2013-2014 Graham Williams. You can freely copy, distribute, or adapt this material, as long as the attribution is retained and derivative work is provided under the same license. Data Science with R 1 Hands-On Exploring Data with GGPlot2 Preparing the Dataset We use the relatively large weatherAUS dataset from rattle (Williams, 2015) to illustrate the capabilities of ggplot2. For plots which generate large images we might use random subsets of the same dataset. library(rattle) dsname <- "weatherAUS" ds <- get(dsname) The dataset is summarised below. dim(ds) ## [1] 105379 24 names(ds) head(ds) ## ## 1 ## 2 ## 3 ## 4 ## 5 .... "Location" "Evaporation" "WindDir9am" "Humidity9am" "Cloud9am" "RainToday" "MinTemp" "Sunshine" "WindDir3pm" "Humidity3pm" "Cloud3pm" "RISK_MM" "MaxTemp" "WindGustDir" "WindSpeed9am" "Pressure9am" "Temp9am" "RainTomorrow" T "Date" "Rainfall" "WindGustSpeed" "WindSpeed3pm" "Pressure3pm" "Temp3pm" DR AF ## [1] ## [5] ## [9] ## [13] ## [17] ## [21] .... Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine 2008-12-01 Albury 13.4 22.9 0.6 NA NA 2008-12-02 Albury 7.4 25.1 0.0 NA NA 2008-12-03 Albury 12.9 25.7 0.0 NA NA 2008-12-04 Albury 9.2 28.0 0.0 NA NA 2008-12-05 Albury 17.5 32.3 1.0 NA NA tail(ds) ## ## 105374 ## 105375 ## 105376 ## 105377 ## 105378 .... Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine 2015-03-25 Uluru 20.4 25.5 0 NA NA 2015-03-26 Uluru 18.7 29.5 0 NA NA 2015-03-27 Uluru 17.9 31.3 0 NA NA 2015-03-28 Uluru 17.5 30.8 0 NA NA 2015-03-29 Uluru 21.1 27.7 0 NA NA str(ds) ## 'data.frame': 105379 obs. of 24 variables: ## $ Date : Date, format: "2008-12-01" "2008-12-02" ... ## $ Location : Factor w/ 49 levels "Adelaide","Albany",..: 3 3 3 3 3 3 ... ## $ MinTemp : num 13.4 7.4 12.9 9.2 17.5 14.6 14.3 7.7 9.7 13.1 ... ## $ MaxTemp : num 22.9 25.1 25.7 28 32.3 29.7 25 26.7 31.9 30.1 ... ## $ Rainfall : num 0.6 0 0 0 1 0.2 0 0 0 1.4 ... Copyright © 2013-2014 [email protected] Module: GGPlot2O Page: 1 of 68 Data Science with R Hands-On Exploring Data with GGPlot2 .... summary(ds) Location Canberra: 2618 Sydney : 2526 Adelaide: 2375 Brisbane: 2375 Darwin : 2375 MinTemp Min. :-8.50 1st Qu.: 7.70 Median :12.00 Mean :12.19 3rd Qu.:16.90 MaxTemp Min. :-4.1 1st Qu.:18.0 Median :22.6 Mean :23.2 3rd Qu.:28.2 DR AF T ## Date ## Min. :2007-11-01 ## 1st Qu.:2010-06-07 ## Median :2012-01-31 ## Mean :2012-01-29 ## 3rd Qu.:2013-10-09 .... Copyright © 2013-2014 [email protected] Module: GGPlot2O Page: 2 of 68 Data Science with R Collecting Information normVarNames(names(ds)) # Optional lower case variable names. names(ds) "rain_tomorrow" c("date", "location") id setdiff(vars, target) which(sapply(ds[vars], is.numeric)) min_temp max_temp 3 4 sunshine wind_gust_speed 7 9 humidity_9am humidity_3pm 14 15 numerics numerics ## [1] ## [4] ## [7] ## [10] ## [13] ## [16] .... cati cati <<<<<<<- rainfall 5 wind_speed_9am 12 pressure_9am 16 <- names(numi) "min_temp" "evaporation" "wind_speed_9am" "humidity_3pm" "cloud_9am" "temp_3pm" "max_temp" "sunshine" "wind_speed_3pm" "pressure_9am" "cloud_3pm" "risk_mm" evaporation 6 wind_speed_3pm 13 pressure_3pm 17 T names(ds) vars target id ignore inputs numi numi ## ## ## ## ## ## .... Exploring Data with GGPlot2 "rainfall" "wind_gust_speed" "humidity_9am" "pressure_3pm" "temp_9am" DR AF 2 Hands-On <- which(sapply(ds[vars], is.factor)) ## location wind_gust_dir ## 2 8 ## rain_tomorrow ## 24 wind_dir_9am 10 wind_dir_3pm 11 rain_today 22 categorics <- names(cati) categorics ## [1] "location" ## [5] "rain_today" "wind_gust_dir" "wind_dir_9am" "rain_tomorrow" "wind_dir_3pm" We perform missing value imputation simply to avoid warnings from ggplot2, ignoring whether this is appropriate to do so from a data integrity point of view. library(randomForest) sum(is.na(ds)) ## [1] 226034 ds[setdiff(vars, ignore)] <- na.roughfix(ds[setdiff(vars, ignore)]) sum(is.na(ds)) Copyright © 2013-2014 [email protected] Module: GGPlot2O Page: 3 of 68 Data Science with R Hands-On Exploring Data with GGPlot2 DR AF T ## [1] 0 Copyright © 2013-2014 [email protected] Module: GGPlot2O Page: 4 of 68 Data Science with R 3 Hands-On Exploring Data with GGPlot2 Scatter Plot 40 max_temp 30 rain_tomorrow No 20 Yes 0 10 min_temp 20 30 DR AF 0 T 10 A scatter plot displays points scattered over a plot. The points are specified as x and y for a two dimensional plot, as specified by the aesthetics. We use geom point() to plot the points. We first identify a random subset of just 1,000 observations to plot, rather than filling the plot completely with points, as might happen when we plot too points on a scatter plot. We then subset the dataset simply by providing the random vector of indicies to the subset operator of the dataset. This subset is piped (using %>%) to ggplot(), identifying the x-axis, the y-axis, and how the points should be coloured as the aesthetics of the plot using aes(), to which a geom point() layer is then added. sobs <- sample(nrow(ds), 1000) ds[sobs,] ggplot(aes(x=min_temp, y=max_temp, colour=rain_tomorrow)) geom_point() Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% + Page: 5 of 68 Data Science with R 3.1 Hands-On Exploring Data with GGPlot2 Adding a Smooth Fitted Curve 40 20 T max_temp 30 DR AF 10 0 2008 2010 date 2012 2014 Here we have added a smooth fitted curve using geom smooth(). Since the dataset has many points the smoothing method recommend is method="gam" and will be automatically chosen if not specified, with a message displayed to inform us of this. The formula specified, formula=y~ s(x, bs="cs") is also the default for method="gam". Typical of scatter plots of big data there is a mass of overlaid points. We can see patterns though, noting the obvious pattern of seasonality. Notice also the three or four white vertical bands—this is indicative of some systemic issue with the data in that for specific days it would seem there is no data collected. Something to explore and understand further, but our scope for now. ds[sobs,] ggplot(aes(x=date, y=max_temp)) geom_point() geom_smooth(method="gam", formula=y~s(x, bs="cs")) Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% + + Page: 6 of 68 Data Science with R Albany Albury AliceSprings BadgerysCreek Ballarat Bendigo Brisbane Cairns Canberra Cobar CoffsHarbour Dartmoor Darwin GoldCoast Hobart Katherine Launceston Melbourne MelbourneAirport Mildura Moree MountGambier MountGinini Newcastle Nhil NorahHead NorfolkIsland Nuriootpa PearceRAAF Penrith Perth PerthAirport Portland Richmond Sale SalmonGums Sydney SydneyAirport 14 12 20 10 20 08 20 20 20 14 Woomera 12 Wollongong 20 08 Uluru 20 14 20 12 20 10 20 08 20 14 20 12 20 10 20 08 14 20 date Witchcliffe Tuggeranong 10 Townsville Williamtown 20 12 20 10 20 08 Watsonia 20 14 10 08 20 14 Walpole 20 12 20 08 20 20 10 WaggaWagga 50 0 -50 -100 -150 20 50 0 -50 -100 -150 12 50 0 -50 -100 -150 20 50 0 -50 -100 -150 DR AF 50 0 -50 -100 -150 20 50 0 -50 -100 -150 T Adelaide 50 0 -50 -100 -150 max_temp Exploring Data with GGPlot2 Using Facet to Plot Multiple Locations 20 3.2 Hands-On Partitioning the dataset by a categoric variable will reduce the blob effect for big data. For the above plot we use facet wrap() to separately plot each location’s maximum temperature over time. The x axis tick labels are rotated 45◦ . The seasonal effect is again very clear, and a mild increase in the maximum temperature over the small period is evident in many locations. Using large dots on the smaller plots still leads to much overlaid plotting. ds[sobs,] ggplot(aes(x=date, y=max_temp)) geom_point() geom_smooth(method="gam", formula=y~s(x, bs="cs")) facet_wrap(~location) theme(axis.text.x=element_text(angle=45, hjust=1)) Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% + + + + Page: 7 of 68 Data Science with R Albany Albury AliceSprings BadgerysCreek Ballarat Bendigo Brisbane Cairns Canberra Cobar CoffsHarbour Dartmoor Darwin GoldCoast Hobart Katherine Launceston Melbourne MelbourneAirport Mildura Moree MountGambier MountGinini Newcastle Nhil NorahHead NorfolkIsland Nuriootpa PearceRAAF Penrith Perth PerthAirport Portland Richmond Sale SalmonGums Sydney SydneyAirport 14 12 20 10 20 08 20 20 20 14 Woomera 12 Wollongong 20 08 Uluru 20 14 20 12 20 10 20 08 20 14 20 12 20 10 20 08 14 20 date Witchcliffe Tuggeranong 10 Townsville Williamtown 20 12 20 10 20 08 Watsonia 20 14 10 08 20 14 Walpole 20 12 20 08 20 20 10 WaggaWagga 50 0 -50 -100 -150 20 50 0 -50 -100 -150 12 50 0 -50 -100 -150 20 50 0 -50 -100 -150 DR AF 50 0 -50 -100 -150 20 50 0 -50 -100 -150 T Adelaide 50 0 -50 -100 -150 max_temp Exploring Data with GGPlot2 Facet over Locations with Small Dots 20 3.3 Hands-On Here we have changed the size of the points to be size=0.2 instead of the default size=0.5. The points then play less of a role in the display. Changing to plotted points to be much smaller de-clutters the plot significantly and improves the presentation. ds[sobs,] ggplot(aes(x=date, y=max_temp)) geom_point(size=0.2) geom_smooth(method="gam", formula=y~s(x, bs="cs")) facet_wrap(~location) theme(axis.text.x=element_text(angle=45, hjust=1)) Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% + + + + Page: 8 of 68 Data Science with R Albany Albury AliceSprings BadgerysCreek Ballarat Bendigo Brisbane Cairns Canberra Cobar CoffsHarbour Dartmoor Darwin GoldCoast Hobart Katherine Launceston Melbourne MelbourneAirport Mildura Moree MountGambier MountGinini Newcastle Nhil NorahHead NorfolkIsland Nuriootpa PearceRAAF Penrith Perth PerthAirport Portland Richmond Sale SalmonGums Sydney SydneyAirport 14 12 20 10 20 08 20 20 20 14 Woomera 12 Wollongong 20 08 Uluru 20 14 20 12 20 10 20 08 20 14 20 12 20 10 20 08 14 20 date Witchcliffe Tuggeranong 10 Townsville Williamtown 20 12 20 10 20 08 Watsonia 20 14 10 08 20 14 Walpole 20 12 20 08 20 20 10 WaggaWagga 50 0 -50 -100 -150 20 50 0 -50 -100 -150 12 50 0 -50 -100 -150 20 50 0 -50 -100 -150 DR AF 50 0 -50 -100 -150 20 50 0 -50 -100 -150 T Adelaide 50 0 -50 -100 -150 max_temp Exploring Data with GGPlot2 Facet over Locations with Lines 20 3.4 Hands-On Changing to lines using geom line() instead of points using geom points() might improve the presentation over the big dots, somewhat. Though it is still a little dense. ds[sobs,] ggplot(aes(x=date, y=max_temp)) geom_line() geom_smooth(method="gam", formula=y~s(x, bs="cs")) facet_wrap(~location) theme(axis.text.x=element_text(angle=45, hjust=1)) Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% + + + + Page: 9 of 68 Data Science with R Albany Albury AliceSprings BadgerysCreek Ballarat Bendigo Brisbane Cairns Canberra Cobar CoffsHarbour Dartmoor Darwin GoldCoast Hobart Katherine Launceston Melbourne MelbourneAirport Mildura Moree MountGambier MountGinini Newcastle Nhil NorahHead NorfolkIsland Nuriootpa PearceRAAF Penrith Perth PerthAirport Portland Richmond Sale SalmonGums Sydney SydneyAirport 14 12 20 10 20 08 20 20 20 14 Woomera 12 Wollongong 20 08 Uluru 20 14 20 12 20 10 20 08 20 14 20 12 20 10 20 08 14 20 date Witchcliffe Tuggeranong 10 Townsville Williamtown 20 12 20 10 20 08 Watsonia 20 14 10 08 20 14 Walpole 20 12 20 08 20 20 10 WaggaWagga 50 0 -50 -100 -150 20 50 0 -50 -100 -150 12 50 0 -50 -100 -150 20 50 0 -50 -100 -150 DR AF 50 0 -50 -100 -150 20 50 0 -50 -100 -150 T Adelaide 50 0 -50 -100 -150 max_temp Exploring Data with GGPlot2 Facet over Locations with Thin Lines 20 3.5 Hands-On Finally, we use a smaller point size to draw the lines. You can decide which looks best for the data you have. Compare the use of lines here and above using geom point(). Often the decision comes down to what you are aiming to achieve with the plots or what story you will tell. ds[sobs,] ggplot(aes(x=date, y=max_temp)) geom_line(size=0.1) geom_smooth(method="gam", formula=y~s(x, bs="cs")) facet_wrap(~location) theme(axis.text.x=element_text(angle=45, hjust=1)) Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% + + + + Page: 10 of 68 Data Science with R 4 Hands-On Exploring Data with GGPlot2 Histograms and Bar Charts 10000 5000 T count 7500 DR AF 2500 0 N NNE NE ENE E ESE SE SSE S SSW SW WSW wind_dir_3pm W WNW NW NNW A histogram displays the frequency of observations using bars. Such plots are easy to display using ggplot2, as we see in the above code used to generate the plot. Here the data= is identified as ds and the x= aesthetic is wind dir 3pm. Using these parameters we then add a bar geometric to build a bar plot for us. The resulting plot shows the frequency of the levels of the categoric variable wind dir 3pm from the dataset. You will have noticed that we placed the plot at the top of the page so that as we turn over to the next page in this module we get a bit of an animation that highlights what changes. In reviewing the above plot we might note that it looks rather dark and drab, so we try to turn it into an appealing graphic that draws us in to wanting to look at it and understand the story it is telling. p <- ds ggplot(aes(x=wind_dir_3pm)) geom_bar() p Copyright © 2013-2014 [email protected] %>% + Module: GGPlot2O Page: 11 of 68 Data Science with R 4.1 Hands-On Exploring Data with GGPlot2 Saving Plots to File Once we have a plot displayed we can save the plot to file quite simply using ggsave(). The format is determined automatically by the name of the file to which we save the plot. Here we save the plot as a PDF: ggsave("barchart.pdf", width=11, height=7) The default width= and height= are those of the current plotting window. For saving the plot above we have specified a particular width and height, presumably to suit our requirements for placing the plot in a document or in some presentation. Essentially, ggsave() is a wrapper around the various functions that save plots to files, and adds some value to these default plotting functions. The traditional approach is to initiate a device, print the plot, and the close the device. T pdf("barchart.pdf", width=11, height=7) p dev.off() Note that for a pdf() device the default width= and height= are 7 (inches). Thus, for both of the above examples we are widening the plot whilst retaining the height. DR AF There is some art required in choosing a good width and height. By increasing the height or width any text that is displayed on the plot, essentially stays the same size. Thus by increasing the plot size, the text will appear smaller. By decreasing the plot size, the text becomes larger. Some experimentation is often required to get the right size for any particular purpose. Copyright © 2013-2014 [email protected] Module: GGPlot2O Page: 12 of 68 Data Science with R 4.2 Hands-On Exploring Data with GGPlot2 Bar Chart 10000 5000 T count 7500 DR AF 2500 0 N ds NNE NE ENE E ESE SE SSE S SSW SW WSW wind_dir_3pm W WNW NW NNW ggplot(aes(x=wind_dir_3pm)) geom_bar() Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% + Page: 13 of 68 Data Science with R 4.3 Hands-On Exploring Data with GGPlot2 Stacked Bar Chart 10000 count 7500 rain_tomorrow No 5000 Yes 0 ds NNE NE ENE E ESE SE SSE S SSW wind_dir_3pm SW WSW W WNW NW NNW DR AF N T 2500 ggplot(aes(x=wind_dir_3pm, fill=rain_tomorrow)) geom_bar() Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% + Page: 14 of 68 Data Science with R 4.4 Hands-On Exploring Data with GGPlot2 Stacked Bar Chart with Reordered Legend 10000 count 7500 rain_tomorrow Yes 5000 No 0 N NNE NE ENE E ESE SE SSE S SSW wind_dir_3pm T 2500 SW WSW W WNW NW NNW DR AF A common annoyance with how the bar chart’s legend is arranged that it is stacked in the reverse order to the stacks in the bar chart itself. This is basically understandable as the bars are plotted in the same order as the levels of the factor. In this case that is: levels(ds$rain_tomorrow) ## [1] "No" "Yes" So for each bar the count of the frequency of “No” is plotted first, and then the frequency of “Yes”. Then the legend is listed in the same order, but from the top of the legend to the bottom. Logically that makes sense, but visually it makes more sense to stack the bars and the legend in the same order. We can do this simply with the guide= option. ds ggplot(aes(x=wind_dir_3pm, fill=rain_tomorrow)) geom_bar() guides(fill=guide_legend(reverse=TRUE)) Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% + + Page: 15 of 68 Data Science with R 4.5 Hands-On Exploring Data with GGPlot2 Dodged Bar Chart 8000 count 6000 rain_tomorrow No 4000 Yes 0 N NNE NE ENE E ESE SE SSE S SSW wind_dir_3pm T 2000 SW WSW W WNW NW NNW ds DR AF Notice for this version of the bar chart that the legend order is quite appropriate, and so no reversing of it is required. ggplot(aes(x=wind_dir_3pm, fill=rain_tomorrow)) geom_bar(position="dodge") Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% + Page: 16 of 68 Data Science with R 4.6 Hands-On Exploring Data with GGPlot2 Stacked Bar Chart with Options Rain Expected by Wind Direction at 3pm 10,000 Number of Days 7,500 Tomorrow Rain Expected 5,000 No Rain Expected 2,500 N NNE NE ENE E ESE SE SSE S T 0 SSW SW WSW Wind Direction 3pm 2015-05-22 08:05:29 W WNW NW NNW DR AF Here we illustrate the stacked bar chart, but now include a variety of options to improve its appearance. The options, of course, are infinite, and we will explore more options through this chapter. Here though we use brewer.pal() from RColorBrewer (Neuwirth, 2014) to generate a dark/light pair of colours that are softer in presentation. We change the order of the pair, so dark blue is first and light blue second, and use only these first two colours from the generated palette. We also include more informative labels through labs(), as well as using comma() from scales (Wickham, 2014) for formatting the thousands on the y-axis. library(scales) library(RColorBrewer) ds ggplot(aes(x=wind_dir_3pm, fill=rain_tomorrow)) geom_bar() scale_y_continuous(labels=comma) scale_fill_manual(values=brewer.pal(4, "Paired")[c(2,1)] guide=guide_legend(reverse=TRUE) labels=c("No Rain Expected", "Rain Expected")) labs(title="Rain Expected by Wind Direction at 3pm" x=paste0("Wind Direction 3pm\n", Sys.time()) y="Number of Days" fill="Tomorrow") Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% + + + , , + , , , Page: 17 of 68 Data Science with R 4.7 Hands-On Exploring Data with GGPlot2 Narrow Bars 10000 5000 T count 7500 DR AF 2500 0 N NNE NE ENE E ESE SE SSE S SSW SW WSW wind_dir_3pm W WNW NW NNW There are many options available to change the appearance of the histogram to make it look like almost anything we could want. In the following pages we will illustrate a few simpler modifications. This first example simply makes the bars narrower using the width= option. Here we make them half width. Perhaps that helps to make the plot look less dark! ds ggplot(aes(wind_dir_3pm)) geom_bar(width=0.5) Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% + Page: 18 of 68 Data Science with R 4.8 Hands-On Exploring Data with GGPlot2 Full Width Bars 10000 5000 T count 7500 DR AF 2500 0 N NNE NE ENE E ESE SE SSE S SSW SW WSW wind_dir_3pm W WNW NW NNW Going the other direction, the bars can be made to touch by specifying a full width with width=1. ds ggplot(aes(wind_dir_3pm)) geom_bar(width=1) Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% + Page: 19 of 68 Data Science with R 4.9 Hands-On Exploring Data with GGPlot2 Full Width Bars with Borders 10000 5000 T count 7500 DR AF 2500 0 N NNE NE ENE E ESE SE SSE S SSW SW WSW wind_dir_3pm W WNW NW NNW We can change the appearance by adding a blue border to the bars, using the colour= option. By itself that would look a bit ugly, so we also fill the bars with a grey rather than a black fill. We can play with different colours to achieve a pleasing and personalised result. ds ggplot(aes(wind_dir_3pm)) geom_bar(width=1, colour="blue", fill="grey") Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% + Page: 20 of 68 Data Science with R 4.10 Hands-On Exploring Data with GGPlot2 Coloured Histogram Without a Legend 10000 5000 T count 7500 DR AF 2500 0 N NNE NE ENE E ESE SE SSE S SSW SW WSW wind_dir_3pm W WNW NW NNW Now we really add a flamboyant streak to our plot by adding quite a spread of colour. To do so we simply specify a fill= aesthetic to be controlled by the values of the variable wind dir 3pm which of course is the variable being plotted on the x-axis. A good set of colours is chosen by default. We add a theme() to remove the legend that would be displayed by default, by indicating that the legend.position= is none. ds ggplot(aes(wind_dir_3pm, fill=wind_dir_3pm)) geom_bar() theme(legend.position="none") Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% + + Page: 21 of 68 Data Science with R 4.11 Hands-On Exploring Data with GGPlot2 Comma Formatted Labels 10,000 5,000 T count 7,500 DR AF 2,500 0 N NNE NE ENE E ESE SE SSE S SSW SW WSW W WNW NW NNW wind_dir_3pm Since ggplot2 Version 0.9.0 the scales (Wickham, 2014) package has been introduced to handle many of the scale operations, in such a way as to support base and lattice graphics, as well as ggplot2 graphics. Scale operations include position guides, as in the axes, and aesthetic guides, as in the legend. Notice that the y-axis has numbers using commas to separate the thousands. This is always a good idea as it assists us in quickly determining the magnitude of the numbers we are looking at. As a matter of course, I recommend we always use commas in plots (and tables). We do this through scale y continuous() and indicating labels= to include a comma. ds ggplot(aes(wind_dir_3pm, fill=wind_dir_3pm)) geom_bar() scale_y_continuous(labels=comma) theme(legend.position="none") Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% + + + Page: 22 of 68 Data Science with R 4.12 Hands-On Exploring Data with GGPlot2 Dollar Formatted Labels $10,000 $7,500 T $5,000 DR AF $2,500 $0 N NNE NE ENE E ESE SE SSE S SSW SW WSW W WNW NW NNW wind_dir_3pm Though we are not actually plotting currency here, one of the supported scales is dollar. Choosing this will add the comma but also prefix the amount with $. ds ggplot(aes(wind_dir_3pm, fill=wind_dir_3pm)) geom_bar() scale_y_continuous(labels=dollar) labs(y="") theme(legend.position="none") Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% + + + + Page: 23 of 68 Data Science with R 4.13 Hands-On Exploring Data with GGPlot2 Multiple Bars 30 T temp_3pm 20 DR AF 10 0 Adelaide Albany AliceSprings BadgerysCreek AlburyBallarat Bendigo Brisbane Cairns Canberra CoffsHarbour Cobar Dartmoor Darwin GoldCoast Hobart Katherine Launceston MelbourneAirport Melbourne Mildura MountGambier Moree MountGinini Newcastle NorahHead NorfolkIsland NhilNuriootpa PearceRAAF Penrith PerthAirport Perth Portland Richmond SalmonGums Sale SydneyAirport Sydney Townsville Tuggeranong WaggaWagga Uluru Walpole Watsonia Williamtown Witchcliffe Wollongong Woomera location Here we see another interesting plot, showing the mean temperature at 3pm for each location in the dataset. However, we notice that the location labels overlap and are quite a mess. Something has to be done about that. ds ggplot(aes(x=location, y=temp_3pm, fill=location)) stat_summary(fun.y="mean", geom="bar") theme(legend.position="none") Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% + + Page: 24 of 68 Data Science with R 4.14 Hands-On Exploring Data with GGPlot2 Rotated Labels 30 temp_3pm 20 DR AF T 10 Adelaide Albany Albury AliceSprings BadgerysCreek Ballarat Bendigo Brisbane Cairns Canberra Cobar CoffsHarbour Dartmoor Darwin GoldCoast Hobart Katherine Launceston Melbourne MelbourneAirport Mildura Moree MountGambier MountGinini Newcastle Nhil NorahHead NorfolkIsland Nuriootpa PearceRAAF Penrith Perth PerthAirport Portland Richmond Sale SalmonGums Sydney SydneyAirport Townsville Tuggeranong Uluru WaggaWagga Walpole Watsonia Williamtown Witchcliffe Wollongong Woomera 0 location The obvious solution is to rotate the labels. We achieve this through modifying the theme(), setting the axis.text= to be rotated 90◦ . ds ggplot(aes(location, temp_3pm, fill=location)) stat_summary(fun.y="mean", geom="bar") theme(legend.position="none", axis.text.x=element_text(angle=90)) Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% + + Page: 25 of 68 Data Science with R Exploring Data with GGPlot2 Horizontal Histogram T Woomera Wollongong Witchcliffe Williamtown Watsonia Walpole WaggaWagga Uluru Tuggeranong Townsville SydneyAirport Sydney SalmonGums Sale Richmond Portland PerthAirport Perth Penrith PearceRAAF Nuriootpa NorfolkIsland NorahHead Nhil Newcastle MountGinini MountGambier Moree Mildura MelbourneAirport Melbourne Launceston Katherine Hobart GoldCoast Darwin Dartmoor CoffsHarbour Cobar Canberra Cairns Brisbane Bendigo Ballarat BadgerysCreek AliceSprings Albury Albany Adelaide DR AF location 4.15 Hands-On 0 10 temp_3pm 20 30 Alternatively here we flip the coordinates and produce a horizontal histogram. ds ggplot(aes(location, temp_3pm, fill=location)) stat_summary(fun.y="mean", geom="bar") theme(legend.position="none") coord_flip() Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% + + + Page: 26 of 68 Data Science with R Exploring Data with GGPlot2 Reorder the Levels T Adelaide Albany Albury AliceSprings BadgerysCreek Ballarat Bendigo Brisbane Cairns Canberra Cobar CoffsHarbour Dartmoor Darwin GoldCoast Hobart Katherine Launceston Melbourne MelbourneAirport Mildura Moree MountGambier MountGinini Newcastle Nhil NorahHead NorfolkIsland Nuriootpa PearceRAAF Penrith Perth PerthAirport Portland Richmond Sale SalmonGums Sydney SydneyAirport Townsville Tuggeranong Uluru WaggaWagga Walpole Watsonia Williamtown Witchcliffe Wollongong Woomera DR AF location 4.16 Hands-On 0 10 temp_3pm 20 30 We also want to have the labels in alphabetic order which makes the plot more accessible. This requires we reverse the order of the levels in the original dataset. We do this and save the result into another dataset so as to revert to the original dataset when appropriate below. ds mutate(location=factor(location, levels=rev(levels(location)))) ggplot(aes(location, temp_3pm, fill=location)) stat_summary(fun.y="mean", geom="bar") theme(legend.position="none") coord_flip() Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% %>% + + + Page: 27 of 68 Data Science with R Exploring Data with GGPlot2 Plot the Mean with CI T Adelaide Albany Albury AliceSprings BadgerysCreek Ballarat Bendigo Brisbane Cairns Canberra Cobar CoffsHarbour Dartmoor Darwin GoldCoast Hobart Katherine Launceston Melbourne MelbourneAirport Mildura Moree MountGambier MountGinini Newcastle Nhil NorahHead NorfolkIsland Nuriootpa PearceRAAF Penrith Perth PerthAirport Portland Richmond Sale SalmonGums Sydney SydneyAirport Townsville Tuggeranong Uluru WaggaWagga Walpole Watsonia Williamtown Witchcliffe Wollongong Woomera DR AF location 4.17 Hands-On 0 10 20 temp_3pm 30 Here we add a confidence interval around the mean. ds mutate(location=factor(location, levels=rev(levels(location)))) ggplot(aes(location, temp_3pm, fill=location)) stat_summary(fun.y="mean", geom="bar") stat_summary(fun.data="mean_cl_normal", geom="errorbar", conf.int=0.95, width=0.35) theme(legend.position="none") coord_flip() Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% %>% + + + + Page: 28 of 68 Data Science with R Exploring Data with GGPlot2 Text Annotations Adelaide Albany Albury AliceSprings BadgerysCreek Ballarat Bendigo Brisbane Cairns Canberra Cobar CoffsHarbour Dartmoor Darwin GoldCoast Hobart Katherine Launceston Melbourne MelbourneAirport Mildura Moree MountGambier MountGinini Newcastle Nhil NorahHead NorfolkIsland Nuriootpa PearceRAAF Penrith Perth PerthAirport Portland Richmond Sale SalmonGums Sydney SydneyAirport Townsville Tuggeranong Uluru WaggaWagga Walpole Watsonia Williamtown Witchcliffe Wollongong Woomera 2222 2222 2222 2191 2222 2222 2222 2191 2191 2191 2222 760 2222 2191 2191 2191 2222 2222 2222 760 2375 2375 2618 2375 2375 2375 2375 T 2186 2191 2191 2191 2221 2191 2191 2191 2191 2183 DR AF location 4.18 Hands-On 760 0 1000 count 2191 2222 2221 2526 2191 2188 2191 2191 2191 2222 2191 2000 It would be informative to also show the actual numeric values on the plot. This plot shows the counts. ds mutate(location=factor(location, levels=rev(levels(location)))) ggplot(aes(location, fill=location)) geom_bar(width=1, colour="white") theme(legend.position="none") coord_flip() geom_text(stat="bin", color="white", hjust=1.0, size=3, aes(y=..count.., label=..count..)) %>% %>% + + + + Exercise: Instead of plotting the counts, plot the mean temp 3pm, and include the textual value. Copyright © 2013-2014 [email protected] Module: GGPlot2O Page: 29 of 68 Data Science with R Exploring Data with GGPlot2 Text Annotations with Commas Adelaide Albany Albury AliceSprings BadgerysCreek Ballarat Bendigo Brisbane Cairns Canberra Cobar CoffsHarbour Dartmoor Darwin GoldCoast Hobart Katherine Launceston Melbourne MelbourneAirport Mildura Moree MountGambier MountGinini Newcastle Nhil NorahHead NorfolkIsland Nuriootpa PearceRAAF Penrith Perth PerthAirport Portland Richmond Sale SalmonGums Sydney SydneyAirport Townsville Tuggeranong Uluru WaggaWagga Walpole Watsonia Williamtown Witchcliffe Wollongong Woomera 2,375 2,222 2,222 2,222 2,191 2,222 2,222 2,375 2,222 2,191 2,191 2,191 2,618 2,375 2,222 2,375 760 2,222 2,375 2,191 2,191 2,191 2,222 2,222 2,222 2,186 2,191 2,191 2,191 2,221 2,375 2,191 2,191 2,191 2,191 2,183 2,526 2,191 2,222 2,221 T 760 DR AF location 4.19 Hands-On 760 0 1000 count 2,191 2,188 2,191 2,191 2,191 2,222 2,191 2000 A small variation is to add commas to the numeric annotations to separate the thousands (Will Beasley 121230 email). ds mutate(location=factor(location, levels=rev(levels(location)))) ggplot(aes(location, fill=location)) geom_bar(width=1, colour="white") theme(legend.position="none") coord_flip() geom_text(stat="bin", color="white", hjust=1.0, size=3, aes(y=..count.., label=scales::comma(..count..))) %>% %>% + + + + Exercise: Do this without having to use scales::, perhaps using a format. Copyright © 2013-2014 [email protected] Module: GGPlot2O Page: 30 of 68 Data Science with R 5 Hands-On Exploring Data with GGPlot2 Density Distributions density 0.04 DR AF T 0.02 0.00 0 ds 10 20 max_temp 30 40 ggplot(aes(x=max_temp)) geom_density() Copyright © 2013-2014 [email protected] Module: GGPlot2O 50 %>% + Page: 31 of 68 Data Science with R 5.1 Hands-On Exploring Data with GGPlot2 Skewed Distributions 0.20 0.10 T density 0.15 DR AF 0.05 0.00 0 100 200 rainfall 300 This plot certainly informs us about the skewed nature of the amount of rainfall recorded for any one day, but we lose a lot of resolution at the low end. Note that we use a subset of the dataset to include only those observations of rainfall (i.e., where the rainfall is non-zero). Otherwise a warning will note many rows contain non-finite values in calculating the density statistic. ds filter(rainfall != 0) ggplot(aes(x=rainfall)) geom_density() scale_y_continuous(labels=comma) theme(legend.position="none") Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% %>% + + + Page: 32 of 68 Data Science with R 5.2 Hands-On Exploring Data with GGPlot2 Log Transform X Axis 0.75 T density 0.50 DR AF 0.25 0.00 1 rainfall 100 A common approach to dealing with the skewed distribution is to transform the scale to be a logarithmic scale, base 10. ds filter(rainfall != 0) ggplot(aes(x=rainfall)) geom_density() scale_x_log10() theme(legend.position="none") Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% %>% + + + Page: 33 of 68 Data Science with R 5.3 Hands-On Exploring Data with GGPlot2 Log Transform X Axis with Breaks 0.75 T density 0.50 DR AF 0.25 0.00 1 ds rainfall 10 100 filter(rainfall != 0) ggplot(aes(x=rainfall)) geom_density(binwidth=0.1) scale_x_log10(breaks=c(1, 10, 100), labels=comma) theme(legend.position="none") Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% %>% + + + Page: 34 of 68 Data Science with R 5.4 Hands-On Exploring Data with GGPlot2 Log Transform X Axis with Ticks 0.75 T density 0.50 DR AF 0.25 0.00 1 ds rainfall 10 100 filter(rainfall != 0) ggplot(aes(x=rainfall)) geom_density(binwidth=0.1) scale_x_log10(breaks=c(1, 10, 100), labels=comma) annotation_logticks(sides="bt") theme(legend.position="none") Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% %>% + + + + Page: 35 of 68 Data Science with R 5.5 Hands-On Exploring Data with GGPlot2 Log Transform X Axis with Custom Label 0.75 T density 0.50 DR AF 0.25 0.00 1mm 10mm rainfall 100mm mm <- function(x) { sprintf("%smm", x) } ds filter(rainfall != 0) ggplot(aes(x=rainfall)) geom_density() scale_x_log10(breaks=c(1, 10, 100), labels=mm) annotation_logticks(sides="bt") theme(legend.position="none") Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% %>% + + + + Page: 36 of 68 Data Science with R 5.6 Hands-On Exploring Data with GGPlot2 Transparent Plots 0.20 0.15 Canberra Darwin Melbourne 0.10 Sydney T density location DR AF 0.05 0.00 10 20 temp_3pm 30 40 cities <- c("Canberra", "Darwin", "Melbourne", "Sydney") ds filter(location %in% cities) ggplot(aes(temp_3pm, colour = location, fill = location)) geom_density(alpha = 0.55) Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% %>% + Page: 37 of 68 Data Science with R 6 Hands-On Exploring Data with GGPlot2 Box Plot Distributions 50 40 max_temp 30 20 T 10 DR AF 0 2007 2008 2009 2010 2011 year 2012 2013 2014 2015 A box plot, also known as a box and whiskers plot, shows the median (the second quartile) within a box which extends to the first and third quartiles. We note that each quartile delimits one quarter of the dataset and hence the box itself contains half the dataset. Colour is added simply to improve the visual appeal of the plot rather than to convey new information. Since we include fill= we also turn off the otherwise included legend. Here we observe the overall change in the maximum temperature over the years. Notice the first and last plots which probably reflect truncated data, providing motivation to confirm this in the data, before making significant statements regarding these observations. ds mutate(year=factor(format(ds$date, "%Y"))) ggplot(aes(x=year, y=max_temp, fill=year)) geom_boxplot(notch=TRUE) theme(legend.position="none") Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% %>% + + Page: 38 of 68 Data Science with R 6.1 Hands-On Exploring Data with GGPlot2 Violin Plot 50 40 max_temp 30 20 T 10 DR AF 0 2007 2008 2009 2010 2011 year 2012 2013 2014 2015 A violin plot is another interesting way to present a distribution, using a shape that resembles a violin. We again use colour to improve the visual appeal of the plot. ds mutate(year=factor(format(ds$date, "%Y"))) ggplot(aes(x=year, y=max_temp, fill=year)) geom_violin() theme(legend.position="none") Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% %>% + + Page: 39 of 68 Data Science with R 6.2 Hands-On Exploring Data with GGPlot2 Violin with Box Plot 50 40 max_temp 30 20 T 10 DR AF 0 2007 2008 2009 2010 2011 year 2012 2013 2014 2015 We can overlay the violin plot with a box plot to show the quartiles. ds mutate(year=factor(format(ds$date, "%Y"))) ggplot(aes(x=year, y=max_temp, fill=year)) geom_violin() geom_boxplot(width=.5, position=position_dodge(width=0)) theme(legend.position="none") Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% %>% + + + Page: 40 of 68 Data Science with R 50 40 30 20 10 0 max_temp 50 40 30 20 10 0 50 40 30 20 10 0 50 40 30 20 10 0 50 40 30 20 10 0 Adelaide Albany Albury AliceSprings BadgerysCreek Ballarat Bendigo Brisbane Cairns Canberra Cobar CoffsHarbour Dartmoor Darwin GoldCoast Hobart Katherine Launceston Melbourne MelbourneAirport Mildura Moree MountGambier MountGinini Newcastle Nhil NorahHead NorfolkIsland Nuriootpa PearceRAAF Penrith Perth PerthAirport Portland Richmond Sale SalmonGums Sydney SydneyAirport Townsville Tuggeranong Uluru WaggaWagga Walpole Watsonia Williamtown Witchcliffe Wollongong Woomera 20 2007 2008 2009 2010 2011 2012 2013 2014 1 20 5 2007 2008 2009 2010 2011 2012 2013 2014 1 20 5 2007 2008 2009 2010 2011 2012 2013 2014 1 20 5 2007 2008 2009 2010 2011 2012 2013 2014 1 20 5 2007 2008 2009 2010 2011 2012 2013 2014 1 20 5 2007 2008 2009 2010 2011 2012 2013 2014 1 20 5 2007 2008 2009 2010 2011 2012 2013 2014 15 50 40 30 20 10 0 Violin with Box Plot by Location T 50 40 30 20 10 0 Exploring Data with GGPlot2 DR AF 6.3 Hands-On year We can readily split the plot across the locations. Things get a little crowded, but we get an overall view across all of the different weather stations. Notice we also rotated the x-axis labels so that they don’t overlap. We can immediately see one of the issues with this dataset, noting that three weather stations have fewer observations that then others. Various other observations are also interesting. Some locations have little variation in their maximum temperatures over the years. ds mutate(year=factor(format(ds$date, "%Y"))) ggplot(aes(x=year, y=max_temp, fill=year)) geom_violin() geom_boxplot(width=.5, position=position_dodge(width=0)) theme(legend.position="none", axis.text.x=element_text(angle=45, hjust=1)) facet_wrap(~location) Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% %>% + + + + Page: 41 of 68 Data Science with R Exploring Data with GGPlot2 Misc Plots DR AF T 7 Hands-On Copyright © 2013-2014 [email protected] Module: GGPlot2O Page: 42 of 68 Data Science with R 7.1 Hands-On Exploring Data with GGPlot2 Waterfall Chart A Waterfall Chart (also known as a cascade chart) is useful in visualizing the breakdown of a measure into some components that are introduced to reduce the original value. An example here is of the cash flow for a small business over a year. https://learnr.wordpress.com/2010/ 05/10/ggplot2-waterfall-charts/. We will invent some simple data to illustrate. Imagine we have a spreadsheet with a companies starting cash balance for the year and then a series of items that are either cash flows in or out. Notice we force an ordering on the items using base::factor(). Also note the technique to sum the vector of items to add the final balance to the vector using the pipe operator. items <- c("Start", "IT", "Sales", "Adv", "Elect", "Tax", "Salary", "Rebate", "Advise", "Consult", "End") amounts <- c(30, -5, 15, -8, -4, -7, 11, 8, -4, 20) %>% c(., sum(.)) cash <- data.frame(Transaction=factor(items, levels=items), Amount=amounts) cash T Transaction Amount Start 30 IT -5 Sales 15 Adv -8 Elect -4 DR AF ## ## 1 ## 2 ## 3 ## 4 ## 5 .... levels(cash$Transaction) ## ## [1] "Start" [8] "Rebate" "IT" "Advise" "Sales" "Adv" "Consult" "End" "Elect" "Tax" "Salary" We’ll record whether the flow is in or out as a conveinence for us in the plot. cash %<>% mutate(Direction=ifelse(Transaction %in% c("Start", "End"), "Net", ifelse(Amount>0, "Credit", "Debit")) %>% factor(levels=c("Debit", "Credit", "Net"))) We now need to calculate the locations of the rectangles along the x-axis. We can count the transactions from 1 and use this as an offset for specifying where to draw the rectangle. Then we need to calculate the y value for the rectangles. This is based on the cumulative sum. cash %<>% mutate(id=seq_along(Transaction), end=c(head(cumsum(Amount), -1), tail(Amount, 1)), start=c(0, head(end, -2), 0)) cash ## ## ## ## ## ## 1 2 3 4 5 Transaction Amount Direction id end start Start 30 Net 1 30 0 IT -5 Debit 2 25 30 Sales 15 Credit 3 40 25 Adv -8 Debit 4 32 40 Elect -4 Debit 5 28 32 Copyright © 2013-2014 [email protected] Module: GGPlot2O Page: 43 of 68 Data Science with R Hands-On Exploring Data with GGPlot2 .... cash %>% ggplot(aes(fill=Direction)) + geom_rect(aes(x=Transaction, xmin=id-0.45, xmax=id+0.45, ymin=end, ymax=start)) + geom_text(aes(x=id, y=start, label=Transaction), size=3, vjust=ifelse(cash$Direction == "Debit", -0.2, 1.2)) + geom_text(aes(x=id, y=start, label=Amount), size=3, colour="white", vjust=ifelse(cash$Direction == "Debit", 1.3, -0.3)) + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) + ylab("Balance in Thousand Dollars") + ggtitle("Annual Cash Flow Analysis") DR AF T Annual Cash Flow Analysis Balance in Thousand Dollars Advise -4 Adv -8 40 20 Consult 15 Sales Direction Debit Tax -7 Credit Net 11 Salary 20 0 8 Rebate Elect -4 IT -5 30 Start 56 End Transaction Copyright © 2013-2014 [email protected] Module: GGPlot2O Page: 44 of 68 Data Science with R 8 Hands-On Pairs Plot: Using ggpairs() NNE NE ENE ESE SE SSE SSW S SW WSW WNW W NW NNW 20 MinTemp Exploring Data with GGPlot2 15 Corr: 0.744 10 5 No Corr: 0.635 Corr: 0.00527 Corr: 0.132 Corr: 0.675 Corr: 0.446 Corr: -0.141 Corr: 0.311 Corr: -0.104 Yes No Yes No Yes 0 MaxTemp -5 30 20 Evaporation 10 10 5 Corr: -0.657 T 5 0 5 4 3 2 1 0 5 4 3 2 1 0 5 4 3 2 1 0 5 4 3 2 1 0 5 4 3 2 1 0 5 4 3 2 1 0 5 4 3 2 1 0 5 4 3 2 1 0 5 4 3 2 1 0 5 4 3 2 1 0 5 4 3 2 1 0 5 4 3 2 1 0 5 4 3 2 1 0 5 4 3 2 1 0 5 4 3 2 1 0 5 4 3 2 1 0 8 6 4 2 RainToday 0 10 MinTemp 20 10 20 30 MaxTemp 0 5 10 Evaporation 15 0 5 10 Sunshine NNE N NE ENE ESE SE SSE SSW S SW WSW WNW WNW NNW0 WindDir9am 2 4 Yes 15 10 5 0 15 10 5 0 No RainTomorrow 0 15 10 5 0 15 10 5 0 DR AF Cloud3pm 10 NNE NE ENE ESE SE SSE SSW S SW WSW WNW W NW NNW WindDir9am Sunshine 0 6 Cloud3pm 8 No Yes RainToday RainTomorrow A sophisticated pairs plot, as delivered by ggpairs() from GGally (Schloerke et al., 2014), brings together many of the kinds of plots we have already seen to compare pairs of variables. Here we have used it to plot the weather dataset from rattle (Williams, 2015). We see scatter plotes, histograms, and box plots in the pairs plot. weather[c(3,4,6,7,10,19,22,24)] na.omit() ggpairs(params=c(shape=I("."), outlier.shape=I("."))) Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% %>% Page: 45 of 68 Data Science with R 9 Hands-On Exploring Data with GGPlot2 Cumulative Distribution Plot Cumulative Rainfall 12,500 millimetres 10,000 location 7,500 Adelaide Canberra Darwin 5,000 Hobart 0 2010 2011 2012 date 2013 2014 2015 DR AF 2009 T 2,500 Here we show the cumulative sum of a variable for different locations. cities <- c("Adelaide", "Canberra", "Darwin", "Hobart") ds filter(location %in% cities & date >= "2009-01-01") group_by(location) mutate(cumRainfall=order_by(date, cumsum(rainfall))) ggplot(aes(x=date, y=cumRainfall, colour=location)) geom_line() ylab("millimetres") scale_y_continuous(labels=comma) ggtitle("Cumulative Rainfall") Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% %>% %>% %>% + + + + Page: 46 of 68 Data Science with R 10 Hands-On Exploring Data with GGPlot2 Parallel Coordinates Plot 5.0 value 2.5 0.0 ax DR AF m m in _t em p _t em p ra in fa ll ev ap or at io su n ns w h in in d_ e gu st w in _s d_ pe sp ed ee w in d d_ _9 sp am ee hu d_ m 3p id m it hu y_9 am m id ity pr _3 es pm su re pr _9 es am su re _ 3p cl ou m d_ 9a cl m ou d_ 3p te m m p_ 9a te m m p_ 3p m ris k_ m m T -2.5 variable A parallel coordinates plot provides a graphical summary of multivariate data. Generally it works best with 10 or fewer variables. The variables will be scaled to a common range (e.g., by performing a z-score transform which subtracts the mean and divides by the standard deviation) and each observation is plot as a line running horizontally across the plot. One of the main uses of parallel coordinate plots is to identify common groups of observations through their common patterns of variable values. We can use parallel coordinate plots to identify patterns of systematic relationships between variables over the observations. For more than about 10 variables consider the use of heatmaps or fluctuation plots. Here we use ggparcoord() from GGally (Schloerke et al., 2014) to generate the parallel coordinates plot. The underlying plotting is performed by ggplot2 (Wickham and Chang, 2015). library(GGally) cities <- c("Canberra", "Darwin", "Melbourne", "Sydney") ds %>% %>% + filter(location %in% cities & rainfall>75) ggparcoord(columns=numi) theme(axis.text.x=element_text(angle=45)) Copyright © 2013-2014 [email protected] Module: GGPlot2O Page: 47 of 68 Data Science with R 10.1 Hands-On Exploring Data with GGPlot2 Labels Aligned 5.0 value 2.5 0.0 DR AF m in _t em m p ax _t em p ra i n ev f ap all or at io n su w in n d_ sh gu in e w st in _ sp d_ e sp ee ed w in d d_ _9 sp am ee d _3 hu pm m id ity _9 hu am m id ity pr _3 es su pm re pr _9 es am su re _3 pm cl ou d_ 9a cl m ou d_ 3p te m m p_ 9a te m m p_ 3p m ris k_ m m T -2.5 variable We notice the labels are by default aligned by their centres. Rotating 45◦ causes the labels to sit over the plot region. We can ask the labels to be aligned at the top edge instead, using hjust=1. ds filter(location %in% cities & rainfall>75) ggparcoord(columns=numi) theme(axis.text.x=element_text(angle=45, hjust=1)) Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% %>% + Page: 48 of 68 Data Science with R 10.2 Hands-On Exploring Data with GGPlot2 Colour and Label 5.0 location Canberra Darwin Melbourne Sydney value 2.5 0.0 DR AF m in _t em m p ax _t em p ra i n ev f ap all or at io n su w in n d_ sh gu in e w st in _ sp d_ e sp ee ed w in d d_ _9 sp am ee d _3 hu pm m id ity _9 hu am m id ity pr _3 es su pm re pr _9 es am su re _3 pm cl ou d_ 9a cl m ou d_ 3p te m m p_ 9a te m m p_ 3p m ris k_ m m T -2.5 variable Here we add some colour and force the legend to be within the plot rather than, by default, reducing the plot size and adding the legend to the right. We can discern just a little structure relating to locations. We have limited the data to those days where more than 74mm of rain is recorded and clearly Darwin becomes prominent. Darwin has many days with at least this much rain, Sydney has a few days and Canberra and Melbourne only one day. We would confirm this with actual queries of the data. Apart from this the parallel coordinates in this instance is not showing much structure. The example code illustrates several aspects of placing the legend. We specified a specific location within the plot itself, justified as top left, laid out horizontally, and transparent. ds filter(location %in% cities & rainfall>75) ggparcoord(columns=numi, group="location") theme(axis.text.x=element_text(angle=45, hjust=1), legend.position=c(0.25, 1), legend.justification=c(0.0, 1.0), legend.direction="horizontal", legend.background=element_rect(fill="transparent"), legend.key=element_rect(fill="transparent", colour="transparent")) Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% %>% + Page: 49 of 68 Data Science with R 11 Hands-On Exploring Data with GGPlot2 Polar Coordinates Jan Dec Feb 40 30 Nov Mar 20 0 Oct Apr T max 10 May DR AF Sep Aug Jun Jul month Here we use a polar coordinate to plot what is otherwise a bar chart. The data preparation begins with converting the date to a month, then grouping by the month, after which we obtain the maximum max temp for each group. The months are then ordered correctly. We then generate a bar plot and place the plot on a polar coordinate. Not necessarily particularly easy to read, and probably better to not use polar coordinates in this case! ds mutate(month=format(date, "%b")) group_by(month) summarise(max=max(max_temp)) mutate(month=factor(month, levels=month.abb)) ggplot(aes(x=month, y=max, group=1)) geom_bar(width=1, stat="identity", fill="orange", color="black") coord_polar(theta="x", start=-pi/12) Copyright © 2013-2014 [email protected] Module: GGPlot2O %>% %>% %>% %>% %>% + + Page: 50 of 68 Data Science with R 12 Hands-On Exploring Data with GGPlot2 Multiple Plots: Using grid.arrange() Here we illustrate the ability to layout multiple plots in a regular grid using grid.arrange() from gridExtra (Auguie, 2012). We illustrate this with the weatherAUS dataset from rattle. We generate a number of informative plots, using the plyr (?) package to aggregate the cumulative rainfall. The scales (Wickham, 2014) package is used to provide labels with commas. library(rattle) library(ggplot2) library(scales) cities <- c("Adelaide", "Canberra", "Darwin", "Hobart") dss <- ds %>% filter(location %in% cities & date >= "2009-01-01") %>% group_by(location) %>% mutate(cumRainfall=order_by(date, cumsum(rainfall))) <<<<<- ggplot(dss, aes(x=date, y=cumRainfall, colour=location)) p + geom_line() p1 + ylab("millimetres") p1 + scale_y_continuous(labels=comma) p1 + ggtitle("Cumulative Rainfall") p2 p2 p2 p2 <<<<- ggplot(dss, aes(x=date, y=max_temp, colour=location)) p2 + geom_point(alpha=.1) p2 + geom_smooth(method="loess", alpha=.2, size=1) p2 + ggtitle("Fitted Max Temperature Curves") DR AF T p p1 p1 p1 p1 p3 <- ggplot(dss, aes(x=pressure_3pm, colour=location)) p3 <- p3 + geom_density() p3 <- p3 + ggtitle("Air Pressure at 3pm") p4 p4 p4 p4 p4 <<<<<- ggplot(dss, aes(x=sunshine, fill=location)) p4 + facet_grid(location ~ .) p4 + geom_histogram(colour="black", binwidth=1) p4 + ggtitle("Hours of Sunshine") p4 + theme(legend.position="none") Copyright © 2013-2014 [email protected] Module: GGPlot2O Page: 51 of 68 Data Science with R 13 Hands-On Exploring Data with GGPlot2 Multiple Plots: Using grid.arrange() Cumulative Rainfall 12,500 Fitted Max Temperature Curves 40 10,000 7,500 Adelaide Canberra 5,000 Darwin Hobart location max_temp millimetres location 2,500 Adelaide 30 Canberra Darwin 20 Hobart 10 0 2010 2011 2012 date 2013 2014 2015 2009 Air Pressure at 3pm Canberra Darwin 0.05 Hobart 1000 1020 pressure_3pm 1040 2013 2014 2015 Hours of Sunshine 0 DR AF 980 date 5 sunshine Hobart 0.00 count Adelaide 2012 Darwin density location 2011 Adelaide Canberra 0.10 1200 800 400 0 1200 800 400 0 1200 800 400 0 1200 800 400 0 2010 T 2009 10 15 library(gridExtra) grid.arrange(p1, p2, p3, p4) The actual plots are arranged by grid.arrange(). A collection of plots such as this can be quite informative and effective in displaying the information efficiently. We can see that Darwin is quite a stick out. It is located in the tropics, whereas the remaining cities are in the southern regions of Australia. Copyright © 2013-2014 [email protected] Module: GGPlot2O Page: 52 of 68 Data Science with R Exploring Data with GGPlot2 Multiple Plots: Alternative Arrangements DR AF T 14 Hands-On Copyright © 2013-2014 [email protected] Module: GGPlot2O Page: 53 of 68 Data Science with R Exploring Data with GGPlot2 Multiple Plots: Sharing a Legend DR AF T 15 Hands-On Copyright © 2013-2014 [email protected] Module: GGPlot2O Page: 54 of 68 Data Science with R Exploring Data with GGPlot2 Multiple Plots: 2D Histogram DR AF T 16 Hands-On Copyright © 2013-2014 [email protected] Module: GGPlot2O Page: 55 of 68 Data Science with R Exploring Data with GGPlot2 Multiple Plots: 2D Histogram Plot DR AF T 17 Hands-On Copyright © 2013-2014 [email protected] Module: GGPlot2O Page: 56 of 68 Data Science with R Exploring Data with GGPlot2 Multiple Plots: Using layOut() DR AF T 18 Hands-On Copyright © 2013-2014 [email protected] Module: GGPlot2O Page: 57 of 68 Data Science with R Exploring Data with GGPlot2 Arranging Plots with layOut() DR AF T 18.1 Hands-On Copyright © 2013-2014 [email protected] Module: GGPlot2O Page: 58 of 68 Data Science with R Exploring Data with GGPlot2 Arranging Plots Through Viewports DR AF T 18.2 Hands-On Copyright © 2013-2014 [email protected] Module: GGPlot2O Page: 59 of 68 Data Science with R Exploring Data with GGPlot2 Plot DR AF T 18.3 Hands-On Copyright © 2013-2014 [email protected] Module: GGPlot2O Page: 60 of 68 Data Science with R Exploring Data with GGPlot2 Plotting a Table DR AF T 19 Hands-On Copyright © 2013-2014 [email protected] Module: GGPlot2O Page: 61 of 68 Data Science with R Exploring Data with GGPlot2 Plotting a Table and Text DR AF T 20 Hands-On Copyright © 2013-2014 [email protected] Module: GGPlot2O Page: 62 of 68 Data Science with R Exploring Data with GGPlot2 Interactive Plot Building DR AF T 21 Hands-On Copyright © 2013-2014 [email protected] Module: GGPlot2O Page: 63 of 68 Data Science with R 22 Hands-On Exploring Data with GGPlot2 Having Some Fun with xkcd max_temp 40 30 20 10 0 10 min_temp 20 30 T 0 For a bit of fun we can generate plots in the style of the popular online comic strip xkcd using xkcd (Manzanera, 2014). DR AF On Ubuntu we can install install the required fonts with the following steps: library(extrafont) download.file("http://simonsoftware.se/other/xkcd.ttf", dest="xkcd.ttf") system("mkdir ~/.fonts") system("mv xkcd.ttf ~/.fonts") font_import() loadfonts() The installation can be uninstalled with: remove.packages(c("extrafont","extrafontdb")) We can then generate a roughly yet neatly drawn plot, as if it might have been drawn by the hand of the author of the comic strip. library(xkcd) xrange <- range(ds$min_temp) yrange <- range(ds$max_temp) ds[sobs,] ggplot(aes(x=min_temp, y=max_temp)) geom_point() xkcdaxis(xrange, yrange) Copyright © 2013-2014 [email protected] %>% + + Module: GGPlot2O Page: 64 of 68 Data Science with R 22.1 Hands-On Exploring Data with GGPlot2 Bar Chart Temperature Degrees Celcius 50 40 30 20 10 0 2010 2012 Year We can do a bar chart in the same style. DR AF library(xkcd) library(scales) library(lubridate) 2014 T 2008 nds <- ds mutate(year=year(ds$date)) group_by(year) summarise(max_temp=max(max_temp)) mutate(xmin=year-0.1, xmax=year+0.1, ymin=0, ymax=max_temp) %>% %>% %>% %>% mapping <- aes(xmin=xmin, ymin=ymin, xmax=xmax, ymax=ymax) ggplot() xkcdrect(mapping, nds) xkcdaxis(c(2006.8, 2014.2), c(0, 50)) labs(x="Year", y="Temperature Degrees Celcius") scale_y_continuous(labels=comma) Copyright © 2013-2014 [email protected] Module: GGPlot2O + + + + Page: 65 of 68 Data Science with R 23 Hands-On Exploring Data with GGPlot2 Exercises Exercise 1 Dashboard Your copmuter collects various information about its usage. These are often called log files. Identify some log files that you have access to on your computer. Build a one page PDF dashboard using KnitR and ggplot2 to fill the page with information plots. Be sure to include some of the following types of plots. (Compare this to my bin/dashboard.sh script). 1. Bar plot 2. Line chart DR AF T 3. etc. Copyright © 2013-2014 [email protected] Module: GGPlot2O Page: 66 of 68 Data Science with R 24 Hands-On Exploring Data with GGPlot2 Further Reading and Acknowledgements The Rattle Book, published by Springer, provides a comprehensive introduction to data mining and analytics using Rattle and R. It is available from Amazon. Other documentation on a broader selection of R topics of relevance to the data scientist is freely available from http://datamining.togaware.com, including the Datamining Desktop Survival Guide. This chapter is one of many chapters available from http:// HandsOnDataScience.com. In particular follow the links on the website with a * which indicates the generally more developed chapters. Other resources include: The GGPlot2 documentation is quite extensive and useful DR AF T The R Cookbook is a great resource explaining how to do many types of plots using ggplot2. Copyright © 2013-2014 [email protected] Module: GGPlot2O Page: 67 of 68 Data Science with R 25 Hands-On Exploring Data with GGPlot2 References Auguie B (2012). gridExtra: functions in Grid graphics. R package version 0.9.1, URL http: //CRAN.R-project.org/package=gridExtra. Manzanera ET (2014). xkcd: Plotting ggplot2 graphics in a XKCD style. R package version 0.0.3, URL http://CRAN.R-project.org/package=xkcd. Neuwirth E (2014). RColorBrewer: ColorBrewer Palettes. R package version 1.1-2, URL http://CRAN.R-project.org/package=RColorBrewer. R Core Team (2015). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/. Schloerke B, Crowley J, Cook D, Hofmann H, Wickham H, Briatte F, Marbach M, Thoen E (2014). GGally: Extension to ggplot2. R package version 0.5.0, URL http://CRAN.R-project. org/package=GGally. T Wickham H (2014). scales: Scale functions for graphics. R package version 0.2.4, URL http: //CRAN.R-project.org/package=scales. Wickham H, Chang W (2015). ggplot2: An Implementation of the Grammar of Graphics. R package version 1.0.1, URL http://CRAN.R-project.org/package=ggplot2. DR AF Williams GJ (2009). “Rattle: A Data Mining GUI for R.” The R Journal, 1(2), 45–55. URL http://journal.r-project.org/archive/2009-2/RJournal_2009-2_Williams.pdf. Williams GJ (2011). Data Mining with Rattle and R: The art of excavating data for knowledge discovery. Use R! Springer, New York. Williams GJ (2015). rattle: Graphical User Interface for Data Mining in R. R package version 3.4.2, URL http://rattle.togaware.com/. This document, sourced from GGPlot2O.Rnw revision 779, was processed by KnitR version 1.9 of 2015-01-20 and took 67.4 seconds to process. It was generated by gjw on theano running Ubuntu 14.04.2 LTS with Intel(R) Core(TM) i7-3517U CPU @ 1.90GHz having 4 cores and 3.9GB of RAM. It completed the processing 2015-05-22 08:06:18. Copyright © 2013-2014 [email protected] Module: GGPlot2O Page: 68 of 68
© Copyright 2024