Visualising data with ggplot2

Hands-On Data Science with R
Exploring Data with GGPlot2
[email protected]
22nd May 2015
Visit http://HandsOnDataScience.com/ for more Chapters.
The required packages for this chapter are:
T
The ggplot2 (Wickham and Chang, 2015) package implements a grammar of graphics. A plot
is built starting with the dataset and aesthetics (e.g., x-axis and y-axis) and adding geometric
elements, statistical operations, scales, facets, coordinates, and options.
DR
AF
library(ggplot2)
# Visualise data.
library(scales)
# Include commas in numbers.
library(RColorBrewer) # Choose different color.
library(rattle)
# Weather dataset.
library(randomForest) # Use na.roughfix() to deal with missing data.
library(gridExtra) # Layout multiple plots.
library(wq)
# Regular grid layout.
library(xkcd)
# Some xkcd fun.
library(extrafont) # Fonts for xkcd.
library(sysfonts) # Font support for xkcd.
library(GGally)
# Parallel coordinates.
library(dplyr)
# Data wrangling.
library(lubridate) # Dates and time.
As we work through this chapter, new R commands will be introduced. Be sure to review the
command’s documentation and understand what the command does. You can ask for help using
the ? command as in:
?read.csv
We can obtain documentation on a particular package using the help= option of library():
library(help=rattle)
This chapter is intended to be hands on. To learn effectively, you are encouraged to have R
running (e.g., RStudio) and to run all the commands as they appear here. Check that you get
the same output, and you understand the output. Try some variations. Explore.
Copyright © 2013-2014 Graham Williams. You can freely copy, distribute,
or adapt this material, as long as the attribution is retained and derivative
work is provided under the same license.
Data Science with R
1
Hands-On
Exploring Data with GGPlot2
Preparing the Dataset
We use the relatively large weatherAUS dataset from rattle (Williams, 2015) to illustrate the
capabilities of ggplot2. For plots which generate large images we might use random subsets of
the same dataset.
library(rattle)
dsname <- "weatherAUS"
ds
<- get(dsname)
The dataset is summarised below.
dim(ds)
## [1] 105379
24
names(ds)
head(ds)
##
## 1
## 2
## 3
## 4
## 5
....
"Location"
"Evaporation"
"WindDir9am"
"Humidity9am"
"Cloud9am"
"RainToday"
"MinTemp"
"Sunshine"
"WindDir3pm"
"Humidity3pm"
"Cloud3pm"
"RISK_MM"
"MaxTemp"
"WindGustDir"
"WindSpeed9am"
"Pressure9am"
"Temp9am"
"RainTomorrow"
T
"Date"
"Rainfall"
"WindGustSpeed"
"WindSpeed3pm"
"Pressure3pm"
"Temp3pm"
DR
AF
## [1]
## [5]
## [9]
## [13]
## [17]
## [21]
....
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine
2008-12-01
Albury
13.4
22.9
0.6
NA
NA
2008-12-02
Albury
7.4
25.1
0.0
NA
NA
2008-12-03
Albury
12.9
25.7
0.0
NA
NA
2008-12-04
Albury
9.2
28.0
0.0
NA
NA
2008-12-05
Albury
17.5
32.3
1.0
NA
NA
tail(ds)
##
## 105374
## 105375
## 105376
## 105377
## 105378
....
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine
2015-03-25
Uluru
20.4
25.5
0
NA
NA
2015-03-26
Uluru
18.7
29.5
0
NA
NA
2015-03-27
Uluru
17.9
31.3
0
NA
NA
2015-03-28
Uluru
17.5
30.8
0
NA
NA
2015-03-29
Uluru
21.1
27.7
0
NA
NA
str(ds)
## 'data.frame': 105379 obs. of 24 variables:
## $ Date
: Date, format: "2008-12-01" "2008-12-02" ...
## $ Location
: Factor w/ 49 levels "Adelaide","Albany",..: 3 3 3 3 3 3 ...
## $ MinTemp
: num 13.4 7.4 12.9 9.2 17.5 14.6 14.3 7.7 9.7 13.1 ...
## $ MaxTemp
: num 22.9 25.1 25.7 28 32.3 29.7 25 26.7 31.9 30.1 ...
## $ Rainfall
: num 0.6 0 0 0 1 0.2 0 0 0 1.4 ...
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
Page: 1 of 68
Data Science with R
Hands-On
Exploring Data with GGPlot2
....
summary(ds)
Location
Canberra: 2618
Sydney : 2526
Adelaide: 2375
Brisbane: 2375
Darwin : 2375
MinTemp
Min.
:-8.50
1st Qu.: 7.70
Median :12.00
Mean
:12.19
3rd Qu.:16.90
MaxTemp
Min.
:-4.1
1st Qu.:18.0
Median :22.6
Mean
:23.2
3rd Qu.:28.2
DR
AF
T
##
Date
## Min.
:2007-11-01
## 1st Qu.:2010-06-07
## Median :2012-01-31
## Mean
:2012-01-29
## 3rd Qu.:2013-10-09
....
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
Page: 2 of 68
Data Science with R
Collecting Information
normVarNames(names(ds)) # Optional lower case variable names.
names(ds)
"rain_tomorrow"
c("date", "location")
id
setdiff(vars, target)
which(sapply(ds[vars], is.numeric))
min_temp
max_temp
3
4
sunshine wind_gust_speed
7
9
humidity_9am
humidity_3pm
14
15
numerics
numerics
## [1]
## [4]
## [7]
## [10]
## [13]
## [16]
....
cati
cati
<<<<<<<-
rainfall
5
wind_speed_9am
12
pressure_9am
16
<- names(numi)
"min_temp"
"evaporation"
"wind_speed_9am"
"humidity_3pm"
"cloud_9am"
"temp_3pm"
"max_temp"
"sunshine"
"wind_speed_3pm"
"pressure_9am"
"cloud_3pm"
"risk_mm"
evaporation
6
wind_speed_3pm
13
pressure_3pm
17
T
names(ds)
vars
target
id
ignore
inputs
numi
numi
##
##
##
##
##
##
....
Exploring Data with GGPlot2
"rainfall"
"wind_gust_speed"
"humidity_9am"
"pressure_3pm"
"temp_9am"
DR
AF
2
Hands-On
<- which(sapply(ds[vars], is.factor))
##
location wind_gust_dir
##
2
8
## rain_tomorrow
##
24
wind_dir_9am
10
wind_dir_3pm
11
rain_today
22
categorics <- names(cati)
categorics
## [1] "location"
## [5] "rain_today"
"wind_gust_dir" "wind_dir_9am"
"rain_tomorrow"
"wind_dir_3pm"
We perform missing value imputation simply to avoid warnings from ggplot2, ignoring whether
this is appropriate to do so from a data integrity point of view.
library(randomForest)
sum(is.na(ds))
## [1] 226034
ds[setdiff(vars, ignore)] <- na.roughfix(ds[setdiff(vars, ignore)])
sum(is.na(ds))
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
Page: 3 of 68
Data Science with R
Hands-On
Exploring Data with GGPlot2
DR
AF
T
## [1] 0
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
Page: 4 of 68
Data Science with R
3
Hands-On
Exploring Data with GGPlot2
Scatter Plot
40
max_temp
30
rain_tomorrow
No
20
Yes
0
10
min_temp
20
30
DR
AF
0
T
10
A scatter plot displays points scattered over a plot. The points are specified as x and y for a two
dimensional plot, as specified by the aesthetics. We use geom point() to plot the points.
We first identify a random subset of just 1,000 observations to plot, rather than filling the plot
completely with points, as might happen when we plot too points on a scatter plot.
We then subset the dataset simply by providing the random vector of indicies to the subset
operator of the dataset. This subset is piped (using %>%) to ggplot(), identifying the x-axis,
the y-axis, and how the points should be coloured as the aesthetics of the plot using aes(), to
which a geom point() layer is then added.
sobs <- sample(nrow(ds), 1000)
ds[sobs,]
ggplot(aes(x=min_temp, y=max_temp, colour=rain_tomorrow))
geom_point()
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
+
Page: 5 of 68
Data Science with R
3.1
Hands-On
Exploring Data with GGPlot2
Adding a Smooth Fitted Curve
40
20
T
max_temp
30
DR
AF
10
0
2008
2010
date
2012
2014
Here we have added a smooth fitted curve using geom smooth(). Since the dataset has many
points the smoothing method recommend is method="gam" and will be automatically chosen if not
specified, with a message displayed to inform us of this. The formula specified, formula=y~
s(x,
bs="cs") is also the default for method="gam".
Typical of scatter plots of big data there is a mass of overlaid points. We can see patterns
though, noting the obvious pattern of seasonality. Notice also the three or four white vertical
bands—this is indicative of some systemic issue with the data in that for specific days it would
seem there is no data collected. Something to explore and understand further, but our scope for
now.
ds[sobs,]
ggplot(aes(x=date, y=max_temp))
geom_point()
geom_smooth(method="gam", formula=y~s(x, bs="cs"))
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
+
+
Page: 6 of 68
Data Science with R
Albany
Albury
AliceSprings
BadgerysCreek
Ballarat
Bendigo
Brisbane
Cairns
Canberra
Cobar
CoffsHarbour
Dartmoor
Darwin
GoldCoast
Hobart
Katherine
Launceston
Melbourne
MelbourneAirport
Mildura
Moree
MountGambier
MountGinini
Newcastle
Nhil
NorahHead
NorfolkIsland
Nuriootpa
PearceRAAF
Penrith
Perth
PerthAirport
Portland
Richmond
Sale
SalmonGums
Sydney
SydneyAirport
14
12
20
10
20
08
20
20
20
14
Woomera
12
Wollongong
20
08
Uluru
20
14
20
12
20
10
20
08
20
14
20
12
20
10
20
08
14
20
date
Witchcliffe
Tuggeranong
10
Townsville
Williamtown
20
12
20
10
20
08
Watsonia
20
14
10
08
20
14
Walpole
20
12
20
08
20
20
10
WaggaWagga
50
0
-50
-100
-150
20
50
0
-50
-100
-150
12
50
0
-50
-100
-150
20
50
0
-50
-100
-150
DR
AF
50
0
-50
-100
-150
20
50
0
-50
-100
-150
T
Adelaide
50
0
-50
-100
-150
max_temp
Exploring Data with GGPlot2
Using Facet to Plot Multiple Locations
20
3.2
Hands-On
Partitioning the dataset by a categoric variable will reduce the blob effect for big data. For the
above plot we use facet wrap() to separately plot each location’s maximum temperature over
time. The x axis tick labels are rotated 45◦ .
The seasonal effect is again very clear, and a mild increase in the maximum temperature over
the small period is evident in many locations.
Using large dots on the smaller plots still leads to much overlaid plotting.
ds[sobs,]
ggplot(aes(x=date, y=max_temp))
geom_point()
geom_smooth(method="gam", formula=y~s(x, bs="cs"))
facet_wrap(~location)
theme(axis.text.x=element_text(angle=45, hjust=1))
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
+
+
+
+
Page: 7 of 68
Data Science with R
Albany
Albury
AliceSprings
BadgerysCreek
Ballarat
Bendigo
Brisbane
Cairns
Canberra
Cobar
CoffsHarbour
Dartmoor
Darwin
GoldCoast
Hobart
Katherine
Launceston
Melbourne
MelbourneAirport
Mildura
Moree
MountGambier
MountGinini
Newcastle
Nhil
NorahHead
NorfolkIsland
Nuriootpa
PearceRAAF
Penrith
Perth
PerthAirport
Portland
Richmond
Sale
SalmonGums
Sydney
SydneyAirport
14
12
20
10
20
08
20
20
20
14
Woomera
12
Wollongong
20
08
Uluru
20
14
20
12
20
10
20
08
20
14
20
12
20
10
20
08
14
20
date
Witchcliffe
Tuggeranong
10
Townsville
Williamtown
20
12
20
10
20
08
Watsonia
20
14
10
08
20
14
Walpole
20
12
20
08
20
20
10
WaggaWagga
50
0
-50
-100
-150
20
50
0
-50
-100
-150
12
50
0
-50
-100
-150
20
50
0
-50
-100
-150
DR
AF
50
0
-50
-100
-150
20
50
0
-50
-100
-150
T
Adelaide
50
0
-50
-100
-150
max_temp
Exploring Data with GGPlot2
Facet over Locations with Small Dots
20
3.3
Hands-On
Here we have changed the size of the points to be size=0.2 instead of the default size=0.5.
The points then play less of a role in the display.
Changing to plotted points to be much smaller de-clutters the plot significantly and improves
the presentation.
ds[sobs,]
ggplot(aes(x=date, y=max_temp))
geom_point(size=0.2)
geom_smooth(method="gam", formula=y~s(x, bs="cs"))
facet_wrap(~location)
theme(axis.text.x=element_text(angle=45, hjust=1))
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
+
+
+
+
Page: 8 of 68
Data Science with R
Albany
Albury
AliceSprings
BadgerysCreek
Ballarat
Bendigo
Brisbane
Cairns
Canberra
Cobar
CoffsHarbour
Dartmoor
Darwin
GoldCoast
Hobart
Katherine
Launceston
Melbourne
MelbourneAirport
Mildura
Moree
MountGambier
MountGinini
Newcastle
Nhil
NorahHead
NorfolkIsland
Nuriootpa
PearceRAAF
Penrith
Perth
PerthAirport
Portland
Richmond
Sale
SalmonGums
Sydney
SydneyAirport
14
12
20
10
20
08
20
20
20
14
Woomera
12
Wollongong
20
08
Uluru
20
14
20
12
20
10
20
08
20
14
20
12
20
10
20
08
14
20
date
Witchcliffe
Tuggeranong
10
Townsville
Williamtown
20
12
20
10
20
08
Watsonia
20
14
10
08
20
14
Walpole
20
12
20
08
20
20
10
WaggaWagga
50
0
-50
-100
-150
20
50
0
-50
-100
-150
12
50
0
-50
-100
-150
20
50
0
-50
-100
-150
DR
AF
50
0
-50
-100
-150
20
50
0
-50
-100
-150
T
Adelaide
50
0
-50
-100
-150
max_temp
Exploring Data with GGPlot2
Facet over Locations with Lines
20
3.4
Hands-On
Changing to lines using geom line() instead of points using geom points() might improve the
presentation over the big dots, somewhat. Though it is still a little dense.
ds[sobs,]
ggplot(aes(x=date, y=max_temp))
geom_line()
geom_smooth(method="gam", formula=y~s(x, bs="cs"))
facet_wrap(~location)
theme(axis.text.x=element_text(angle=45, hjust=1))
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
+
+
+
+
Page: 9 of 68
Data Science with R
Albany
Albury
AliceSprings
BadgerysCreek
Ballarat
Bendigo
Brisbane
Cairns
Canberra
Cobar
CoffsHarbour
Dartmoor
Darwin
GoldCoast
Hobart
Katherine
Launceston
Melbourne
MelbourneAirport
Mildura
Moree
MountGambier
MountGinini
Newcastle
Nhil
NorahHead
NorfolkIsland
Nuriootpa
PearceRAAF
Penrith
Perth
PerthAirport
Portland
Richmond
Sale
SalmonGums
Sydney
SydneyAirport
14
12
20
10
20
08
20
20
20
14
Woomera
12
Wollongong
20
08
Uluru
20
14
20
12
20
10
20
08
20
14
20
12
20
10
20
08
14
20
date
Witchcliffe
Tuggeranong
10
Townsville
Williamtown
20
12
20
10
20
08
Watsonia
20
14
10
08
20
14
Walpole
20
12
20
08
20
20
10
WaggaWagga
50
0
-50
-100
-150
20
50
0
-50
-100
-150
12
50
0
-50
-100
-150
20
50
0
-50
-100
-150
DR
AF
50
0
-50
-100
-150
20
50
0
-50
-100
-150
T
Adelaide
50
0
-50
-100
-150
max_temp
Exploring Data with GGPlot2
Facet over Locations with Thin Lines
20
3.5
Hands-On
Finally, we use a smaller point size to draw the lines. You can decide which looks best for the
data you have. Compare the use of lines here and above using geom point(). Often the decision
comes down to what you are aiming to achieve with the plots or what story you will tell.
ds[sobs,]
ggplot(aes(x=date, y=max_temp))
geom_line(size=0.1)
geom_smooth(method="gam", formula=y~s(x, bs="cs"))
facet_wrap(~location)
theme(axis.text.x=element_text(angle=45, hjust=1))
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
+
+
+
+
Page: 10 of 68
Data Science with R
4
Hands-On
Exploring Data with GGPlot2
Histograms and Bar Charts
10000
5000
T
count
7500
DR
AF
2500
0
N
NNE
NE
ENE
E
ESE
SE
SSE
S
SSW SW WSW
wind_dir_3pm
W WNW NW NNW
A histogram displays the frequency of observations using bars. Such plots are easy to display
using ggplot2, as we see in the above code used to generate the plot. Here the data= is identified
as ds and the x= aesthetic is wind dir 3pm. Using these parameters we then add a bar geometric
to build a bar plot for us.
The resulting plot shows the frequency of the levels of the categoric variable wind dir 3pm from
the dataset.
You will have noticed that we placed the plot at the top of the page so that as we turn over to
the next page in this module we get a bit of an animation that highlights what changes.
In reviewing the above plot we might note that it looks rather dark and drab, so we try to turn
it into an appealing graphic that draws us in to wanting to look at it and understand the story
it is telling.
p <- ds
ggplot(aes(x=wind_dir_3pm))
geom_bar()
p
Copyright © 2013-2014 [email protected]
%>%
+
Module: GGPlot2O
Page: 11 of 68
Data Science with R
4.1
Hands-On
Exploring Data with GGPlot2
Saving Plots to File
Once we have a plot displayed we can save the plot to file quite simply using ggsave(). The
format is determined automatically by the name of the file to which we save the plot. Here we
save the plot as a PDF:
ggsave("barchart.pdf", width=11, height=7)
The default width= and height= are those of the current plotting window. For saving the plot
above we have specified a particular width and height, presumably to suit our requirements for
placing the plot in a document or in some presentation.
Essentially, ggsave() is a wrapper around the various functions that save plots to files, and adds
some value to these default plotting functions.
The traditional approach is to initiate a device, print the plot, and the close the device.
T
pdf("barchart.pdf", width=11, height=7)
p
dev.off()
Note that for a pdf() device the default width= and height= are 7 (inches). Thus, for both of
the above examples we are widening the plot whilst retaining the height.
DR
AF
There is some art required in choosing a good width and height. By increasing the height or
width any text that is displayed on the plot, essentially stays the same size. Thus by increasing
the plot size, the text will appear smaller. By decreasing the plot size, the text becomes larger.
Some experimentation is often required to get the right size for any particular purpose.
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
Page: 12 of 68
Data Science with R
4.2
Hands-On
Exploring Data with GGPlot2
Bar Chart
10000
5000
T
count
7500
DR
AF
2500
0
N
ds
NNE
NE
ENE
E
ESE
SE
SSE
S
SSW SW WSW
wind_dir_3pm
W WNW NW NNW
ggplot(aes(x=wind_dir_3pm))
geom_bar()
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
+
Page: 13 of 68
Data Science with R
4.3
Hands-On
Exploring Data with GGPlot2
Stacked Bar Chart
10000
count
7500
rain_tomorrow
No
5000
Yes
0
ds
NNE
NE
ENE
E
ESE
SE
SSE
S
SSW
wind_dir_3pm
SW
WSW
W
WNW NW
NNW
DR
AF
N
T
2500
ggplot(aes(x=wind_dir_3pm, fill=rain_tomorrow))
geom_bar()
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
+
Page: 14 of 68
Data Science with R
4.4
Hands-On
Exploring Data with GGPlot2
Stacked Bar Chart with Reordered Legend
10000
count
7500
rain_tomorrow
Yes
5000
No
0
N
NNE
NE
ENE
E
ESE
SE
SSE
S
SSW
wind_dir_3pm
T
2500
SW
WSW
W
WNW NW
NNW
DR
AF
A common annoyance with how the bar chart’s legend is arranged that it is stacked in the reverse
order to the stacks in the bar chart itself. This is basically understandable as the bars are plotted
in the same order as the levels of the factor. In this case that is:
levels(ds$rain_tomorrow)
## [1] "No"
"Yes"
So for each bar the count of the frequency of “No” is plotted first, and then the frequency of
“Yes”. Then the legend is listed in the same order, but from the top of the legend to the bottom.
Logically that makes sense, but visually it makes more sense to stack the bars and the legend in
the same order. We can do this simply with the guide= option.
ds
ggplot(aes(x=wind_dir_3pm, fill=rain_tomorrow))
geom_bar()
guides(fill=guide_legend(reverse=TRUE))
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
+
+
Page: 15 of 68
Data Science with R
4.5
Hands-On
Exploring Data with GGPlot2
Dodged Bar Chart
8000
count
6000
rain_tomorrow
No
4000
Yes
0
N
NNE
NE
ENE
E
ESE
SE
SSE
S
SSW
wind_dir_3pm
T
2000
SW
WSW
W
WNW
NW
NNW
ds
DR
AF
Notice for this version of the bar chart that the legend order is quite appropriate, and so no
reversing of it is required.
ggplot(aes(x=wind_dir_3pm, fill=rain_tomorrow))
geom_bar(position="dodge")
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
+
Page: 16 of 68
Data Science with R
4.6
Hands-On
Exploring Data with GGPlot2
Stacked Bar Chart with Options
Rain Expected by Wind Direction at 3pm
10,000
Number of Days
7,500
Tomorrow
Rain Expected
5,000
No Rain Expected
2,500
N
NNE
NE
ENE
E
ESE
SE
SSE
S
T
0
SSW SW WSW
Wind Direction 3pm
2015-05-22 08:05:29
W
WNW NW NNW
DR
AF
Here we illustrate the stacked bar chart, but now include a variety of options to improve its
appearance. The options, of course, are infinite, and we will explore more options through this
chapter. Here though we use brewer.pal() from RColorBrewer (Neuwirth, 2014) to generate a
dark/light pair of colours that are softer in presentation. We change the order of the pair, so
dark blue is first and light blue second, and use only these first two colours from the generated
palette. We also include more informative labels through labs(), as well as using comma() from
scales (Wickham, 2014) for formatting the thousands on the y-axis.
library(scales)
library(RColorBrewer)
ds
ggplot(aes(x=wind_dir_3pm, fill=rain_tomorrow))
geom_bar()
scale_y_continuous(labels=comma)
scale_fill_manual(values=brewer.pal(4, "Paired")[c(2,1)]
guide=guide_legend(reverse=TRUE)
labels=c("No Rain Expected", "Rain Expected"))
labs(title="Rain Expected by Wind Direction at 3pm"
x=paste0("Wind Direction 3pm\n", Sys.time())
y="Number of Days"
fill="Tomorrow")
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
+
+
+
,
,
+
,
,
,
Page: 17 of 68
Data Science with R
4.7
Hands-On
Exploring Data with GGPlot2
Narrow Bars
10000
5000
T
count
7500
DR
AF
2500
0
N
NNE
NE
ENE
E
ESE
SE
SSE
S
SSW SW WSW
wind_dir_3pm
W WNW NW NNW
There are many options available to change the appearance of the histogram to make it look
like almost anything we could want. In the following pages we will illustrate a few simpler
modifications.
This first example simply makes the bars narrower using the width= option. Here we make them
half width. Perhaps that helps to make the plot look less dark!
ds
ggplot(aes(wind_dir_3pm))
geom_bar(width=0.5)
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
+
Page: 18 of 68
Data Science with R
4.8
Hands-On
Exploring Data with GGPlot2
Full Width Bars
10000
5000
T
count
7500
DR
AF
2500
0
N
NNE
NE
ENE
E
ESE
SE
SSE
S
SSW SW WSW
wind_dir_3pm
W WNW NW NNW
Going the other direction, the bars can be made to touch by specifying a full width with
width=1.
ds
ggplot(aes(wind_dir_3pm))
geom_bar(width=1)
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
+
Page: 19 of 68
Data Science with R
4.9
Hands-On
Exploring Data with GGPlot2
Full Width Bars with Borders
10000
5000
T
count
7500
DR
AF
2500
0
N
NNE
NE
ENE
E
ESE
SE
SSE
S
SSW SW WSW
wind_dir_3pm
W WNW NW NNW
We can change the appearance by adding a blue border to the bars, using the colour= option.
By itself that would look a bit ugly, so we also fill the bars with a grey rather than a black fill.
We can play with different colours to achieve a pleasing and personalised result.
ds
ggplot(aes(wind_dir_3pm))
geom_bar(width=1, colour="blue", fill="grey")
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
+
Page: 20 of 68
Data Science with R
4.10
Hands-On
Exploring Data with GGPlot2
Coloured Histogram Without a Legend
10000
5000
T
count
7500
DR
AF
2500
0
N
NNE
NE
ENE
E
ESE
SE
SSE
S
SSW SW WSW
wind_dir_3pm
W WNW NW NNW
Now we really add a flamboyant streak to our plot by adding quite a spread of colour. To do so
we simply specify a fill= aesthetic to be controlled by the values of the variable wind dir 3pm
which of course is the variable being plotted on the x-axis. A good set of colours is chosen by
default.
We add a theme() to remove the legend that would be displayed by default, by indicating that
the legend.position= is none.
ds
ggplot(aes(wind_dir_3pm, fill=wind_dir_3pm))
geom_bar()
theme(legend.position="none")
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
+
+
Page: 21 of 68
Data Science with R
4.11
Hands-On
Exploring Data with GGPlot2
Comma Formatted Labels
10,000
5,000
T
count
7,500
DR
AF
2,500
0
N
NNE
NE
ENE
E
ESE
SE
SSE
S
SSW SW WSW W WNW NW NNW
wind_dir_3pm
Since ggplot2 Version 0.9.0 the scales (Wickham, 2014) package has been introduced to handle
many of the scale operations, in such a way as to support base and lattice graphics, as well as
ggplot2 graphics. Scale operations include position guides, as in the axes, and aesthetic guides,
as in the legend.
Notice that the y-axis has numbers using commas to separate the thousands. This is always a
good idea as it assists us in quickly determining the magnitude of the numbers we are looking
at. As a matter of course, I recommend we always use commas in plots (and tables). We do this
through scale y continuous() and indicating labels= to include a comma.
ds
ggplot(aes(wind_dir_3pm, fill=wind_dir_3pm))
geom_bar()
scale_y_continuous(labels=comma)
theme(legend.position="none")
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
+
+
+
Page: 22 of 68
Data Science with R
4.12
Hands-On
Exploring Data with GGPlot2
Dollar Formatted Labels
$10,000
$7,500
T
$5,000
DR
AF
$2,500
$0
N
NNE NE
ENE
E
ESE
SE
SSE
S
SSW SW WSW W WNW NW NNW
wind_dir_3pm
Though we are not actually plotting currency here, one of the supported scales is dollar. Choosing this will add the comma but also prefix the amount with $.
ds
ggplot(aes(wind_dir_3pm, fill=wind_dir_3pm))
geom_bar()
scale_y_continuous(labels=dollar)
labs(y="")
theme(legend.position="none")
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
+
+
+
+
Page: 23 of 68
Data Science with R
4.13
Hands-On
Exploring Data with GGPlot2
Multiple Bars
30
T
temp_3pm
20
DR
AF
10
0
Adelaide
Albany
AliceSprings
BadgerysCreek
AlburyBallarat
Bendigo
Brisbane
Cairns
Canberra
CoffsHarbour
Cobar
Dartmoor
Darwin
GoldCoast
Hobart
Katherine
Launceston
MelbourneAirport
Melbourne
Mildura
MountGambier
Moree
MountGinini
Newcastle
NorahHead
NorfolkIsland
NhilNuriootpa
PearceRAAF
Penrith
PerthAirport
Perth
Portland
Richmond
SalmonGums
Sale
SydneyAirport
Sydney
Townsville
Tuggeranong
WaggaWagga
Uluru
Walpole
Watsonia
Williamtown
Witchcliffe
Wollongong
Woomera
location
Here we see another interesting plot, showing the mean temperature at 3pm for each location
in the dataset. However, we notice that the location labels overlap and are quite a mess.
Something has to be done about that.
ds
ggplot(aes(x=location, y=temp_3pm, fill=location))
stat_summary(fun.y="mean", geom="bar")
theme(legend.position="none")
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
+
+
Page: 24 of 68
Data Science with R
4.14
Hands-On
Exploring Data with GGPlot2
Rotated Labels
30
temp_3pm
20
DR
AF
T
10
Adelaide
Albany
Albury
AliceSprings
BadgerysCreek
Ballarat
Bendigo
Brisbane
Cairns
Canberra
Cobar
CoffsHarbour
Dartmoor
Darwin
GoldCoast
Hobart
Katherine
Launceston
Melbourne
MelbourneAirport
Mildura
Moree
MountGambier
MountGinini
Newcastle
Nhil
NorahHead
NorfolkIsland
Nuriootpa
PearceRAAF
Penrith
Perth
PerthAirport
Portland
Richmond
Sale
SalmonGums
Sydney
SydneyAirport
Townsville
Tuggeranong
Uluru
WaggaWagga
Walpole
Watsonia
Williamtown
Witchcliffe
Wollongong
Woomera
0
location
The obvious solution is to rotate the labels. We achieve this through modifying the theme(),
setting the axis.text= to be rotated 90◦ .
ds
ggplot(aes(location, temp_3pm, fill=location))
stat_summary(fun.y="mean", geom="bar")
theme(legend.position="none",
axis.text.x=element_text(angle=90))
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
+
+
Page: 25 of 68
Data Science with R
Exploring Data with GGPlot2
Horizontal Histogram
T
Woomera
Wollongong
Witchcliffe
Williamtown
Watsonia
Walpole
WaggaWagga
Uluru
Tuggeranong
Townsville
SydneyAirport
Sydney
SalmonGums
Sale
Richmond
Portland
PerthAirport
Perth
Penrith
PearceRAAF
Nuriootpa
NorfolkIsland
NorahHead
Nhil
Newcastle
MountGinini
MountGambier
Moree
Mildura
MelbourneAirport
Melbourne
Launceston
Katherine
Hobart
GoldCoast
Darwin
Dartmoor
CoffsHarbour
Cobar
Canberra
Cairns
Brisbane
Bendigo
Ballarat
BadgerysCreek
AliceSprings
Albury
Albany
Adelaide
DR
AF
location
4.15
Hands-On
0
10
temp_3pm
20
30
Alternatively here we flip the coordinates and produce a horizontal histogram.
ds
ggplot(aes(location, temp_3pm, fill=location))
stat_summary(fun.y="mean", geom="bar")
theme(legend.position="none")
coord_flip()
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
+
+
+
Page: 26 of 68
Data Science with R
Exploring Data with GGPlot2
Reorder the Levels
T
Adelaide
Albany
Albury
AliceSprings
BadgerysCreek
Ballarat
Bendigo
Brisbane
Cairns
Canberra
Cobar
CoffsHarbour
Dartmoor
Darwin
GoldCoast
Hobart
Katherine
Launceston
Melbourne
MelbourneAirport
Mildura
Moree
MountGambier
MountGinini
Newcastle
Nhil
NorahHead
NorfolkIsland
Nuriootpa
PearceRAAF
Penrith
Perth
PerthAirport
Portland
Richmond
Sale
SalmonGums
Sydney
SydneyAirport
Townsville
Tuggeranong
Uluru
WaggaWagga
Walpole
Watsonia
Williamtown
Witchcliffe
Wollongong
Woomera
DR
AF
location
4.16
Hands-On
0
10
temp_3pm
20
30
We also want to have the labels in alphabetic order which makes the plot more accessible. This
requires we reverse the order of the levels in the original dataset. We do this and save the result
into another dataset so as to revert to the original dataset when appropriate below.
ds
mutate(location=factor(location, levels=rev(levels(location))))
ggplot(aes(location, temp_3pm, fill=location))
stat_summary(fun.y="mean", geom="bar")
theme(legend.position="none")
coord_flip()
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
%>%
+
+
+
Page: 27 of 68
Data Science with R
Exploring Data with GGPlot2
Plot the Mean with CI
T
Adelaide
Albany
Albury
AliceSprings
BadgerysCreek
Ballarat
Bendigo
Brisbane
Cairns
Canberra
Cobar
CoffsHarbour
Dartmoor
Darwin
GoldCoast
Hobart
Katherine
Launceston
Melbourne
MelbourneAirport
Mildura
Moree
MountGambier
MountGinini
Newcastle
Nhil
NorahHead
NorfolkIsland
Nuriootpa
PearceRAAF
Penrith
Perth
PerthAirport
Portland
Richmond
Sale
SalmonGums
Sydney
SydneyAirport
Townsville
Tuggeranong
Uluru
WaggaWagga
Walpole
Watsonia
Williamtown
Witchcliffe
Wollongong
Woomera
DR
AF
location
4.17
Hands-On
0
10
20
temp_3pm
30
Here we add a confidence interval around the mean.
ds
mutate(location=factor(location, levels=rev(levels(location))))
ggplot(aes(location, temp_3pm, fill=location))
stat_summary(fun.y="mean", geom="bar")
stat_summary(fun.data="mean_cl_normal", geom="errorbar",
conf.int=0.95, width=0.35)
theme(legend.position="none")
coord_flip()
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
%>%
+
+
+
+
Page: 28 of 68
Data Science with R
Exploring Data with GGPlot2
Text Annotations
Adelaide
Albany
Albury
AliceSprings
BadgerysCreek
Ballarat
Bendigo
Brisbane
Cairns
Canberra
Cobar
CoffsHarbour
Dartmoor
Darwin
GoldCoast
Hobart
Katherine
Launceston
Melbourne
MelbourneAirport
Mildura
Moree
MountGambier
MountGinini
Newcastle
Nhil
NorahHead
NorfolkIsland
Nuriootpa
PearceRAAF
Penrith
Perth
PerthAirport
Portland
Richmond
Sale
SalmonGums
Sydney
SydneyAirport
Townsville
Tuggeranong
Uluru
WaggaWagga
Walpole
Watsonia
Williamtown
Witchcliffe
Wollongong
Woomera
2222
2222
2222
2191
2222
2222
2222
2191
2191
2191
2222
760
2222
2191
2191
2191
2222
2222
2222
760
2375
2375
2618
2375
2375
2375
2375
T
2186
2191
2191
2191
2221
2191
2191
2191
2191
2183
DR
AF
location
4.18
Hands-On
760
0
1000
count
2191
2222
2221
2526
2191
2188
2191
2191
2191
2222
2191
2000
It would be informative to also show the actual numeric values on the plot. This plot shows the
counts.
ds
mutate(location=factor(location, levels=rev(levels(location))))
ggplot(aes(location, fill=location))
geom_bar(width=1, colour="white")
theme(legend.position="none")
coord_flip()
geom_text(stat="bin", color="white", hjust=1.0, size=3,
aes(y=..count.., label=..count..))
%>%
%>%
+
+
+
+
Exercise: Instead of plotting the counts, plot the mean temp 3pm, and include the textual
value.
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
Page: 29 of 68
Data Science with R
Exploring Data with GGPlot2
Text Annotations with Commas
Adelaide
Albany
Albury
AliceSprings
BadgerysCreek
Ballarat
Bendigo
Brisbane
Cairns
Canberra
Cobar
CoffsHarbour
Dartmoor
Darwin
GoldCoast
Hobart
Katherine
Launceston
Melbourne
MelbourneAirport
Mildura
Moree
MountGambier
MountGinini
Newcastle
Nhil
NorahHead
NorfolkIsland
Nuriootpa
PearceRAAF
Penrith
Perth
PerthAirport
Portland
Richmond
Sale
SalmonGums
Sydney
SydneyAirport
Townsville
Tuggeranong
Uluru
WaggaWagga
Walpole
Watsonia
Williamtown
Witchcliffe
Wollongong
Woomera
2,375
2,222
2,222
2,222
2,191
2,222
2,222
2,375
2,222
2,191
2,191
2,191
2,618
2,375
2,222
2,375
760
2,222
2,375
2,191
2,191
2,191
2,222
2,222
2,222
2,186
2,191
2,191
2,191
2,221
2,375
2,191
2,191
2,191
2,191
2,183
2,526
2,191
2,222
2,221
T
760
DR
AF
location
4.19
Hands-On
760
0
1000
count
2,191
2,188
2,191
2,191
2,191
2,222
2,191
2000
A small variation is to add commas to the numeric annotations to separate the thousands (Will
Beasley 121230 email).
ds
mutate(location=factor(location, levels=rev(levels(location))))
ggplot(aes(location, fill=location))
geom_bar(width=1, colour="white")
theme(legend.position="none")
coord_flip()
geom_text(stat="bin", color="white", hjust=1.0, size=3,
aes(y=..count.., label=scales::comma(..count..)))
%>%
%>%
+
+
+
+
Exercise: Do this without having to use scales::, perhaps using a format.
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
Page: 30 of 68
Data Science with R
5
Hands-On
Exploring Data with GGPlot2
Density Distributions
density
0.04
DR
AF
T
0.02
0.00
0
ds
10
20
max_temp
30
40
ggplot(aes(x=max_temp))
geom_density()
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
50
%>%
+
Page: 31 of 68
Data Science with R
5.1
Hands-On
Exploring Data with GGPlot2
Skewed Distributions
0.20
0.10
T
density
0.15
DR
AF
0.05
0.00
0
100
200
rainfall
300
This plot certainly informs us about the skewed nature of the amount of rainfall recorded for
any one day, but we lose a lot of resolution at the low end.
Note that we use a subset of the dataset to include only those observations of rainfall (i.e., where
the rainfall is non-zero). Otherwise a warning will note many rows contain non-finite values in
calculating the density statistic.
ds
filter(rainfall != 0)
ggplot(aes(x=rainfall))
geom_density()
scale_y_continuous(labels=comma)
theme(legend.position="none")
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
%>%
+
+
+
Page: 32 of 68
Data Science with R
5.2
Hands-On
Exploring Data with GGPlot2
Log Transform X Axis
0.75
T
density
0.50
DR
AF
0.25
0.00
1
rainfall
100
A common approach to dealing with the skewed distribution is to transform the scale to be a
logarithmic scale, base 10.
ds
filter(rainfall != 0)
ggplot(aes(x=rainfall))
geom_density()
scale_x_log10()
theme(legend.position="none")
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
%>%
+
+
+
Page: 33 of 68
Data Science with R
5.3
Hands-On
Exploring Data with GGPlot2
Log Transform X Axis with Breaks
0.75
T
density
0.50
DR
AF
0.25
0.00
1
ds
rainfall
10
100
filter(rainfall != 0)
ggplot(aes(x=rainfall))
geom_density(binwidth=0.1)
scale_x_log10(breaks=c(1, 10, 100), labels=comma)
theme(legend.position="none")
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
%>%
+
+
+
Page: 34 of 68
Data Science with R
5.4
Hands-On
Exploring Data with GGPlot2
Log Transform X Axis with Ticks
0.75
T
density
0.50
DR
AF
0.25
0.00
1
ds
rainfall
10
100
filter(rainfall != 0)
ggplot(aes(x=rainfall))
geom_density(binwidth=0.1)
scale_x_log10(breaks=c(1, 10, 100), labels=comma)
annotation_logticks(sides="bt")
theme(legend.position="none")
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
%>%
+
+
+
+
Page: 35 of 68
Data Science with R
5.5
Hands-On
Exploring Data with GGPlot2
Log Transform X Axis with Custom Label
0.75
T
density
0.50
DR
AF
0.25
0.00
1mm
10mm
rainfall
100mm
mm <- function(x) { sprintf("%smm", x) }
ds
filter(rainfall != 0)
ggplot(aes(x=rainfall))
geom_density()
scale_x_log10(breaks=c(1, 10, 100), labels=mm)
annotation_logticks(sides="bt")
theme(legend.position="none")
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
%>%
+
+
+
+
Page: 36 of 68
Data Science with R
5.6
Hands-On
Exploring Data with GGPlot2
Transparent Plots
0.20
0.15
Canberra
Darwin
Melbourne
0.10
Sydney
T
density
location
DR
AF
0.05
0.00
10
20
temp_3pm
30
40
cities <- c("Canberra", "Darwin", "Melbourne", "Sydney")
ds
filter(location %in% cities)
ggplot(aes(temp_3pm, colour = location, fill = location))
geom_density(alpha = 0.55)
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
%>%
+
Page: 37 of 68
Data Science with R
6
Hands-On
Exploring Data with GGPlot2
Box Plot Distributions
50
40
max_temp
30
20
T
10
DR
AF
0
2007
2008
2009
2010
2011
year
2012
2013
2014
2015
A box plot, also known as a box and whiskers plot, shows the median (the second quartile)
within a box which extends to the first and third quartiles. We note that each quartile delimits
one quarter of the dataset and hence the box itself contains half the dataset.
Colour is added simply to improve the visual appeal of the plot rather than to convey new
information. Since we include fill= we also turn off the otherwise included legend.
Here we observe the overall change in the maximum temperature over the years. Notice the first
and last plots which probably reflect truncated data, providing motivation to confirm this in the
data, before making significant statements regarding these observations.
ds
mutate(year=factor(format(ds$date, "%Y")))
ggplot(aes(x=year, y=max_temp, fill=year))
geom_boxplot(notch=TRUE)
theme(legend.position="none")
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
%>%
+
+
Page: 38 of 68
Data Science with R
6.1
Hands-On
Exploring Data with GGPlot2
Violin Plot
50
40
max_temp
30
20
T
10
DR
AF
0
2007
2008
2009
2010
2011
year
2012
2013
2014
2015
A violin plot is another interesting way to present a distribution, using a shape that resembles a
violin.
We again use colour to improve the visual appeal of the plot.
ds
mutate(year=factor(format(ds$date, "%Y")))
ggplot(aes(x=year, y=max_temp, fill=year))
geom_violin()
theme(legend.position="none")
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
%>%
+
+
Page: 39 of 68
Data Science with R
6.2
Hands-On
Exploring Data with GGPlot2
Violin with Box Plot
50
40
max_temp
30
20
T
10
DR
AF
0
2007
2008
2009
2010
2011
year
2012
2013
2014
2015
We can overlay the violin plot with a box plot to show the quartiles.
ds
mutate(year=factor(format(ds$date, "%Y")))
ggplot(aes(x=year, y=max_temp, fill=year))
geom_violin()
geom_boxplot(width=.5, position=position_dodge(width=0))
theme(legend.position="none")
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
%>%
+
+
+
Page: 40 of 68
Data Science with R
50
40
30
20
10
0
max_temp
50
40
30
20
10
0
50
40
30
20
10
0
50
40
30
20
10
0
50
40
30
20
10
0
Adelaide
Albany
Albury
AliceSprings
BadgerysCreek
Ballarat
Bendigo
Brisbane
Cairns
Canberra
Cobar
CoffsHarbour
Dartmoor
Darwin
GoldCoast
Hobart
Katherine
Launceston
Melbourne
MelbourneAirport
Mildura
Moree
MountGambier
MountGinini
Newcastle
Nhil
NorahHead
NorfolkIsland
Nuriootpa
PearceRAAF
Penrith
Perth
PerthAirport
Portland
Richmond
Sale
SalmonGums
Sydney
SydneyAirport
Townsville
Tuggeranong
Uluru
WaggaWagga
Walpole
Watsonia
Williamtown
Witchcliffe
Wollongong
Woomera
20
2007
2008
2009
2010
2011
2012
2013
2014
1
20 5
2007
2008
2009
2010
2011
2012
2013
2014
1
20 5
2007
2008
2009
2010
2011
2012
2013
2014
1
20 5
2007
2008
2009
2010
2011
2012
2013
2014
1
20 5
2007
2008
2009
2010
2011
2012
2013
2014
1
20 5
2007
2008
2009
2010
2011
2012
2013
2014
1
20 5
2007
2008
2009
2010
2011
2012
2013
2014
15
50
40
30
20
10
0
Violin with Box Plot by Location
T
50
40
30
20
10
0
Exploring Data with GGPlot2
DR
AF
6.3
Hands-On
year
We can readily split the plot across the locations. Things get a little crowded, but we get an
overall view across all of the different weather stations. Notice we also rotated the x-axis labels
so that they don’t overlap.
We can immediately see one of the issues with this dataset, noting that three weather stations
have fewer observations that then others.
Various other observations are also interesting. Some locations have little variation in their
maximum temperatures over the years.
ds
mutate(year=factor(format(ds$date, "%Y")))
ggplot(aes(x=year, y=max_temp, fill=year))
geom_violin()
geom_boxplot(width=.5, position=position_dodge(width=0))
theme(legend.position="none",
axis.text.x=element_text(angle=45, hjust=1))
facet_wrap(~location)
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
%>%
+
+
+
+
Page: 41 of 68
Data Science with R
Exploring Data with GGPlot2
Misc Plots
DR
AF
T
7
Hands-On
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
Page: 42 of 68
Data Science with R
7.1
Hands-On
Exploring Data with GGPlot2
Waterfall Chart
A Waterfall Chart (also known as a cascade chart) is useful in visualizing the breakdown of a
measure into some components that are introduced to reduce the original value. An example
here is of the cash flow for a small business over a year. https://learnr.wordpress.com/2010/
05/10/ggplot2-waterfall-charts/.
We will invent some simple data to illustrate. Imagine we have a spreadsheet with a companies
starting cash balance for the year and then a series of items that are either cash flows in or out.
Notice we force an ordering on the items using base::factor(). Also note the technique to sum
the vector of items to add the final balance to the vector using the pipe operator.
items <- c("Start", "IT", "Sales", "Adv", "Elect",
"Tax", "Salary", "Rebate",
"Advise", "Consult", "End")
amounts <- c(30, -5, 15, -8, -4, -7, 11, 8, -4, 20) %>% c(., sum(.))
cash <- data.frame(Transaction=factor(items, levels=items), Amount=amounts)
cash
T
Transaction Amount
Start
30
IT
-5
Sales
15
Adv
-8
Elect
-4
DR
AF
##
## 1
## 2
## 3
## 4
## 5
....
levels(cash$Transaction)
##
##
[1] "Start"
[8] "Rebate"
"IT"
"Advise"
"Sales"
"Adv"
"Consult" "End"
"Elect"
"Tax"
"Salary"
We’ll record whether the flow is in or out as a conveinence for us in the plot.
cash %<>% mutate(Direction=ifelse(Transaction %in% c("Start", "End"),
"Net", ifelse(Amount>0, "Credit", "Debit")) %>%
factor(levels=c("Debit", "Credit", "Net")))
We now need to calculate the locations of the rectangles along the x-axis. We can count the
transactions from 1 and use this as an offset for specifying where to draw the rectangle. Then
we need to calculate the y value for the rectangles. This is based on the cumulative sum.
cash %<>% mutate(id=seq_along(Transaction),
end=c(head(cumsum(Amount), -1), tail(Amount, 1)),
start=c(0, head(end, -2), 0))
cash
##
##
##
##
##
##
1
2
3
4
5
Transaction Amount Direction id end start
Start
30
Net 1 30
0
IT
-5
Debit 2 25
30
Sales
15
Credit 3 40
25
Adv
-8
Debit 4 32
40
Elect
-4
Debit 5 28
32
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
Page: 43 of 68
Data Science with R
Hands-On
Exploring Data with GGPlot2
....
cash %>%
ggplot(aes(fill=Direction)) +
geom_rect(aes(x=Transaction,
xmin=id-0.45, xmax=id+0.45,
ymin=end, ymax=start)) +
geom_text(aes(x=id, y=start, label=Transaction),
size=3,
vjust=ifelse(cash$Direction == "Debit", -0.2, 1.2)) +
geom_text(aes(x=id, y=start, label=Amount),
size=3, colour="white",
vjust=ifelse(cash$Direction == "Debit", 1.3, -0.3)) +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank()) +
ylab("Balance in Thousand Dollars") +
ggtitle("Annual Cash Flow Analysis")
DR
AF
T
Annual Cash Flow Analysis
Balance in Thousand Dollars
Advise
-4
Adv
-8
40
20
Consult
15
Sales
Direction
Debit
Tax
-7
Credit
Net
11
Salary
20
0
8
Rebate
Elect
-4
IT
-5
30
Start
56
End
Transaction
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
Page: 44 of 68
Data Science with R
8
Hands-On
Pairs Plot: Using ggpairs()
NNE
NE
ENE
ESE
SE
SSE
SSW
S SW
WSW
WNW
W NW
NNW
20
MinTemp
Exploring Data with GGPlot2
15
Corr:
0.744
10
5
No
Corr:
0.635
Corr:
0.00527
Corr:
0.132
Corr:
0.675
Corr:
0.446
Corr:
-0.141
Corr:
0.311
Corr:
-0.104
Yes
No
Yes
No
Yes
0
MaxTemp
-5
30
20
Evaporation
10
10
5
Corr:
-0.657
T
5
0
5
4
3
2
1
0
5
4
3
2
1
0
5
4
3
2
1
0
5
4
3
2
1
0
5
4
3
2
1
0
5
4
3
2
1
0
5
4
3
2
1
0
5
4
3
2
1
0
5
4
3
2
1
0
5
4
3
2
1
0
5
4
3
2
1
0
5
4
3
2
1
0
5
4
3
2
1
0
5
4
3
2
1
0
5
4
3
2
1
0
5
4
3
2
1
0
8
6
4
2
RainToday
0
10
MinTemp
20
10
20
30
MaxTemp
0
5
10
Evaporation
15 0
5
10
Sunshine
NNE
N NE
ENE
ESE
SE
SSE
SSW
S SW
WSW
WNW
WNW
NNW0
WindDir9am
2
4
Yes
15
10
5
0
15
10
5
0
No
RainTomorrow
0
15
10
5
0
15
10
5
0
DR
AF
Cloud3pm
10
NNE
NE
ENE
ESE
SE
SSE
SSW
S SW
WSW
WNW
W NW
NNW
WindDir9am
Sunshine
0
6
Cloud3pm
8
No
Yes
RainToday
RainTomorrow
A sophisticated pairs plot, as delivered by ggpairs() from GGally (Schloerke et al., 2014), brings
together many of the kinds of plots we have already seen to compare pairs of variables. Here we
have used it to plot the weather dataset from rattle (Williams, 2015). We see scatter plotes,
histograms, and box plots in the pairs plot.
weather[c(3,4,6,7,10,19,22,24)]
na.omit()
ggpairs(params=c(shape=I("."), outlier.shape=I(".")))
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
%>%
Page: 45 of 68
Data Science with R
9
Hands-On
Exploring Data with GGPlot2
Cumulative Distribution Plot
Cumulative Rainfall
12,500
millimetres
10,000
location
7,500
Adelaide
Canberra
Darwin
5,000
Hobart
0
2010
2011
2012
date
2013
2014
2015
DR
AF
2009
T
2,500
Here we show the cumulative sum of a variable for different locations.
cities <- c("Adelaide", "Canberra", "Darwin", "Hobart")
ds
filter(location %in% cities & date >= "2009-01-01")
group_by(location)
mutate(cumRainfall=order_by(date, cumsum(rainfall)))
ggplot(aes(x=date, y=cumRainfall, colour=location))
geom_line()
ylab("millimetres")
scale_y_continuous(labels=comma)
ggtitle("Cumulative Rainfall")
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
%>%
%>%
%>%
+
+
+
+
Page: 46 of 68
Data Science with R
10
Hands-On
Exploring Data with GGPlot2
Parallel Coordinates Plot
5.0
value
2.5
0.0
ax
DR
AF
m
m
in
_t
em
p
_t
em
p
ra
in
fa
ll
ev
ap
or
at
io
su
n
ns
w
h
in
in
d_
e
gu
st
w
in
_s
d_
pe
sp
ed
ee
w
in
d
d_
_9
sp
am
ee
hu
d_
m
3p
id
m
it
hu y_9
am
m
id
ity
pr
_3
es
pm
su
re
pr
_9
es
am
su
re
_
3p
cl
ou
m
d_
9a
cl
m
ou
d_
3p
te
m
m
p_
9a
te
m
m
p_
3p
m
ris
k_
m
m
T
-2.5
variable
A parallel coordinates plot provides a graphical summary of multivariate data. Generally it
works best with 10 or fewer variables. The variables will be scaled to a common range (e.g., by
performing a z-score transform which subtracts the mean and divides by the standard deviation)
and each observation is plot as a line running horizontally across the plot.
One of the main uses of parallel coordinate plots is to identify common groups of observations
through their common patterns of variable values. We can use parallel coordinate plots to identify
patterns of systematic relationships between variables over the observations.
For more than about 10 variables consider the use of heatmaps or fluctuation plots.
Here we use ggparcoord() from GGally (Schloerke et al., 2014) to generate the parallel coordinates plot. The underlying plotting is performed by ggplot2 (Wickham and Chang, 2015).
library(GGally)
cities <- c("Canberra", "Darwin", "Melbourne", "Sydney")
ds
%>%
%>%
+
filter(location %in% cities & rainfall>75)
ggparcoord(columns=numi)
theme(axis.text.x=element_text(angle=45))
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
Page: 47 of 68
Data Science with R
10.1
Hands-On
Exploring Data with GGPlot2
Labels Aligned
5.0
value
2.5
0.0
DR
AF
m
in
_t
em
m
p
ax
_t
em
p
ra
i
n
ev
f
ap all
or
at
io
n
su
w
in
n
d_
sh
gu
in
e
w
st
in
_
sp
d_
e
sp
ee ed
w
in
d
d_
_9
sp
am
ee
d
_3
hu
pm
m
id
ity
_9
hu
am
m
id
ity
pr
_3
es
su pm
re
pr
_9
es
am
su
re
_3
pm
cl
ou
d_
9a
cl
m
ou
d_
3p
te
m
m
p_
9a
te
m
m
p_
3p
m
ris
k_
m
m
T
-2.5
variable
We notice the labels are by default aligned by their centres. Rotating 45◦ causes the labels to
sit over the plot region. We can ask the labels to be aligned at the top edge instead, using
hjust=1.
ds
filter(location %in% cities & rainfall>75)
ggparcoord(columns=numi)
theme(axis.text.x=element_text(angle=45, hjust=1))
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
%>%
+
Page: 48 of 68
Data Science with R
10.2
Hands-On
Exploring Data with GGPlot2
Colour and Label
5.0
location
Canberra
Darwin
Melbourne
Sydney
value
2.5
0.0
DR
AF
m
in
_t
em
m
p
ax
_t
em
p
ra
i
n
ev
f
ap all
or
at
io
n
su
w
in
n
d_
sh
gu
in
e
w
st
in
_
sp
d_
e
sp
ee ed
w
in
d
d_
_9
sp
am
ee
d
_3
hu
pm
m
id
ity
_9
hu
am
m
id
ity
pr
_3
es
su pm
re
pr
_9
es
am
su
re
_3
pm
cl
ou
d_
9a
cl
m
ou
d_
3p
te
m
m
p_
9a
te
m
m
p_
3p
m
ris
k_
m
m
T
-2.5
variable
Here we add some colour and force the legend to be within the plot rather than, by default,
reducing the plot size and adding the legend to the right.
We can discern just a little structure relating to locations. We have limited the data to those
days where more than 74mm of rain is recorded and clearly Darwin becomes prominent. Darwin
has many days with at least this much rain, Sydney has a few days and Canberra and Melbourne
only one day. We would confirm this with actual queries of the data. Apart from this the parallel
coordinates in this instance is not showing much structure.
The example code illustrates several aspects of placing the legend. We specified a specific location
within the plot itself, justified as top left, laid out horizontally, and transparent.
ds
filter(location %in% cities & rainfall>75)
ggparcoord(columns=numi, group="location")
theme(axis.text.x=element_text(angle=45, hjust=1),
legend.position=c(0.25, 1),
legend.justification=c(0.0, 1.0),
legend.direction="horizontal",
legend.background=element_rect(fill="transparent"),
legend.key=element_rect(fill="transparent",
colour="transparent"))
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
%>%
+
Page: 49 of 68
Data Science with R
11
Hands-On
Exploring Data with GGPlot2
Polar Coordinates
Jan
Dec
Feb
40
30
Nov
Mar
20
0
Oct
Apr
T
max
10
May
DR
AF
Sep
Aug
Jun
Jul
month
Here we use a polar coordinate to plot what is otherwise a bar chart. The data preparation begins
with converting the date to a month, then grouping by the month, after which we obtain the
maximum max temp for each group. The months are then ordered correctly. We then generate
a bar plot and place the plot on a polar coordinate. Not necessarily particularly easy to read,
and probably better to not use polar coordinates in this case!
ds
mutate(month=format(date, "%b"))
group_by(month)
summarise(max=max(max_temp))
mutate(month=factor(month, levels=month.abb))
ggplot(aes(x=month, y=max, group=1))
geom_bar(width=1, stat="identity", fill="orange", color="black")
coord_polar(theta="x", start=-pi/12)
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
%>%
%>%
%>%
%>%
%>%
+
+
Page: 50 of 68
Data Science with R
12
Hands-On
Exploring Data with GGPlot2
Multiple Plots: Using grid.arrange()
Here we illustrate the ability to layout multiple plots in a regular grid using grid.arrange()
from gridExtra (Auguie, 2012). We illustrate this with the weatherAUS dataset from rattle. We
generate a number of informative plots, using the plyr (?) package to aggregate the cumulative
rainfall. The scales (Wickham, 2014) package is used to provide labels with commas.
library(rattle)
library(ggplot2)
library(scales)
cities <- c("Adelaide", "Canberra", "Darwin", "Hobart")
dss <- ds %>%
filter(location %in% cities & date >= "2009-01-01") %>%
group_by(location) %>%
mutate(cumRainfall=order_by(date, cumsum(rainfall)))
<<<<<-
ggplot(dss, aes(x=date, y=cumRainfall, colour=location))
p + geom_line()
p1 + ylab("millimetres")
p1 + scale_y_continuous(labels=comma)
p1 + ggtitle("Cumulative Rainfall")
p2
p2
p2
p2
<<<<-
ggplot(dss, aes(x=date, y=max_temp, colour=location))
p2 + geom_point(alpha=.1)
p2 + geom_smooth(method="loess", alpha=.2, size=1)
p2 + ggtitle("Fitted Max Temperature Curves")
DR
AF
T
p
p1
p1
p1
p1
p3 <- ggplot(dss, aes(x=pressure_3pm, colour=location))
p3 <- p3 + geom_density()
p3 <- p3 + ggtitle("Air Pressure at 3pm")
p4
p4
p4
p4
p4
<<<<<-
ggplot(dss, aes(x=sunshine, fill=location))
p4 + facet_grid(location ~ .)
p4 + geom_histogram(colour="black", binwidth=1)
p4 + ggtitle("Hours of Sunshine")
p4 + theme(legend.position="none")
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
Page: 51 of 68
Data Science with R
13
Hands-On
Exploring Data with GGPlot2
Multiple Plots: Using grid.arrange()
Cumulative Rainfall
12,500
Fitted Max Temperature Curves
40
10,000
7,500
Adelaide
Canberra
5,000
Darwin
Hobart
location
max_temp
millimetres
location
2,500
Adelaide
30
Canberra
Darwin
20
Hobart
10
0
2010
2011
2012
date
2013
2014
2015
2009
Air Pressure at 3pm
Canberra
Darwin
0.05
Hobart
1000
1020
pressure_3pm
1040
2013
2014
2015
Hours of Sunshine
0
DR
AF
980
date
5
sunshine
Hobart
0.00
count
Adelaide
2012
Darwin
density
location
2011
Adelaide Canberra
0.10
1200
800
400
0
1200
800
400
0
1200
800
400
0
1200
800
400
0
2010
T
2009
10
15
library(gridExtra)
grid.arrange(p1, p2, p3, p4)
The actual plots are arranged by grid.arrange().
A collection of plots such as this can be quite informative and effective in displaying the information efficiently. We can see that Darwin is quite a stick out. It is located in the tropics, whereas
the remaining cities are in the southern regions of Australia.
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
Page: 52 of 68
Data Science with R
Exploring Data with GGPlot2
Multiple Plots: Alternative Arrangements
DR
AF
T
14
Hands-On
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
Page: 53 of 68
Data Science with R
Exploring Data with GGPlot2
Multiple Plots: Sharing a Legend
DR
AF
T
15
Hands-On
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
Page: 54 of 68
Data Science with R
Exploring Data with GGPlot2
Multiple Plots: 2D Histogram
DR
AF
T
16
Hands-On
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
Page: 55 of 68
Data Science with R
Exploring Data with GGPlot2
Multiple Plots: 2D Histogram Plot
DR
AF
T
17
Hands-On
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
Page: 56 of 68
Data Science with R
Exploring Data with GGPlot2
Multiple Plots: Using layOut()
DR
AF
T
18
Hands-On
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
Page: 57 of 68
Data Science with R
Exploring Data with GGPlot2
Arranging Plots with layOut()
DR
AF
T
18.1
Hands-On
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
Page: 58 of 68
Data Science with R
Exploring Data with GGPlot2
Arranging Plots Through Viewports
DR
AF
T
18.2
Hands-On
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
Page: 59 of 68
Data Science with R
Exploring Data with GGPlot2
Plot
DR
AF
T
18.3
Hands-On
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
Page: 60 of 68
Data Science with R
Exploring Data with GGPlot2
Plotting a Table
DR
AF
T
19
Hands-On
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
Page: 61 of 68
Data Science with R
Exploring Data with GGPlot2
Plotting a Table and Text
DR
AF
T
20
Hands-On
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
Page: 62 of 68
Data Science with R
Exploring Data with GGPlot2
Interactive Plot Building
DR
AF
T
21
Hands-On
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
Page: 63 of 68
Data Science with R
22
Hands-On
Exploring Data with GGPlot2
Having Some Fun with xkcd
max_temp
40
30
20
10
0
10
min_temp
20
30
T
0
For a bit of fun we can generate plots in the style of the popular online comic strip xkcd using
xkcd (Manzanera, 2014).
DR
AF
On Ubuntu we can install install the required fonts with the following steps:
library(extrafont)
download.file("http://simonsoftware.se/other/xkcd.ttf", dest="xkcd.ttf")
system("mkdir ~/.fonts")
system("mv xkcd.ttf ~/.fonts")
font_import()
loadfonts()
The installation can be uninstalled with:
remove.packages(c("extrafont","extrafontdb"))
We can then generate a roughly yet neatly drawn plot, as if it might have been drawn by the
hand of the author of the comic strip.
library(xkcd)
xrange <- range(ds$min_temp)
yrange <- range(ds$max_temp)
ds[sobs,]
ggplot(aes(x=min_temp, y=max_temp))
geom_point()
xkcdaxis(xrange, yrange)
Copyright © 2013-2014 [email protected]
%>%
+
+
Module: GGPlot2O
Page: 64 of 68
Data Science with R
22.1
Hands-On
Exploring Data with GGPlot2
Bar Chart
Temperature Degrees Celcius
50
40
30
20
10
0
2010
2012
Year
We can do a bar chart in the same style.
DR
AF
library(xkcd)
library(scales)
library(lubridate)
2014
T
2008
nds <- ds
mutate(year=year(ds$date))
group_by(year)
summarise(max_temp=max(max_temp))
mutate(xmin=year-0.1, xmax=year+0.1, ymin=0, ymax=max_temp)
%>%
%>%
%>%
%>%
mapping <- aes(xmin=xmin, ymin=ymin, xmax=xmax, ymax=ymax)
ggplot()
xkcdrect(mapping, nds)
xkcdaxis(c(2006.8, 2014.2), c(0, 50))
labs(x="Year", y="Temperature Degrees Celcius")
scale_y_continuous(labels=comma)
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
+
+
+
+
Page: 65 of 68
Data Science with R
23
Hands-On
Exploring Data with GGPlot2
Exercises
Exercise 1
Dashboard
Your copmuter collects various information about its usage. These are often called log files.
Identify some log files that you have access to on your computer. Build a one page PDF dashboard
using KnitR and ggplot2 to fill the page with information plots. Be sure to include some of the
following types of plots. (Compare this to my bin/dashboard.sh script).
1. Bar plot
2. Line chart
DR
AF
T
3. etc.
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
Page: 66 of 68
Data Science with R
24
Hands-On
Exploring Data with GGPlot2
Further Reading and Acknowledgements
The Rattle Book, published by Springer, provides a comprehensive
introduction to data mining and analytics using Rattle and R.
It is available from Amazon. Other documentation on a broader
selection of R topics of relevance to the data scientist is freely
available from http://datamining.togaware.com, including the
Datamining Desktop Survival Guide.
This chapter is one of many chapters available from http://
HandsOnDataScience.com. In particular follow the links on the
website with a * which indicates the generally more developed chapters.
Other resources include:
ˆ The GGPlot2 documentation is quite extensive and useful
DR
AF
T
ˆ The R Cookbook is a great resource explaining how to do many types of plots using ggplot2.
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
Page: 67 of 68
Data Science with R
25
Hands-On
Exploring Data with GGPlot2
References
Auguie B (2012). gridExtra: functions in Grid graphics. R package version 0.9.1, URL http:
//CRAN.R-project.org/package=gridExtra.
Manzanera ET (2014). xkcd: Plotting ggplot2 graphics in a XKCD style. R package version
0.0.3, URL http://CRAN.R-project.org/package=xkcd.
Neuwirth E (2014). RColorBrewer: ColorBrewer Palettes. R package version 1.1-2, URL
http://CRAN.R-project.org/package=RColorBrewer.
R Core Team (2015). R: A Language and Environment for Statistical Computing. R Foundation
for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
Schloerke B, Crowley J, Cook D, Hofmann H, Wickham H, Briatte F, Marbach M, Thoen E
(2014). GGally: Extension to ggplot2. R package version 0.5.0, URL http://CRAN.R-project.
org/package=GGally.
T
Wickham H (2014). scales: Scale functions for graphics. R package version 0.2.4, URL http:
//CRAN.R-project.org/package=scales.
Wickham H, Chang W (2015). ggplot2: An Implementation of the Grammar of Graphics. R
package version 1.0.1, URL http://CRAN.R-project.org/package=ggplot2.
DR
AF
Williams GJ (2009). “Rattle: A Data Mining GUI for R.” The R Journal, 1(2), 45–55. URL
http://journal.r-project.org/archive/2009-2/RJournal_2009-2_Williams.pdf.
Williams GJ (2011). Data Mining with Rattle and R: The art of excavating data for knowledge
discovery. Use R! Springer, New York.
Williams GJ (2015). rattle: Graphical User Interface for Data Mining in R. R package version
3.4.2, URL http://rattle.togaware.com/.
This document, sourced from GGPlot2O.Rnw revision 779, was processed by KnitR version 1.9 of
2015-01-20 and took 67.4 seconds to process. It was generated by gjw on theano running Ubuntu
14.04.2 LTS with Intel(R) Core(TM) i7-3517U CPU @ 1.90GHz having 4 cores and 3.9GB of
RAM. It completed the processing 2015-05-22 08:06:18.
Copyright © 2013-2014 [email protected]
Module: GGPlot2O
Page: 68 of 68