Workshop 1: Getting R to work for you

Workshop 1: Getting R to work for you
R for Landscape Ecology Workshop Series
Fall 2011
Paul Galpern
NRI, University of Manitoba
Contents
1 Workshop overview
2
2 First steps in R
2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
3
4
4
3 Help with help
3.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
5
5
4 R Objects
4.1 Example . . . . . . . .
4.1.1 Simple objects
4.1.2 Matrices . . . .
4.1.3 Data frames . .
4.2 Exercise . . . . . . . .
4.3 Challenge . . . . . . .
5 Manipulating
5.1 Example .
5.2 Exercise .
5.3 Challenge
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
6
6
8
9
11
11
real data
12
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1
1
Workshop overview
For each of these three workshops I will prepare a document like this one. These
will serve as a guide for us, and a record for you of what we did.
The intended structure of our workshop will be: (1) Paul illustrates an example;
(2) you complete a short exercise based on the example, while Paul is on-hand
to assist; (3) if there is time, you can do the optional challenge which takes
the exercise a step further. Please don’t hesitate to email me if you have R
questions!
There are also many useful resources online. University of Manitoba has over
50 R specific e-books in its library that are ready to download. Click here for a
complete list. These are some of the more intriguing titles that relate to ecology,
and only a sample of analyses that are fully-documented and ready to run in R.
1. A beginner’s guide to R
2. A tiny handbook of R
3. Introductory statistics with R
4. A handbook of statistical analyses using R
5. Using R for data management, statistical analysis, and graphics
6. Numerical ecology with R
7. A practical guide to ecological modelling
8. Mixed effects models and extensions in ecology with R
9. A primer of ecology with R
10. Applied spatial data analysis with R
11. Ecological models and data in R
12. A modern approach to regression with R
13. Data manipulation With R
14. Morphometrics with R
15. Nonlinear regression with R
16. An introduction to applied multivariate analysis with R
17. Hidden Markov models for time series
18. Introductory time series with R
19. Bioinformatics and computational biology solutions using R and Bioconductor
20. Extending the linear model with R
2
2
First steps in R
Here we familiarize ourselves with the R workspace by writing a bit of code.
2.1
Example
Here is some sample code. Note that lines that begin with a number sign are
code comments and don’t have any effect in R.
> # A textual announcement
> cat("Hello world!")
Hello world!
# A graphical announcement
# Create a blank graphical plot with dimensions 100 by 100
plot.default(c(1,100), c(1,100), type="n", xlab="", ylab="")
# Add your announcement in the middle of the plot
text(50, 50, "Hello world!", cex=5)
60
80
100
>
>
>
>
>
0
20
40
Hello world!
0
20
40
60
80
100
We entered the code in the console window, and we saw the plot in its own
window. Now let’s see a demo of some basic R graphics.
3
> demo(graphics)
Finally, let’s review some of the help system. We can enter the following in the
console and use the search engine to explore the function plot
> help.start()
2.2
Exercise
1. Open R, and manually type in the code listed above.
2. Add a different message to the ”Hello World” plot.
3. Determine the function of the cex parameter.
2.3
Challenge
1. Use the help. Figure out how to change the colour of your text.
4
3
Help with help
3.1
Example
Every R function is documented, but the documentation can sometimes be
obscure. Here are a few hints to make sense of it.
1. Search for what you are looking for using help.start()
2. Functions are each associated with a package. A few packages are loaded
automatically when you turn on R. Everything else needs to be loaded
using the library() function before you use it. For example:
> library(stats)
3. Functions in R do things. They may have a statistical, graphical, or data
manipulation purpose. They accept arguments (e.g. data or parameters),
produce values (e.g. results), and sometimes have side effects (e.g. saving
files, or making pictures). Some arguments need to be supplied, and others
can be left out. Help should clear this up.
4. Most functions come with examples. You can run these examples, and
learn LOTS about how to use each function with example(functionName).
> library(stats)
> example(glm)
5. Many functions and packages come with demos. You can see the full list
of demos using the demo function. Each demo also produces code which
indicates how to create the results.
> demo()
> # Run one of these demos
> # demo(persp)
Using help, example() and demo() goes a long way to getting R to work for you
quickly. If we want to go beyond R as a ”black box” for your analyses, we need
to understand a bit better how R works. This is the goal of the next section.
3.2
Exercise
1. Use demo() to find a list of demos. Then run one that interests you.
5
4
R Objects
The goal in this section is to cover enough of the basics so that the help section
becomes intelligible, and you can begin to independently use the resources that
come with R.
4.1
4.1.1
Example
Simple objects
Let’s start by creating some R objects. Another more generic programming
term for an object is a variable. The <- operator means ”assigned to”.
> a <- 1
> a
[1] 1
> hero <- "Superman"
> hero
[1] "Superman"
> capitalsMatter <- 365
> capitalsMatter==365
[1] TRUE
>
>
>
>
>
+
+
+
+
# The following line will not work because object does not exist:
# Capitalsmatter==365
# Some logic
if (hero=="Superman") {
cat(hero, "is my hero too!")
} else {
cat(hero, "is a villain!")
}
Superman is my hero too!
The basic type of R object is the vector. Here are some examples:
> B <- 1:10
> B
[1]
1
2
3
4
5
6
7
8
9 10
> tenRandomNumbers <- rnorm(10)
> tenRandomNumbers
6
[1]
[7]
2.8875586 -0.2024185 -0.1590609 -0.2871859
1.0879563 1.3634984 -0.1082369 -0.4326746
1.1751339 -1.3016199
Note the c() below. This is a way of creating a vector
> myChemMarks <- c(0.49, 0.75, 0.86, 0.67)
> myChemMarks
[1] 0.49 0.75 0.86 0.67
> politicalHeroes <- c("Obama", "Layton", "Gandhi", "Mandela")
> politicalHeroes
[1] "Obama"
"Layton"
"Gandhi"
"Mandela"
> warblerCount <- c(6, 4, 10, 4, 8, 0, 0, 0, 5)
> warblerCount
[1]
6
4 10
4
8
0
0
0
5
Now let’s do something useful with these vectors:
> # How many political heroes do I have?
> length(politicalHeroes)
[1] 4
> # What is my chemistry average?
> mean(myChemMarks)
[1] 0.6925
> # How often did I fail a chemistry assignment?
> myChemMarks
[1] 0.49 0.75 0.86 0.67
> myChemMarks < 0.5
[1]
TRUE FALSE FALSE FALSE
> sum(myChemMarks < 0.5)
[1] 1
> # At which point count locations did I see no warblers?
> warblerCount
[1]
6
4 10
4
8
0
0
0
5
> warblerCount == 0
[1] FALSE FALSE FALSE FALSE FALSE
> which(warblerCount==0)
[1] 6 7 8
7
TRUE
TRUE
TRUE FALSE
> # I didn't count yellow-rumpeds, and I saw them everywhere
> warblerCount <- warblerCount + 1
> warblerCount
[1]
7
5 11
5
9
1
1
1
6
> which(warblerCount==0)
integer(0)
> ## Aah good!
4.1.2
I saw them everywhere!
Matrices
Matrices and data frames are two important types of R objects, and it is these
that make R especially suitable for data analysis and statistics. Let’s make some
matrices first. These are just the two-dimensional equivalent of vectors, and you
can treat them in the same way.
> # Create a 2 x 5 matrix using a vector of numbers from 1 to 10
> A <- matrix(data=1:10, nrow=2, ncol=5)
> A
[1,]
[2,]
[,1] [,2] [,3] [,4] [,5]
1
3
5
7
9
2
4
6
8
10
> # Scalar arithmetic operations on matrices
> X <- A*2
> X
[1,]
[2,]
[,1] [,2] [,3] [,4] [,5]
2
6
10
14
18
4
8
12
16
20
> Y <- A+4
> Y
[1,]
[2,]
[,1] [,2] [,3] [,4] [,5]
5
7
9
11
13
6
8
10
12
14
> # A matrix of text
> politicalMatrix <- matrix(data=politicalHeroes, nrow=2, ncol=2)
> politicalMatrix
[,1]
[,2]
[1,] "Obama" "Gandhi"
[2,] "Layton" "Mandela"
8
>
> # You can't multiply these however!
> # politicalMatrix*2
4.1.3
Error!
Data frames
Data frames are like matrices but more flexible. Unlike matrices, which can only
be of one type (e.g. all numbers, all text), data frames can mix these elements,
making them very useful for data analysis.
Also notice how R doesn’t mind when we split things onto multiple lines. But
watch the nested brackets.
> blackBirds <- data.frame(species=c("CORA", "AMCR", "COGR", "EUST"),
+
count1=c(1, 3, 1, 12),
+
count2=c(2, 2, 6, 8),
+
count3=c(0, 1, 0, 6))
> blackBirds
1
2
3
4
species count1 count2 count3
CORA
1
2
0
AMCR
3
2
1
COGR
1
6
0
EUST
12
8
6
We can pull out parts of the data frame as follows:
> # Get a row
> blackBirds[1, ]
1
species count1 count2 count3
CORA
1
2
0
> # Get a column
> blackBirds[ , 1]
[1] CORA AMCR COGR EUST
Levels: AMCR COGR CORA EUST
> # Get a column by name
> blackBirds$species
[1] CORA AMCR COGR EUST
Levels: AMCR COGR CORA EUST
> # Another way to get a column by name
> blackBirds[, "species"]
[1] CORA AMCR COGR EUST
Levels: AMCR COGR CORA EUST
9
> # Get a range of columns
> blackBirds[ , 2:4]
1
2
3
4
count1 count2 count3
1
2
0
3
2
1
1
6
0
12
8
6
> # Get a single datum
> blackBirds[1,2]
[1] 1
And we can change a single datum by assigning
> blackBirds[1, "count1"] <- 7
Adding columns can often by helpful, for example to produce new variables that
are derived from existing ones. Adding rows (i.e. adding a new observation)
occurs less often in typical data analysis. Often rows will be added outside R
at the stage of raw data preparation (e.g. in Excel). Examples of adding both
columns and rows to data frames follow:
> # Add a column called "commonName"
> blackBirds$commonName <+
c("Common raven",
+
"American crow",
+
"Common grackle",
+
"European starling")
> blackBirds
1
2
3
4
species count1 count2 count3
commonName
CORA
7
2
0
Common raven
AMCR
3
2
1
American crow
COGR
1
6
0
Common grackle
EUST
12
8
6 European starling
> # Adding a row is more complicated, however
> blackBirds <- rbind(blackBirds,
+
data.frame(species="RUBL",
+
count1=0,
+
count2=0,
+
count3=1,
+
commonName="Rusty blackbird"))
> blackBirds
1
2
3
species count1 count2 count3
CORA
7
2
0
AMCR
3
2
1
COGR
1
6
0
commonName
Common raven
American crow
Common grackle
10
4
5
EUST
RUBL
12
0
8
0
6 European starling
1
Rusty blackbird
> # Add a row total column derived from count columns
> blackBirds$totals <- rowSums(blackBirds[ , 2:4])
> blackBirds
1
2
3
4
5
species count1 count2 count3
commonName totals
CORA
7
2
0
Common raven
9
AMCR
3
2
1
American crow
6
COGR
1
6
0
Common grackle
7
EUST
12
8
6 European starling
26
RUBL
0
0
1
Rusty blackbird
1
4.2
Exercise
1. Create a vector containing the numbers 10 to 20
2. Create a vector containing the names of five people in the room
3. Create a data frame with your favourite restaurants. It should contain four columns and at least three rows. Columns should be ”name”,
”cuisineStyle”, ”costPerVisit” and ”timesVisited”
4. Find your average expenditure per visit
5. Find your total restaurant expenditure
4.3
Challenge
1. Create a vector containing 25 random numbers using the rnorm() function.
2. Find how many of these numbers are less than zero, and how many are
greater than zero.
11
5
Manipulating real data
Here we will import some data from a CSV file (produced by Excel) into an
R data frame object. We will demonstrate some basic and more advanced
manipulations of this data.
The datasets slktbbs.csv and slktbbslatlon.csv gives observations from breeding
bird survey point counts conducted in the vicinity of Sioux Lookout, Ontario
between 2006 and 2010. You should have received a copy of these datasets.
They can also be downloaded by clicking on filenames.
5.1
Example
We’ll use the dataset slktbbs.csv in this session. The second dataset with lat/lon
information will be useful for future tutorials when we explore spatial data.
Let’s start by loading the csv file. To do this we need to know where on our
computer’s file system the csv file has been saved.
In case you are unfamiliar with the concept of a file path we’ll assume that you
saved the csv files in your Windows default documents folder (My Documents
on Windows XP). In R we can refer to this default documents folder using the
tilde. Note that if you do give a file path in R you must replace back slashes
with forward slashes!
>
>
>
>
>
>
# Assumes the file slktbbs.csv is saved in documents folder
bbs <- read.csv("~/slktbbs.csv")
# It is now loaded into a data frame with object name bbs
# Examine the entire data frame (note "V"iew)
View(bbs)
You can see by viewing the data frame that it is organized by route, year and
species with the counts of the species at each of 50 route locations given as
variables. It is not terribly useful in this format. Our first goal is to impose
some order on this.
Add up counts for 25 consecutive stops. This gives us two abundance observations for each route/year/species combination. We’ll turn these into two new
columns (or variables).
> # Use column names to find column numbers
> names(bbs)
[1]
[5]
[9]
[13]
"route"
"species"
"stop3"
"stop7"
"year"
"habitat"
"stop4"
"stop8"
"aou"
"stop1"
"stop5"
"stop9"
12
"speciesname"
"stop2"
"stop6"
"stop10"
[17]
[21]
[25]
[29]
[33]
[37]
[41]
[45]
[49]
[53]
>
>
>
>
>
+
"stop11"
"stop15"
"stop19"
"stop23"
"stop27"
"stop31"
"stop35"
"stop39"
"stop43"
"stop47"
"stop12"
"stop16"
"stop20"
"stop24"
"stop28"
"stop32"
"stop36"
"stop40"
"stop44"
"stop48"
"stop13"
"stop17"
"stop21"
"stop25"
"stop29"
"stop33"
"stop37"
"stop41"
"stop45"
"stop49"
"stop14"
"stop18"
"stop22"
"stop26"
"stop30"
"stop34"
"stop38"
"stop42"
"stop46"
"stop50"
# Sum across rows of selected columns to make 2 new columns
bbs$stops1to25 <- rowSums(bbs[ , 7:31])
bbs$stops26to50 <- rowSums(bbs[ , 32:56])
# Make a smaller data frame with just what we need
birds <- bbs[, c("route", "year", "species", "habitat",
"stops1to25", "stops26to50")]
We should also know the total abundance for each route/year/species. We can
add the two new variable columns we made. This is vector addition, so the first
element of the first vector is added to the first element of the second vector, and
so on.
> # Create another new column
> birds$total <- birds$stops1to25 + birds$stops26to50
Before we move on, let’s take a look at our analysis-ready data frame. We can
use View() as shown above, but we can also look at it in two additional ways:
> head(birds)
1
2
3
4
5
6
route
SilverDollar
SilverDollar
SilverDollar
SilverDollar
SilverDollar
SilverDollar
year species
habitat stops1to25 stops26to50 total
2006
COLO
water
2
4
6
2006
HEGU
water
0
6
6
2006
COGO
water
0
1
1
2006
AMBI
wetland
3
0
3
2006
WISN
wetland
5
0
5
2006
RUGR understory
4
1
5
> # Summarize the structure of the data frame
> str(birds)
'data.frame':
$ route
:
$ year
:
$ species
:
$ habitat
:
$ stops1to25 :
$ stops26to50:
$ total
:
351 obs. of 7 variables:
Factor w/ 2 levels "BigSandy","SilverDollar": 2 2 2 ..
int 2006 2006 2006 2006 2006 2006 2006 2006 2006 20..
Factor w/ 83 levels "ALFL","AMBI",..: 24 39 22 2 75 ..
Factor w/ 5 levels "forested","open",..: 4 4 4 5 5 3..
num 2 0 0 3 5 4 0 1 0 0 ...
num 4 6 1 0 0 1 1 0 3 1 ...
num 6 6 1 3 5 5 1 1 3 1 ...
13
Notice how each column or variable in the data frame is given a different row
in this summary. The year variable is currently an integer ”int”. We need to
convert this to a ”factor” so we can use it as a grouping variable in the subsequent
steps.
> birds$year <- as.factor(birds$year)
Now comes the interesting part! Let’s ask some questions about the data by
grouping observations in different ways. A powerful function for this purposes
is aggregate().
We can tell aggregate() to use an entire data frame. This saves us constantly
referring to the data frame object. We also use a ”formula” which appears often
in R. A formula uses a tilde to separate the dependent from the independent
variables. Although in this case we are not doing statistics and dependence
doesn’t matter, this is a convenient way of saying how we want things to be
grouped.
The following finds the total number of birds found in each habitat, aggregating
all data from years and routes.
> # The bird abundance in each habitat
> aggregate(total ~ habitat, data=birds, function(subtotal) sum(subtotal))
habitat total
1
forested 1772
2
open
148
3 understory
865
4
water
154
5
wetland
162
In the above example, aggregate sorts the total column into five groups representing each habitat type. It creates a vector called subtotal with the totals just
for that group. The last part ”sum(subtotal)” takes that vector and sums it to
produce the total abundance for just that group.
We can find abundance by habitat and year in the same way
> # The bird abundance in each habitat for each year
> aggregate(total ~ habitat + year, data=birds,
+
function(subtotal) sum(subtotal))
1
2
3
4
5
6
7
8
habitat
forested
open
understory
water
wetland
forested
open
understory
year total
2006
544
2006
34
2006
277
2006
35
2006
46
2007
319
2007
31
2007
133
14
9
water 2007
10
wetland 2007
11
forested 2008
12
open 2008
13 understory 2008
14
water 2008
15
wetland 2008
16
forested 2009
17
open 2009
18 understory 2009
19
water 2009
20
wetland 2009
21
forested 2010
22
open 2010
23 understory 2010
24
water 2010
25
wetland 2010
31
27
523
45
244
65
51
194
25
118
16
20
192
13
93
7
18
What about species richness? Again, we need to ask for what subgroups in the
data set we want species richness (i.e. by year, by route, by year and route).
We’ll use a different approach because now we want to count the number of
group members rather than perform a calculation based on the abundance in
that group.
We will find the subset of the data frame we are interested in and then count
how many unique bird species names are in that subset.
> # Species richness for Big Sandy route all years combined
> nrow(unique(subset(birds, route=="BigSandy", "species")))
[1] 71
> # Species richness for the Silver Dollar route all years combined
> nrow(unique(subset(birds, route=="SilverDollar", "species")))
[1] 69
> # Species richness for wetland habitats in 2008
> nrow(unique(subset(birds,
+
habitat=="wetland" & year=="2008", "species")))
[1] 7
> # Species richness for the dataset
> length(unique(birds$species))
[1] 83
Now for some plots. On the subsequent page we use aggregate() and subset() to
give us full control over a plot. We also can specify a subset with aggregate().
This code looks complicated, but don’t be alarmed. All the principles presented
15
here have been covered in previous pages. Use ?points ?lines and ?plot.default
to get more info on these functions. We could go much further to make a more
useful plot, with legend etc. But for simplicity we’ll stop here.
5.2
Exercise
1. Use read.csv() to load the slktbbs.csv dataset
2. Make the column variable ”total” as shown
3. Use aggregate to find abundance totals by species
4. Use aggregate and the subset= option to find abundance totals by species
in open habitats
5.3
Challenge
1. Modify the plotting code provided to use colour
2. Then, modify the code provided to create a per-year plot of abundance
for four species in open habitats on the Silver Dollar route
3. Then, create a third plot showing how species richness and total bird
abundance for Silver Dollar in each of five years.
16
# Put an aggregated table into an object
aggBird <- aggregate(total ~ species + year, data=birds,
subset=(route=="SilverDollar"),
function(subtotal) sum(subtotal))
# Turn year back into a number for plotting
aggBird$year <- as.integer(as.character(aggBird$year))
# Create an empty plot
plot.default(x=c(min(aggBird[,"year"]), max(aggBird[, "year"])),
y=c(0, max(aggBird[, "total"])),
type="n", xlab="Year", ylab="Abundance")
# Plot data for three species
subBird <- subset(aggBird, species=="WTSP", c("year", "total"))
points(subBird, pch=20)
lines(subBird, lty="solid")
subBird <- subset(aggBird, species=="ALFL", c("year", "total"))
points(subBird, pch=21)
lines(subBird, lty="dashed")
subBird <- subset(aggBird, species=="COWA", c("year", "total"))
points(subBird, pch=22)
lines(subBird, lty="dotted", lwd=4)
80
>
>
+
+
>
>
>
>
+
+
>
>
>
>
>
>
>
>
>
>
●
●
●
60
●
40
Abundance
●
20
●
●
●
●
2009
2010
0
●
2006
2007
2008
Year
Figure 1: Abundance for three species on the Silver Dollar route. Whitethroated sparrow (solid), Alder flycatcher (dashed), Connecticut warbler (dotted)
17