R basics Albin Sandelin

R basics
Albin Sandelin
Ultra-short R introduction
Most life scientists use spreadsheet
programs for analysis - like excel or
statistica. Why?
Ease of use
- click buttons, select data by hand
-you see the data in front of you
-you can do limited programming
Disadvantages
• Hard to handle large dataset (>100
points)
• Inflexible, few analyses available
• “Dumb-down” click button effect
• Hard to repeat analyses systematically
with new data
• R is a computational environment somewhere between a program and a
programming language
• No buttons, no wizards: only a command line
interface
• Is a professional statistics toolset - likely has
all the analyses you will ever need in your
career
• Is also a programming language
• Can handle large datasets
• Very powerful graphics
• The state-of-the-art statistics program for
bioinformatics, especially array analysis
• Free, and open source!
In this course, we will use R for
Exploring data
Select subsets of data
Plotting data
Statistical tests on data
…etc
= almost everything.
It is the most important tool in this course!
R is hard
R is challenging to learn:
No visual cues
Very different from spreadsheets
Hard to find the method to use
Cryptic manual pages (at times)
It requires a lot of effort to get inside the
system, but bear in mind that it is worth it you gain an edge to all spread-sheet users
Reading material
• Keep the reference sheet handy
• Save code from lecture exercises in a text file
• Powerpoint slides are available after lectures
at the web page
• Can also recommend the tutorials on the
CRAN website – this is how I learned R
• Dahlgaard book is very good, also for future
reference
Expected:
• You will have R open at all times during this
course
• We will make MANY smaller exercises during
class, most often in pairs
• Understanding these exercises is the key to
succeeding in the exam
• We will have at least two teachers present at
al lectures - use us to get help when needed
•NB: It is YOUR responsibility to that you
understand the examples/exercises
-if not, you have to ask for clarification
during the walkthrough of the problems
First challenge of the day
• Start R
• Type
demo(graphics)
• Hit enter a few times
Starting with R
We will now walk through the most basic
R concepts
We will add successive detail on this –
with many exercises - and in a while
get over to plotting, basic statistics, and
finally some programming
Challenging?
• Those who know R or have a math
background might find most things for
today no-so-challenging
• We cover this to keep everyone on the
same ground
Vectors
•
•
•
In statistics, we are almost always dealing with several “data points”
A vector is an collection of numbers and/or strings:
– (“albin”, “sanne”, “jens”)
– (1, 5.2, 10, 7, 2, 21)
– (3)
The last example is a vector of length 1
In R, we make a vector by the c() command (for concatenate)
Method name
Method arguments
> c(1,5,10, 7, 2, 1)
[1] 1 5 10 7 2 1
Method output is also called “return value(s)”
> c('albin', 'sanne', 'jens')
[1] "albin" "sanne" "jens"
When we input strings or characters, we have to surround them
with “ or ‘
If making vectors of size 1, we can skip c()
>3
[1] 3
Challenge:
1) Make the following vector
45,5,12,10
2) What happens with the following command?
c(1:100)
c(50:2)
A vector is a data structure, and the most fundamental
in R: almost everything in R is some kind of vector,
although sometimes in several dimensions – vectors
within vectors
The reference sheet is your
friend!
• You will get over-whelmed by different
command names fast
• Use the reference sheet to remind
yourself in all exercises
Assignment to memory
•
•
The c() command is almost useless in itself - we want to keep the
vector for other analyses
The assignment concept:
> 4+5
[1] 9
> a<-4
> b<-5
>a
[1] 4
>b
[1] 5
> a+b
[1] 9
# add 4 and 5
# store 4 as “a”
# store 5 as “b”
# just checking
# add a+b (4+5)
# yes, works
Anything after
a # will be
disregarded
by R. Used for
making
comments
Expanding this to a whole vector
my_vector<- c(1,5,10, 7, 2)
Note that there is no “return value” now - this is caught by the
“my_vector”. What happens if you type my_vector after this?
my_vector is a variable, with the variable name my_vector. Variable
names are totally arbitrary!
The anatomy of the vector:
Name
my_vector
Values
1
5
10
7
2
Index
[1]
[2]
[3]
[4]
[5]
We can access part of the vector like this
my_vector[5] # will give the 5th item in the vector
What happens if you do this?
my_vector<- c(1,5,10, 7, 2) # define the vector
my_vector [c(1,3,5)]
my_vector[1:4]
my_vector[4:1]
> my_vector[c(1,3,5)]
[1] 1 10 2
> my_vector[1:4]
[1] 1 5 10 7
Making a series of integers:
A:B will give a vector of A,A+1, A+2…B
>my_vector[4:1] # also works backwards
[1] 7 10 5 1
Challenge
• Using the reference sheet, figure out at
least three ways of making R print your
vector in the other direction my_vector<- c(1,5,10, 7, 2) # define the vector
> my_vector[c(5,4,3,2,1)]
[1] 2 7 10 5 1
> my_vector[c(5:1)]
[1] 2 7 10 5 1
> rev(my_vector)
[1] 2 7 10 5 1
>
Interlude: naming rules and
the danger of over-writing
Naming: We can name vectors to almost anything. The most basic
rule is : Never start a vector name with a number
> a<- c(1,5,4,2)
#OK
> 1a<- c(1,5,4,2)
# NOT OK
Error: syntax error
> a1<- c(1,5,4,2) # OK
Over-writing:
my_vector<- c(1,5,10, 7, 2)
my_vector<- c(10,5,2, 3, 1)
#what does my_vector contain now?
Analyzing vectors
• Many functions work directly on vectors - most have
logical names
• For instance, length(my_vector) gives the number of
items in the vector (= 5)
• Challenge:
– Make a vector big_vector with values 1 to 10000 using the
a:b construct
– What is the
• Length of the vector
• Sum of all items in vector: the sum() function
• Mean(= average) of all items in the vector: the mean() function
> big_vector<-(1:10000)
> length(big_vector)
[1] 10000
> sum(big_vector)
[1] 50005000
> mean(big_vector)
[1] 5000.5
Interlude: the R help system
• R has hundreds of functions where most
have a non-trivial syntax: The easiest way of
retrieving documentation is by saying
?function_name #OR
help(function_name) #OR
help.search(“something”) #for ‘fuzzy’ searches
• Look at the help for sample() and sort() and
then try them out on big_vector
• Also try out RSiteSearch(“sample”) – what
happens
Adding and multiplying a number
to a vector
Sometimes we want to add a number, like 10, to each element in
the vector:
> big_vector +10
Test this:
big_vector2<-big_vector +10
Also test
min(big_vector)
max(big_vector)
min(big_vector2)
max(big_vector2)
What happens?
> big_vector2<-big_vector +10
> min(big_vector)
[1] 1
> max(big_vector)
[1] 10000
> min(big_vector2)
[1] 11
> max(big_vector2)
[1] 10010
Adding vectors
• We can also add one vector to another vector
• Say that we have the three vectors
A<-c(10, 20, 30, 50)
B<-c(1,4,2,3)
C<-c(2.5, 3.5)
• Test what happens and explain the outcome:
A+B
A+C
> A+B
[1] 11 24 32 53
> A+C
[1] 12.5 23.5 32.5 53.5
A<-c(10, 20, 30, 50)
B<-c(1, 4, 2, 3 )
C<-c(2.5,3.5
)
A+B is easy to understand : A[1]+B[1] , etc
A+C is trickier - the C vector is just of length 2. It is reused! So
A[1]+C[1]=12.5
A[2]+C[2]=23.5
A[3]+C[1]=32.5
A[4]+C[2]=53.5
Actually, this is what is happening also with A+10 …the
10 is used many times.
Plotting vectors
• Lets make up some semi-random data:
dat<-rnorm (100) #draw 100 random, normal
distributed data points
• Test the following:
plot(dat)
plot(dat,type=“l”)
barplot(dat)
hist(dat)
• What is the difference?
Why are your three first plots different than mine?
Why is your last plot more similar to mine?
Graph options
• Generally, you can give extra options to graphical commands
like this
• plot (some_vector, col=“blue”, type=“l”)
• In plot: try to vary the following options - note that you can use
several at once (and figure out what they do)
type=“l”, “b”
col= “hotpink”
main=“important plot”
type=“h”
type=“S”
These options are really arguments to the plot() function
More about functions
In almost all cases, a function needs some
input, like plot(dat).
‘dat’ here is an unnamed argument, and this
works because plot() assumes we mean
x values = dat.
We could also say plot(x=dat) - a named
argument. If you have many arguments, most
of them are named - such as
plot (some_vector, col=“blue”, type=“s”)
The help pages will tell you what type of
arguments you can use
The danger of unnamed arguments…
• …is that the order of them will make a
big difference. Try this out - what is the
difference between the plot commands?
a<-rnorm(100)
b<-rnorm(100)*2
plot(a,b)
plot(b,a)
plot(x=b, y=a)
plot(a,b)
b
a
plot(b,a)
a
b
plot(x=b, y=a)
a
b
Some generic R arguments to
plots - the par() function
The par() function is used to set general plot
properties. It has hundreds of possible
arguments - see ?par
Two very handy par() arguments is mfrow() and
mfcol() - these will allow many plots in one
page
You give these functions a vector of length 2 this gives the number of cells in the page
(see example)
Example:
> par( mfrow=c(3,1) )
> plot (a,b)
> plot (b,a)
> plot(x=b, y=a)
> par( mfrow=c(2,2) )
> plot (a,b)
> plot (b,a)
> plot(x=b, y=a)
Challenge - can you get the
three plots in a row using
mfrow?
> par( mfrow=c(1,3) )
> plot (a,b)
> plot (b,a)
> plot(x=b, y=a)
Overlaying plots
• Sometimes we want to put may data
sets within one graph, on top of each
other
• This is often made by the lines() or
points() command, like this
> plot(b, type="l", col="blue")
> lines(a, col="red")
Why did I start with
plotting b?
What would have
happed if using
points() instead of
lines()?
Sizing graphs
Simple concept, but awkward to write
Change X scale: xlim=c(start_value,
end_value)
Change Y scale: ylim=c(start_value,
end_value)
> plot(a, type="l", col="blue")
> plot(a, type="l", col="blue", ylim=c(-5,5))
Saving graphs
• Different on different systems!
• All systems can use the original tricky
way with device() – see introduction
PDFs if you have an interest. We won’t
use this.
OSX
• Click on the graph, and just copy it. Will
become a pdf or a bitmap when pasting.
• On my mac you have top open a new
file in Preview and paste there first. On
others you can paste it into other
programs directly
• Or, just have the graphics window
active and do file->save
Windows
Right-click on graph, copy as metafile or
bitmap, paste
Linux
• Use awkward device() system, like
pdf(…) or
• Use GIMP and just make a screen
capture from there by “acquire”
Statistics part 1
• Describing the data: descriptive
statistics
• What is the center point of the data?
• What is the spread of the data?
• How does the distribution look?
Summary statistics
• hist() (= Histogram) is a graphical way of
summarizing distributions – it creates a
number of “bins” and calculates how many of
the data points that fall into each bin.
• We can also summarize by the center points
in the data:
– mean()
• median():
– Sort all the data, pick the number in the center. If
the number of data points is even, take the mean
of the two center points
• Challenge:
• We make another vector
dat2<-rnorm(10)
And add a few extra points to it
dat2<-c(dat2, 10, 10.5, 30 )
• Test mean() and median() on dat2. Are they the
same? Can you explain the differences by plotting a
histogram? What is the advantage/disadvantage of
each measure?
> dat2<-rnorm(10)
> dat2<-c(dat2, 10, 10.5, 30 )
> median(dat2)
[1] 0.11125
> mean(dat2)
[1] 3.636139
> hist(dat2)
Means are sensitive to
outliers! Very common
situation in genomics…
Percentiles
• An extension of the median concept
• Best explained by example:
– the 20th percentile is the value (or score) below
which 20 percent of the observations may be
found.
• The median is the same as the 50th
percentile
• The first quartile is the 25th percentile, the
third is the 75th
• Try summary(dat) and summary(dat2)
summary(dat)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.20600 -0.60480 -0.06930 -0.01333 0.68030 3.49600
> summary(dat2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.4010 -0.5385 0.1113 3.6360 0.6771 30.0000
The command ecdf()
(empirical cumulative distribution) calculates “all”
percentiles in your data – and also understands
plot()
Try plot (ecdf(dat2))
What
fraction of
the data
that has
been
covered at
point X
Sorted values of dat2
Boxplots
• An “easier” representation of ECDFs. Is
based on both making boxes that tell us
about both center point and “spread” of the
data
• First, calculate the first quartile, the median
and the third quartile
• Calculate the “inter-quartile range” (IQR): 3rd
quartile -1st quartile
• These will be used to draw a “box”
>boxplot(dat)
> rug(dat, side=2)
Interquartile range (IQR)
3rd quartile
Median
1st quartile
…continued
• Sounds more complicated than it is:
Any data observation which lies more than
1.5*IQR lower than the first quartile is
considered an outlier.
Indicate where the smallest value that is not
an outlier is by a vertical tic mark or "whisker",
and connect the whisker to the box via a
horizontal line.
Do the same for higher values
The data
point that
is highest
but within
1.5 IQR
from the
center
More
extreme
data outliers
3rd quartile
Median
1st quartile
Variance, standard deviation
and data spread
• What is the difference between these
distributions?
• Same mean and median, but
different spread over the x axis
• This can be measured by the
variance of the data:
• It is basically the difference
between each point and the
mean, squared
• Standard deviation is simply variance
squared.
• This gives nice statistical features that
you will get to know if you study
statistics 
Why is variance and standard
deviation important?
• Variance tells you something about the
quality of measurements
• The higher the variance, the harder it is
to with certainty say that two
measurements are different
• (whiteboard demo)
• What are the R functions for variance
and standard deviation?
• If we make some random data
smallset<-rnorm(100)
largeset<-rnorm(10000)
What is the variance and standard deviation
for these?
Is the standard deviation really the square
root of the variance (what is the function for
square root?)
>smallset<-rnorm(100)
>largeset<-rnorm(10000)
>var(smallset)
[1] 0.9504884
>var(largeset)
[1] 0.98416
Why do we get about the same variance?
>sd(largeset)
[1] 0.9920484
>sd(smallset)
[1] 0.97493
>sqrt(var(smallset))
[1] 0.97493
Dealing with real data
• …involves interacting with messy data
from other people
• Almost all interesting data is “real”
• We need methods to get the data into R
to start with.
Reading data from files
• Typically, you have data in a .txt or .xls file how do you get it into R for analysis?
• Four steps:
–
–
–
–
1 Format it to tab-separated text file
2 Tell R what directory to work in
3 Read in the file
4 Attach data to memory (optional)
What R likes:
• A “data frame”(=table) with columns.
The first row can be column names, like
this:
Name row, “header”
Lecturer
height
age
Poker_games_won
Albin
181
34
14
Jens
189
31
10
Sanne
170
NA
21
First data row
String data
Numerical data
Code for missing data
Do not ask ladies for age…
Reading data(1)
• General rules:
– Use tabs as separators
– Each row has to have the same number of
columns
– Missing data is NA, not empty
• As .txt: The ideal format!
• As .xls: save as tabbed text
Get some data to play with:
Update: old file is wrong!
Please download the data
See course web page! again
•
• http://people.binf.ku.dk/albin/teaching/htbinf/R/
• Download the zipped file – save this at a good place that
you know where it is
• I will refer to this place on your hard drive as “the data
directory”
Reading data(2)
• Tell R where to work.
– In R, you always work in a given directory, called
“working directory”.
– You can set this graphically in Windows or MacOS
(next pages)
– Or use setwd(“some directory”)
– In Linux, it pays to start R in the same directory
• It pays to have one directory per larger
analysis, and call the data files informative
things (not “Data.txt”)
Current working directory
Change working directory
In OSX
In Windows
Reading in data (3)
my_data<- read.table(“filename”, header=T)
Data frame object
holding all the data
Only if you have
column names – I always
have that
The actual command
Lets try…
• In the data directory, there is a file
“flowers.txt”. Have a look at it (it is a normal
text file – can be opened in almost anything).
Now:
my_data<-read.table(“flowers.txt”, header=T)
names(my_data)
What happens – what does “names” do?
> my_data<-read.table("flowers.txt", h=T)
# note to self: show name completion feature
> names(my_data)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
Sepal.Length Sepal.Width
5.1
3.5
1.4
0.2
4.9
3.0
1.4
0.2
4.7
3.2
1.3
0.2
4.6
3.1
1.5
0.2
5.0
3.6
1.4
0.2
Etc…
Petal.Length Petal.Width Species
setosa
setosa
setosa
setosa
setosa
Accessing the columns
• There are three ways of getting the data
in a data frame
– Get the column by saying
dataframe_name$column_name
– Get the columns as separate vectors,
named as their header - the attach
command
– More complex queries, using index
operations (more in a few slides)
Columns by names
• The attach(some_data_frame) function will
put all the columns of some_data_frame as
vectors in working memory
• So, in our case
• attach(my_data)
• Will give the 5 vectors
•
"Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" and "Species"
• (which we then can analyze)
Dangers of the attach()
function
• Attach will make a vector out each
column, named after the column name
• What happens if we already have a
vector with that name?
• What does ls() and detach() do? Why
would we want to do this?
Challenge
• Using the flowers file,
– Plot a histogram of the length of sepals
(“Sepal.Length”)
– Plot the correlation between sepal length and
sepal width (an x-y plot)
• What is the mean of Sepal.Length and
Sepal.Width?
– Why do you get strange results on Sepal.Length?
– Can you solve the problem by using the
appropriate help file?
> hist(Sepal.Length)
> plot(Sepal.Length, Sepal.Width)
> mean(Sepal.Width)
[1] 3.057333
> mean(Sepal.Length)
[1] NA
>
If we look at the file:
Sepal.Length Sepal.Width
5.1
3.5
1.4
0.2
4.9
3.0
1.4
0.2
NA
3.2
1.3
0.2
4.6
3.1
1.5
0.2
5.0
3.6
1.4
0.2
5.4
3.9
1.7
0.4
4.6
3.4
1.4
0.3
# strange!
Petal.Length Petal.Width Specie
setosa
setosa
setosa
setosa
setosa
setosa
setosa
Missing values
• In real data, we often have missing
values. In R, we label these “NA”
• Any examples of missing data?
• Some functions can by default deal with
these - for instance hist()
• Some functions will need to be told what
to do with them, like mean()
> mean(Sepal.Length)
[1] NA
#na.rm is the option for mean to remove all NAs first
> mean(Sepal.Length, na.rm=T)
[1] 5.851007
With new files, you should check whether there are any
NA values first.
What if we wanted just the 5 first
rows? Or just all values from the
species versicolor?
Sepal.Length
5.1
3.5
4.9
3.0
4.7
3.2
4.6
3.1
5.0
3.6
6.2
2.9
5.1
2.5
5.7
2.8
6.3
3.3
5.8
2.7
7.1
3.0
6.3
2.9
Sepal.Width
1.4
0.2
1.4
0.2
1.3
0.2
1.5
0.2
1.4
0.2
4.3
1.3
3.0
1.1
4.1
1.3
6.0
2.5
5.1
1.9
5.9
2.1
5.6
1.8
Petal.Length
setosa
setosa
setosa
setosa
setosa
versicolor
versicolor
versicolor
virginica
virginica
virginica
virginica
Petal.Width
Species
The index concept
• Recap:
• In a vector, we can select individual values by saying
• Sepal.Width[some position]
> Sepal.Width[5]
[1] 3.6
• Sepal.Width[some_positions]
> Sepal.Width[2:10]
[1] 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1
Sepal.Width[c(2,4,6,8)]
[1] 3.0 3.1 3.9 3.4
• The first position will always have the number 1, the last will have the
length of the vector.
• Challenge: what is the mean of the first 100 values of
Sepal.Width?
•
•
•
•
•
•
> a<-Sepal.Width[1:100]
> mean(a)
[1] 3.099
# or, in just one line
> mean(Sepal.Width[1:100])
[1] 3.099
Selecting parts of vectors
depending on values
• Often, we want parts of vector due to the values of
the vector -for instance all values <10, or all strings
equaling “versicolor”.
• This is made by a slightly odd syntax(it will make
sense later on when we have many vectors)
• To get all values over 10 in Sepal.Length
Sepal.Length[ Sepal.Length >10 ]
• So, we are using a “criterion” instead of the index
Try out
• Make a boxplot of all values in
Sepal.Length that are higher than 5
• How many vales are there that satisfy
this criterion?
Albin’s code
a<-Sepal.Length[Sepal.Length >5 ]
boxplot (a)
length(a)
[1] 119
criteria in R
•
•
•
•
•
•
< Smaller than
> Bigger than
>= Bigger or equal to
<= smaller than or equal to
== equal to (not “=“)
!= not equal to
• Challenge:
how many rows for the virginica species?
> length(Species[Species==virginica] )
Error in NextMethod("[") : object "virginica" not found
# what happens here is that R thinks virginica is an
object - a vector or something else
> length(Species[Species=="virginica"] )
[1] 50
For string comparisons, we need to enclose the
string with “ ” or ‘’
Selection using two vectors
• Often we have two vectors from a file,
and we want to use the value of one
vector to select values from the other.
• For instance, sepal length broken up by
species
Sepal.Length
5.1
3.5
4.9
3.0
4.7
3.2
4.6
3.1
5.0
3.6
6.2
2.9
5.1
2.5
5.7
2.8
6.3
3.3
5.8
2.7
Sepal.Width
1.4
0.2
1.4
0.2
1.3
0.2
1.5
0.2
1.4
0.2
4.3
1.3
3.0
1.1
4.1
1.3
6.0
2.5
5.1
1.9
Petal.Length
setosa
setosa
setosa
setosa
setosa
versicolor
versicolor
versicolor
virginica
virginica
Petal.Width
Species
• Use the same syntax as before - but
change the vetor names!
• some_vector[ other_vector == something]
• Challenge:
• select the Sepal.Width values that
correspond to the setosa species
• What is the mean of these values?
• What is the distribution of these values?
Albin’s code
> a<-Sepal.Width[Species=="setosa"]
> mean(a)
[1] 3.428
> hist(a)
Fancy way of breaking up
data (only works with some
functions)
• boxplot (Sepal.Width~Species)
• Try it out…
• This is a “formula” construction - break
up width in the levels of species
Even more complex: multiple
criteria
• & “and”: combine two criteria
• | “or”: either this or that has to be true
So,
Species[ Sepal.Width >4 & Sepal.Length <5]
will give the species rows where width >4 AND Length is <5.
…and, Species[ Sepal.Width >4 | Sepal.Length <5] will give?
Challenge: Make a boxplot of Sepal.Width
where Sepal.Length >3 AND the species is
virginica
>boxplot (Sepal.Width[Sepal.Length >3 &
Species=="virginica"] )
More complex data selections
• The dataframe is really a twodimensional vector - a vector that
contains vectors
• We can use almost the same
commands to select sub-sets of the
data frame, without using the attach()
function
• This is very handy at times
• The data frame is really a big table - a 2D vector.
• Just as some_vector[5] gives use the fifth value in a
vector,
• some_dataframe[10, 2] gives us the cell in the 10th
row and the 2nd column
>my_data[10, 2]
[1] 3.1
Sepal.Length
5.1
3.5
4.9
3.0
4.7
3.2
4.6
3.1
5.0
3.6
5.4
3.9
4.6
3.4
5.0
3.4
4.4
2.9
4.9
3.1
5.4
3.7
4.8
3.4
4.8
3.0
Sepal.Width
1.4
0.2
1.4
0.2
1.3
0.2
1.5
0.2
1.4
0.2
1.7
0.4
1.4
0.3
1.5
0.2
1.4
0.2
1.5
0.1
1.5
0.2
1.6
0.2
1.4
0.1
Petal.Length
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
Petal.WidthSpecies
Selecting rows
•
•
•
•
If we instead say
my_data[10, ]
We will get the WHOLE 10th row back
One can think of it as saying
my_data[10, “give me everything”]
• So, what will happen if we say
my_data[1:5, ]
• Or
my_data[5:1, ]
• Try it out…
Indexing exercise
• In the data directory, there is a file
called Rindexing.pdf – it is a set of
small exercises designed to help you
understand indexing
• It works with a very small data file, just
so that you can check your results by
eye
• Indexing is very important for the rest of
the course
Nota Bene:
1)Students on the reserve list
• Please talk to me to get signed on to the
course (I need your student number or
similar)
• Otherwise you cannot go to the exam
=> no credit
• 2) A watch was found in the last lecture
– will be given back if you can describe
it well
Sorting one vector by another
• Very common situation with
experimental data – what are the genes
with the highest expression in my data
set?
• Recap: what happened if you said
my_data[1:5, ]
• Or
my_data[5:1, ]
• ?
# get some data to play with
d<-read.table("mtcar.txt", h=T)
attach(d)
Just by looking at the data,
d;
what car has the highest and lowest horsepower (=hp)?
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
car_name mpg cyldisp
1
Mazda RX4
2
Mazda RX4 Wag
3
Datsun 710
4
Hornet 4 Drive
5
Hornet Sportabout
6
Valiant
7
Duster 360
8
Merc 240D
9
Merc 230
10
Merc 280
11
Merc 280C
12
Merc 450SE
13
Merc 450SL
14
Merc 450SLC
15 Cadillac Fleetwood
16 Lincoln Continental
17
Chrysler Imperial
18
Fiat 128
19
Honda Civic
20
Toyota Corolla
21
Toyota Corona
22
Dodge Challenger
23
AMC Javelin
24
Camaro Z28
25
Pontiac Firebird
26
Fiat X1-9
27
Porsche 914-2
28
Lotus Europa
29
Ford Pantera L
30
Ferrari Dino
31
Maserati Bora
32
Volvo 142E
hp drat
21.0
6
21.0
6
22.8
4
21.4
6
18.7
8
18.1
6
14.3
8
24.4
4
22.8
4
19.2
6
17.8
6
16.4
8
17.3
8
15.2
8
10.4
8
10.4
8
14.7
8
32.4
4
30.4
4
33.9
4
21.5
4
15.5
8
15.2
8
13.3
8
19.2
8
27.3
4
26.0
4
30.4
4
15.8
8
19.7
6
15.0
8
21.4
4
wt
160.0
160.0
108.0
258.0
360.0
225.0
360.0
146.7
140.8
167.6
167.6
275.8
275.8
275.8
472.0
460.0
440.0
78.7
75.7
71.1
120.1
318.0
304.0
350.0
400.0
79.0
120.3
95.1
351.0
145.0
301.0
121.0
qsecvs am gear
110 3.90 2.620
110 3.90 2.875
93 3.85 2.320
110 3.08 3.215
175 3.15 3.440
105 2.76 3.460
245 3.21 3.570
62 3.69 3.190
95 3.92 3.150
123 3.92 3.440
123 3.92 3.440
180 3.07 4.070
180 3.07 3.730
180 3.07 3.780
205 2.93 5.250
215 3.00 5.424
230 3.23 5.345
66 4.08 2.200
52 4.93 1.615
65 4.22 1.835
97 3.70 2.465
150 2.76 3.520
150 3.15 3.435
245 3.73 3.840
175 3.08 3.845
66 4.08 1.935
91 4.43 2.140
113 3.77 1.513
264 4.22 3.170
175 3.62 2.770
335 3.54 3.570
109 4.11 2.780
carb
16.46
17.02
18.61
19.44
17.02
20.22
15.84
20.00
22.90
18.30
18.90
17.40
17.60
18.00
17.98
17.82
17.42
19.47
18.52
19.90
20.01
16.87
17.30
15.41
17.05
18.90
16.70
16.90
14.50
15.50
14.60
18.60
0
0
1
1
0
1
0
1
1
1
1
0
0
0
0
0
0
1
1
1
1
0
0
0
0
1
0
1
0
0
0
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0
0
0
1
1
1
1
1
1
1
4
4
4
3
3
3
3
4
4
4
4
3
3
3
3
3
3
4
4
4
3
3
3
3
3
4
5
5
5
5
5
4
4
4
1
1
2
1
4
2
2
4
4
3
3
3
4
4
4
1
2
1
1
2
2
4
2
1
2
2
4
6
8
2
Permutations
sort.list(x) will give the permutation of x
that sorts x
Is much easier to explain with a small
example…
> a<-c(3,1,10)
> sort.list(a)
[1] 2 1 3
> a<-c(3,1,10)
> sort.list(a)
[1] 2 1 3
> b<-sort.list(a)
> a[b]
[1] 1 3 10
sort.list(hp)
[1] 19 8 20 18 26 27 3 9 21 6 32 1 2
4 28 10 11 22 23 5 25 30 12 13 14 15
16 17 7 24 29 31
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
car_name mpg cyl disp hp drat
wt qsec vs am gear carb
1
Mazda RX4 21.0
6 160.0 110 3.90 2.620 16.46 0
2
Mazda RX4 Wag 21.0
6 160.0 110 3.90 2.875 17.02 0
3
Datsun 710 22.8
4 108.0 93 3.85 2.320 18.61 1
4
Hornet 4 Drive 21.4
6 258.0 110 3.08 3.215 19.44 1
5
Hornet Sportabout 18.7
8 360.0 175 3.15 3.440 17.02 0
6
Valiant 18.1
6 225.0 105 2.76 3.460 20.22 1
7
Duster 360 14.3
8 360.0 245 3.21 3.570 15.84 0
8
Merc 240D 24.4
4 146.7 62 3.69 3.190 20.00 1
9
Merc 230 22.8
4 140.8 95 3.92 3.150 22.90 1
10
Merc 280 19.2
6 167.6 123 3.92 3.440 18.30 1
11
Merc 280C 17.8
6 167.6 123 3.92 3.440 18.90 1
12
Merc 450SE 16.4
8 275.8 180 3.07 4.070 17.40 0
13
Merc 450SL 17.3
8 275.8 180 3.07 3.730 17.60 0
14
Merc 450SLC 15.2
8 275.8 180 3.07 3.780 18.00 0
15 Cadillac Fleetwood 10.4
8 472.0 205 2.93 5.250 17.98 0
16 Lincoln Continental 10.4
8 460.0 215 3.00 5.424 17.82 0
17
Chrysler Imperial 14.7
8 440.0 230 3.23 5.345 17.42 0
18
Fiat 128 32.4
4 78.7 66 4.08 2.200 19.47 1
19
Honda Civic 30.4
4 75.7 52 4.93 1.615 18.52 1
20
Toyota Corolla 33.9
4 71.1 65 4.22 1.835 19.90 1
21
Toyota Corona 21.5
4 120.1 97 3.70 2.465 20.01 1
22
Dodge Challenger 15.5
8 318.0 150 2.76 3.520 16.87 0
23
AMC Javelin 15.2
8 304.0 150 3.15 3.435 17.30 0
24
Camaro Z28 13.3
8 350.0 245 3.73 3.840 15.41 0
25
Pontiac Firebird 19.2
8 400.0 175 3.08 3.845 17.05 0
26
Fiat X1-9 27.3
4 79.0 66 4.08 1.935 18.90 1
27
Porsche 914-2 26.0
4 120.3 91 4.43 2.140 16.70 0
28
Lotus Europa 30.4
4 95.1 113 3.77 1.513 16.90 1
29
Ford Pantera L 15.8
8 351.0 264 4.22 3.170 14.50 0
30
Ferrari Dino 19.7
6 145.0 175 3.62 2.770 15.50 0
31
Maserati Bora 15.0
8 301.0 335 3.54 3.570 14.60 0
32
Volvo 142E 21.4
4 121.0 109 4.11 2.780 18.60 1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0
0
0
1
1
1
1
1
1
1
4
4
4
3
3
3
3
4
4
4
4
3
3
3
3
3
3
4
4
4
3
3
3
3
3
4
5
5
5
5
5
4
4
4
1
1
2
1
4
2
2
4
4
3
3
3
4
4
4
1
2
1
1
2
2
4
2
1
2
2
4
6
8
2
19
8
20
18
26
27
3
9
21
6
32
1
2
4
28
10
11
22
23
5
25
30
12
13
14
15
16
17
7
24
29
31
Thought exercise
What happens if we now say
hp[c(19 , 8 ,20 ,18, 26, 27, 3, 9, 21, 6,
32, 1, 2, 4, 28, 10, 11, 22, 23, 5, 25,
30, 12, 13, 14 ,15 ,16 ,17 , 7 ,24 ,29,
31)]
?
We get hp sorted…
hp[c(19 , 8 ,20 ,18, 26, 27, 3, 9, 21, 6,
32, 1, 2, 4, 28, 10, 11, 22, 23, 5, 25,
30, 12, 13, 14 ,15 ,16 ,17 , 7 ,24 ,29,
31)]
[1] 52 62 65 66 66 91 93 95 97 105
109 110 110 110 113 123 123 150 150
175 175 175 180 180 180 205 215 230
245 245 264 335
So, sort.list really gives an
instruction how to sort
So, we can also say
sort_order<-sort.list(hp)
hp[sort_order]
What will happen if we say
car_name[sort_order]
?
car_name[sort_order]
[1] Honda Civic
Merc 240D
Toyota Corolla
Fiat 128
Fiat X1-9
Porsche 914-2
Datsun 710
[8] Merc 230
Toyota Corona
Valiant
Volvo 142E
Mazda RX4
Mazda RX4 Wag
Hornet 4
Drive
[15] Lotus EuropaMerc 280
Merc 280C
Dodge Challenger AMC Javelin
Hornet Sportabout Pontiac
Firebird
[22] Ferrari Dino
Merc 450SE
Merc 450SL
Merc 450SLC
Cadillac Fleetwood Lincoln Continental
Chrysler Imperial
[29] Duster 360
Camaro Z28
Ford Pantera L
Maserati Bora
32 Levels: AMC Javelin Cadillac Fleetwood Camaro Z28 Chrysler Imperial Datsun 710 Dodge Challenger Duster 360
Ferrari Dino Fiat 128 ... Volvo 142E
So, we use the hp permutation to reorder the cars!
We can also do this with the whole dataframe, in
the same way
>d[sort_order,]
> d[sort_order,]
car_name
19
Honda Civic
8
Merc 240D
20
Toyota Corolla
18
Fiat 128
26
Fiat X1-9
27
Porsche 914-2
3
Datsun 710
9
Merc 230
21
Toyota Corona
6
Valiant
32
Volvo 142E
1
Mazda RX4
2
Mazda RX4 Wag
4
Hornet 4 Drive
28
Lotus Europa
10
Merc 280
11
Merc 280C
22
Dodge Challenger
23
AMC Javelin
5
Hornet Sportabout
25
Pontiac Firebird
30
Ferrari Dino
12
Merc 450SE
13
Merc 450SL
14
Merc 450SLC
15 Cadillac Fleetwood
16 Lincoln Continental
17
Chrysler Imperial
7
Duster 360
24
Camaro Z28
29
Ford Pantera L
31
Maserati Bora
mpg cyl disp hp drat
wt qsec vs am gear carb
30.4
4 75.7 52 4.93 1.615 18.52 1 1
4
2
24.4
4 146.7 62 3.69 3.190 20.00 1 0
4
2
33.9
4 71.1 65 4.22 1.835 19.90 1 1
4
1
32.4
4 78.7 66 4.08 2.200 19.47 1 1
4
1
27.3
4 79.0 66 4.08 1.935 18.90 1 1
4
1
26.0
4 120.3 91 4.43 2.140 16.70 0 1
5
2
22.8
4 108.0 93 3.85 2.320 18.61 1 1
4
1
22.8
4 140.8 95 3.92 3.150 22.90 1 0
4
2
21.5
4 120.1 97 3.70 2.465 20.01 1 0
3
1
18.1
6 225.0 105 2.76 3.460 20.22 1 0
3
1
21.4
4 121.0 109 4.11 2.780 18.60 1 1
4
2
21.0
6 160.0 110 3.90 2.620 16.46 0 1
4
4
21.0
6 160.0 110 3.90 2.875 17.02 0 1
4
4
21.4
6 258.0 110 3.08 3.215 19.44 1 0
3
1
30.4
4 95.1 113 3.77 1.513 16.90 1 1
5
2
19.2
6 167.6 123 3.92 3.440 18.30 1 0
4
4
17.8
6 167.6 123 3.92 3.440 18.90 1 0
4
4
15.5
8 318.0 150 2.76 3.520 16.87 0 0
3
2
15.2
8 304.0 150 3.15 3.435 17.30 0 0
3
2
18.7
8 360.0 175 3.15 3.440 17.02 0 0
3
2
19.2
8 400.0 175 3.08 3.845 17.05 0 0
3
2
19.7
6 145.0 175 3.62 2.770 15.50 0 1
5
6
16.4
8 275.8 180 3.07 4.070 17.40 0 0
3
3
17.3
8 275.8 180 3.07 3.730 17.60 0 0
3
3
15.2
8 275.8 180 3.07 3.780 18.00 0 0
3
3
10.4
8 472.0 205 2.93 5.250 17.98 0 0
3
4
10.4
8 460.0 215 3.00 5.424 17.82 0 0
3
4
14.7
8 440.0 230 3.23 5.345 17.42 0 0
3
4
14.3
8 360.0 245 3.21 3.570 15.84 0 0
3
4
13.3
8 350.0 245 3.73 3.840 15.41 0 0
3
4
15.8
8 351.0 264 4.22 3.170 14.50 0 1
5
4
15.0
8 301.0 335 3.54 3.570 14.60 0 1
5
8
What type of data are we
looking at?
• Type here refers to what kind of data representation for example “strings” or “numbers”
• R usually does a good job at interpreting this for us
when we read data
• This becomes important later on, as some functions
only take vectors of a certain data type - for instance,
you cannot take the mean() of a vector with “words”!
Simple types
• Numbers:
num_vector<- c(1,5,2,10)
• Strings(or characters)
String_vector<-c( “a”, “trial1”)
• How can we know?
class(num_vector)
[1] "numeric"
Factors/groups
• Factors: slightly more tricky. A factor is often a
grouping variable - such as the gender or experiment
group, like
gender<-c(‘f’, ‘f’, ‘m’, ‘m’, ‘f’)
experiment_number<-c(1,2,4,5,5,6)
• Problem here is that R will think that the first vector is
a string vector, and the last a numerical vector.
> class(gender)
[1] "character"
> class(experiment_number)
[1] "numeric"
• This is resolved by
• > experiment_number<-as.factor(
c(1,2,4,5,5,6))
• > class(experiment_number)
• [1] "factor"
Over to analysis of data!
boxplot (Sepal.Width~Species)
Is there a real
difference?
We now move over from “pure
R” to “using R to make
statistical tests”
• Focus is: what typical tests are there
and how to use them
• We skip almost all statistical theory(!) covered in the course in fall
Population and sample
• Population refers to a set of potential
measurements or values, including not only
cases actually observed but those that are
potentially observable. Really depends on the
question at hand!
• Sample is the data we actually have generally this is very small compared to the
whole population, but we use it to draw
inferences about the population
• So, the sample is a small part of a much
larger population, in most cases
• Can sample==population? Examples?
Significance tests
• In real life, we always get some difference
between datasets, because we sample
limited data, and there is always noise
• So, is the difference in sepal width we see
between species something that is just due to
“luck” - or is it real?
• Challenge: What three features in the dataset
will be important for finding this out
(significant difference or just difference due to
random noise) ? Discuss with your sideman
for a minute
Albin’s take
• The observed difference
– Obviously, the bigger observed difference, the
less random it is
• The data depth
– The more data you have, the less likely it is that
strange things happens due to random effects
• Data spread (variance)
– Slightly less intuitive. The less spread, the less it
can be explained by random events.
The concept of null hypothesis
and P-values
• Always in statistics, we set a null-hypothesis.
This is basically what we want to disprove!
• Usually, this is something like “there is no
difference in mean in these two samples”.
The alternative hypothesis is then, by default,
“There is a difference”
• Then, we test how likely this is, with certain
assumptions (depending on what test)
• The P-value of a test can be interpreted as
the chance that the observed difference
occurs if the null hypothesis is true
– This only makes sense if your test is relevant!
• So, a small P-value means that the null
hypothesis is not a good model - which
means that there is a difference
• We often want P-values to be as small as
possible - the standard cutoff for
“significance” is 5%, or 0.05
• However, this is just tradition – it is
completely arbitrary
Two-sample t-test
• [deliberate skip of math details – Jens
will cover some later on]
• The t-test tests if two distributions come
from distributions with the same mean.
• The assumption is that both
distributions are “normal-shaped” - we
should always check this first
Mean
Standard deviation
– a measure of
spread
Normality - brief check
• There are many tests for normality. Now
we will do the simplest: by eye. Make a
histogram and see if the data looks
reasonably “bell-shaped”
• Challenge: Test to make histograms for
the sepal length of each species
Albin’s take
#warning: wrong approach
hist(Sepal.Length)
# not very normal-shaped!
#why is this the wrong approach?
Albin’s take, again
#now for each species
par (mfrow=c(1,3))
hist(Sepal.Length[Species=="setosa"])
hist(Sepal.Length[Species=="versicolor"])
hist(Sepal.Length[Species=="virginica"])
#IMHO, these are not very far away from normal-shaped given the
amount of data
Slightly more advanced way –
the quartile-quartile plot
• If two distributions have the same
shape, their cumulative distributions
should match well
ecdf(Sepal.Length)
0.0
0.0
0.2
0.2
0.4
0.4
Fn(x)
Fn(x)
0.6
0.6
0.8
0.8
1.0
1.0
ecdf(rnorm(50))
4
5
6
7
8
-2
-1
0
x
x
1
2
qqnorm(Sepal.Length)
6.5
6.0
5.5
5.0
4.5
Sample Quantiles
7.0
7.5
8.0
Normal Q-Q Plot
-2
-1
0
Theoretical Quantiles
1
2
qqnorm(Sepal.Length[Species=="setosa"])
5.0
4.5
Sample Quantiles
5.5
Normal Q-Q Plot
-2
-1
0
Theoretical Quantiles
1
2
So, are they different?
> setosa<-Sepal.Length[Species=="setosa"]
> versicolor<-Sepal.Length[Species=="versicolor"]
> t.test(setosa, versicolor)
Welch Two Sample t-test
data: setosa and versicolor
t = -10.521, df = 86.538, p-value < 2.2e-16 #Very!
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.1057074 -0.7542926
sample estimates:
mean of x mean of y
5.006 5.936
Lets test it with some real data
In your data directory, there are two files with
binding sites lengths, from estrogen receptor
alfa and beta: ERA_widths.txt,
ERB_widths.txt
1) Read these into two different vectors, called
era, erb
2) Have they significantly different means?
Albins code
> d<-read.table("ERA_widths.txt", h=T)
> attach(d)
> d<-read.table("ERB_widths.txt", h=T)
> attach(d)
>
> t.test(ERA_width, ERB_width)
Welch Two Sample t-test
data: ERA_width and ERB_width
t = 2.7288, df = 1066.975, p-value = 0.00646
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
11.18740 68.45565
sample estimates:
mean of x mean of y
514.1370 474.3155
Hey, we forgot to check for
normality!
> par(mfrow=c(1,2))
> hist (ERA_width)
> hist (ERB_width)
# NOT really normal shaped! t.test is not valid!
The two-sample Wilcoxon test
• The t-test is assuming that samples are
normal-distributed, which is often not true
with biological data
• The wilcoxon test : wilcox.test() is a less
powerful test which does not make such
assumptions.
• Tests of this kind are called non-parametric
• Use when the normality assumption is
doubtful
Challenge
• Repeat the analysis of the ERA/beta
sites using wilcox.test. Is the difference
significant?
> wilcox.test(ERA_width, ERB_width)
Wilcoxon rank sum test with continuity correction
data: ERA_width and ERB_width
W = 151243.5, p-value = 0.05551
alternative hypothesis: true location shift is not equal to 0
The “AHH, so close to significance” situation
Different types of alternative
hypothesis
Null hypothesis:“there is no difference”
Default alternative hypothesis: “there is
difference”. This is called “two-sided”, as the
differences can be in either direction
Optional: We can make the alternative
hypothesis “one-sided” - like “There is a
difference - distribution A has a greater mean
than distribution B”. This increases the
statistical power *2
In R
> wilcox.test(ERA_width, ERB_width)
Wilcoxon rank sum test with continuity correction
data: ERA_width and ERB_width
W = 151243.5, p-value = 0.05551
alternative hypothesis: true location shift is not equal to 0
> wilcox.test(ERA_width, ERB_width, alternative="greater")
Wilcoxon rank sum test with continuity correction
data: ERA_width and ERB_width
W = 151243.5, p-value = 0.02776
alternative hypothesis: true location shift is greater than 0
# what happens if alternative=“less“?
> wilcox.test(ERA_width, ERB_width, alternative="less")
Wilcoxon rank sum test with continuity correction
data: ERA_width and ERB_width
W = 151243.5, p-value = 0.9723
alternative hypothesis: true location shift is less than 0
Dangers with “hypothesis
fiddling”
• This should never be used to get “twice
as good P-value” without some
justification
• Homework for Tuesday: read BMJ
1994;309:248 (linked at the web page)
for discussion on when this is
appropriate
Testing many samples at once
• Sometimes, we have a table of groups, and
we want to see if any of the groups are
different in means
treatment1 treatment2 treatment3
plant1
plant2
plant3
etc
• This is the typical one-way Anova test
situation
boxplot (Sepal.Width~Species)
Is there a real
difference?
Last time we checked this, we just tested setosa vs versicolor
In R, oneway anova tests can be made easy by
the oneway.test() command
The oneway.test() works by giving it a formula,
in our case
Sepal.Width~Species.
This will basically give the function a “table” with
three columns of values, one for each
species.
It will then test if the means of any of these
columns are different from the others
Test flowers set
> d<-read.table("flowers.txt", h=T)
> attach(d)
oneway.test(Sepal.Width~Species)
One-way analysis of means (not assuming equal variances)
data: Sepal.Width and Species
F = 45.012, num df = 2.000, denom df = 97.402, p-value = 1.433e-14
As expected, very, very significant.
Is the difference in length of sepals (not widths) even
more significant? Test it!
Non-parametric alternative to
Anova
As with t-tests, Anova tests assume that
each of the columns has a normal-like
distribution. This is often not the case
The Kruskal-Wallis test is analogous to
the wilcoxon test - a non-paramatric
variant. Practically, it works exactly in
the same way as the oneway.test().
2 data sets
>2 data sets
Normal
distribution
t.test()
oneway.test()
Not normal
distribution
(nonparametric
tests)
wilcox.test()
kruskal.test()
Larger exercise
In the data folder, there is file gene_lengths.txt,
showing gene lengths from three different
chromosomes.
Is there a significant difference in the center
points (means or medians) of gene lengths
for the different chromosomes? What is the
method of choice? What two chromosomes
have the most/least significant difference?
> d<-read.table("gene_lengths.txt", h=T)
> attach(d)
> names(d)
[1] "id"
"chr"
"gene_length”
#first - is the gene length distribution normal-shaped?
> hist(gene_length)
# nope!
# => nonparametric tests
# needed
# many more than two groups,
# and non-parametric test needed =>kruskal wallis test
> kruskal.test(gene_length~chr)
Kruskal-Wallis rank sum test
data: gene_length by chr
Kruskal-Wallis chi-squared = 131.7681, df = 2, p-value <
2.2e-16
# very significant difference!
#are all chromosome pairs significant?
# test case: use wilcox.test on two chromosomes
> wilcox.test(gene_length[chr=="chr2"] ,
gene_length[chr=="chr3"])
Wilcoxon rank sum test with continuity correction
data: gene_length[chr == "chr2"] and gene_length[chr ==
"chr3"]
W = 1068584, p-value = 0.1822
alternative hypothesis: true location shift is not equal to 0
# so, all pairs are not significantly different!
chr1
chr2
chr3
chr1
chr2
chr3
1
2.2E-16
2.2E-16
1
0.1822
1
#for us who are lazy - we can do all vs all tests like this:
> pairwise.wilcox.test(gene_length, g=chr )
Pairwise comparisons using Wilcoxon rank sum test
data: gene_length and chr
chr1 chr2
chr2 <2e-16 chr3 <2e-16 0.18
P value adjustment method: holm
Small homework I: Reading about one-and two-sided tests
What is horribly wrong in this (published) study?
Small homework (II)
What is the statistical problem with
making these pairwise tests en masse,
for instance making 1000 t-tests and
reporting all cases where P<0.05?
Hint: Go back to the slides about what the
5% P value cutoff really means
Homework and Examination
Big homework for examination
goes online this evening!
Examination
• Three home works during the course
– These will together give 50% of the final points
– Main focus here is technical skill (how to do analysis x)
– Theoretically optional. Very strict deadlines.
• One large home work at the end of the course
– Will give 50% of final points
– Overlaps substantially with the three first, but is more high-level
– Main focus here is to merge technical skill with biologically
motivated analysis - real science
• Do the math…the three smaller home works pays off
• Note: High level of activity needed from the start of the
course. Will be demanding! The home works are the
examination!
• The points with homework
– Encourage study during the course
– It is easy to think that you understood
something, but when meeting the real thing…
– Only realistic way of training
– Evaluation
Rules for homework
• Working together in 2-3 perons teams is
allowed and encouraged, BUT
– Each person hands in a separate report
– …which must be unique. This is usually not a
problem if you write it yourself…
– If you work with others, write who you work with on
the report (“ I worked with X and Y in this
homework!)
• Cases of copying between students WILL be
reported. The BEST thing that can happen is
that you will get lower scores
Using material from others
• Copying text/images from other sources
– including fellow students – is
considered cheating
• You can quote sources, given that you
cite them, but this has a limit
Questions?
Correlations
• Sometimes, we do not want to show that two
things are different, but that they are
correlated.
• For instance, weight and height for individuals
are clearly different, but also highly correlated
• This is a positive correlation - but we could
also have a negative correlation
• Correlations are dangerous things - it
depends on how we ask questions .Easily the
most misused and misunderstood part of
statistics
• Correlation != Causation
Consider : Why is it that…
Truth: Compared to the
general population,
there is a strong
correlation between
being an American
football player and risk
for testicle cancer
Interpretation: American
football leads to cancer!
Consider : Why is it that…
Truth: There is an
inverse correlation
between yearly
income and running
speed
Interpretation:
carrying money
makes you slow!
Time to explore
Open the height.txt file in R, and look at
the different data types. Try to plot
various columns against each other, like
plot (mean_height, Year)
What columns correlate? Can we break
up the data even more?
Try it out!
>plot (mean_weight, year )
> plot (mean_height, year )
> plot (mean_height, mean_weight )
Hmmm…obviously
there are TWO
datasets here! How
can we break this
up?
Visualize the groups with color
plot (mean_weight, year )
lines (mean_weight[gender=="F"], year[gender=="F"], col="red" )
lines (mean_weight[gender=="M"], year[gender=="M"], col="blue" )
# a typical case of OVER-plotting
Assessing correlations
• First: plot a standard x-y plot
• Second, make a correlation test:
• cor(x, y) gives the Pearson correlation
coefficient between z an y, which goes
from -1(perfect negative correlation) to
+1(prefect correlation). 0=no correlation
Test it…
> cor(mean_weight, year)
[1] 0.1688499
#BAD correlation - close to zero
What happens if we break it up by gender
as in the last plot? Test it!
> cor (mean_weight[gender=="M"], year[gender=="M"] )
[1] 0.9527568
> cor (mean_weight[gender=="F"], year[gender=="F"] )
[1] 0.931249
# very, very correlated!
Important: Pearson correlation gives a “score”, not a
significance value. Obviously, it is easier to get a good
correlation if you only have a few values!
Often, we want to also want to TEST if the correlation
is there. The null hypothesis is then 0 correlation
This is done with the
cor.test()
command
> cor.test (mean_weight, year )
Pearson's product-moment correlation
data: mean_weight and year
t = 0.5934, df = 12, p-value = 0.5639
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3973253 0.6419208
sample estimates:
cor
0.1688499
> cor.test (mean_weight[gender=="F"],
year[gender=="F"] )
Pearson's product-moment correlation
data: mean_weight[gender == "F"] and year[gender
== "F"]
t = 5.7147, df = 5, p-value = 0.002293
alternative hypothesis: true correlation is not equal to
0
95 percent confidence interval:
0.5965244 0.9900208
sample estimates:
cor
0.931249
• Note that cor and cor.test are not the
same thing
• cor gives a co-efficient – how well is the
data described by a linear correlation
• cor.test tests if that co-efficient is 0 or
not. If you have enough data, you don’t
need a nice cor value to get a small pvalue. So cor.test doe not test if the
correlation is linear or not
Making a simple
linear model fit
• If there is a correlation, we often want to fit a
simple model to it
– the classical Y=kx+m
• R does this by the lm() function, which also
can do a lot more complex things
Here, we need to use the formula construct that we
used before, so
lm (y~x) # make a model where x describes the
outcome y
This will return a lm “object”, which holds all kinds of
information - the model, how well it fitted the data, etc
Some caveats
Here, we deal with linear regression - one
of the simplest usages of lm()
lm() can do a lot of other funky things that
are “advanced manual”
Some necessary math
In linear regression, we are fitting the
response (y) to the data(x) by a linear
function
Y=kX+m
Where k is the “slope”, and m the
intercept (high school math!)
How to make a linear regression between mean
weight and years, only females
lm( year[gender=="F"] ~mean_weight[gender=="F"] )
Call:
lm(formula = year[gender == "F"] ~ mean_weight[gender == "F"])
Coefficients:
(Intercept) mean_weight[gender == "F"]
1616.551
6.297
Y=kX+m
Plotting a fit
# first, plot the actual data:
plot (mean_weight[gender=="F"] , year[gender=="F"] , col=“red”)
#
x
y
# then make the fit - note the switch of year and mean_weight!
my_model <-lm ( year[gender=="F"] ~ mean_weight[gender=="F"] )
#
y
x
abline(my_model, col=‘red’)
Challenge…
We made this plot
previously - weight
vs year (females in
red, males in blue)
Now, make one
linear fit for each
gender and
display them in the
same figure
(example on next
page)
attach(d)
> plot (mean_weight, year )
> lines (mean_weight[gender=="F"], year[gender=="F"], col="red"
) # will over-write plot
> lines (mean_weight[gender=="M"], year[gender=="M"],
col="blue" ) # will over-write plot
> my_model <-lm ( year[gender=="F"] ~
mean_weight[gender=="F"] )
> females_model <-lm ( year[gender=="F"] ~
mean_weight[gender=="F"] )
> males_model <-lm ( year[gender=="M"] ~
mean_weight[gender=="M"] )
> abline(females_model, col="orange" )
> abline(males_model, col="cyan" )
How good is a fit?
A whole science in itself!
Here, we will focus on R2:
R=Pearson Correlation coefficient from before
(-1 to +1)
R2 is just R squared (so, from 0 to 1).
We can access these, and a LOT of other
things, by saying
summary(my_fitting)
A horrible amount of data…
Call:
lm(formula = year[gender == "F"] ~ mean_weight[gender == "F"])
Residuals:
1
2
3
4
5
6
7
-2.3856 2.4661 -4.0161 -0.7568 -1.2754 0.7246 5.2432
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
1616.551 66.311 24.378 2.17e-06 ***
mean_weight[gender == "F"] 6.297
1.102 5.715 0.00229 **
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.409 on 5 degrees of freedom
Multiple R-Squared: 0.8672, Adjusted R-squared: 0.8407
F-statistic: 32.66 on 1 and 5 DF, p-value: 0.002293
Count statistics
• Often, we have ONLY categorical data, such
as “number of genes of category Y”
• Typically, these situations is where you have
table where all the cells adds up to all cases
or experiments
• This is called a contingency table, or a matrix
Matrices
• A matrix is almost the same as a data frame a table
• Only difference is that all data are only ONE
DATATYPE - only numbers, only characters
or only logicals
• Here, we will mostly deal with numbers
• How to create a matrix from some values:
– matrix(c(2,10, 5, 8,7,4), nrow=2)
Number of rows
matrix(c(2,10, 5, 8,7,4), nrow=2)
[,1] [,2] [,3]
[1,] 2 5 7
[2,] 10 8 4
By default, fills in the matrix column by column…
> matrix(c(2,10, 5, 8,7,4), nrow=2, byrow=T)
[,1] [,2] [,3]
[1,] 2 10 5
[2,] 8 7 4
But can do it by row also!
Want an extra row or column?
•
The cbind() command adds an extra column to a matrix. Very handy at
times! It can be understood as “bind a column to my matrix”. So, if we
want to add another column to our matrix:
> m<- matrix(c(2,10, 5, 8,7,4), nrow=2, byrow=T) # just
as before
> m
[,1] [,2] [,3]
[1,]
2
10
5
[2,]
8
7
4
> m2<-cbind(m, c(9,7))
> m2
[,1] [,2] [,3] [,4]
[1,]
2
10
5
9
[2,]
8
7
4
7
Challenge
• 1
• Figure out what the rbind() command does compared to cbind()
• Using these two commands, go from this matrix
[,1] [,2] [,3]
[1,] 2 10 5
• [2,] 8 7 4
•
To this
[,1] [,2] [,3] [,4]
[1,]
2
10
5
9
[2,]
8
7
4
7
[3,]
2
1
4
5
•2
•If your final matrix is called m3, test the following:
•rbind(m3, 1);
•rbind(m3, 1:3)
•What is happening and why?
Albin’s try
•
•
•
•
•
•
•
•
•
•
•
m2<-cbind(m, c(9,7)) # add the column=a vector of size 2
> m2
[,1] [,2] [,3] [,4]
[1,] 2 10 5 9
[2,] 8 7 4 7
> m3<-rbind(m2, c(2,1,4,5)) # add a row - this is now of size 4, since
we added a column above
> m3
[,1] [,2] [,3] [,4]
[1,] 2 10 5 9
[2,] 8 7 4 7
[3,] 2 1 4 5
> rbind(m3, 1);
[,1] [,2] [,3] [,4]
[1,] 2 10 5 9
[2,] 8 7 4 7
[3,] 2 1 4 5
[4,] 1 1 1 1
> rbind(m3, 1:3)
[,1] [,2] [,3] [,4]
[1,] 2 10 5 9
[2,] 8 7 4 7
[3,] 2 1 4 5
[4,] 1 2 3 1
Warning message:
number of columns of result
is not a multiple of vector length (arg 2) in: rbind(1, m3, 1:3)
Dangerous feature of many R commands - the vector you supply is reused if
possible!! In example 1, it is used times. In example 2 is is used 1.33
times (hence the warning)
Indexes in matrices
• Works exactly as in data frames!
• m2[x, y] gives the cell in the x row and y
column
• m2[, y] gives the whole column y
• Etc…
We can also add annotation to the matrix…
Often useful to give rows and columns names, similarly to a dataframe
> m<- matrix(c(2,10, 5, 8,7,4),
nrow=2, byrow=T, dimnames =
list(c("first_player", "second_player"),
c("score1", "score2", "score3")))
> m
score1 score2 score3
first_player
2
10
5
second_player
8
7
4
The list() construct we have not used so far
-view it just as syntax at this stage
This matrix is called a 3x2 contingency table
2x2 contingency table - a
common situation in biology
• Say that we have two types of genes, and two types
of annotation features, for instance “related to
cancer”. Is one of the groups more cancer-related?
Cancer-related
Not cancer-related
Genes of type A
40
60
Genes of type B
6
4
A (40%) vs B(60%)…but the B group is very small.
Are the proportions significantly different - could this
arise just by chance?
Important!
• Contingency tables must have
MUTUALLY EXCLUSIVE cells!
• As in all other test, we are assuming
that samples are independently drawn!
Fishers Exact Test
• Tests “proportions” in a 2x2 table by permutations
• Actually, the null hypothesis is “no relation between
rows and columns”
my_matrix<-matrix(c(40, 60, 6, 4), nrow=2, byrow=T)
> fisher.test(my_matrix)
Fisher's Exact Test for Count Data
data: my_matrix
p-value = 0.3152
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.08719973 2.02538340
sample estimates:
odds ratio
0.4478183
NOT
significant…
Maybe the
size of the B
set is too
small?
• Test the following tables
Cancer-related
Not cancer-related
Genes of type A
40
60
Genes of type B
60
40
Cancer-related
Not cancer-related
Genes of type A
5100
4900
Genes of type B
4900
5900
• Test the following tables
P=0.00706
Cancer-related
Not cancer-related
Genes of type A
40
60
Genes of type B
60
40
P<2.2 E-16
Cancer-related
Not cancer-related
Genes of type A
5100
4900
Genes of type B
4900
5900
Bonus take-home message
• Significance is due to
– Actual signal (like 60% vs 40%) strength
– How much data you have
• So, you can get significance by
– Having big signal on small data
– Having weak signal but lots of data
– Or some combination
• This is an issue with genome-wide data,
where you often have TONS of data
Going from typical vectors to
contingency tables
• Say that we have typical data files with categorical data
(“female”, “male” )
Gender Car_owner
M
N
F
Y
Owns car(Y) Does not
own car (N)
M
Y
F
Y
Female(F)
2
0
M
N
Male(M)
3
2
M
Y
M
Y
• We often want to reshape this into a contingency table: how
many males have a car, how many do not, etc
The table() function
…makes this very easy
> gender<-c("M", "F", "M", "F", "M“, “M”, “M”)
> car_owner<-c("N", "Y", "Y", "Y", "N“, “Y”, “Y”)
>table(gender, car_owner)
# important: gender and car should not be continuous values
car_owner
gender N Y
F 0 2
M 2 3
The chi-square test
• chisq.test()
• Similar to Fisher Exact test, but less
computationally intensive: we can use if for
any size of a matrix.
• Sensitive to small counts, especially zeroes will warn about this
• Almost indentical in usage to fisher.test()
Let’s add some dimensions…
Is the blood group
distribution
significantly
different
between
Europeans and
Africans?
Test this with
both fisher.test
and chisq.test
Blood group
Africa
A
58
B
34
AB
9
0
100
Europe
90
16
8
86
# horrible mistake ahead…can you spot it?
> a<-matrix(c(58, 34, 9, 100 , 90, 16, 8, 86), nrow=2 )
> chisq.test(a)
Pearson's Chi-squared test
data: a
X-squared = 194.352, df = 3, p-value < 2.2e-16
> a<-matrix(c(58, 34, 9, 100 , 90, 16, 8, 86), ncol=2 )
>a
[,1] [,2]
[1,] 58 90
[2,] 34 16
[3,] 9 8
[4,] 100 86
> chisq.test(a)
Pearson's Chi-squared test
X-squared = 14.5091, df = 3, p-value = 0.002288
data: a
> fisher.test(a)
Fisher's Exact Test for
Count Data
data: a
p-value = 0.002048
alternative hypothesis:
two.sided
>
When to use Fisher/Chi?
• Fisher exact test is “exact” - it tests all
possible permutations, while chi-squre is an
approximation
• Chi-square has problems with 0 count bins
• Chi-square is less computationally
demanding - fisher will run out your memory if
the set is too large
Some handy functions on
matrices
Apply()
Applies a function to all rows or columns
in a matrix: works like this
apply(X, MARGIN, FUN, ...)
X is the matrix
Margin is rows(1=rows, 2=columns,
c(1,2)=both)
FUN is the function, like “mean”
Some examples
>a
[,1] [,2]
[1,] 58 90
[2,] 34 16
[3,] 9 8
[4,] 100 86
> apply(a, 2, sum)
[1] 201 200
> apply(a, 1, sum)
[1] 148 50 17 186
Column-wise
Row-wise
> apply(a, 2, sort)
[,1] [,2]
[1,] 9 8
[2,] 34 16
[3,] 58 86
[4,] 100 90
apply(a, 1, sort)
[,1] [,2] [,3] [,4]
[1,] 58 16 8 86
[2,] 90 34 9 100
Note that we lost
the order between
columns here!
Writing custom functions in R
• As you have found out, sometimes we
have to write a lot in R to get things
done
• We often save the things you write in a
separate file and do copy paste
• Sometimes this gets awkward –
especially if you are repeating
something very many times with small
changes
Example 1
• Say that we have a 30 experiments,
where each is a vector in a dataframe
• We want to have the mean, max and
min value of each of those, and plot the
distribution of the values
• That’s a LOT of typing
• Instead, we’ll write a function that does
this for us
• Defining your own function in R
– Lets make a function that calculates
X2 +x+1 for any given x
Telling R that we define a function
Function name
my_function<- function(x) {
x^2 + x + 1
}
Argument to the function
The “code” of the function – anything within the { }
#define it
my_function<- function(x) {
x^2 + x + 1
}
# run it
my_function(10)
[1] 111
my_function(5)
[1] 31
challenges
1) Remake the function so that it
calculates x3+x2+x
2) Make a new function that calculates
the sum 1+2+3…etc up to x
3) Remake this function to calculate the
sum from y to x (where y also is an
input)
> my_function<- function(x) {
x^3 +x^2 + x + 1
}
> # run it
> my_function(10)
[1] 1111
> my_function<- function(x) {
sum(1:x)
}
> # run it
> my_function(10)
[1] 55
> my_function<- function(x,y) {
sum(y:x)
}
> # run it
> my_function(5,10)
[1] 55
> my_function<- function(y, x) {
sum(y:x)
}
>
> my_function(10)
Error in my_function(10) : element 1 is empty;
the part of the args list of ':' being evaluated
was:
(y, x)
> my_function(5,10)
[1] 45
> my_function(9,10)
[1] 19
Arguments
• Sometimes we want default inputs
• We can set defaults to arguments, like this
> my_function<- function(y=5,x=10){
sum(y:x)
}
> my_function()
[1] 45
> my_function(5,10)
[1] 45
> my_function(5,11)
[1] 56
Iterators/loops
Often, we want to “walk over” a set of
numbers, a set of vectors, and repeat
an analysis
Like applying the mean_and_max
function to 30 experiments,one by one
Or, in the flowers.txt file, to plot all
columns, each in a separate picture
The for() loop
Best described by example – this will print all values in the vector
1:5
for(i in 1:5){
print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
Challenge
• Rewrite this program to write out the
square root of each value in the vector x
my_function<-function(x){
for(i in x){
print(i)
}
}
my_function<-function(x){
for(i in x){
print (sqrt(i))
}
}
Return values
Most functions in R return a value (or
something more complex) instead of
writhing on the screen
There are two ways of doing this
1)Per default, R will return the last
statement evaluated
2)We can instead force returning
something by saying return (whatever).
If R returns something it will also end
the function (!)
Examples
my_function<- function(x){
x^2 + x + 1
}
# run it
my_function(10)
[1] 111
my_function(5)
[1] 31
my_function<- function(x) {
my_val <- x^2 + x + 1
return (my_val)
}
# run it
my_function(10)
[1] 111
my_function(5)
[1] 31
Hardest challenge so far...
• Rewrite this program to RETURN – not
print - a vector which is the square root
values of x (where x also is a vector)
my_function<-function(x){
for(i in x){
print (sqrt(i))
}
}
#look at next slide before continuing
Why is this the wrong solution:
my_function<-function(x){
for(i in x){
return(sqrt(i))
}
}
my_function(c(2,3,4))
[1] 1.414214 # we should get three values
my_function<-function(x){
for(i in x){
return(sqrt(i))
}
}
We need to somehow build a new vector, and fill that in with the
square roots as the for loop progresses
Hint: we can append some value to a vector by saying
my_vector<-c(my_vector, somevalue)
my_function<-function(x){
my_vector<-c(); # empty vector
for(i in x){
my_value<-(sqrt(i))
# append this to the vector
my_vector<-c(my_vector, my_value)
# and repeat
}
return(my_vector)
}
my_function(c(2,3,4))
[1] 1.414214 1.732051 2.000000
Extensions to R
• R has many “add-on” packages for
specific analyses
• Some of these you already have, but
they are not accessible unless you “ask”
for them
• Others you need to download
• Bioconductor, for array analysis and
more, is such an addon. We will use this
later in the course
Testing out Lattice graphics
• Different type of
plotting methods:
basically a
“conditional plot”
• This is a “rewrite
of R’s standard
“plot()” command
– called “xyplot()”
Testing…
my_data<-read.table("flowers.txt", h=T)
names(my_data)
attach(my_data)
# complex command ahead!
xyplot(Sepal.Length~Sepal.Width|Species )
# wont work! Why?
Need to “activate” the library
first
• library(lattice)
xyplot(Sepal.Length~Sepal.Width|Species
)
help(package=“lattice”) will give an
overview
xyplot(Sepal.Length~Sepal.Width|Species )
Interpreted as
Plot length as a function of width,
broken up by species
Free-form exercise/homework
Using reference sheet and exploration of
the drop-down menus in R, figure out
how to get, install and run the barplot2()
command in the gplots package (it is
NOT in your computer)
Now what?
• With the commands and concepts we
have looked at so far, you can do quite
a lot
• But, we have only scratched the R
surface
• Best to view the content so far as a
“driving license”: now you are capable
of exploring by yourself
• We will now go over to use R in a
genomics context