R basics Albin Sandelin Ultra-short R introduction Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why? Ease of use - click buttons, select data by hand -you see the data in front of you -you can do limited programming Disadvantages • Hard to handle large dataset (>100 points) • Inflexible, few analyses available • “Dumb-down” click button effect • Hard to repeat analyses systematically with new data • R is a computational environment somewhere between a program and a programming language • No buttons, no wizards: only a command line interface • Is a professional statistics toolset - likely has all the analyses you will ever need in your career • Is also a programming language • Can handle large datasets • Very powerful graphics • The state-of-the-art statistics program for bioinformatics, especially array analysis • Free, and open source! In this course, we will use R for Exploring data Select subsets of data Plotting data Statistical tests on data …etc = almost everything. It is the most important tool in this course! R is hard R is challenging to learn: No visual cues Very different from spreadsheets Hard to find the method to use Cryptic manual pages (at times) It requires a lot of effort to get inside the system, but bear in mind that it is worth it you gain an edge to all spread-sheet users Reading material • Keep the reference sheet handy • Save code from lecture exercises in a text file • Powerpoint slides are available after lectures at the web page • Can also recommend the tutorials on the CRAN website – this is how I learned R • Dahlgaard book is very good, also for future reference Expected: • You will have R open at all times during this course • We will make MANY smaller exercises during class, most often in pairs • Understanding these exercises is the key to succeeding in the exam • We will have at least two teachers present at al lectures - use us to get help when needed •NB: It is YOUR responsibility to that you understand the examples/exercises -if not, you have to ask for clarification during the walkthrough of the problems First challenge of the day • Start R • Type demo(graphics) • Hit enter a few times Starting with R We will now walk through the most basic R concepts We will add successive detail on this – with many exercises - and in a while get over to plotting, basic statistics, and finally some programming Challenging? • Those who know R or have a math background might find most things for today no-so-challenging • We cover this to keep everyone on the same ground Vectors • • • In statistics, we are almost always dealing with several “data points” A vector is an collection of numbers and/or strings: – (“albin”, “sanne”, “jens”) – (1, 5.2, 10, 7, 2, 21) – (3) The last example is a vector of length 1 In R, we make a vector by the c() command (for concatenate) Method name Method arguments > c(1,5,10, 7, 2, 1) [1] 1 5 10 7 2 1 Method output is also called “return value(s)” > c('albin', 'sanne', 'jens') [1] "albin" "sanne" "jens" When we input strings or characters, we have to surround them with “ or ‘ If making vectors of size 1, we can skip c() >3 [1] 3 Challenge: 1) Make the following vector 45,5,12,10 2) What happens with the following command? c(1:100) c(50:2) A vector is a data structure, and the most fundamental in R: almost everything in R is some kind of vector, although sometimes in several dimensions – vectors within vectors The reference sheet is your friend! • You will get over-whelmed by different command names fast • Use the reference sheet to remind yourself in all exercises Assignment to memory • • The c() command is almost useless in itself - we want to keep the vector for other analyses The assignment concept: > 4+5 [1] 9 > a<-4 > b<-5 >a [1] 4 >b [1] 5 > a+b [1] 9 # add 4 and 5 # store 4 as “a” # store 5 as “b” # just checking # add a+b (4+5) # yes, works Anything after a # will be disregarded by R. Used for making comments Expanding this to a whole vector my_vector<- c(1,5,10, 7, 2) Note that there is no “return value” now - this is caught by the “my_vector”. What happens if you type my_vector after this? my_vector is a variable, with the variable name my_vector. Variable names are totally arbitrary! The anatomy of the vector: Name my_vector Values 1 5 10 7 2 Index [1] [2] [3] [4] [5] We can access part of the vector like this my_vector[5] # will give the 5th item in the vector What happens if you do this? my_vector<- c(1,5,10, 7, 2) # define the vector my_vector [c(1,3,5)] my_vector[1:4] my_vector[4:1] > my_vector[c(1,3,5)] [1] 1 10 2 > my_vector[1:4] [1] 1 5 10 7 Making a series of integers: A:B will give a vector of A,A+1, A+2…B >my_vector[4:1] # also works backwards [1] 7 10 5 1 Challenge • Using the reference sheet, figure out at least three ways of making R print your vector in the other direction my_vector<- c(1,5,10, 7, 2) # define the vector > my_vector[c(5,4,3,2,1)] [1] 2 7 10 5 1 > my_vector[c(5:1)] [1] 2 7 10 5 1 > rev(my_vector) [1] 2 7 10 5 1 > Interlude: naming rules and the danger of over-writing Naming: We can name vectors to almost anything. The most basic rule is : Never start a vector name with a number > a<- c(1,5,4,2) #OK > 1a<- c(1,5,4,2) # NOT OK Error: syntax error > a1<- c(1,5,4,2) # OK Over-writing: my_vector<- c(1,5,10, 7, 2) my_vector<- c(10,5,2, 3, 1) #what does my_vector contain now? Analyzing vectors • Many functions work directly on vectors - most have logical names • For instance, length(my_vector) gives the number of items in the vector (= 5) • Challenge: – Make a vector big_vector with values 1 to 10000 using the a:b construct – What is the • Length of the vector • Sum of all items in vector: the sum() function • Mean(= average) of all items in the vector: the mean() function > big_vector<-(1:10000) > length(big_vector) [1] 10000 > sum(big_vector) [1] 50005000 > mean(big_vector) [1] 5000.5 Interlude: the R help system • R has hundreds of functions where most have a non-trivial syntax: The easiest way of retrieving documentation is by saying ?function_name #OR help(function_name) #OR help.search(“something”) #for ‘fuzzy’ searches • Look at the help for sample() and sort() and then try them out on big_vector • Also try out RSiteSearch(“sample”) – what happens Adding and multiplying a number to a vector Sometimes we want to add a number, like 10, to each element in the vector: > big_vector +10 Test this: big_vector2<-big_vector +10 Also test min(big_vector) max(big_vector) min(big_vector2) max(big_vector2) What happens? > big_vector2<-big_vector +10 > min(big_vector) [1] 1 > max(big_vector) [1] 10000 > min(big_vector2) [1] 11 > max(big_vector2) [1] 10010 Adding vectors • We can also add one vector to another vector • Say that we have the three vectors A<-c(10, 20, 30, 50) B<-c(1,4,2,3) C<-c(2.5, 3.5) • Test what happens and explain the outcome: A+B A+C > A+B [1] 11 24 32 53 > A+C [1] 12.5 23.5 32.5 53.5 A<-c(10, 20, 30, 50) B<-c(1, 4, 2, 3 ) C<-c(2.5,3.5 ) A+B is easy to understand : A[1]+B[1] , etc A+C is trickier - the C vector is just of length 2. It is reused! So A[1]+C[1]=12.5 A[2]+C[2]=23.5 A[3]+C[1]=32.5 A[4]+C[2]=53.5 Actually, this is what is happening also with A+10 …the 10 is used many times. Plotting vectors • Lets make up some semi-random data: dat<-rnorm (100) #draw 100 random, normal distributed data points • Test the following: plot(dat) plot(dat,type=“l”) barplot(dat) hist(dat) • What is the difference? Why are your three first plots different than mine? Why is your last plot more similar to mine? Graph options • Generally, you can give extra options to graphical commands like this • plot (some_vector, col=“blue”, type=“l”) • In plot: try to vary the following options - note that you can use several at once (and figure out what they do) type=“l”, “b” col= “hotpink” main=“important plot” type=“h” type=“S” These options are really arguments to the plot() function More about functions In almost all cases, a function needs some input, like plot(dat). ‘dat’ here is an unnamed argument, and this works because plot() assumes we mean x values = dat. We could also say plot(x=dat) - a named argument. If you have many arguments, most of them are named - such as plot (some_vector, col=“blue”, type=“s”) The help pages will tell you what type of arguments you can use The danger of unnamed arguments… • …is that the order of them will make a big difference. Try this out - what is the difference between the plot commands? a<-rnorm(100) b<-rnorm(100)*2 plot(a,b) plot(b,a) plot(x=b, y=a) plot(a,b) b a plot(b,a) a b plot(x=b, y=a) a b Some generic R arguments to plots - the par() function The par() function is used to set general plot properties. It has hundreds of possible arguments - see ?par Two very handy par() arguments is mfrow() and mfcol() - these will allow many plots in one page You give these functions a vector of length 2 this gives the number of cells in the page (see example) Example: > par( mfrow=c(3,1) ) > plot (a,b) > plot (b,a) > plot(x=b, y=a) > par( mfrow=c(2,2) ) > plot (a,b) > plot (b,a) > plot(x=b, y=a) Challenge - can you get the three plots in a row using mfrow? > par( mfrow=c(1,3) ) > plot (a,b) > plot (b,a) > plot(x=b, y=a) Overlaying plots • Sometimes we want to put may data sets within one graph, on top of each other • This is often made by the lines() or points() command, like this > plot(b, type="l", col="blue") > lines(a, col="red") Why did I start with plotting b? What would have happed if using points() instead of lines()? Sizing graphs Simple concept, but awkward to write Change X scale: xlim=c(start_value, end_value) Change Y scale: ylim=c(start_value, end_value) > plot(a, type="l", col="blue") > plot(a, type="l", col="blue", ylim=c(-5,5)) Saving graphs • Different on different systems! • All systems can use the original tricky way with device() – see introduction PDFs if you have an interest. We won’t use this. OSX • Click on the graph, and just copy it. Will become a pdf or a bitmap when pasting. • On my mac you have top open a new file in Preview and paste there first. On others you can paste it into other programs directly • Or, just have the graphics window active and do file->save Windows Right-click on graph, copy as metafile or bitmap, paste Linux • Use awkward device() system, like pdf(…) or • Use GIMP and just make a screen capture from there by “acquire” Statistics part 1 • Describing the data: descriptive statistics • What is the center point of the data? • What is the spread of the data? • How does the distribution look? Summary statistics • hist() (= Histogram) is a graphical way of summarizing distributions – it creates a number of “bins” and calculates how many of the data points that fall into each bin. • We can also summarize by the center points in the data: – mean() • median(): – Sort all the data, pick the number in the center. If the number of data points is even, take the mean of the two center points • Challenge: • We make another vector dat2<-rnorm(10) And add a few extra points to it dat2<-c(dat2, 10, 10.5, 30 ) • Test mean() and median() on dat2. Are they the same? Can you explain the differences by plotting a histogram? What is the advantage/disadvantage of each measure? > dat2<-rnorm(10) > dat2<-c(dat2, 10, 10.5, 30 ) > median(dat2) [1] 0.11125 > mean(dat2) [1] 3.636139 > hist(dat2) Means are sensitive to outliers! Very common situation in genomics… Percentiles • An extension of the median concept • Best explained by example: – the 20th percentile is the value (or score) below which 20 percent of the observations may be found. • The median is the same as the 50th percentile • The first quartile is the 25th percentile, the third is the 75th • Try summary(dat) and summary(dat2) summary(dat) Min. 1st Qu. Median Mean 3rd Qu. Max. -2.20600 -0.60480 -0.06930 -0.01333 0.68030 3.49600 > summary(dat2) Min. 1st Qu. Median Mean 3rd Qu. Max. -2.4010 -0.5385 0.1113 3.6360 0.6771 30.0000 The command ecdf() (empirical cumulative distribution) calculates “all” percentiles in your data – and also understands plot() Try plot (ecdf(dat2)) What fraction of the data that has been covered at point X Sorted values of dat2 Boxplots • An “easier” representation of ECDFs. Is based on both making boxes that tell us about both center point and “spread” of the data • First, calculate the first quartile, the median and the third quartile • Calculate the “inter-quartile range” (IQR): 3rd quartile -1st quartile • These will be used to draw a “box” >boxplot(dat) > rug(dat, side=2) Interquartile range (IQR) 3rd quartile Median 1st quartile …continued • Sounds more complicated than it is: Any data observation which lies more than 1.5*IQR lower than the first quartile is considered an outlier. Indicate where the smallest value that is not an outlier is by a vertical tic mark or "whisker", and connect the whisker to the box via a horizontal line. Do the same for higher values The data point that is highest but within 1.5 IQR from the center More extreme data outliers 3rd quartile Median 1st quartile Variance, standard deviation and data spread • What is the difference between these distributions? • Same mean and median, but different spread over the x axis • This can be measured by the variance of the data: • It is basically the difference between each point and the mean, squared • Standard deviation is simply variance squared. • This gives nice statistical features that you will get to know if you study statistics Why is variance and standard deviation important? • Variance tells you something about the quality of measurements • The higher the variance, the harder it is to with certainty say that two measurements are different • (whiteboard demo) • What are the R functions for variance and standard deviation? • If we make some random data smallset<-rnorm(100) largeset<-rnorm(10000) What is the variance and standard deviation for these? Is the standard deviation really the square root of the variance (what is the function for square root?) >smallset<-rnorm(100) >largeset<-rnorm(10000) >var(smallset) [1] 0.9504884 >var(largeset) [1] 0.98416 Why do we get about the same variance? >sd(largeset) [1] 0.9920484 >sd(smallset) [1] 0.97493 >sqrt(var(smallset)) [1] 0.97493 Dealing with real data • …involves interacting with messy data from other people • Almost all interesting data is “real” • We need methods to get the data into R to start with. Reading data from files • Typically, you have data in a .txt or .xls file how do you get it into R for analysis? • Four steps: – – – – 1 Format it to tab-separated text file 2 Tell R what directory to work in 3 Read in the file 4 Attach data to memory (optional) What R likes: • A “data frame”(=table) with columns. The first row can be column names, like this: Name row, “header” Lecturer height age Poker_games_won Albin 181 34 14 Jens 189 31 10 Sanne 170 NA 21 First data row String data Numerical data Code for missing data Do not ask ladies for age… Reading data(1) • General rules: – Use tabs as separators – Each row has to have the same number of columns – Missing data is NA, not empty • As .txt: The ideal format! • As .xls: save as tabbed text Get some data to play with: Update: old file is wrong! Please download the data See course web page! again • • http://people.binf.ku.dk/albin/teaching/htbinf/R/ • Download the zipped file – save this at a good place that you know where it is • I will refer to this place on your hard drive as “the data directory” Reading data(2) • Tell R where to work. – In R, you always work in a given directory, called “working directory”. – You can set this graphically in Windows or MacOS (next pages) – Or use setwd(“some directory”) – In Linux, it pays to start R in the same directory • It pays to have one directory per larger analysis, and call the data files informative things (not “Data.txt”) Current working directory Change working directory In OSX In Windows Reading in data (3) my_data<- read.table(“filename”, header=T) Data frame object holding all the data Only if you have column names – I always have that The actual command Lets try… • In the data directory, there is a file “flowers.txt”. Have a look at it (it is a normal text file – can be opened in almost anything). Now: my_data<-read.table(“flowers.txt”, header=T) names(my_data) What happens – what does “names” do? > my_data<-read.table("flowers.txt", h=T) # note to self: show name completion feature > names(my_data) [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" Sepal.Length Sepal.Width 5.1 3.5 1.4 0.2 4.9 3.0 1.4 0.2 4.7 3.2 1.3 0.2 4.6 3.1 1.5 0.2 5.0 3.6 1.4 0.2 Etc… Petal.Length Petal.Width Species setosa setosa setosa setosa setosa Accessing the columns • There are three ways of getting the data in a data frame – Get the column by saying dataframe_name$column_name – Get the columns as separate vectors, named as their header - the attach command – More complex queries, using index operations (more in a few slides) Columns by names • The attach(some_data_frame) function will put all the columns of some_data_frame as vectors in working memory • So, in our case • attach(my_data) • Will give the 5 vectors • "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" and "Species" • (which we then can analyze) Dangers of the attach() function • Attach will make a vector out each column, named after the column name • What happens if we already have a vector with that name? • What does ls() and detach() do? Why would we want to do this? Challenge • Using the flowers file, – Plot a histogram of the length of sepals (“Sepal.Length”) – Plot the correlation between sepal length and sepal width (an x-y plot) • What is the mean of Sepal.Length and Sepal.Width? – Why do you get strange results on Sepal.Length? – Can you solve the problem by using the appropriate help file? > hist(Sepal.Length) > plot(Sepal.Length, Sepal.Width) > mean(Sepal.Width) [1] 3.057333 > mean(Sepal.Length) [1] NA > If we look at the file: Sepal.Length Sepal.Width 5.1 3.5 1.4 0.2 4.9 3.0 1.4 0.2 NA 3.2 1.3 0.2 4.6 3.1 1.5 0.2 5.0 3.6 1.4 0.2 5.4 3.9 1.7 0.4 4.6 3.4 1.4 0.3 # strange! Petal.Length Petal.Width Specie setosa setosa setosa setosa setosa setosa setosa Missing values • In real data, we often have missing values. In R, we label these “NA” • Any examples of missing data? • Some functions can by default deal with these - for instance hist() • Some functions will need to be told what to do with them, like mean() > mean(Sepal.Length) [1] NA #na.rm is the option for mean to remove all NAs first > mean(Sepal.Length, na.rm=T) [1] 5.851007 With new files, you should check whether there are any NA values first. What if we wanted just the 5 first rows? Or just all values from the species versicolor? Sepal.Length 5.1 3.5 4.9 3.0 4.7 3.2 4.6 3.1 5.0 3.6 6.2 2.9 5.1 2.5 5.7 2.8 6.3 3.3 5.8 2.7 7.1 3.0 6.3 2.9 Sepal.Width 1.4 0.2 1.4 0.2 1.3 0.2 1.5 0.2 1.4 0.2 4.3 1.3 3.0 1.1 4.1 1.3 6.0 2.5 5.1 1.9 5.9 2.1 5.6 1.8 Petal.Length setosa setosa setosa setosa setosa versicolor versicolor versicolor virginica virginica virginica virginica Petal.Width Species The index concept • Recap: • In a vector, we can select individual values by saying • Sepal.Width[some position] > Sepal.Width[5] [1] 3.6 • Sepal.Width[some_positions] > Sepal.Width[2:10] [1] 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 Sepal.Width[c(2,4,6,8)] [1] 3.0 3.1 3.9 3.4 • The first position will always have the number 1, the last will have the length of the vector. • Challenge: what is the mean of the first 100 values of Sepal.Width? • • • • • • > a<-Sepal.Width[1:100] > mean(a) [1] 3.099 # or, in just one line > mean(Sepal.Width[1:100]) [1] 3.099 Selecting parts of vectors depending on values • Often, we want parts of vector due to the values of the vector -for instance all values <10, or all strings equaling “versicolor”. • This is made by a slightly odd syntax(it will make sense later on when we have many vectors) • To get all values over 10 in Sepal.Length Sepal.Length[ Sepal.Length >10 ] • So, we are using a “criterion” instead of the index Try out • Make a boxplot of all values in Sepal.Length that are higher than 5 • How many vales are there that satisfy this criterion? Albin’s code a<-Sepal.Length[Sepal.Length >5 ] boxplot (a) length(a) [1] 119 criteria in R • • • • • • < Smaller than > Bigger than >= Bigger or equal to <= smaller than or equal to == equal to (not “=“) != not equal to • Challenge: how many rows for the virginica species? > length(Species[Species==virginica] ) Error in NextMethod("[") : object "virginica" not found # what happens here is that R thinks virginica is an object - a vector or something else > length(Species[Species=="virginica"] ) [1] 50 For string comparisons, we need to enclose the string with “ ” or ‘’ Selection using two vectors • Often we have two vectors from a file, and we want to use the value of one vector to select values from the other. • For instance, sepal length broken up by species Sepal.Length 5.1 3.5 4.9 3.0 4.7 3.2 4.6 3.1 5.0 3.6 6.2 2.9 5.1 2.5 5.7 2.8 6.3 3.3 5.8 2.7 Sepal.Width 1.4 0.2 1.4 0.2 1.3 0.2 1.5 0.2 1.4 0.2 4.3 1.3 3.0 1.1 4.1 1.3 6.0 2.5 5.1 1.9 Petal.Length setosa setosa setosa setosa setosa versicolor versicolor versicolor virginica virginica Petal.Width Species • Use the same syntax as before - but change the vetor names! • some_vector[ other_vector == something] • Challenge: • select the Sepal.Width values that correspond to the setosa species • What is the mean of these values? • What is the distribution of these values? Albin’s code > a<-Sepal.Width[Species=="setosa"] > mean(a) [1] 3.428 > hist(a) Fancy way of breaking up data (only works with some functions) • boxplot (Sepal.Width~Species) • Try it out… • This is a “formula” construction - break up width in the levels of species Even more complex: multiple criteria • & “and”: combine two criteria • | “or”: either this or that has to be true So, Species[ Sepal.Width >4 & Sepal.Length <5] will give the species rows where width >4 AND Length is <5. …and, Species[ Sepal.Width >4 | Sepal.Length <5] will give? Challenge: Make a boxplot of Sepal.Width where Sepal.Length >3 AND the species is virginica >boxplot (Sepal.Width[Sepal.Length >3 & Species=="virginica"] ) More complex data selections • The dataframe is really a twodimensional vector - a vector that contains vectors • We can use almost the same commands to select sub-sets of the data frame, without using the attach() function • This is very handy at times • The data frame is really a big table - a 2D vector. • Just as some_vector[5] gives use the fifth value in a vector, • some_dataframe[10, 2] gives us the cell in the 10th row and the 2nd column >my_data[10, 2] [1] 3.1 Sepal.Length 5.1 3.5 4.9 3.0 4.7 3.2 4.6 3.1 5.0 3.6 5.4 3.9 4.6 3.4 5.0 3.4 4.4 2.9 4.9 3.1 5.4 3.7 4.8 3.4 4.8 3.0 Sepal.Width 1.4 0.2 1.4 0.2 1.3 0.2 1.5 0.2 1.4 0.2 1.7 0.4 1.4 0.3 1.5 0.2 1.4 0.2 1.5 0.1 1.5 0.2 1.6 0.2 1.4 0.1 Petal.Length setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa Petal.WidthSpecies Selecting rows • • • • If we instead say my_data[10, ] We will get the WHOLE 10th row back One can think of it as saying my_data[10, “give me everything”] • So, what will happen if we say my_data[1:5, ] • Or my_data[5:1, ] • Try it out… Indexing exercise • In the data directory, there is a file called Rindexing.pdf – it is a set of small exercises designed to help you understand indexing • It works with a very small data file, just so that you can check your results by eye • Indexing is very important for the rest of the course Nota Bene: 1)Students on the reserve list • Please talk to me to get signed on to the course (I need your student number or similar) • Otherwise you cannot go to the exam => no credit • 2) A watch was found in the last lecture – will be given back if you can describe it well Sorting one vector by another • Very common situation with experimental data – what are the genes with the highest expression in my data set? • Recap: what happened if you said my_data[1:5, ] • Or my_data[5:1, ] • ? # get some data to play with d<-read.table("mtcar.txt", h=T) attach(d) Just by looking at the data, d; what car has the highest and lowest horsepower (=hp)? • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • car_name mpg cyldisp 1 Mazda RX4 2 Mazda RX4 Wag 3 Datsun 710 4 Hornet 4 Drive 5 Hornet Sportabout 6 Valiant 7 Duster 360 8 Merc 240D 9 Merc 230 10 Merc 280 11 Merc 280C 12 Merc 450SE 13 Merc 450SL 14 Merc 450SLC 15 Cadillac Fleetwood 16 Lincoln Continental 17 Chrysler Imperial 18 Fiat 128 19 Honda Civic 20 Toyota Corolla 21 Toyota Corona 22 Dodge Challenger 23 AMC Javelin 24 Camaro Z28 25 Pontiac Firebird 26 Fiat X1-9 27 Porsche 914-2 28 Lotus Europa 29 Ford Pantera L 30 Ferrari Dino 31 Maserati Bora 32 Volvo 142E hp drat 21.0 6 21.0 6 22.8 4 21.4 6 18.7 8 18.1 6 14.3 8 24.4 4 22.8 4 19.2 6 17.8 6 16.4 8 17.3 8 15.2 8 10.4 8 10.4 8 14.7 8 32.4 4 30.4 4 33.9 4 21.5 4 15.5 8 15.2 8 13.3 8 19.2 8 27.3 4 26.0 4 30.4 4 15.8 8 19.7 6 15.0 8 21.4 4 wt 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8 275.8 275.8 472.0 460.0 440.0 78.7 75.7 71.1 120.1 318.0 304.0 350.0 400.0 79.0 120.3 95.1 351.0 145.0 301.0 121.0 qsecvs am gear 110 3.90 2.620 110 3.90 2.875 93 3.85 2.320 110 3.08 3.215 175 3.15 3.440 105 2.76 3.460 245 3.21 3.570 62 3.69 3.190 95 3.92 3.150 123 3.92 3.440 123 3.92 3.440 180 3.07 4.070 180 3.07 3.730 180 3.07 3.780 205 2.93 5.250 215 3.00 5.424 230 3.23 5.345 66 4.08 2.200 52 4.93 1.615 65 4.22 1.835 97 3.70 2.465 150 2.76 3.520 150 3.15 3.435 245 3.73 3.840 175 3.08 3.845 66 4.08 1.935 91 4.43 2.140 113 3.77 1.513 264 4.22 3.170 175 3.62 2.770 335 3.54 3.570 109 4.11 2.780 carb 16.46 17.02 18.61 19.44 17.02 20.22 15.84 20.00 22.90 18.30 18.90 17.40 17.60 18.00 17.98 17.82 17.42 19.47 18.52 19.90 20.01 16.87 17.30 15.41 17.05 18.90 16.70 16.90 14.50 15.50 14.60 18.60 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2 Permutations sort.list(x) will give the permutation of x that sorts x Is much easier to explain with a small example… > a<-c(3,1,10) > sort.list(a) [1] 2 1 3 > a<-c(3,1,10) > sort.list(a) [1] 2 1 3 > b<-sort.list(a) > a[b] [1] 1 3 10 sort.list(hp) [1] 19 8 20 18 26 27 3 9 21 6 32 1 2 4 28 10 11 22 23 5 25 30 12 13 14 15 16 17 7 24 29 31 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • car_name mpg cyl disp hp drat wt qsec vs am gear carb 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 11 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 12 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 13 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 14 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 15 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 16 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 17 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 18 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 19 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 20 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 21 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 22 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 23 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 24 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 25 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 26 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 27 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 28 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 29 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 30 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 31 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 32 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2 19 8 20 18 26 27 3 9 21 6 32 1 2 4 28 10 11 22 23 5 25 30 12 13 14 15 16 17 7 24 29 31 Thought exercise What happens if we now say hp[c(19 , 8 ,20 ,18, 26, 27, 3, 9, 21, 6, 32, 1, 2, 4, 28, 10, 11, 22, 23, 5, 25, 30, 12, 13, 14 ,15 ,16 ,17 , 7 ,24 ,29, 31)] ? We get hp sorted… hp[c(19 , 8 ,20 ,18, 26, 27, 3, 9, 21, 6, 32, 1, 2, 4, 28, 10, 11, 22, 23, 5, 25, 30, 12, 13, 14 ,15 ,16 ,17 , 7 ,24 ,29, 31)] [1] 52 62 65 66 66 91 93 95 97 105 109 110 110 110 113 123 123 150 150 175 175 175 180 180 180 205 215 230 245 245 264 335 So, sort.list really gives an instruction how to sort So, we can also say sort_order<-sort.list(hp) hp[sort_order] What will happen if we say car_name[sort_order] ? car_name[sort_order] [1] Honda Civic Merc 240D Toyota Corolla Fiat 128 Fiat X1-9 Porsche 914-2 Datsun 710 [8] Merc 230 Toyota Corona Valiant Volvo 142E Mazda RX4 Mazda RX4 Wag Hornet 4 Drive [15] Lotus EuropaMerc 280 Merc 280C Dodge Challenger AMC Javelin Hornet Sportabout Pontiac Firebird [22] Ferrari Dino Merc 450SE Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental Chrysler Imperial [29] Duster 360 Camaro Z28 Ford Pantera L Maserati Bora 32 Levels: AMC Javelin Cadillac Fleetwood Camaro Z28 Chrysler Imperial Datsun 710 Dodge Challenger Duster 360 Ferrari Dino Fiat 128 ... Volvo 142E So, we use the hp permutation to reorder the cars! We can also do this with the whole dataframe, in the same way >d[sort_order,] > d[sort_order,] car_name 19 Honda Civic 8 Merc 240D 20 Toyota Corolla 18 Fiat 128 26 Fiat X1-9 27 Porsche 914-2 3 Datsun 710 9 Merc 230 21 Toyota Corona 6 Valiant 32 Volvo 142E 1 Mazda RX4 2 Mazda RX4 Wag 4 Hornet 4 Drive 28 Lotus Europa 10 Merc 280 11 Merc 280C 22 Dodge Challenger 23 AMC Javelin 5 Hornet Sportabout 25 Pontiac Firebird 30 Ferrari Dino 12 Merc 450SE 13 Merc 450SL 14 Merc 450SLC 15 Cadillac Fleetwood 16 Lincoln Continental 17 Chrysler Imperial 7 Duster 360 24 Camaro Z28 29 Ford Pantera L 31 Maserati Bora mpg cyl disp hp drat wt qsec vs am gear carb 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 What type of data are we looking at? • Type here refers to what kind of data representation for example “strings” or “numbers” • R usually does a good job at interpreting this for us when we read data • This becomes important later on, as some functions only take vectors of a certain data type - for instance, you cannot take the mean() of a vector with “words”! Simple types • Numbers: num_vector<- c(1,5,2,10) • Strings(or characters) String_vector<-c( “a”, “trial1”) • How can we know? class(num_vector) [1] "numeric" Factors/groups • Factors: slightly more tricky. A factor is often a grouping variable - such as the gender or experiment group, like gender<-c(‘f’, ‘f’, ‘m’, ‘m’, ‘f’) experiment_number<-c(1,2,4,5,5,6) • Problem here is that R will think that the first vector is a string vector, and the last a numerical vector. > class(gender) [1] "character" > class(experiment_number) [1] "numeric" • This is resolved by • > experiment_number<-as.factor( c(1,2,4,5,5,6)) • > class(experiment_number) • [1] "factor" Over to analysis of data! boxplot (Sepal.Width~Species) Is there a real difference? We now move over from “pure R” to “using R to make statistical tests” • Focus is: what typical tests are there and how to use them • We skip almost all statistical theory(!) covered in the course in fall Population and sample • Population refers to a set of potential measurements or values, including not only cases actually observed but those that are potentially observable. Really depends on the question at hand! • Sample is the data we actually have generally this is very small compared to the whole population, but we use it to draw inferences about the population • So, the sample is a small part of a much larger population, in most cases • Can sample==population? Examples? Significance tests • In real life, we always get some difference between datasets, because we sample limited data, and there is always noise • So, is the difference in sepal width we see between species something that is just due to “luck” - or is it real? • Challenge: What three features in the dataset will be important for finding this out (significant difference or just difference due to random noise) ? Discuss with your sideman for a minute Albin’s take • The observed difference – Obviously, the bigger observed difference, the less random it is • The data depth – The more data you have, the less likely it is that strange things happens due to random effects • Data spread (variance) – Slightly less intuitive. The less spread, the less it can be explained by random events. The concept of null hypothesis and P-values • Always in statistics, we set a null-hypothesis. This is basically what we want to disprove! • Usually, this is something like “there is no difference in mean in these two samples”. The alternative hypothesis is then, by default, “There is a difference” • Then, we test how likely this is, with certain assumptions (depending on what test) • The P-value of a test can be interpreted as the chance that the observed difference occurs if the null hypothesis is true – This only makes sense if your test is relevant! • So, a small P-value means that the null hypothesis is not a good model - which means that there is a difference • We often want P-values to be as small as possible - the standard cutoff for “significance” is 5%, or 0.05 • However, this is just tradition – it is completely arbitrary Two-sample t-test • [deliberate skip of math details – Jens will cover some later on] • The t-test tests if two distributions come from distributions with the same mean. • The assumption is that both distributions are “normal-shaped” - we should always check this first Mean Standard deviation – a measure of spread Normality - brief check • There are many tests for normality. Now we will do the simplest: by eye. Make a histogram and see if the data looks reasonably “bell-shaped” • Challenge: Test to make histograms for the sepal length of each species Albin’s take #warning: wrong approach hist(Sepal.Length) # not very normal-shaped! #why is this the wrong approach? Albin’s take, again #now for each species par (mfrow=c(1,3)) hist(Sepal.Length[Species=="setosa"]) hist(Sepal.Length[Species=="versicolor"]) hist(Sepal.Length[Species=="virginica"]) #IMHO, these are not very far away from normal-shaped given the amount of data Slightly more advanced way – the quartile-quartile plot • If two distributions have the same shape, their cumulative distributions should match well ecdf(Sepal.Length) 0.0 0.0 0.2 0.2 0.4 0.4 Fn(x) Fn(x) 0.6 0.6 0.8 0.8 1.0 1.0 ecdf(rnorm(50)) 4 5 6 7 8 -2 -1 0 x x 1 2 qqnorm(Sepal.Length) 6.5 6.0 5.5 5.0 4.5 Sample Quantiles 7.0 7.5 8.0 Normal Q-Q Plot -2 -1 0 Theoretical Quantiles 1 2 qqnorm(Sepal.Length[Species=="setosa"]) 5.0 4.5 Sample Quantiles 5.5 Normal Q-Q Plot -2 -1 0 Theoretical Quantiles 1 2 So, are they different? > setosa<-Sepal.Length[Species=="setosa"] > versicolor<-Sepal.Length[Species=="versicolor"] > t.test(setosa, versicolor) Welch Two Sample t-test data: setosa and versicolor t = -10.521, df = 86.538, p-value < 2.2e-16 #Very! alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.1057074 -0.7542926 sample estimates: mean of x mean of y 5.006 5.936 Lets test it with some real data In your data directory, there are two files with binding sites lengths, from estrogen receptor alfa and beta: ERA_widths.txt, ERB_widths.txt 1) Read these into two different vectors, called era, erb 2) Have they significantly different means? Albins code > d<-read.table("ERA_widths.txt", h=T) > attach(d) > d<-read.table("ERB_widths.txt", h=T) > attach(d) > > t.test(ERA_width, ERB_width) Welch Two Sample t-test data: ERA_width and ERB_width t = 2.7288, df = 1066.975, p-value = 0.00646 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 11.18740 68.45565 sample estimates: mean of x mean of y 514.1370 474.3155 Hey, we forgot to check for normality! > par(mfrow=c(1,2)) > hist (ERA_width) > hist (ERB_width) # NOT really normal shaped! t.test is not valid! The two-sample Wilcoxon test • The t-test is assuming that samples are normal-distributed, which is often not true with biological data • The wilcoxon test : wilcox.test() is a less powerful test which does not make such assumptions. • Tests of this kind are called non-parametric • Use when the normality assumption is doubtful Challenge • Repeat the analysis of the ERA/beta sites using wilcox.test. Is the difference significant? > wilcox.test(ERA_width, ERB_width) Wilcoxon rank sum test with continuity correction data: ERA_width and ERB_width W = 151243.5, p-value = 0.05551 alternative hypothesis: true location shift is not equal to 0 The “AHH, so close to significance” situation Different types of alternative hypothesis Null hypothesis:“there is no difference” Default alternative hypothesis: “there is difference”. This is called “two-sided”, as the differences can be in either direction Optional: We can make the alternative hypothesis “one-sided” - like “There is a difference - distribution A has a greater mean than distribution B”. This increases the statistical power *2 In R > wilcox.test(ERA_width, ERB_width) Wilcoxon rank sum test with continuity correction data: ERA_width and ERB_width W = 151243.5, p-value = 0.05551 alternative hypothesis: true location shift is not equal to 0 > wilcox.test(ERA_width, ERB_width, alternative="greater") Wilcoxon rank sum test with continuity correction data: ERA_width and ERB_width W = 151243.5, p-value = 0.02776 alternative hypothesis: true location shift is greater than 0 # what happens if alternative=“less“? > wilcox.test(ERA_width, ERB_width, alternative="less") Wilcoxon rank sum test with continuity correction data: ERA_width and ERB_width W = 151243.5, p-value = 0.9723 alternative hypothesis: true location shift is less than 0 Dangers with “hypothesis fiddling” • This should never be used to get “twice as good P-value” without some justification • Homework for Tuesday: read BMJ 1994;309:248 (linked at the web page) for discussion on when this is appropriate Testing many samples at once • Sometimes, we have a table of groups, and we want to see if any of the groups are different in means treatment1 treatment2 treatment3 plant1 plant2 plant3 etc • This is the typical one-way Anova test situation boxplot (Sepal.Width~Species) Is there a real difference? Last time we checked this, we just tested setosa vs versicolor In R, oneway anova tests can be made easy by the oneway.test() command The oneway.test() works by giving it a formula, in our case Sepal.Width~Species. This will basically give the function a “table” with three columns of values, one for each species. It will then test if the means of any of these columns are different from the others Test flowers set > d<-read.table("flowers.txt", h=T) > attach(d) oneway.test(Sepal.Width~Species) One-way analysis of means (not assuming equal variances) data: Sepal.Width and Species F = 45.012, num df = 2.000, denom df = 97.402, p-value = 1.433e-14 As expected, very, very significant. Is the difference in length of sepals (not widths) even more significant? Test it! Non-parametric alternative to Anova As with t-tests, Anova tests assume that each of the columns has a normal-like distribution. This is often not the case The Kruskal-Wallis test is analogous to the wilcoxon test - a non-paramatric variant. Practically, it works exactly in the same way as the oneway.test(). 2 data sets >2 data sets Normal distribution t.test() oneway.test() Not normal distribution (nonparametric tests) wilcox.test() kruskal.test() Larger exercise In the data folder, there is file gene_lengths.txt, showing gene lengths from three different chromosomes. Is there a significant difference in the center points (means or medians) of gene lengths for the different chromosomes? What is the method of choice? What two chromosomes have the most/least significant difference? > d<-read.table("gene_lengths.txt", h=T) > attach(d) > names(d) [1] "id" "chr" "gene_length” #first - is the gene length distribution normal-shaped? > hist(gene_length) # nope! # => nonparametric tests # needed # many more than two groups, # and non-parametric test needed =>kruskal wallis test > kruskal.test(gene_length~chr) Kruskal-Wallis rank sum test data: gene_length by chr Kruskal-Wallis chi-squared = 131.7681, df = 2, p-value < 2.2e-16 # very significant difference! #are all chromosome pairs significant? # test case: use wilcox.test on two chromosomes > wilcox.test(gene_length[chr=="chr2"] , gene_length[chr=="chr3"]) Wilcoxon rank sum test with continuity correction data: gene_length[chr == "chr2"] and gene_length[chr == "chr3"] W = 1068584, p-value = 0.1822 alternative hypothesis: true location shift is not equal to 0 # so, all pairs are not significantly different! chr1 chr2 chr3 chr1 chr2 chr3 1 2.2E-16 2.2E-16 1 0.1822 1 #for us who are lazy - we can do all vs all tests like this: > pairwise.wilcox.test(gene_length, g=chr ) Pairwise comparisons using Wilcoxon rank sum test data: gene_length and chr chr1 chr2 chr2 <2e-16 chr3 <2e-16 0.18 P value adjustment method: holm Small homework I: Reading about one-and two-sided tests What is horribly wrong in this (published) study? Small homework (II) What is the statistical problem with making these pairwise tests en masse, for instance making 1000 t-tests and reporting all cases where P<0.05? Hint: Go back to the slides about what the 5% P value cutoff really means Homework and Examination Big homework for examination goes online this evening! Examination • Three home works during the course – These will together give 50% of the final points – Main focus here is technical skill (how to do analysis x) – Theoretically optional. Very strict deadlines. • One large home work at the end of the course – Will give 50% of final points – Overlaps substantially with the three first, but is more high-level – Main focus here is to merge technical skill with biologically motivated analysis - real science • Do the math…the three smaller home works pays off • Note: High level of activity needed from the start of the course. Will be demanding! The home works are the examination! • The points with homework – Encourage study during the course – It is easy to think that you understood something, but when meeting the real thing… – Only realistic way of training – Evaluation Rules for homework • Working together in 2-3 perons teams is allowed and encouraged, BUT – Each person hands in a separate report – …which must be unique. This is usually not a problem if you write it yourself… – If you work with others, write who you work with on the report (“ I worked with X and Y in this homework!) • Cases of copying between students WILL be reported. The BEST thing that can happen is that you will get lower scores Using material from others • Copying text/images from other sources – including fellow students – is considered cheating • You can quote sources, given that you cite them, but this has a limit Questions? Correlations • Sometimes, we do not want to show that two things are different, but that they are correlated. • For instance, weight and height for individuals are clearly different, but also highly correlated • This is a positive correlation - but we could also have a negative correlation • Correlations are dangerous things - it depends on how we ask questions .Easily the most misused and misunderstood part of statistics • Correlation != Causation Consider : Why is it that… Truth: Compared to the general population, there is a strong correlation between being an American football player and risk for testicle cancer Interpretation: American football leads to cancer! Consider : Why is it that… Truth: There is an inverse correlation between yearly income and running speed Interpretation: carrying money makes you slow! Time to explore Open the height.txt file in R, and look at the different data types. Try to plot various columns against each other, like plot (mean_height, Year) What columns correlate? Can we break up the data even more? Try it out! >plot (mean_weight, year ) > plot (mean_height, year ) > plot (mean_height, mean_weight ) Hmmm…obviously there are TWO datasets here! How can we break this up? Visualize the groups with color plot (mean_weight, year ) lines (mean_weight[gender=="F"], year[gender=="F"], col="red" ) lines (mean_weight[gender=="M"], year[gender=="M"], col="blue" ) # a typical case of OVER-plotting Assessing correlations • First: plot a standard x-y plot • Second, make a correlation test: • cor(x, y) gives the Pearson correlation coefficient between z an y, which goes from -1(perfect negative correlation) to +1(prefect correlation). 0=no correlation Test it… > cor(mean_weight, year) [1] 0.1688499 #BAD correlation - close to zero What happens if we break it up by gender as in the last plot? Test it! > cor (mean_weight[gender=="M"], year[gender=="M"] ) [1] 0.9527568 > cor (mean_weight[gender=="F"], year[gender=="F"] ) [1] 0.931249 # very, very correlated! Important: Pearson correlation gives a “score”, not a significance value. Obviously, it is easier to get a good correlation if you only have a few values! Often, we want to also want to TEST if the correlation is there. The null hypothesis is then 0 correlation This is done with the cor.test() command > cor.test (mean_weight, year ) Pearson's product-moment correlation data: mean_weight and year t = 0.5934, df = 12, p-value = 0.5639 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.3973253 0.6419208 sample estimates: cor 0.1688499 > cor.test (mean_weight[gender=="F"], year[gender=="F"] ) Pearson's product-moment correlation data: mean_weight[gender == "F"] and year[gender == "F"] t = 5.7147, df = 5, p-value = 0.002293 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.5965244 0.9900208 sample estimates: cor 0.931249 • Note that cor and cor.test are not the same thing • cor gives a co-efficient – how well is the data described by a linear correlation • cor.test tests if that co-efficient is 0 or not. If you have enough data, you don’t need a nice cor value to get a small pvalue. So cor.test doe not test if the correlation is linear or not Making a simple linear model fit • If there is a correlation, we often want to fit a simple model to it – the classical Y=kx+m • R does this by the lm() function, which also can do a lot more complex things Here, we need to use the formula construct that we used before, so lm (y~x) # make a model where x describes the outcome y This will return a lm “object”, which holds all kinds of information - the model, how well it fitted the data, etc Some caveats Here, we deal with linear regression - one of the simplest usages of lm() lm() can do a lot of other funky things that are “advanced manual” Some necessary math In linear regression, we are fitting the response (y) to the data(x) by a linear function Y=kX+m Where k is the “slope”, and m the intercept (high school math!) How to make a linear regression between mean weight and years, only females lm( year[gender=="F"] ~mean_weight[gender=="F"] ) Call: lm(formula = year[gender == "F"] ~ mean_weight[gender == "F"]) Coefficients: (Intercept) mean_weight[gender == "F"] 1616.551 6.297 Y=kX+m Plotting a fit # first, plot the actual data: plot (mean_weight[gender=="F"] , year[gender=="F"] , col=“red”) # x y # then make the fit - note the switch of year and mean_weight! my_model <-lm ( year[gender=="F"] ~ mean_weight[gender=="F"] ) # y x abline(my_model, col=‘red’) Challenge… We made this plot previously - weight vs year (females in red, males in blue) Now, make one linear fit for each gender and display them in the same figure (example on next page) attach(d) > plot (mean_weight, year ) > lines (mean_weight[gender=="F"], year[gender=="F"], col="red" ) # will over-write plot > lines (mean_weight[gender=="M"], year[gender=="M"], col="blue" ) # will over-write plot > my_model <-lm ( year[gender=="F"] ~ mean_weight[gender=="F"] ) > females_model <-lm ( year[gender=="F"] ~ mean_weight[gender=="F"] ) > males_model <-lm ( year[gender=="M"] ~ mean_weight[gender=="M"] ) > abline(females_model, col="orange" ) > abline(males_model, col="cyan" ) How good is a fit? A whole science in itself! Here, we will focus on R2: R=Pearson Correlation coefficient from before (-1 to +1) R2 is just R squared (so, from 0 to 1). We can access these, and a LOT of other things, by saying summary(my_fitting) A horrible amount of data… Call: lm(formula = year[gender == "F"] ~ mean_weight[gender == "F"]) Residuals: 1 2 3 4 5 6 7 -2.3856 2.4661 -4.0161 -0.7568 -1.2754 0.7246 5.2432 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1616.551 66.311 24.378 2.17e-06 *** mean_weight[gender == "F"] 6.297 1.102 5.715 0.00229 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.409 on 5 degrees of freedom Multiple R-Squared: 0.8672, Adjusted R-squared: 0.8407 F-statistic: 32.66 on 1 and 5 DF, p-value: 0.002293 Count statistics • Often, we have ONLY categorical data, such as “number of genes of category Y” • Typically, these situations is where you have table where all the cells adds up to all cases or experiments • This is called a contingency table, or a matrix Matrices • A matrix is almost the same as a data frame a table • Only difference is that all data are only ONE DATATYPE - only numbers, only characters or only logicals • Here, we will mostly deal with numbers • How to create a matrix from some values: – matrix(c(2,10, 5, 8,7,4), nrow=2) Number of rows matrix(c(2,10, 5, 8,7,4), nrow=2) [,1] [,2] [,3] [1,] 2 5 7 [2,] 10 8 4 By default, fills in the matrix column by column… > matrix(c(2,10, 5, 8,7,4), nrow=2, byrow=T) [,1] [,2] [,3] [1,] 2 10 5 [2,] 8 7 4 But can do it by row also! Want an extra row or column? • The cbind() command adds an extra column to a matrix. Very handy at times! It can be understood as “bind a column to my matrix”. So, if we want to add another column to our matrix: > m<- matrix(c(2,10, 5, 8,7,4), nrow=2, byrow=T) # just as before > m [,1] [,2] [,3] [1,] 2 10 5 [2,] 8 7 4 > m2<-cbind(m, c(9,7)) > m2 [,1] [,2] [,3] [,4] [1,] 2 10 5 9 [2,] 8 7 4 7 Challenge • 1 • Figure out what the rbind() command does compared to cbind() • Using these two commands, go from this matrix [,1] [,2] [,3] [1,] 2 10 5 • [2,] 8 7 4 • To this [,1] [,2] [,3] [,4] [1,] 2 10 5 9 [2,] 8 7 4 7 [3,] 2 1 4 5 •2 •If your final matrix is called m3, test the following: •rbind(m3, 1); •rbind(m3, 1:3) •What is happening and why? Albin’s try • • • • • • • • • • • m2<-cbind(m, c(9,7)) # add the column=a vector of size 2 > m2 [,1] [,2] [,3] [,4] [1,] 2 10 5 9 [2,] 8 7 4 7 > m3<-rbind(m2, c(2,1,4,5)) # add a row - this is now of size 4, since we added a column above > m3 [,1] [,2] [,3] [,4] [1,] 2 10 5 9 [2,] 8 7 4 7 [3,] 2 1 4 5 > rbind(m3, 1); [,1] [,2] [,3] [,4] [1,] 2 10 5 9 [2,] 8 7 4 7 [3,] 2 1 4 5 [4,] 1 1 1 1 > rbind(m3, 1:3) [,1] [,2] [,3] [,4] [1,] 2 10 5 9 [2,] 8 7 4 7 [3,] 2 1 4 5 [4,] 1 2 3 1 Warning message: number of columns of result is not a multiple of vector length (arg 2) in: rbind(1, m3, 1:3) Dangerous feature of many R commands - the vector you supply is reused if possible!! In example 1, it is used times. In example 2 is is used 1.33 times (hence the warning) Indexes in matrices • Works exactly as in data frames! • m2[x, y] gives the cell in the x row and y column • m2[, y] gives the whole column y • Etc… We can also add annotation to the matrix… Often useful to give rows and columns names, similarly to a dataframe > m<- matrix(c(2,10, 5, 8,7,4), nrow=2, byrow=T, dimnames = list(c("first_player", "second_player"), c("score1", "score2", "score3"))) > m score1 score2 score3 first_player 2 10 5 second_player 8 7 4 The list() construct we have not used so far -view it just as syntax at this stage This matrix is called a 3x2 contingency table 2x2 contingency table - a common situation in biology • Say that we have two types of genes, and two types of annotation features, for instance “related to cancer”. Is one of the groups more cancer-related? Cancer-related Not cancer-related Genes of type A 40 60 Genes of type B 6 4 A (40%) vs B(60%)…but the B group is very small. Are the proportions significantly different - could this arise just by chance? Important! • Contingency tables must have MUTUALLY EXCLUSIVE cells! • As in all other test, we are assuming that samples are independently drawn! Fishers Exact Test • Tests “proportions” in a 2x2 table by permutations • Actually, the null hypothesis is “no relation between rows and columns” my_matrix<-matrix(c(40, 60, 6, 4), nrow=2, byrow=T) > fisher.test(my_matrix) Fisher's Exact Test for Count Data data: my_matrix p-value = 0.3152 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.08719973 2.02538340 sample estimates: odds ratio 0.4478183 NOT significant… Maybe the size of the B set is too small? • Test the following tables Cancer-related Not cancer-related Genes of type A 40 60 Genes of type B 60 40 Cancer-related Not cancer-related Genes of type A 5100 4900 Genes of type B 4900 5900 • Test the following tables P=0.00706 Cancer-related Not cancer-related Genes of type A 40 60 Genes of type B 60 40 P<2.2 E-16 Cancer-related Not cancer-related Genes of type A 5100 4900 Genes of type B 4900 5900 Bonus take-home message • Significance is due to – Actual signal (like 60% vs 40%) strength – How much data you have • So, you can get significance by – Having big signal on small data – Having weak signal but lots of data – Or some combination • This is an issue with genome-wide data, where you often have TONS of data Going from typical vectors to contingency tables • Say that we have typical data files with categorical data (“female”, “male” ) Gender Car_owner M N F Y Owns car(Y) Does not own car (N) M Y F Y Female(F) 2 0 M N Male(M) 3 2 M Y M Y • We often want to reshape this into a contingency table: how many males have a car, how many do not, etc The table() function …makes this very easy > gender<-c("M", "F", "M", "F", "M“, “M”, “M”) > car_owner<-c("N", "Y", "Y", "Y", "N“, “Y”, “Y”) >table(gender, car_owner) # important: gender and car should not be continuous values car_owner gender N Y F 0 2 M 2 3 The chi-square test • chisq.test() • Similar to Fisher Exact test, but less computationally intensive: we can use if for any size of a matrix. • Sensitive to small counts, especially zeroes will warn about this • Almost indentical in usage to fisher.test() Let’s add some dimensions… Is the blood group distribution significantly different between Europeans and Africans? Test this with both fisher.test and chisq.test Blood group Africa A 58 B 34 AB 9 0 100 Europe 90 16 8 86 # horrible mistake ahead…can you spot it? > a<-matrix(c(58, 34, 9, 100 , 90, 16, 8, 86), nrow=2 ) > chisq.test(a) Pearson's Chi-squared test data: a X-squared = 194.352, df = 3, p-value < 2.2e-16 > a<-matrix(c(58, 34, 9, 100 , 90, 16, 8, 86), ncol=2 ) >a [,1] [,2] [1,] 58 90 [2,] 34 16 [3,] 9 8 [4,] 100 86 > chisq.test(a) Pearson's Chi-squared test X-squared = 14.5091, df = 3, p-value = 0.002288 data: a > fisher.test(a) Fisher's Exact Test for Count Data data: a p-value = 0.002048 alternative hypothesis: two.sided > When to use Fisher/Chi? • Fisher exact test is “exact” - it tests all possible permutations, while chi-squre is an approximation • Chi-square has problems with 0 count bins • Chi-square is less computationally demanding - fisher will run out your memory if the set is too large Some handy functions on matrices Apply() Applies a function to all rows or columns in a matrix: works like this apply(X, MARGIN, FUN, ...) X is the matrix Margin is rows(1=rows, 2=columns, c(1,2)=both) FUN is the function, like “mean” Some examples >a [,1] [,2] [1,] 58 90 [2,] 34 16 [3,] 9 8 [4,] 100 86 > apply(a, 2, sum) [1] 201 200 > apply(a, 1, sum) [1] 148 50 17 186 Column-wise Row-wise > apply(a, 2, sort) [,1] [,2] [1,] 9 8 [2,] 34 16 [3,] 58 86 [4,] 100 90 apply(a, 1, sort) [,1] [,2] [,3] [,4] [1,] 58 16 8 86 [2,] 90 34 9 100 Note that we lost the order between columns here! Writing custom functions in R • As you have found out, sometimes we have to write a lot in R to get things done • We often save the things you write in a separate file and do copy paste • Sometimes this gets awkward – especially if you are repeating something very many times with small changes Example 1 • Say that we have a 30 experiments, where each is a vector in a dataframe • We want to have the mean, max and min value of each of those, and plot the distribution of the values • That’s a LOT of typing • Instead, we’ll write a function that does this for us • Defining your own function in R – Lets make a function that calculates X2 +x+1 for any given x Telling R that we define a function Function name my_function<- function(x) { x^2 + x + 1 } Argument to the function The “code” of the function – anything within the { } #define it my_function<- function(x) { x^2 + x + 1 } # run it my_function(10) [1] 111 my_function(5) [1] 31 challenges 1) Remake the function so that it calculates x3+x2+x 2) Make a new function that calculates the sum 1+2+3…etc up to x 3) Remake this function to calculate the sum from y to x (where y also is an input) > my_function<- function(x) { x^3 +x^2 + x + 1 } > # run it > my_function(10) [1] 1111 > my_function<- function(x) { sum(1:x) } > # run it > my_function(10) [1] 55 > my_function<- function(x,y) { sum(y:x) } > # run it > my_function(5,10) [1] 55 > my_function<- function(y, x) { sum(y:x) } > > my_function(10) Error in my_function(10) : element 1 is empty; the part of the args list of ':' being evaluated was: (y, x) > my_function(5,10) [1] 45 > my_function(9,10) [1] 19 Arguments • Sometimes we want default inputs • We can set defaults to arguments, like this > my_function<- function(y=5,x=10){ sum(y:x) } > my_function() [1] 45 > my_function(5,10) [1] 45 > my_function(5,11) [1] 56 Iterators/loops Often, we want to “walk over” a set of numbers, a set of vectors, and repeat an analysis Like applying the mean_and_max function to 30 experiments,one by one Or, in the flowers.txt file, to plot all columns, each in a separate picture The for() loop Best described by example – this will print all values in the vector 1:5 for(i in 1:5){ print(i) } [1] 1 [1] 2 [1] 3 [1] 4 [1] 5 Challenge • Rewrite this program to write out the square root of each value in the vector x my_function<-function(x){ for(i in x){ print(i) } } my_function<-function(x){ for(i in x){ print (sqrt(i)) } } Return values Most functions in R return a value (or something more complex) instead of writhing on the screen There are two ways of doing this 1)Per default, R will return the last statement evaluated 2)We can instead force returning something by saying return (whatever). If R returns something it will also end the function (!) Examples my_function<- function(x){ x^2 + x + 1 } # run it my_function(10) [1] 111 my_function(5) [1] 31 my_function<- function(x) { my_val <- x^2 + x + 1 return (my_val) } # run it my_function(10) [1] 111 my_function(5) [1] 31 Hardest challenge so far... • Rewrite this program to RETURN – not print - a vector which is the square root values of x (where x also is a vector) my_function<-function(x){ for(i in x){ print (sqrt(i)) } } #look at next slide before continuing Why is this the wrong solution: my_function<-function(x){ for(i in x){ return(sqrt(i)) } } my_function(c(2,3,4)) [1] 1.414214 # we should get three values my_function<-function(x){ for(i in x){ return(sqrt(i)) } } We need to somehow build a new vector, and fill that in with the square roots as the for loop progresses Hint: we can append some value to a vector by saying my_vector<-c(my_vector, somevalue) my_function<-function(x){ my_vector<-c(); # empty vector for(i in x){ my_value<-(sqrt(i)) # append this to the vector my_vector<-c(my_vector, my_value) # and repeat } return(my_vector) } my_function(c(2,3,4)) [1] 1.414214 1.732051 2.000000 Extensions to R • R has many “add-on” packages for specific analyses • Some of these you already have, but they are not accessible unless you “ask” for them • Others you need to download • Bioconductor, for array analysis and more, is such an addon. We will use this later in the course Testing out Lattice graphics • Different type of plotting methods: basically a “conditional plot” • This is a “rewrite of R’s standard “plot()” command – called “xyplot()” Testing… my_data<-read.table("flowers.txt", h=T) names(my_data) attach(my_data) # complex command ahead! xyplot(Sepal.Length~Sepal.Width|Species ) # wont work! Why? Need to “activate” the library first • library(lattice) xyplot(Sepal.Length~Sepal.Width|Species ) help(package=“lattice”) will give an overview xyplot(Sepal.Length~Sepal.Width|Species ) Interpreted as Plot length as a function of width, broken up by species Free-form exercise/homework Using reference sheet and exploration of the drop-down menus in R, figure out how to get, install and run the barplot2() command in the gplots package (it is NOT in your computer) Now what? • With the commands and concepts we have looked at so far, you can do quite a lot • But, we have only scratched the R surface • Best to view the content so far as a “driving license”: now you are capable of exploring by yourself • We will now go over to use R in a genomics context
© Copyright 2024