Introduction to R Mikkel Meyer Andersen Department of Mathematical Sciences Aalborg University Denmark Apr 21, 2015 About me I I Assistant professor in applied statistics Main research focus: Evidential weight of lineage markers (so far: Y-STR) – more will come after this introduction to R Prerequisites I I R (RStudio?) What is R? Advanced (and basic!) calculator 2 + 2 ## [1] 4 cos(2*pi/3) ## [1] -0.5 Graph illustrator 10 0 −10 −20 −30 x^3 + 3 * x^2 − 6 * x − 8 20 30 curve(x^3 + 3*x^2 - 6*x - 8, from = -5, to = 3) −4 −2 0 x 2 Vectors 1:10 ## [1] 1 2 3 4 c(1, 5, 6) ## [1] 1 5 6 c(1:3, 6:8) ## [1] 1 2 3 6 7 8 5 6 7 8 9 10 Matrices A <- matrix(1:9, nrow = 3) A ## [,1] [,2] [,3] ## [1,] 1 4 7 ## [2,] 2 5 8 ## [3,] 3 6 9 A[2,2] <- 1000 A ## [,1] [,2] [,3] ## [1,] 1 4 7 ## [2,] 2 1000 8 ## [3,] 3 6 9 Programming language x <- 0 for (i in 1:10) { x <- x + i } x ## [1] 55 sum(1:10) ## [1] 55 double_number <- function(x) 2*x double_number(4) ## [1] 8 Statistical tool head(cars) ## ## ## ## ## ## ## 1 2 3 4 5 6 speed dist 4 2 4 10 7 4 7 22 8 16 9 10 Statistical tool 0 20 40 dist 60 80 100 120 plot(dist ~ speed, cars) 5 10 15 speed 20 25 Statistical tool fit <- lm(dist ~ speed, cars) summary(fit) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = dist ~ speed, data = cars) Residuals: Min 1Q -29.069 -9.525 Median -2.272 3Q 9.215 Max 43.201 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -17.5791 6.7584 -2.601 0.0123 * speed 3.9324 0.4155 9.464 1.49e-12 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' Residual standard error: 15.38 on 48 degrees of freedom Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438 Statistical tool 60 0 20 40 dist 80 100 120 plot(dist ~ speed, cars) lines(cars$speed, fit$fitted.values) 5 10 15 speed #abline(a = coef(fit)[1], b = coef(fit)[2]) 20 25 Combining features a <- sample(1:10) a ## [1] 3 10 8 5 9 1 6 2 4 7 1 2 9 4 8 7 5 2 10 7 9 3 6 5 1 a[1] ## [1] 3 a[-(1:5)] ## [1] 1 6 2 4 7 sample(1:10) ## [1] 6 3 10 sample(1:10) ## [1] 8 4 Extensible A lot of packages (or libraries) exist Reproducible documents Reproducible documents (reports, presentations, etc.): Rmarkdown. Demonstration later. Is R difficult? Yes. The learning curve can be steep. So is talking, we still do the climb. My usage I I Research (simulation studies) Reproducible research and reports – yes, “final” datasets do change Hands-on Install packages R is extensible – install a package called disclapmix install.packages("disclapmix") or RStudio: Tools -> Install Packages Search for help I I I I Know function name: ?sum Does not know function name: help.search("search phrase") Cheat sheet (e.g. http: //cran.r-project.org/doc/contrib/Short-refcard.pdf) http://tryr.codeschool.com Good practises I Save your commands in script files (also see History tab in RStudio) Loading data I I I I Loading Excel files can be difficult to set-up (but can be done) Use Excel to collect, edit, prepare etc. data, then export to CSV (I prefer tab as field delimiter, others comma) Import CSV in R Data cleaning can be time consuming Example Supplementary Table 2 from paper: http://www.sciencedirect.com/ science/article/pii/S0379073805000095 (note the table caption is included – remove) RStudio: Tools -> Import Dataset -> From text file (remember checking input, options, and output – e.g. heading) (Command can be saved to script file for future automated loading.) View dataset RStudio: View(den) Simple R: head(den) ## ## ## ## ## ## ## ## ## ## ## ## ## 1 2 3 4 5 6 1 2 3 4 5 Ht DYS19 DYS385a.b DYS389.I DYS389.II DYS390 DYS391 DYS392 1 13 13–16 13 29 22 10 15 2 13 13–19 13 30 25 10 14 3 13 14–23 11 27 23 11 14 4 13 16 14 32 24 10 11 5 13 16–18 13 30 24 10 11 6 14 10–15 13 30 23 11 13 DYS438 DYS439 N Freq. 11 12 1 0.0054 12 12 1 0.0054 12 12 1 0.0054 10 12 1 0.0054 10 12 2 0.0108 Descriptive statistics sum(den$N) ## [1] 185 dys19 <- rep(den$DYS19, den$N) table(dys19) ## dys19 ## 13 14 ## 6 113 15 45 16 18 17 3 summary(dys19) ## ## Min. 1st Qu. 13.00 14.00 Median 14.00 Mean 3rd Qu. 14.45 15.00 Max. 17.00 Locus dependency tab <- xtabs(N ~ DYS19 + DYS390, den) tab ## DYS390 ## DYS19 21 22 23 24 25 26 ## 13 0 1 1 3 1 0 ## 14 1 30 41 35 4 2 ## 15 0 8 9 12 12 4 ## 16 0 1 3 4 9 1 ## 17 0 0 0 0 3 0 chisq.test(tab) ## Warning in chisq.test(tab): Chi-squared approximation may be ## ## Pearson's Chi-squared test ## ## data: tab ## X-squared = 58.8064, df = 20, p-value = 1.088e-05 Plotting Histogram: hist(dys19) 60 40 20 0 Frequency 80 100 Histogram of dys19 13 14 15 16 17 Plotting Bar plot: 60 40 20 0 table(dys19) 80 100 plot(table(dys19)) 13 14 15 dys19 16 17 Rmarkdown RStudio File -> New File -> R Markdown Tips and tricks Tips and tricks I I I I I I Horn of plenty: overwhelming functionality (logistic regression, support vector machines, random forests, . . . ) Print cheat sheet (e.g. http: //cran.r-project.org/doc/contrib/Short-refcard.pdf) Plotting: ggplot2 Data cleaning and wrangling: tidyr and dplyr Questions? R help, Google (R sort data frame) or Stack overflow at http://stackoverflow.com/questions/tagged/r RStudio Support about learning R: https://support.rstudio. com/hc/en-us/categories/200098757-Learn-R
© Copyright 2024