Introduction to R Aedín Culhane

Introduction to R
Aedín Culhane
[email protected]
http://bcb.dfci.harvard.edu/~aedin
http://www.hsph.harvard.edu/research/aedin-culhane/
Jan 2009
Data Analysts Captivated by R’s Power
"R is really important to the point that it’s hard to overvalue it,” said Daryl
Pregibon, a research scientist at Google, which uses the software widely. “It
allows statisticians to do very intricate and complicated analyses without knowing
the blood and guts of computing systems.”
Nov 10 2010
Names You Need to Know in 2011: R Data Analysis Software
"R is rapidly augmenting or replacing other statistical analysis
packages at universities"
▫
▫
▫
▫
Open source, development- flexible, extensible
Large number of statistical and numerical methods
High quality visualization and graphical tools
Extended by a very large collection of rapidly
developing packages
R
• Why is it called R?
▫ The name is partly based on the (first) names of the
first two R authors and partly a play on the name of
the Bell Labs language ‘S
▫ Initially written by Robert Gentleman, & Ross Ihaka,
Dept of Statistics, University of Auckland, New
Zealand (1996)
Short R History
1̂991: Ross Ihaka, Robert Gentleman begin work
on a project that will become R
1993: The first announcement of R
1995: R available by ftp
1996: A mailing list is started and maintained by
Martin Maechler at ETH
1997: The R core group is formed
2000: R 1.0.0 is released
Short R History Continued
2001: Bioconductor for the analysis and
comprehension of genomic data using R
2008: The Omegahat project to enable
connectivity between R and other languages
2010: Former co-founder and employees of SPSS
found Revolution Analytics, a company which
offers a commerical package around R.
2011: Rstudio Project provide a free open source
integrated development environment (IDE) for
R
R
R project (v2.15 April 2012)
pre v2.15 biannual release (April, October)
post v2.15 annual release (April)
Download core and contributed packages from
CRAN
Link: R Task Views
R Interface
• Default R interface
• Rstudio
▫ www.rstudio.org
▫ Cross platform, Windows/Mac/Linux
• Others
▫ Notepad++, TinnR, RCMDR, etc
RStudio
• 4 windows
-Editor, Console, History,
Files/plots
•
•
•
•
•
Code completion
Easy access to help (F1)
One step Sweave pdf generation
Searchable history
Keyboard Shortcuts
▫ http://www.rstudio.org/docs/using/keyboard_shortcu
ts
Starting with R
• The R environment is controlled by hidden files in the
startup directory: .RData, .Rhistory and .Rprofile
(optional) These are very useful.
• History means you can automatically save all commands
you type
• Rdata saves everything in memory (can be large- be
careful)
• Best to rename these using
▫ save.image(file=“S01_GeneProjectMay2012.RData”)
▫ save(myVec, file=“S01_GeneProjectMay2012.RData”)
▫ savehistory(file=“S01_GeneProjectMay2012.Rhistory”)
Tips for projects management
• Save commands to a script myscript.R
## In R
source(“myscript.R”)
## Or from the command line
R CMD BATCH myscript.R
• Save scripts, S01_xxxDate.R, S02_xxxDate.R,
etc where xxx is project name
• Use Folders or Projects in Rstudio
getwd()
setwd()
Overview of Bioconductor
Aedín Culhane
[email protected]
http://bcb.dfci.harvard.edu/~aedin
http://www.hsph.harvard.edu/research/aedin-culhane
Bioconductor
Release coincides with R release.
Current: Bioconductor 2.10
(release coincide with R 2.15)
To install use script on Bioconductor Website
source("http://www.bioconductor.org/biocLite.R")
biocLite()
What Packages do I need?
Specific to you data and analysis pipeline but for
examples:
• Bioconductor Workshops
• Bioconductor Workflows
Packages Overview
BioConductor web site
• Bioconductor BiocViews Task view
Software
Annotation Data
Experimental Data
Main types of Annotation Packages
• Gene centric AnnotationDbi packages:
▫
▫
▫
▫
Organism: org.Mm.eg.db.
Technology/Platform: hgu133plus2.db.
GeneSets and Pathway (biology level): GO.db or KEGG.db
.db packages can be queried with sql or accessed using annotation
package (totable, get, mget)
• Genome centric GenomicFeatures packages:
▫ Transriptome level: TxDb.Hsapiens.UCSC.hg19.knownGene
▫ Generic features: Can generate via GenomicFeatures
• biomaRt:
▫ Query web-based `biomart' resource for genes, sequence, SNPs,
and etc.
• See http://www.bioconductor.org/help/course-materials/2011/BioC2011/LabStuff/AnnotationSlidesBioc2011.pdf
Bioconductor resources
• Mailing List (sign up for daily digest)
• Documentation, workshop/course material online
▫ Slides from talks, pdf of tutorials, R code
• Help available for each software package
▫ Each package MUST contain vignette (howto)
• Other resources ww.Rseek.org www.r-bloggers.com
Vignette
• Tutorials, provide worked example of package
• Required in Bioconductor packages
• Written in Sweave (Leisch, 2002).
▫ LATEX dynamic reports in which R code is embedded
and executable
▫ All R code in vignette is checked (and executed) by R
CMD check
▫ http://www.bioconductor.org/docs/vignettes.html
library("Biobase")
library("GOstats")
openVignette()
# Load package of interest
Getting Data into R &
Bioconductor
Aedín Culhane
[email protected]
http://www.hsph.harvard.edu/research/aedin-culhane/
21
Simple Excel SpreadSheet data
• Simple table
▫ read.table()
▫ read.csv()
▫ scan()
• However more datatype specialized. See Technologies on
BiocViews.
▫ http://www.bioconductor.org/packages/release/Bioc
Views.html
• Large data files. Also see
http://www.revolutionanalytics.com
22
May 2011
Some common data types
• Microarray
• SNP
• NGS
A Microarray Overview
23
24
May 2011
Reading Affymetrix Data
library(affy)
require(affy) # Alternative
affybatch <- ReadAffy(celfile.path="[Location of
your data]")
eSet<-justRMA()
25
Sample R code
26
May 2011
Other Arrays
• Illumina
▫ Lumi package
• 2 color spotted arrays
▫ Limma package
• Other arrays
▫ http://www.bioconductor.org/help/workflows/oli
go-arrays/
Next Generation Sequencing Data
Public Microarray Data
ArrayExpress
 21997 Studies (622,617 profiles,)
GEO
 22,735 Studies (558,074 profiles)
Statistics May 2011
29
May 2011
R Code
30
May 2011
More on GEOquery
require(GEOquery)
Let's try to load the GDS810 dataset which contains data on
Alzheimer's disease at various stages of severity.
GDS810<-getGEO("GDS810")
The getGEO function returns an object of class GEOData. You can
get a description of this class like this:
help("GEOData-class")
Meta(GDS810)
Columns(GDS810)
head(Table(GDS810))
31
May 2011
Assessing Data Quality
32
May 2011
ExpressionSet Class in R
R basics: Getting help
• To get help
▫ ?mean
▫ help(mean)
• help.search(“mean”)
• apropos("mean")
• example(mean)
• http://www.bioconductor.org/help/