How to work with large R projects Tutorial Alex Zolotovitski,

Tutorial
How to work with large R projects
Alex Zolotovitski, www.zolot.us
Medio Inc, www.medio.com
How to work with large R projects.
Alex Zolotovitski, [email protected]
UseR! 2013, Albacete, Spain
1
Contents
1. Workflow of R Projects
2. Reporting and Literate Programming. Packages knitr, highlight, brew, R2HTML.
Code2HTML(); ReleaseOut()
3. IDE: Eclipse+StatET, RStudio and other (vim, ess,..).
4. Naming and style conventions
5. Structure of a project directory. Package ProjectTemplate.
CreateProject()
6. Helper functions to work with a number of large projects.
7. R Work Journal:
Code2HTML(); MakeRWJournals(); createRWJalbum()
8. ToDo
9. References
How to work with large R projects.
Alex Zolotovitski, [email protected]
UseR! 2013, Albacete, Spain
2
1.
Workflow of R Projects
CRISP-DM - lifecycle for a data mining project
Model
Monitoring
Model
Deployment
How to work with large R projects.
Alex Zolotovitski, [email protected]
UseR! 2013, Albacete, Spain
3
1. Workflow of R Projects
Many projects:
Objective: reduce overhead in cycle “ start - leave – find – return – remember - continue”
How to work with large R projects.
Alex Zolotovitski, [email protected]
UseR! 2013, Albacete, Spain
4
1. Workflow of R Projects
Many projects:
How to work with large R projects.
Alex Zolotovitski, [email protected]
UseR! 2013, Albacete, Spain
5
1. Workflow of R Projects
Key points
How to work with large R projects.
Alex Zolotovitski, [email protected]
UseR! 2013, Albacete, Spain
6
1. Workflow of R Projects
Each Project:
How to work with large R projects.
Alex Zolotovitski, [email protected]
UseR! 2013, Albacete, Spain
7
1. Workflow of R Projects
Each Project:
How to work with large R projects.
Alex Zolotovitski, [email protected]
UseR! 2013, Albacete, Spain
8
2.
Reporting and Literate Programming.
Packages
knitr, highlight, brew, R2HTML. Code2HTML(); ReleaseOut()
Literate programming: Self-reported / Self-explaining code [1-3]. Reproducible Research.


[http://www.r-bloggers.com/how-to-set-up-a-reproducible-r-project/]: Treating data as read-only files: do datamunging in R code, but always start with the source data
Consider output artifcacts (figures and tables) as disposable: the data plus the R script is the canonical source
[Rich Fitzjohn : http://nicercode.github.io/blog/2013-04-05-projects/ ]
1.
Treat data as read only
In my mind, this is probably the most important goal of setting up a project. Data are typically time consuming and/or
expensive to collect. Working with them interactively (e.g., in Excel) where they can be modified means you are never
sure of where the data came from, or how they have been modified. My suggestion is to put your data into
the data directory and treat it as read only. Within your scripts you might generate derived data sets either temporarily
(in an R session only) or semi-permanently (as an file in output/), but the original data is always left in an untouched
state.
2.
Treat generated output as disposable
In this approach, files in directories figs/ and output/ are all generated by the scripts. A nice thing about this approach
is that if the filenames of generated files change (e.g, changing from phylogeny.pdfto mammal-phylogeny.pdf) files
with the old names may still stick around, but because they’re in this directory you know you can always delete them.
How to work with large R projects.
Alex Zolotovitski, [email protected]
UseR! 2013, Albacete, Spain
9
2. Reporting and Literate Programming.
Before submitting a paper, I will go through and delete all the generated files and rerun the analysis to make sure that I
can create all the analyses and figures from the data.
What is Sweave [4]?





Sweave is a tool that allows to embed R code in (sort of) LATEX documents.
The document will contain both documentation parts (written in LATEX) and code parts (written in R).
The code is evaluated in R.
The resulting console output, figures and tables are automatically inserted into the final document.
This produces a .tex file on which it is possible to run LATEX: .Rnw → .tex → .pdf
R packages: highlight, brew, knitr, R2HTML::RweaveHTML
Package knitr:
.R → .Rmd → .md → .HTML
R code → document (pdf, HTML) to publish .
How to work with large R projects.
Alex Zolotovitski, [email protected]
UseR! 2013, Albacete, Spain
10
2. Reporting and Literate Programming.
Problems:
Change input data (rare) or modify .R code (often) – rerun all chain to recreate the .html
Navigation through the code
Need:
1. Additional research
2. Forks in environment/workspace → data to save and restore.
3. If modify .R code - Long time to rerun all code to recreate .html
4. Dynamic HTML with JavaScript navigation.
Need: R code ↔ R work journal → document (pdf, HTML) to publish
Similar to journal in JMP or notebook in iPython.
How to work with large R projects.
Alex Zolotovitski, [email protected]
UseR! 2013, Albacete, Spain
11
3.
IDE: Eclipse+StatET, RStudio and other (vim, ess,..).
Eclipse + StatET vs RStudio [7, 8]
Pro:




Ctrl+R, V
Search
Multi-win
Other useful Eclipse plugins may be: mylyn (tasks), SVN, git, python, java, toad for clouds
(hive),….
 Full screen view on click
Contra:
 Installation [7].
Solution: Portable for Win R+Eclipse package that does not require installation:
http://dl.dropbox.com/u/37458038/REclipse-J.zip - just download, unzip in any folder and
click REclipse.bat . It supposes that jre is installed in default location.
How to work with large R projects.
Alex Zolotovitski, [email protected]
UseR! 2013, Albacete, Spain
12
4.
Naming and style conventions
 Not the point to discuss.
 The attached code could be easy modified to your naming and style convention
 I prefer from alternatives the shorter,
e.g. x= 1 if it is equivalent x <- 1, 'a' if it is equivalent "a"
How to work with large R projects.
Alex Zolotovitski, [email protected]
UseR! 2013, Albacete, Spain
13
5.
Structure of a project directory.
Package ProjectTemplate. CreateProject()
http://projecttemplate.net/architecture.html
How to work with large R projects.
Alex Zolotovitski, [email protected]
UseR! 2013, Albacete, Spain
14
5. Structure of a Project Directory.
How to work with large R projects.
Alex Zolotovitski, [email protected]
UseR! 2013, Albacete, Spain
15
5. Structure of a Project Directory.
file:///C:/z/eclipse/work/R-svn-ass/00_commonR/71_TestProjTemplate/zProj2-min/
file:///T:/work/UseR-2013/lib/newProjTemplName/
file:///M:/88_XBox.LTV
How to work with large R projects.
Alex Zolotovitski, [email protected]
UseR! 2013, Albacete, Spain
16
5. Structure of a Project Directory.
Template Folder:
Just created Proj:
How to work with large R projects.
Alex Zolotovitski, [email protected]
UseR! 2013, Albacete, Spain
17
5. Structure of a Project Directory.
After 2nd ReleaseOut():
How to work with large R projects.
Alex Zolotovitski, [email protected]
UseR! 2013, Albacete, Spain
18
6.
1.
2.
3.
4.
Helper Functions. Wrappers, one-liners, aliases
do something
print reminder
print hint /template for the next step
mnemonics
Function
Wrapper for
== Create Project
CreateProject()
Create new Project
libra()
install.packages + library()
theFile
global variable - current R code
How to work with large R projects.
Alex Zolotovitski, [email protected]
UseR! 2013, Albacete, Spain
19
6. Helper Functions. Wrappers, one-liners, aliases
#'
#'
== Remind objects:
DT()
strftime(Sys.time()) - current Date and Time
st({})
system.time + play sound - for long executed blocks
hee()
nrow + head
sg()
dev.print
srm()
save & remove
##
copy output from console to the code
^RV
save graphics to .png file
== Save state:
sa()
save.image, save
Code2HTML()
R code theFile to html R Work Journal
MakeRWJournals()
createRWJalbum()
ReleaseOut()
#'
-
///
How to work with large R projects.
- the same for many R files to create
albums of galleries
move R code and all output to a DateTime-version folder
before new data
exit location -
mark place in the file
Alex Zolotovitski, [email protected]
UseR! 2013, Albacete, Spain
20
6. Helper Functions. Wrappers, one-liners, aliases
#'
== Restore
state:
rmall()
rm all
rmDF()
rm datafames and lists
init
initialise environment
loo();
gff('saved:')
find saved data;
find saved locations
lo()
load saved data
lsDF()
ls
data frames
#'
Convenience,
aliases
tocsv()
write.csv
totsv()
write.table
suss()
subset + grep
gre2()
grep
df()
data.frame
How to work with large R projects.
Alex Zolotovitski, [email protected]
UseR! 2013, Albacete, Spain
21
7.
R Work Journal:
Code2HTML(); MakeRWJournals(); createRWJalbum()
Code2HTML()Features:
1. Transforms .R file into self-documented .html file, containing all R code with output pics, headers, table of
contents and gallery.
2. The titles in body and contents are clickable to navigate from contents to body and back.
3. The pics are clickable to resize.
4. The html file has partly R syntax highlighted. It is possible to do the full R syntax highlighting in resulting html, but
the result file becomes almost twice heavier.
5. Parts of the result html file could be folded.
6. If you in browser fold TOC, select all, copy and paste from browser to a text editor, you should get the pure
original R file.
7. If modify .R code, recreate .html is fast.
8. It is not replacement of knitr or sweave, because output is not a document to print, but rather an R work journal.
How to work with large R projects.
Alex Zolotovitski, [email protected]
UseR! 2013, Albacete, Spain
22
7. R Work Journal
To show:
How to work with large R projects.
Alex Zolotovitski, [email protected]
UseR! 2013, Albacete, Spain
23
7. R Work Journal
How to work with large R projects.
Alex Zolotovitski, [email protected]
UseR! 2013, Albacete, Spain
24
7. R Work Journal
How to work with large R projects.
Alex Zolotovitski, [email protected]
UseR! 2013, Albacete, Spain
25
7. R Work Journal
1.
2.
3.
RAlbum 42d
95_ABC_LTV
97_tutorial-demo
How to work with large R projects.
Alex Zolotovitski, [email protected]
UseR! 2013, Albacete, Spain
26
8.
To do:
Max: execute code from r.html
Min: navigate between r.html and .R views in Eclipse
How to work with large R projects.
Alex Zolotovitski, [email protected]
UseR! 2013, Albacete, Spain
27
9.
References
1. David Smith, How to set up a reproducible R project www.r-bloggers.com/how-to-set-up-a-reproducible-r-project/
2. Carl Boettiger, My research workflow, based on Github http://carlboettiger.info/2012/05/06/research-workflow.html
3. Rich Fitzjohn, Nice R Code http://nicercode.github.io/blog/2013-04-05-projects
4. Daniel Falster, Why I want to write nice R code http://nicercode.github.io/blog/2013-04-05-why-nice-code
5. William Stafford Noble, A Quick Guide to Organizing Computational Biology Projects
www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000424
6. Data: Data Management http://software-carpentry.org/4_0/data/mgmt.html
7. Eclipse and StatET 2.0 Install For Running R http://lukemiller.org/index.php/2012/01/eclipse-and-statet-2-0-install-for-running-r
8. Eclipse Platform Runtime Binary http://download.eclipse.org/eclipse/downloads/drops4/S-4.3RC4-201306052000/#RCPRuntime
9. StatET Installation. www.walware.de/?page=/index.mframe
10. Installation & Update of the Eclipse Plug-in StatET www.walware.de/?page=/it/statet/installation.html
11. Longhow Lam, A guide to Eclipse and the R plug-in StatET, www.splusbook.com/RIntro/R_Eclipse_StatET.pdf
12. http://en.wikipedia.org/wiki/Literate_Programming
13. http://en.wikipedia.org/wiki/Noweb
14. CRAN Task View: Reproducible Research http://cran.r-project.org/web/views/ReproducibleResearch.html
15. Nicola Sartori, An Sweave tutorial. www.cepe.ethz.ch/education/NPecoHS2010/Sartori-Sweave.pdf
16. Package Knitr http://yihui.name/knitr/, http://cran.r-project.org/web/packages/knitr/index.html
How to work with large R projects.
Alex Zolotovitski, [email protected]
UseR! 2013, Albacete, Spain
28