Thursday, August 20, 2009

Reproducible Research

In applied data analysis, it is important to publish not only the results but the code that was used to create those results. In the 'reproducible research' approach, one creates a dynamic document which contains not only the text of the final manuscript but it also contains the R code used to create all of the tables and figures in the manuscript. While this was proposed sometime ago, until recently it was rather difficult to do in practice. However, the freely available LyX editor has now been extended to support documents containing a mixture of LaTeX and R code, so it is now much easier using this system to create these dynamic documents.

Use of such documents is particularly valuable not only improving communications with others as to what exactly was done, but also to provide a detailed record for later use of how exactly the analyses were done. This integrated compendium of text and code is much easier to understand and revisit in the future. As Gentleman (2005) wrote, "New researchers are able to quickly and relatively easily determine what the previous investigator had done. Extension, improvement, or simply use will be easier than if no protocol has been used."

Here are some relevant links:

Robert Gentleman's slides on reproducible research can be found at:

http://gentleman.fhcrc.org/Fld-talks/RGRepRes.pdf

He also wrote a nice paper about this,

1: Gentleman R. Reproducible research: a bioinformatics case study. Stat Appl Genet Mol Biol. 2005;4:Article2. Epub 2005 Jan 11. PubMed PMID: 16646837.

which can be found here.

This approach is implemented using the Sweave command in R. Instructions for using LyX together with R and Sweave can be found at:
http://wiki.lyx.org/LyX/LyxWithRThroughSweave

See also the links on:
http://gregor.gorjanc.googlepages.com/lyx-sweave

as well as the article:

Using Sweave with LyX
How to lower the LATEX/Sweave learning curve
by Gregor Gorjanc

which is the first article in: R News, Volume 8/1, May 2008
http://www.r-project.org/doc/Rnews/Rnews_2008-1.pdf

The Sweave manual and FAQ can be found here:

http://www.statistik.lmu.de/~leisch/Sweave/


NOTE:
While it is wonderful to be able to use Sweave/R within LyX, in my experience it can be difficult to debug one's R code while working inside of LyX, as LyX does all of its R computations in a temporary directory and isn't very good yet at returning the R error messages back to the LyX user. Here is an example of how one might track down an error in the R code while working in LyX:

If an LyX document fails to typeset, it can be difficult to track down the error.

Suppose your chunk of R code in the LyX editor window contains an error like this:

<<2, echo=FALSE, fig=TRUE>>=
plot(x
@

Now when you try to typeset it, you will get a message that states:

"An error occured whilst running R CMD Sweave 'test.Rnw'"

To track this down, you can look at the intermediate temporary files that LyX generated.

To do this, open a Terminal window, and then type

cd /tmp
ls

You chould see a temporary directory with a name like lyx_tmpdir4466f8u8JU
Move into that directory by typing

cd lyx_tmpdir4466f8u8JU
ls

Now you should see a temporary directory with a name like lyx_tmpbuf0
Move into that directory by typing

cd lyx_tmpbuf0
ls

Now you should see all the temporary files that LyX generated, including one with a name like 'test.Rnw' which contains the Sweave code that LyX uses to generate the document. To see why the R command failed, type

R CMD Sweave test.Rnw

When I do this, I see:

Writing to file test.tex
Processing code chunks ...
1 : echo term verbatim (label=myFirstChunkInLyX)
2 : term verbatim eps pdf (label=2)

Error: chunk 2 (label=2)
Error in parse(text = chunk) : unexpected end of input in "plot(x"
Execution halted

This output indicates exactly what the cause of the error is.

Thompson. Statistical Inference from Genetic Data on Pedigrees book available in JSTOR

Elizabeth Thompson's whole book:

Thompson. Statistical Inference from Genetic Data on Pedigrees. NSF-CBMS Regional Conference Series in Probability and Statistics (2000) vol. 6 pp. i-xiv+1-169

is available through JSTOR here.

About Me

My photo
Pittsburgh, PA, United States