Daniel E. Weeks' Blog: 2009

Wednesday, December 9, 2009

Open access Autism 10K genome-scan data set in NIH GEO

The Affymetrix 10K genome-scan data set from the Autism Genome Project is available in the NIH GEO repository (open access):

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE6754

"This is a large linkage study undertaken by The Autism Genome Project (AGP) Consortium to search for candidate genes underlying the etiology of autism. 1168 Muliplex families (= 2 affected individuals) consisting of 7600 individuals were genotyped using Affymetrix 10K whole genome mapping arrays. Copy number analysis was performed using DNA Chip (dChip) Analyzer."

Something just didn't seem right: noticing the unexpected!

The secret of good data analysis is noticing those things that just don't seem right!:

'Wildlife Conservation Officer Cory Bentzoni noticed something suspicious on a Luzerne County highway -- a pickup truck with a heaping pile of doughnuts, breads and pastries spilling over its bed.

"Being that we were so close to bear season, seeing that person drive by with an unusual amount of pastries was like watching an individual go down a row of parked vehicles testing each handle to see if it were open," Bentzoni said in a written statement. "Something just didn't seem right."'

http://www.post-gazette.com/pg/09343/1019403-454.stm

Monday, December 7, 2009

The R Inferno

Here is the link to a PDF file entitled "The R Inferno", which is not only an excellent guide to efficient and good programming practice in R, but is also very funny:

http://www.burns-stat.com/pages/Tutor/R_inferno.pdf

Here's the Abstract:

"If you are using R and you think you’re in hell, this is a map for you."

Tuesday, October 27, 2009

The Elements of Statistical Learning

A complete PDF of this very good book:

Hastie et al. The elements of statistical learning: data mining, inference, and prediction‎. (2009) pp. 745

is available from the authors' web site here.

Monday, October 26, 2009

Humor: Large-scale science

"A BIG COMPUTER, A COMPLEX ALGORITHM AND A LONG TIME DOES NOT EQUAL SCIENCE."

ROBERT GENTLEMAN, SSC 2003, HALIFAX (JUNE 2003)

(From the One R Tip a Day blog http://onertipaday.blogspot.com/)

And should we add

"and a p-value of 10^-8"?

Humor: How to write a scientific paper

This article on "How to Write a Scientific Paper" is amusing.

HapMap BioMart interface

One can retrieve allele frequencies and genotype frequencies for all the populations in HapMap in a nice tabular format using their BioMart HapMart interface available here:

http://hapmap.ncbi.nlm.nih.gov/biomart/martview

This is a very nice interface, which is relatively easy to use.

The same interface also generates Perl code for the same query, which then could easily be reused in a program if needed.

Thursday, August 20, 2009

Reproducible Research

In applied data analysis, it is important to publish not only the results but the code that was used to create those results. In the 'reproducible research' approach, one creates a dynamic document which contains not only the text of the final manuscript but it also contains the R code used to create all of the tables and figures in the manuscript. While this was proposed sometime ago, until recently it was rather difficult to do in practice. However, the freely available LyX editor has now been extended to support documents containing a mixture of LaTeX and R code, so it is now much easier using this system to create these dynamic documents.

Use of such documents is particularly valuable not only improving communications with others as to what exactly was done, but also to provide a detailed record for later use of how exactly the analyses were done. This integrated compendium of text and code is much easier to understand and revisit in the future. As Gentleman (2005) wrote, "New researchers are able to quickly and relatively easily determine what the previous investigator had done. Extension, improvement, or simply use will be easier than if no protocol has been used."

Here are some relevant links:

Robert Gentleman's slides on reproducible research can be found at:

http://gentleman.fhcrc.org/Fld-talks/RGRepRes.pdf

He also wrote a nice paper about this,

1: Gentleman R. Reproducible research: a bioinformatics case study. Stat Appl Genet Mol Biol. 2005;4:Article2. Epub 2005 Jan 11. PubMed PMID: 16646837.

which can be found here.

This approach is implemented using the Sweave command in R. Instructions for using LyX together with R and Sweave can be found at:
http://wiki.lyx.org/LyX/LyxWithRThroughSweave

See also the links on:
http://gregor.gorjanc.googlepages.com/lyx-sweave

as well as the article:

Using Sweave with LyX
How to lower the LATEX/Sweave learning curve
by Gregor Gorjanc

which is the first article in: R News, Volume 8/1, May 2008
http://www.r-project.org/doc/Rnews/Rnews_2008-1.pdf

The Sweave manual and FAQ can be found here:

http://www.statistik.lmu.de/~leisch/Sweave/

NOTE:
While it is wonderful to be able to use Sweave/R within LyX, in my experience it can be difficult to debug one's R code while working inside of LyX, as LyX does all of its R computations in a temporary directory and isn't very good yet at returning the R error messages back to the LyX user. Here is an example of how one might track down an error in the R code while working in LyX:

If an LyX document fails to typeset, it can be difficult to track down the error.

Suppose your chunk of R code in the LyX editor window contains an error like this:

<<2, echo=FALSE, fig=TRUE>>=
plot(x
@

Now when you try to typeset it, you will get a message that states:

"An error occured whilst running R CMD Sweave 'test.Rnw'"

To track this down, you can look at the intermediate temporary files that LyX generated.

To do this, open a Terminal window, and then type

cd /tmp
ls

You chould see a temporary directory with a name like lyx_tmpdir4466f8u8JU
Move into that directory by typing

cd lyx_tmpdir4466f8u8JU
ls

Now you should see a temporary directory with a name like lyx_tmpbuf0
Move into that directory by typing

cd lyx_tmpbuf0
ls

Now you should see all the temporary files that LyX generated, including one with a name like 'test.Rnw' which contains the Sweave code that LyX uses to generate the document. To see why the R command failed, type

R CMD Sweave test.Rnw

When I do this, I see:

Writing to file test.tex
Processing code chunks ...
1 : echo term verbatim (label=myFirstChunkInLyX)
2 : term verbatim eps pdf (label=2)

Error: chunk 2 (label=2)
Error in parse(text = chunk) : unexpected end of input in "plot(x"
Execution halted

This output indicates exactly what the cause of the error is.

Thompson. Statistical Inference from Genetic Data on Pedigrees book available in JSTOR

Elizabeth Thompson's whole book:

Thompson. Statistical Inference from Genetic Data on Pedigrees. NSF-CBMS Regional Conference Series in Probability and Statistics (2000) vol. 6 pp. i-xiv+1-169

is available through JSTOR here.

Friday, July 31, 2009

Constructing an interpolated genetic map

It is often difficult to construct a good genetic map of a large number of SNPs. Typically this involves interpolation between a genetic map and the physical map of the SNPs.

The Cartographer web site at

https://apps.bioinfo.helsinki.fi/software/cartographer.aspx

looks like it might ease this task.

Here is a description of what Cartographer does from a recent paper:

"Cartographer retrieves the physical location of the markers from the University of California Santa Cruz (UCSC) database and orders the markers based on the sequence information. The genetic location of each marker is defined using the published deCODE genetic map [51], which are also stored in the UCSC database. For markers that were not included in the deCODE genetic map, we used linear interpolation for obtaining estimates of the genetic locations of these markers by using the physical and the genetic locations of the immediately flanking deCODE markers. Also, if the physical location from the UCSC database and the genetic location from the deCODE genetic map for a given marker were in disagreement, we obtained an estimate of the genetic location via interpolation using the nearest flanking deCODE markers that are in agreement with the sequence information. For those markers that were not found in the UCSC database, we retrieved PCR primers from UniSTS and used these to map the marker using UCSC in-silico PCR, and performed the linear interpolation to obtain an estimate of the genetic location as described previously [14]."

From:
Combined Genome Scans for Body Stature in 6,602 European Twins: Evidence for Common Caucasian Loci Perola M, Sammalisto S, Hiekkalinna T, Martin NG, Visscher PM, et al. PLoS Genetics Vol. 3, No. 6, e97 doi:10.1371/journal.pgen.0030097

-----------
For an alternative approach to constructing interpolated genetic maps, see David Duffy's discussion.

10 Simple Rules for Making Good Presentations

The PLoS Computational Biology journal has a very interesting series of '10 Simple Rules' papers that are quite good. Here is one of the more recent ones (which contains references to the earlier ones):

Bourne PE (2007) Ten Simple Rules for Making Good Oral Presentations. PLoS Comput Biol 3(4): e77 doi:10.1371/journal.pcbi.0030077

See also:

Bourne PE (2005) Ten simple rules for getting published. PLoS Comp Biol 1: e57.

Bourne PE, Chalupa LM (2006) Ten simple rules for getting grants. PLoS Comp Biol 2: e12.

Bourne PE, Korngreen A (2006) Ten simple rules for reviewers. PLoS Comp Biol 2: e110.

Bourne PE, Friedberg I (2006) Ten simple rules for selecting a postdoctoral fellowship. PLoS Comp Biol 2: e121.

Vicens Q, Bourne PE (2007) Ten simple rules for a successful collaboration. PLoS Comp Biol 3: e44.

Excellent slides on R

Here are some excellent slides on R (pdf), which were made by Dr. Thomas Lumley for a Two-day Tutorial in "R" Statistical Computing that he gave in February 2006.

These are by far the best slides I've seen so far on R, with lots of very useful advice and insights into how to use R optimally.

Quotations from a scientific article

"Perhaps men are just simpler than women."

From: Atwood LD, Heard-Costa NL, Fox CS, Jaquish CE, Cupples LA. Sex and age specific effects of chromosomal regions linked to body mass index in the Framingham Study. BMC Genet. 2006 Jan 26;7:7.

For more quotations, please see the Statistical Genetics Humor Page.

Science is poetry

"Basic science" reasoning is often as cloudy, beautiful, and impossible to refute as poetry.

From: Ioannidis JP. Journals should publish all "null" results and should sparingly publish "positive" results. Cancer Epidemiol Biomarkers Prev. 2006 Jan;15(1):186.

For more quotations, please see the Statistical Genetics Humor Page.

Creating TIFF files from pdfs

R easily creates graphic files in pdf format, but many journals require that one submit high-resolution TIFF files. Johanna figured out how to convert from pdf to TIFF:

ghostscript can be used to change between formats. '-r' option for how
may d.p.i you want. This is the command to write to file (you can also
write to X11 using ghostview). It is also possible to change to and from
many other formats.

The command used:

gs -q -dNOPAUSE -dBATCH -sDEVICE=tifflzw -r600 -sOutputFile=name.tiff
name.pdf

I think when using preview to change from pdf to tiff it automatically
puts the figure in 72 d.p.i.

This looks very nice in Word even if zooming more than 200%.

NOTES:
A) The 'tifflzw' option results in a compressed file that is much much smaller than the corresponding uncompressed tiff file. And it is a loss-less compression, so there is no loss of quality.

B) The 'tifflzw' option produces monochrome output. If you need LZW-compressed color TIFF files, you can do this, using a command from the libtiff library:

gs -q -dNOPAUSE -dBATCH -sDEVICE=tiff24nc -r600 -sOutputFile=figure4a.TIFF figure_4a.pdf
tiffcp -c lzw figure4a.TIFF figure4aLZW.TIFF

C) For some reason, importing pdf figures into Microsoft Word causes the figures to become fuzzy. To avoid this, convert your pdf figures to compressed TIFF and then import the TIFF files into Word.

D) When creating plots within R, one can directly create a TIFF file using the bitmap() or dev2bitmap() functions. For example, this command:

dev2bitmap(file="test.tiff",type="tifflzw",res=600)

copied the plot I had already drawn in the R graphics window to a 600 DPI TIFF file named "test.tiff".

Old Blog entries

All of the entries from my old blog can be found in this pdf file:

OldBlog.pdf

The pdf file contains bookmarks, which might be helpful in navigating through the document.

Using R to reshape data

Sometimes marker data are sent to us in this format, where each row only contains one genotype for a particular marker and person:

 
ID      Marker  A1      A2 
1       M1      1       2 
2       M1      2       2 
3       M1      2       2 
1       M2      1       1 
3       M2      1       2

However, in order to put this marker data in LINKAGE-format, we need to reshape the data so that each row contains all the marker data for a specific person. This can easily be done using the ‘reshape’ command in R:

 
> a < - read.table("marker.txt",header=T) 
> a 
  ID Marker A1 A2 
1  1     M1  1  2 
2  2     M1  2  2 
3  3     M1  2  2 
4  1     M2  1  1 
5  3     M2  1  2 
> attach(a) 
> b < - reshape(a,idvar="ID",direction="wide",timevar="Marker") 
> b 
  ID A1.M1 A2.M1 A1.M2 A2.M2 
1  1     1     2     1     1 
2  2     2     2    NA    NA 
3  3     2     2     1     2

Model cluster script

*** IMPORTANT: Please also read the Comment. ***

Principles of a ‘good’ cluster script:
1) Should only copy over ONE file at the beginning & only ONE file back at the end.
2) Should show which node and directory it is running in, so that we can figure out which nodes are not running.
3) Should clean up after the run is done, erasing files from the compute node.
4) Should keep stderr and stdout output to a minimum, by piping it to /dev/null

A) Make a ‘files.list’ file contain a list of all the files you need to copy over.

 
% more files.list 
ccrel.BATCH 
ccrel.awk 
crun.sh 
datain.dat 
loop.sh 
out.txt 
seed 
simdata.dat 
simped.dat 
slinkin.dat 
unmake.awk

B) Create a tar file ‘files.tgz’ based on this list:

 
tar zcvfT files.tgz  files.list

IMPORTANT: Do steps A and B OUTSIDE of the script that will be submitted to the cluster.

Do not use zip instead of tar. Use of tar as described here creates ONE compressed file - copying of only one file over/back keeps creation of the associated meta-data to a minimum.

C) Use qsub to submit a file like this to the cluster.

 
 
#!/bin/csh/ -f 
# Purpose: 
# 
# Steps: 
# 
# Helper scripts: 
# 
# Uses: 
# ============================================================================== 
#$ -cwd 
#$ -m e 
#$ -j y 
#$ -N CCRELsim 
date 
# This will tell you which host it is running on. 
echo JOB_ID: $JOB_ID JOB_NAME: $JOB_NAME HOSTNAME: $HOSTNAME 
unalias cp 
# ============================================================================== 
echo Making directory /tmp/$$ 
mkdir /tmp/$$/ 
if !(-e files.tgz) then 
 echo ERROR The files.tgz archive does not exist 
 echo Please create it prior to running this script 
 exit(1) 
endif 
# Copy the needed files over to the compute node 
cp  files.tgz /tmp/$$/ 
# Set an alias to your current directory 
set HomeDir = `pwd` 
# Move into the temporary directory on the compute node 
cd /tmp/$$ 
# Extract the files from your compressed tar file 
tar zxvf files.tgz >& /dev/null 
# ============================================================================== 
# Do the needed computations.  In this case, these are done by my 'loop.sh' script file that I copied over 
./loop.sh >& /dev/null 
# ============================================================================== 
# Copy results back to working directory 
unalias cp 
cd .. 
tar zcf results_$JOB_ID.tgz $$ 
cp results_$JOB_ID.tgz $HomeDir 
# ============================================================================== 
# Enter working directory 
cd $HomeDir 
# Remove all the files on compute node 
rm -rf /tmp/$$ 
rm -f /tmp/results_$JOB_ID.tgz 
date 
# ================================

Daniel E. Weeks' Blog