Friday, July 31, 2009

Constructing an interpolated genetic map

It is often difficult to construct a good genetic map of a large number of SNPs. Typically this involves interpolation between a genetic map and the physical map of the SNPs. 
 
The Cartographer web site at 
 
https://apps.bioinfo.helsinki.fi/software/cartographer.aspx 
 
looks like it might ease this task. 
 
Here is a description of what Cartographer does from a recent paper: 
 
"Cartographer retrieves the physical location of the markers from the University of California Santa Cruz (UCSC) database and orders the markers based on the sequence information. The genetic location of each marker is defined using the published deCODE genetic map [51], which are also stored in the UCSC database. For markers that were not included in the deCODE genetic map, we used linear interpolation for obtaining estimates of the genetic locations of these markers by using the physical and the genetic locations of the immediately flanking deCODE markers. Also, if the physical location from the UCSC database and the genetic location from the deCODE genetic map for a given marker were in disagreement, we obtained an estimate of the genetic location via interpolation using the nearest flanking deCODE markers that are in agreement with the sequence information. For those markers that were not found in the UCSC database, we retrieved PCR primers from UniSTS and used these to map the marker using UCSC in-silico PCR, and performed the linear interpolation to obtain an estimate of the genetic location as described previously [14]." 
 
From: 
Combined Genome Scans for Body Stature in 6,602 European Twins: Evidence for Common Caucasian Loci Perola M, Sammalisto S, Hiekkalinna T, Martin NG, Visscher PM, et al. PLoS Genetics Vol. 3, No. 6, e97 doi:10.1371/journal.pgen.0030097 
 
----------- 
For an alternative approach to constructing interpolated genetic maps, see David Duffy's discussion

10 Simple Rules for Making Good Presentations

The PLoS Computational Biology journal has a very interesting series of '10 Simple Rules' papers that are quite good. Here is one of the more recent ones (which contains references to the earlier ones): 
 
Bourne PE (2007) Ten Simple Rules for Making Good Oral Presentations. PLoS Comput Biol 3(4): e77 doi:10.1371/journal.pcbi.0030077 
 
 
See also: 
 
Bourne PE (2005) Ten simple rules for getting published. PLoS Comp Biol 1: e57. 
 
Bourne PE, Chalupa LM (2006) Ten simple rules for getting grants. PLoS Comp Biol 2: e12. 
 
Bourne PE, Korngreen A (2006) Ten simple rules for reviewers. PLoS Comp Biol 2: e110. 
 
Bourne PE, Friedberg I (2006) Ten simple rules for selecting a postdoctoral fellowship. PLoS Comp Biol 2: e121. 
 
Vicens Q, Bourne PE (2007) Ten simple rules for a successful collaboration. PLoS Comp Biol 3: e44. 

Excellent slides on R

Here are some excellent slides on R (pdf), which were made by Dr. Thomas Lumley for a Two-day Tutorial in "R" Statistical Computing that he gave in February 2006. 
 
These are by far the best slides I've seen so far on R, with lots of very useful advice and insights into how to use R optimally. 

Quotations from a scientific article

"Perhaps men are just simpler than women." 
 
From: Atwood LD, Heard-Costa NL, Fox CS, Jaquish CE, Cupples LA. Sex and age specific effects of chromosomal regions linked to body mass index in the Framingham Study. BMC Genet. 2006 Jan 26;7:7. 
 
For more quotations, please see the Statistical Genetics Humor Page

Science is poetry

"Basic science" reasoning is often as cloudy, beautiful, and impossible to refute as poetry. 
 
From: Ioannidis JP. Journals should publish all "null" results and should sparingly publish "positive" results. Cancer Epidemiol Biomarkers Prev. 2006 Jan;15(1):186. 
 
For more quotations, please see the Statistical Genetics Humor Page

Creating TIFF files from pdfs

R easily creates graphic files in pdf format, but many journals require that one submit high-resolution TIFF files. Johanna figured out how to convert from pdf to TIFF:

ghostscript can be used to change between formats. '-r' option for how
may d.p.i you want. This is the command to write to file (you can also
write to X11 using ghostview). It is also possible to change to and from
many other formats.

The command used:

gs -q -dNOPAUSE -dBATCH -sDEVICE=tifflzw -r600 -sOutputFile=name.tiff
name.pdf

I think when using preview to change from pdf to tiff it automatically
puts the figure in 72 d.p.i.

This looks very nice in Word even if zooming more than 200%.

NOTES:
A) The 'tifflzw' option results in a compressed file that is much much smaller than the corresponding uncompressed tiff file. And it is a loss-less compression, so there is no loss of quality.

B) The 'tifflzw' option produces monochrome output. If you need LZW-compressed color TIFF files, you can do this, using a command from the libtiff library:

gs -q -dNOPAUSE -dBATCH -sDEVICE=tiff24nc -r600 -sOutputFile=figure4a.TIFF figure_4a.pdf
tiffcp -c lzw figure4a.TIFF figure4aLZW.TIFF

C) For some reason, importing pdf figures into Microsoft Word causes the figures to become fuzzy. To avoid this, convert your pdf figures to compressed TIFF and then import the TIFF files into Word.

D) When creating plots within R, one can directly create a TIFF file using the bitmap() or dev2bitmap() functions. For example, this command:

dev2bitmap(file="test.tiff",type="tifflzw",res=600)

copied the plot I had already drawn in the R graphics window to a 600 DPI TIFF file named "test.tiff".

Old Blog entries

All of the entries from my old blog can be found in this pdf file:

OldBlog.pdf

The pdf file contains bookmarks, which might be helpful in navigating through the document.

Using R to reshape data

Sometimes marker data are sent to us in this format, where each row only contains one genotype for a particular marker and person: 
 
ID Marker A1 A2 
1 M1 1 2 
2 M1 2 2 
3 M1 2 2 
1 M2 1 1 
3 M2 1 2 
 
However, in order to put this marker data in LINKAGE-format, we need to reshape the data so that each row contains all the marker data for a specific person. This can easily be done using the ‘reshape’ command in R: 
 
 
> a < - read.table("marker.txt",header=T) 
> a 
ID Marker A1 A2 
1 1 M1 1 2 
2 2 M1 2 2 
3 3 M1 2 2 
4 1 M2 1 1 
5 3 M2 1 2 
> attach(a) 
> b < - reshape(a,idvar="ID",direction="wide",timevar="Marker") 
> b 
ID A1.M1 A2.M1 A1.M2 A2.M2 
1 1 1 2 1 1 
2 2 2 2 NA NA 
3 3 2 2 1 2 
 

Model cluster script

*** IMPORTANT: Please also read the Comment. *** 
 
Principles of a ‘good’ cluster script: 
1) Should only copy over ONE file at the beginning & only ONE file back at the end. 
2) Should show which node and directory it is running in, so that we can figure out which nodes are not running. 
3) Should clean up after the run is done, erasing files from the compute node. 
4) Should keep stderr and stdout output to a minimum, by piping it to /dev/null 
 
A) Make a ‘files.list’ file contain a list of all the files you need to copy over. 
 
% more files.list 
ccrel.BATCH 
ccrel.awk 
crun.sh 
datain.dat 
loop.sh 
out.txt 
seed 
simdata.dat 
simped.dat 
slinkin.dat 
unmake.awk 
 
B) Create a tar file ‘files.tgz’ based on this list: 
 
tar zcvfT files.tgz files.list 
 
IMPORTANT: Do steps A and B OUTSIDE of the script that will be submitted to the cluster. 
 
Do not use zip instead of tar. Use of tar as described here creates ONE compressed file - copying of only one file over/back keeps creation of the associated meta-data to a minimum. 
 
C) Use qsub to submit a file like this to the cluster.
 
 
#!/bin/csh/ -f 
# Purpose: 

# Steps: 

# Helper scripts: 

# Uses: 
# ============================================================================== 
#$ -cwd 
#$ -m e 
#$ -j y 
#$ -N CCRELsim 
date 
# This will tell you which host it is running on. 
echo JOB_ID: $JOB_ID JOB_NAME: $JOB_NAME HOSTNAME: $HOSTNAME 
unalias cp 
# ============================================================================== 
echo Making directory /tmp/$$ 
mkdir /tmp/$$/ 
if !(-e files.tgz) then 
echo ERROR The files.tgz archive does not exist 
echo Please create it prior to running this script 
exit(1) 
endif 
# Copy the needed files over to the compute node 
cp files.tgz /tmp/$$/ 
# Set an alias to your current directory 
set HomeDir = `pwd` 
# Move into the temporary directory on the compute node 
cd /tmp/$$ 
# Extract the files from your compressed tar file 
tar zxvf files.tgz >& /dev/null 
# ============================================================================== 
# Do the needed computations. In this case, these are done by my 'loop.sh' script file that I copied over 
./loop.sh >& /dev/null 
# ============================================================================== 
# Copy results back to working directory 
unalias cp 
cd .. 
tar zcf results_$JOB_ID.tgz $$ 
cp results_$JOB_ID.tgz $HomeDir 
# ============================================================================== 
# Enter working directory 
cd $HomeDir 
# Remove all the files on compute node 
rm -rf /tmp/$$ 
rm -f /tmp/results_$JOB_ID.tgz 
date 
# ================================ 
 

About Me

My photo
Pittsburgh, PA, United States