ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Contamination: Difference between revisions

From angsd
Jump to navigation Jump to search
No edit summary
Line 91: Line 91:


==method==
==method==
The method is described in the supplementary of  
The method is described in the supplementary of [[Rasmussen2011]]
[http://www.ncbi.nlm.nih.gov/entrez?Db=pubmed&Cmd=ShowDetailView&uid=23803765 PMID 23803765], [http://www.bioinformatics.org/texmed/cgi-bin/list.cgi?PMID=23803765 bibtex]  
 
Orlando, L, Ginolhac, A, Zhang, G, Froese, D, Albrechtsen, A, Stiller, M, Schubert, M, Cappellini, E, Petersen, B, Moltke, I, Johnson, PL, Fumagalli, M, Vilstrup, JT, Raghavan, M, Korneliussen, T, Malaspinas, AS, Vogt, J, Szklarczyk, D, Kelstrup, CD, Vinther, J, Dolocan, A, Stenderup, J, Velazquez, AM, Cahill, J, Rasmussen, M, Wang, X, Min, J, Zazula, GD, Seguin-Orlando, A, Mortensen, C, Magnussen, K, Thompson, JF, Weinstock, J, Gregersen, K, Røed, KH, Eisenmann, V, Rubin, CJ, Miller, DC, Antczak, DF, Bertelsen, MF, Brunak, S, Al-Rasheid, KA, Ryder, O, Andersson, L, Mundy, J, Krogh, A, Gilbert, MT, Kjær, K, Sicheritz-Ponten, T, Jensen, LJ, Olsen, JV, Hofreiter, M, Nielsen, R, Shapiro, B, Wang, J, Willerslev, E (2013). Recalibrating Equus evolution using the genome sequence of an early Middle Pleistocene horse. Nature, 499, 7456:74-8.

Revision as of 14:35, 27 June 2014

Angsd can estimate contamination, but only for chromosomes that exists in one genecopy (eg chrX for males). This method requires a list of polymorphic sites along with their frequency and we also recommend to discard regions with low mappability.

We have included a mappability and HapMap files for chrX these are found in the RES subfolder of the angsd source package. So if you are working with humans, and your sample is a male then you can estimate the contamination with the follow two commands.

  • First we generate a binary count file for chrX for a single BAM file (ANGSD cprogram)
  • Then we do a fisher test for finding a p-value, and jackknife to get an estimate of contamination (Rprogram)


An example are found below:

#run angsd
./angsd -i my.bam -r X: -doCounts 1  -iCounts 1 -minMapQ 30 -minQ 20
#do jackKnife in R
Rscript contamination.R mapFile="map100.chrX.bz2" hapFile="hapMapCeuXlift.map.bz2" countFile="angsdput.icnts.gz" mc.cores=24

The contamination.R program is found in the R/ subfolder, and the resource files are found in the RES folder. The jackknive procedure can be quite slow, so we allocate 10 cores for this analysis mc.cores=10.

Output

The output from the above command is shown below

Rscript ../R/contamination.R mapFile="map100.chrX.bz2" hapFile="hapMapCeuXlift.map.bz2" countFile="angsdput.icnts.gz" mc.cores=24

Loading required package: parallel

-----------------------
Doing Fisher exact test for Method1:
           SNP site adjacent site
minor base      616          3554
major base   198492       1589087

	Fisher's Exact Test for Count Data

data:  mat
p-value = 5.286e-13
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.271632 1.512213
sample estimates:
odds ratio 
  1.387606 


-----------------------
Doing Fisher exact test for Method2:
           SNP site adjacent site
minor base      114           654
major base    37983        304122

	Fisher's Exact Test for Count Data

data:  mat2
p-value = 0.001532
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.133367 1.705751
sample estimates:
odds ratio 
  1.395672 


-----------------------
 major and minor bases - Method1:
               -4     -3     -2     -1 SNP site      1      2      3      4
minor base    427    417    475    437      616    486    439    427    446
major base 198651 198715 198656 198645   198492 198500 198681 198693 198546

-----------------------
 major and minor bases - Method2:
              -4    -3    -2    -1 SNP site     1     2     3     4
minor base    75    76    96    73      114    86    79    80    89
major base 38022 38021 38001 38024    37983 38011 38018 38017 38008

----------------------
Running jackknife for Method1 (could be slow)
Running jackknife for Method2 (could be slow)
$est
              Method1     Method2    
Contamination 0.03837625  0.03380983 
llh           1034.078    483.5145   
SE            0.002630455 0.003900376


Interpretation of outputfiles

Both methods shows a highly significant pvalue, and estimate the level of contamination to be approx 3%.


method

The method is described in the supplementary of Rasmussen2011