ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Contamination: Difference between revisions

From angsd
Jump to navigation Jump to search
No edit summary
No edit summary
Line 88: Line 88:


Both methods shows a highly significant pvalue, and estimate the level of contamination to be approx 3%.
Both methods shows a highly significant pvalue, and estimate the level of contamination to be approx 3%.
==method==
The method is described in the supplementary of
[http://www.ncbi.nlm.nih.gov/entrez?Db=pubmed&Cmd=ShowDetailView&uid=23803765 PMID 23803765], [http://www.bioinformatics.org/texmed/cgi-bin/list.cgi?PMID=23803765 bibtex]
Orlando, L, Ginolhac, A, Zhang, G, Froese, D, Albrechtsen, A, Stiller, M, Schubert, M, Cappellini, E, Petersen, B, Moltke, I, Johnson, PL, Fumagalli, M, Vilstrup, JT, Raghavan, M, Korneliussen, T, Malaspinas, AS, Vogt, J, Szklarczyk, D, Kelstrup, CD, Vinther, J, Dolocan, A, Stenderup, J, Velazquez, AM, Cahill, J, Rasmussen, M, Wang, X, Min, J, Zazula, GD, Seguin-Orlando, A, Mortensen, C, Magnussen, K, Thompson, JF, Weinstock, J, Gregersen, K, Røed, KH, Eisenmann, V, Rubin, CJ, Miller, DC, Antczak, DF, Bertelsen, MF, Brunak, S, Al-Rasheid, KA, Ryder, O, Andersson, L, Mundy, J, Krogh, A, Gilbert, MT, Kjær, K, Sicheritz-Ponten, T, Jensen, LJ, Olsen, JV, Hofreiter, M, Nielsen, R, Shapiro, B, Wang, J, Willerslev, E (2013). Recalibrating Equus evolution using the genome sequence of an early Middle Pleistocene horse. Nature, 499, 7456:74-8.

Revision as of 14:29, 27 June 2014

Angsd can estimate contamination, but only for chromosomes that exists in one genecopy (eg chrX for males). This method requires a list of polymorphic sites along with their frequency and we also recommend to discard regions with low mappability.

We have included a mappability and HapMap files for chrX these are found in the RES subfolder of the angsd source package. So if you are working with humans, and your sample is a male then you can estimate the contamination with the follow two commands.

  • First we generate a binary count file for chrX for a single BAM file (ANGSD cprogram)
  • Then we do a fisher test for finding a p-value, and jackknife to get an estimate of contamination (Rprogram)


An example are found below:

#run angsd
./angsd -i my.bam -r X: -doCounts 1  -iCounts 1 -minMapQ 30 -minQ 20
#do jackKnife in R
Rscript contamination.R mapFile="map100.chrX.bz2" hapFile="hapMapCeuXlift.map.bz2" countFile="angsdput.icnts.gz" mc.cores=24

The contamination.R program is found in the R/ subfolder, and the resource files are found in the RES folder. The jackknive procedure can be quite slow, so we allocate 10 cores for this analysis mc.cores=10.

Output

The output from the above command is shown below

Rscript ../R/contamination.R mapFile="map100.chrX.bz2" hapFile="hapMapCeuXlift.map.bz2" countFile="angsdput.icnts.gz" mc.cores=24

Loading required package: parallel

-----------------------
Doing Fisher exact test for Method1:
           SNP site adjacent site
minor base      616          3554
major base   198492       1589087

	Fisher's Exact Test for Count Data

data:  mat
p-value = 5.286e-13
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.271632 1.512213
sample estimates:
odds ratio 
  1.387606 


-----------------------
Doing Fisher exact test for Method2:
           SNP site adjacent site
minor base      114           654
major base    37983        304122

	Fisher's Exact Test for Count Data

data:  mat2
p-value = 0.001532
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.133367 1.705751
sample estimates:
odds ratio 
  1.395672 


-----------------------
 major and minor bases - Method1:
               -4     -3     -2     -1 SNP site      1      2      3      4
minor base    427    417    475    437      616    486    439    427    446
major base 198651 198715 198656 198645   198492 198500 198681 198693 198546

-----------------------
 major and minor bases - Method2:
              -4    -3    -2    -1 SNP site     1     2     3     4
minor base    75    76    96    73      114    86    79    80    89
major base 38022 38021 38001 38024    37983 38011 38018 38017 38008

----------------------
Running jackknife for Method1 (could be slow)
Running jackknife for Method2 (could be slow)
$est
              Method1     Method2    
Contamination 0.03837625  0.03380983 
llh           1034.078    483.5145   
SE            0.002630455 0.003900376


Interpretation of outputfiles

Both methods shows a highly significant pvalue, and estimate the level of contamination to be approx 3%.


method

The method is described in the supplementary of

PMID 23803765, bibtex 

Orlando, L, Ginolhac, A, Zhang, G, Froese, D, Albrechtsen, A, Stiller, M, Schubert, M, Cappellini, E, Petersen, B, Moltke, I, Johnson, PL, Fumagalli, M, Vilstrup, JT, Raghavan, M, Korneliussen, T, Malaspinas, AS, Vogt, J, Szklarczyk, D, Kelstrup, CD, Vinther, J, Dolocan, A, Stenderup, J, Velazquez, AM, Cahill, J, Rasmussen, M, Wang, X, Min, J, Zazula, GD, Seguin-Orlando, A, Mortensen, C, Magnussen, K, Thompson, JF, Weinstock, J, Gregersen, K, Røed, KH, Eisenmann, V, Rubin, CJ, Miller, DC, Antczak, DF, Bertelsen, MF, Brunak, S, Al-Rasheid, KA, Ryder, O, Andersson, L, Mundy, J, Krogh, A, Gilbert, MT, Kjær, K, Sicheritz-Ponten, T, Jensen, LJ, Olsen, JV, Hofreiter, M, Nielsen, R, Shapiro, B, Wang, J, Willerslev, E (2013). Recalibrating Equus evolution using the genome sequence of an early Middle Pleistocene horse. Nature, 499, 7456:74-8.