ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Contamination

From angsd
Revision as of 13:06, 27 June 2014 by Albrecht (talk | contribs)
Jump to navigation Jump to search

Angsd can estimate contamination, but only for chromosomes that exists in one genecopy (eg chrX for males). This method requires a list of polymorphic sites along with their frequency and we also recommend to discard regions with low mappability.

We have included a mappability and HapMap files for chrX these are found in the RES subfolder of the angsd source package. So if you are working with humans, and your sample is a male then you can estimate the contamination with the follow two commands.

  • First we generate a binary count file for chrX for a single BAM file (ANGSD cprogram)
  • Then we do a fisher test for finding a p-value, and jackknife to get an estimate of contamination (Rprogram)


An example are found below:

#run angsd
./angsd -i my.bam -r X: -doCounts 1  -iCounts 1 -minMapQ 30 -minQ 20
#do jackKnife in R
Rscript contamination.R mapFile="map100.chrX.bz2" hapFile="hapMapCeuXlift.map.bz2" countFile="angsdput.icnts.gz" mc.cores=24

The contamination.R program is found in the R/ subfolder, and the resource files are found in the RES folder. The jackknive procedure can be quite slow, so we allocate 10 cores for this analysis mc.cores=10.

Output

The output from the above command is shown below

Rscript ../R/contamination.R mapFile="map100.chrX.bz2" hapFile="hapMapCeuXlift.map.bz2" countFile="angsdput.icnts.gz" mc.cores=24

Loading required package: parallel

-----------------------
Doing Fisher exact test for Method1:
           SNP site adjacent site
minor base      616          3554
major base   198492       1589087

	Fisher's Exact Test for Count Data

data:  mat
p-value = 5.286e-13
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.271632 1.512213
sample estimates:
odds ratio 
  1.387606 


-----------------------
Doing Fisher exact test for Method2:
           SNP site adjacent site
minor base      114           654
major base    37983        304122

	Fisher's Exact Test for Count Data

data:  mat2
p-value = 0.001532
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.133367 1.705751
sample estimates:
odds ratio 
  1.395672 


-----------------------
 major and minor bases - Method1:
               -4     -3     -2     -1 SNP site      1      2      3      4
minor base    427    417    475    437      616    486    439    427    446
major base 198651 198715 198656 198645   198492 198500 198681 198693 198546

-----------------------
 major and minor bases - Method2:
              -4    -3    -2    -1 SNP site     1     2     3     4
minor base    75    76    96    73      114    86    79    80    89
major base 38022 38021 38001 38024    37983 38011 38018 38017 38008

----------------------
Running jackknife for Method1 (could be slow)
Running jackknife for Method2 (could be slow)
$est
              Method1     Method2    
Contamination 0.03837625  0.03380983 
llh           1034.078    483.5145   
SE            0.002630455 0.003900376


Interpretation of outputfiles

Both methods shows a highly significant pvalue, and estimate the level of contamination to be approx 3%.