Revision as of 13:00, 27 June 2014

Angsd can estimate contamination, but only for chromosomes that exists in one genecopy (eg chrX for males). This method requires a list of HapMap sites along with their frequency and we also recommend to discard regions with low mappability.

We have included a mappability and HapMap files for chrX these are found in the RES subfolder of the angsd source package. So if you are working with humans, and your sample is a male then you can estimate the contamination with the follow two commands.

First we generate a binary count file for chrX for a single BAM file (ANGSD cprogram)
Then we do a fisher test for finding a p-value, and jackknife to get an estimate of contamination (Rprogram)

An example are found below:

./angsd -i my.bam -r X: -doCounts 1  -iCounts 1 -minMapQ 30 -minQ 20
Rscript contamination.R mapFile="map100.chrX.bz2" hapFile="hapMapCeuXlift.map.bz2" countFile="angsdput.icnts.gz" mc.cores=24

The contamination.R program is found in the R/ subfolder, and the resource files are found in the RES folder. The jackknive procedure can be quite slow, so we allocate 24 cores for this analysis mc.cores=24.

Output

The output from the above command is shown below

Rscript ../R/contamination.R mapFile="map100.chrX.bz2" hapFile="hapMapCeuXlift.map.bz2" countFile="angsdput.icnts.gz" mc.cores=24
Loading required package: multicore

-----------------------
Doing Fisher exact test for Method1:
      [,1]   [,2]
[1,]   246    157
[2,] 17700 143407

        Fisher's Exact Test for Count Data

data:  mat
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 10.34000 15.61672
sample estimates:
odds ratio 
   12.6959 


-----------------------
Doing Fisher exact test for Method2:
     [,1]  [,2]
[1,]   91    55
[2,] 7355 59513

        Fisher's Exact Test for Count Data

data:  mat2
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
  9.466476 19.085589
sample estimates:
odds ratio 
  13.38675 

----------------------
Running jackknife for Method1 (could be slow)
Running jackknife for Method2 (could be slow)
$est
              Method1     Method2    
Contamination 0.03837625  0.03380983 
llh           1034.078    483.5145   
SE            0.002630455 0.003900376

$err
[1] 0.01370779

$c
[1] 0.001093589

$est
              Method1     Method2    
Contamination 0.03837625  0.03380983 
llh           1034.078    483.5145   
SE            0.002630455 0.003900376

$err
[1] 0.01370779

$c
[1] 0.001093589

Interpretation of outputfiles

Both methods shows a highly significant pvalue, and estimate the level of contamination to be approx 3%.

@@ Line 86: / Line 86: @@
 </pre>
+==Interpretation of outputfiles==
+Both methods shows a highly significant pvalue, and estimate the level of contamination to be approx 3%.

Contamination: Difference between revisions

Revision as of 13:00, 27 June 2014

Output

Interpretation of outputfiles

Navigation menu