ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Contamination: Difference between revisions

From angsd
Jump to navigation Jump to search
Line 21: Line 21:
<pre>
<pre>
Rscript ../R/contamination.R mapFile="map100.chrX.bz2" hapFile="hapMapCeuXlift.map.bz2" countFile="angsdput.icnts.gz" mc.cores=24
Rscript ../R/contamination.R mapFile="map100.chrX.bz2" hapFile="hapMapCeuXlift.map.bz2" countFile="angsdput.icnts.gz" mc.cores=24
Loading required package: multicore
 
Loading required package: parallel


-----------------------
-----------------------
Doing Fisher exact test for Method1:
Doing Fisher exact test for Method1:
      [,1]  [,2]
          SNP site adjacent site
[1,]   246    157
minor base      616          3554
[2,] 17700 143407
major base   198492      1589087


        Fisher's Exact Test for Count Data
Fisher's Exact Test for Count Data


data:  mat
data:  mat
p-value < 2.2e-16
p-value = 5.286e-13
alternative hypothesis: true odds ratio is not equal to 1
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
95 percent confidence interval:
  10.34000 15.61672
  1.271632 1.512213
sample estimates:
sample estimates:
odds ratio  
odds ratio  
  12.6959
  1.387606




-----------------------
-----------------------
Doing Fisher exact test for Method2:
Doing Fisher exact test for Method2:
     [,1]  [,2]
          SNP site adjacent site
[1,]  91   55
minor base     114          654
[2,] 7355 59513
major base   37983        304122


        Fisher's Exact Test for Count Data
Fisher's Exact Test for Count Data


data:  mat2
data:  mat2
p-value < 2.2e-16
p-value = 0.001532
alternative hypothesis: true odds ratio is not equal to 1
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
95 percent confidence interval:
  9.466476 19.085589
1.133367 1.705751
sample estimates:
sample estimates:
odds ratio  
odds ratio  
   13.38675
   1.395672
 
 
-----------------------
major and minor bases - Method1:
              -4    -3    -2    -1 SNP site      1      2      3      4
minor base    427    417    475    437      616    486    439    427    446
major base 198651 198715 198656 198645  198492 198500 198681 198693 198546
 
-----------------------
major and minor bases - Method2:
              -4    -3    -2    -1 SNP site    1    2    3    4
minor base    75    76    96    73      114    86    79    80    89
major base 38022 38021 38001 38024    37983 38011 38018 38017 38008


----------------------
----------------------
Line 67: Line 81:
SE            0.002630455 0.003900376
SE            0.002630455 0.003900376


$err
[1] 0.01370779
$c
[1] 0.001093589
$est
              Method1    Method2   
Contamination 0.03837625  0.03380983
llh          1034.078    483.5145 
SE            0.002630455 0.003900376
$err
[1] 0.01370779
$c
[1] 0.001093589


</pre>
</pre>

Revision as of 12:59, 27 June 2014

Angsd can estimate contamination, but only for chromosomes that exists in one genecopy (eg chrX for males). This method requires a list of HapMap sites along with their frequency and we also recommend to discard regions with low mappability.

We have included a mappability and HapMap files for chrX these are found in the RES subfolder of the angsd source package. So if you are working with humans, and your sample is a male then you can estimate the contamination with the follow two commands.

  • First we generate a binary count file for chrX for a single BAM file (ANGSD cprogram)
  • Then we do a fisher test for finding a p-value, and jackknife to get an estimate of contamination (Rprogram)


An example are found below:

./angsd -i my.bam -r X: -doCounts 1  -iCounts 1 -minMapQ 30 -minQ 20
Rscript contamination.R mapFile="map100.chrX.bz2" hapFile="hapMapCeuXlift.map.bz2" countFile="angsdput.icnts.gz" mc.cores=24

The contamination.R program is found in the R/ subfolder, and the resource files are found in the RES folder. The jackknive procedure can be quite slow, so we allocate 24 cores for this analysis mc.cores=24.

Output

The output from the above command is shown below

Rscript ../R/contamination.R mapFile="map100.chrX.bz2" hapFile="hapMapCeuXlift.map.bz2" countFile="angsdput.icnts.gz" mc.cores=24

Loading required package: parallel

-----------------------
Doing Fisher exact test for Method1:
           SNP site adjacent site
minor base      616          3554
major base   198492       1589087

	Fisher's Exact Test for Count Data

data:  mat
p-value = 5.286e-13
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.271632 1.512213
sample estimates:
odds ratio 
  1.387606 


-----------------------
Doing Fisher exact test for Method2:
           SNP site adjacent site
minor base      114           654
major base    37983        304122

	Fisher's Exact Test for Count Data

data:  mat2
p-value = 0.001532
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.133367 1.705751
sample estimates:
odds ratio 
  1.395672 


-----------------------
 major and minor bases - Method1:
               -4     -3     -2     -1 SNP site      1      2      3      4
minor base    427    417    475    437      616    486    439    427    446
major base 198651 198715 198656 198645   198492 198500 198681 198693 198546

-----------------------
 major and minor bases - Method2:
              -4    -3    -2    -1 SNP site     1     2     3     4
minor base    75    76    96    73      114    86    79    80    89
major base 38022 38021 38001 38024    37983 38011 38018 38017 38008

----------------------
Running jackknife for Method1 (could be slow)
Running jackknife for Method2 (could be slow)
$est
              Method1     Method2    
Contamination 0.03837625  0.03380983 
llh           1034.078    483.5145   
SE            0.002630455 0.003900376


Interpretation of outputfiles

Both methods shows a highly significant pvalue, and estimate the level of contamination to be approx 3%.