ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Contamination: Difference between revisions

From angsd
Jump to navigation Jump to search
No edit summary
No edit summary
 
(21 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Angsd can estimate contamination, but only for chromosomes that exists in one genecopy (eg chrX for males). This method requires a list of HapMap sites along with their frequency and we also recommend to discard regions with low mappability.
Angsd can estimate contamination, but only for chromosomes that exists in one genecopy (eg chrX for males). This method requires a list of polymorphic sites along with their frequency and we also recommend to discard regions with low mappability.


We have included a mappability and HapMap files for chrX these are found in the '''RES''' subfolder of the angsd source package.
We have included a mappability and HapMap files for chrX these are found in the '''RES''' subfolder of the angsd source package (using hg19).
So if you are working with humans, and your sample is a male then you can estimate the contamination with the follow two commands.  
So if you are working with humans, and your sample is a male then you can estimate the contamination with the follow two commands.  


* First we generate a binary count file for chrX for a single BAM file (ANGSD cprogram)
* First we generate a binary count file for chrX for a single BAM file (ANGSD cprogram)
* Then we do a fisher test for finding a p-value, and jackknife to get an estimate of contamination (Rprogram)
* Then we do a Fisher's exact test for finding a p-value, and jackknife to get an estimate of contamination (Rprogram)




An example are found below:
An example are found below:
<pre>
<pre>
./angsd -i my.bam -r X: -doCounts 1  -iCounts 1 -minMapQ 30 -minQ 20
#run angsd
Rscript ../R/contamination.R mapFile="../RES/map100.chrX.bz2" hapFile="../RES/hapMapCeuXlift.map.bz2" countFile="angsdput.icnts.gz" mc.cores=24
./angsd -i my.bam -r X:5000000-154900000 -doCounts 1  -iCounts 1 -minMapQ 30 -minQ 20
#do jackKnife in R
Rscript R/contamination.R mapFile="RES/map100.chrX.gz" hapFile="RES/hapMapCeuXlift.map.gz" countFile="angsdput.icnts.gz" mc.cores=24
#or use the command below for a newer version of the resourcefiles.
Rscript R/contamination.R mapFile="RES/chrX.unique.gz" hapFile="RES/HapMapChrX.gz" countFile="angsdput.icnts.gz" mc.cores=24
##or the fancy fast new c++ program
misc/contamination -a angsdput.icnts.gz -h RES/HapMapChrX.gz
</pre>
</pre>


The '''contamination.R''' program is found in the '''R/''' subfolder, and the resource files are found in the '''RES''' folder. The jackknive procedure can be quite slow, so we allocate 24 cores for this analysis '''mc.cores=24'''.


outputfiles
 
The output from the above command is shown below
 
=R=
 
<pre>
Rscript R/contamination.R
Loading required package: parallel
->  Needed arguments:
hapFile : HapMapfile
countFile : Count file
->  Optional arguments (defaults):
mapFile  ( NA ) : Mappability file
minDepth  ( 2 ) : Minimum depth
maxDepth  ( 20 ) : Maximium depth
mc.cores  ( 10 ) : Number of cores
fixed  ( TRUE ) : Use fixed version of likelihood
jack  ( TRUE ) : Jacknive to get confidence intervals
minmaf  ( 0.05 ) : minimum maf
startPos  ( 5e+06 ) : start position
stopPos  ( 154900000 ) : stop position
seed  ( NA ) : set a seed (supply int value)
</pre>


<pre>
<pre>
Rscript ../R/contamination.R mapFile="../RES/map100.chrX.bz2" hapFile="../RES/hapMapCeuXlift.map.bz2" countFile="/space/anders/ida/idaSjov/kostenkitest/contamination/out/V1countKostinki.USER.bam.X.gz" mc.cores=24
Rscript ../R/contamination.R mapFile="map100.chrX.bz2" hapFile="hapMapCeuXlift.map.bz2" countFile="angsdput.icnts.gz" mc.cores=24
Loading required package: multicore
 
Loading required package: parallel


-----------------------
-----------------------
Doing Fisher exact test for Method1:
Doing Fisher exact test for Method1:
      [,1]  [,2]
          SNP site adjacent site
[1,]   246    157
minor base      616          3554
[2,] 17700 143407
major base   198492      1589087


        Fisher's Exact Test for Count Data
Fisher's Exact Test for Count Data


data:  mat
data:  mat
p-value < 2.2e-16
p-value = 5.286e-13
alternative hypothesis: true odds ratio is not equal to 1
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
95 percent confidence interval:
  10.34000 15.61672
  1.271632 1.512213
sample estimates:
sample estimates:
odds ratio  
odds ratio  
  12.6959
  1.387606




-----------------------
-----------------------
Doing Fisher exact test for Method2:
Doing Fisher exact test for Method2:
     [,1]  [,2]
          SNP site adjacent site
[1,]  91   55
minor base     114          654
[2,] 7355 59513
major base   37983        304122


        Fisher's Exact Test for Count Data
Fisher's Exact Test for Count Data


data:  mat2
data:  mat2
p-value < 2.2e-16
p-value = 0.001532
alternative hypothesis: true odds ratio is not equal to 1
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
95 percent confidence interval:
  9.466476 19.085589
1.133367 1.705751
sample estimates:
sample estimates:
odds ratio  
odds ratio  
   13.38675
   1.395672
 
 
-----------------------
major and minor bases - Method1:
              -4    -3    -2    -1 SNP site      1      2      3      4
minor base    427    417    475    437      616    486    439    427    446
major base 198651 198715 198656 198645  198492 198500 198681 198693 198546
 
-----------------------
major and minor bases - Method2:
              -4    -3    -2    -1 SNP site    1    2    3    4
minor base    75    76    96    73      114    86    79    80    89
major base 38022 38021 38001 38024    37983 38011 38018 38017 38008


----------------------
----------------------
Line 65: Line 108:
SE            0.002630455 0.003900376
SE            0.002630455 0.003900376


$err
[1] 0.01370779


$c
</pre>
[1] 0.001093589
==Interpretation of output==
 
Both methods shows a highly significant pvalue, and estimate the level of contamination to be approx 3%.
 
=C=
Later versions of angsd gives a much nice output format. Below is based on a different file from above.
<pre>
misc/contamination
-> Must supply -h hapmapfile -a angsd.icnts.gz file
-> Other options: -m minaf -b startpos -c stoppos -d mindepth -e maxdepth -f skiptrans -p nthreads -s seed -j maxjackknife
</pre>


$est
              Method1    Method2   
Contamination 0.03837625  0.03380983
llh          1034.078    483.5145 
SE            0.002630455 0.003900376


$err
[1] 0.01370779


$c
<pre>
[1] 0.001093589
misc/contamination -a angsdput.icnts.gz -h RES/HapMapChrX.gz -p 100 -s 100
-----------------
hapmap:RES/HapMapChrX.gz counts:angsdput.icnts.gz minMaf:0.050000 startPos:5000000 stopPos:154900000 minDepth:2 maxDepth:200 skiptrans:0 nthreads:100 seed:100
Method2 is subject to fluctuations due to random sampling
Seed value of 0 (zero) will use time as seed
-----------------
[readhap] We now have: 58190 snpSites after filtering based on hapMapfile
[readicnts] fname:angsdput.icnts.gz minDepth:2 maxDepth:200
[readicnts] Has read:70325301 sites,  28883428 sites (after depfilter) from ANGSD icnts file
  After removing SNP sites with no data in 5bp surrounding region`
  We have nSNP sites: 9809, with flanking: 88281
--------
MAIN RESULTS: Fisher exact test:
Method n11 n12 n21 n22 prob left right twotail
method1 316 882 25792 207906 4.60682e-49 1 7.02673e-49 9.33613e-49
method2 107 318 9702 78154 2.18987e-16 1 3.43772e-16 4.01568e-16
Mismatch_rate_for_flanking:0.004224 MisMatch_rate_for_snpsite:0.012104


Method1: old_llh Version: MoM:0.020657 SE(MoM):1.898369e-03 ML:0.020983 SE(ML):1.847108e-03
Method1: new_llh Version: MoM:0.020538 SE(MoM):1.908704e-03 ML:0.020909 SE(ML):1.839107e-03
Method2: old_llh Version: MoM:0.018192 SE(MoM):2.877245e-03 ML:0.018908 SE(ML):2.799355e-03
Method2: new_llh Version: MoM:0.018087 SE(MoM):2.892190e-03 ML:0.018832 SE(ML):2.787229e-03
</pre>
</pre>
==Method==
The method is described in the supplementary of [[Rasmussen2011]]

Latest revision as of 14:51, 11 July 2015

Angsd can estimate contamination, but only for chromosomes that exists in one genecopy (eg chrX for males). This method requires a list of polymorphic sites along with their frequency and we also recommend to discard regions with low mappability.

We have included a mappability and HapMap files for chrX these are found in the RES subfolder of the angsd source package (using hg19). So if you are working with humans, and your sample is a male then you can estimate the contamination with the follow two commands.

  • First we generate a binary count file for chrX for a single BAM file (ANGSD cprogram)
  • Then we do a Fisher's exact test for finding a p-value, and jackknife to get an estimate of contamination (Rprogram)


An example are found below:

#run angsd
./angsd -i my.bam -r X:5000000-154900000 -doCounts 1  -iCounts 1 -minMapQ 30 -minQ 20
#do jackKnife in R
 Rscript R/contamination.R mapFile="RES/map100.chrX.gz" hapFile="RES/hapMapCeuXlift.map.gz" countFile="angsdput.icnts.gz" mc.cores=24
#or use the command below for a newer version of the resourcefiles.
Rscript R/contamination.R mapFile="RES/chrX.unique.gz" hapFile="RES/HapMapChrX.gz" countFile="angsdput.icnts.gz" mc.cores=24
##or the fancy fast new c++ program
misc/contamination -a angsdput.icnts.gz -h RES/HapMapChrX.gz 

The contamination.R program is found in the R/ subfolder, and the resource files are found in the RES folder. The jackknive procedure can be quite slow, so we allocate 24 cores for this analysis mc.cores=24.


The output from the above command is shown below

R

 Rscript R/contamination.R
Loading required package: parallel
->  Needed arguments:
	 hapFile : HapMapfile 
	 countFile : Count file 
->  Optional arguments (defaults):
	 mapFile  ( NA ) : Mappability file 
	 minDepth  ( 2 ) : Minimum depth 
	 maxDepth  ( 20 ) : Maximium depth 
	 mc.cores  ( 10 ) : Number of cores 
	 fixed  ( TRUE ) : Use fixed version of likelihood 
	 jack  ( TRUE ) : Jacknive to get confidence intervals 
	 minmaf  ( 0.05 ) : minimum maf 
	 startPos  ( 5e+06 ) : start position 
	 stopPos  ( 154900000 ) : stop position 
	 seed  ( NA ) : set a seed (supply int value) 
Rscript ../R/contamination.R mapFile="map100.chrX.bz2" hapFile="hapMapCeuXlift.map.bz2" countFile="angsdput.icnts.gz" mc.cores=24

Loading required package: parallel

-----------------------
Doing Fisher exact test for Method1:
           SNP site adjacent site
minor base      616          3554
major base   198492       1589087

	Fisher's Exact Test for Count Data

data:  mat
p-value = 5.286e-13
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.271632 1.512213
sample estimates:
odds ratio 
  1.387606 


-----------------------
Doing Fisher exact test for Method2:
           SNP site adjacent site
minor base      114           654
major base    37983        304122

	Fisher's Exact Test for Count Data

data:  mat2
p-value = 0.001532
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.133367 1.705751
sample estimates:
odds ratio 
  1.395672 


-----------------------
 major and minor bases - Method1:
               -4     -3     -2     -1 SNP site      1      2      3      4
minor base    427    417    475    437      616    486    439    427    446
major base 198651 198715 198656 198645   198492 198500 198681 198693 198546

-----------------------
 major and minor bases - Method2:
              -4    -3    -2    -1 SNP site     1     2     3     4
minor base    75    76    96    73      114    86    79    80    89
major base 38022 38021 38001 38024    37983 38011 38018 38017 38008

----------------------
Running jackknife for Method1 (could be slow)
Running jackknife for Method2 (could be slow)
$est
              Method1     Method2    
Contamination 0.03837625  0.03380983 
llh           1034.078    483.5145   
SE            0.002630455 0.003900376


Interpretation of output

Both methods shows a highly significant pvalue, and estimate the level of contamination to be approx 3%.

C

Later versions of angsd gives a much nice output format. Below is based on a different file from above.

misc/contamination 
	-> Must supply -h hapmapfile -a angsd.icnts.gz file
	-> Other options: -m minaf -b startpos -c stoppos -d mindepth -e maxdepth -f skiptrans -p nthreads -s seed -j maxjackknife


 misc/contamination -a angsdput.icnts.gz -h RES/HapMapChrX.gz -p 100 -s 100
-----------------
hapmap:RES/HapMapChrX.gz counts:angsdput.icnts.gz minMaf:0.050000 startPos:5000000 stopPos:154900000 minDepth:2 maxDepth:200 skiptrans:0 nthreads:100 seed:100
Method2 is subject to fluctuations due to random sampling
Seed value of 0 (zero) will use time as seed
-----------------
[readhap] We now have: 58190 snpSites after filtering based on hapMapfile
[readicnts] fname:angsdput.icnts.gz minDepth:2 maxDepth:200
[readicnts] Has read:70325301 sites,  28883428 sites (after depfilter) from ANGSD icnts file
  After removing SNP sites with no data in 5bp surrounding region`
  We have nSNP sites: 9809, with flanking: 88281
--------
MAIN RESULTS: Fisher exact test:
Method	 n11 n12 n21 n22 prob left right twotail
method1	316	882	25792	207906	4.60682e-49	1	7.02673e-49	9.33613e-49
method2	107	318	9702	78154	2.18987e-16	1	3.43772e-16	4.01568e-16
Mismatch_rate_for_flanking:0.004224 MisMatch_rate_for_snpsite:0.012104 

Method1: old_llh Version: MoM:0.020657 SE(MoM):1.898369e-03 ML:0.020983 SE(ML):1.847108e-03
Method1: new_llh Version: MoM:0.020538 SE(MoM):1.908704e-03 ML:0.020909 SE(ML):1.839107e-03
Method2: old_llh Version: MoM:0.018192 SE(MoM):2.877245e-03 ML:0.018908 SE(ML):2.799355e-03
Method2: new_llh Version: MoM:0.018087 SE(MoM):2.892190e-03 ML:0.018832 SE(ML):2.787229e-03

Method

The method is described in the supplementary of Rasmussen2011