ANGSD: Analysis of next generation Sequencing Data
Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.
Contamination: Difference between revisions
(→Output) |
(→Output) |
||
Line 21: | Line 21: | ||
<pre> | <pre> | ||
Rscript ../R/contamination.R mapFile="map100.chrX.bz2" hapFile="hapMapCeuXlift.map.bz2" countFile="angsdput.icnts.gz" mc.cores=24 | Rscript ../R/contamination.R mapFile="map100.chrX.bz2" hapFile="hapMapCeuXlift.map.bz2" countFile="angsdput.icnts.gz" mc.cores=24 | ||
Loading required package: | |||
Loading required package: parallel | |||
----------------------- | ----------------------- | ||
Doing Fisher exact test for Method1: | Doing Fisher exact test for Method1: | ||
SNP site adjacent site | |||
minor base 616 3554 | |||
major base 198492 1589087 | |||
Fisher's Exact Test for Count Data | |||
data: mat | data: mat | ||
p-value | p-value = 5.286e-13 | ||
alternative hypothesis: true odds ratio is not equal to 1 | alternative hypothesis: true odds ratio is not equal to 1 | ||
95 percent confidence interval: | 95 percent confidence interval: | ||
1.271632 1.512213 | |||
sample estimates: | sample estimates: | ||
odds ratio | odds ratio | ||
1.387606 | |||
----------------------- | ----------------------- | ||
Doing Fisher exact test for Method2: | Doing Fisher exact test for Method2: | ||
SNP site adjacent site | |||
minor base 114 654 | |||
major base 37983 304122 | |||
Fisher's Exact Test for Count Data | |||
data: mat2 | data: mat2 | ||
p-value | p-value = 0.001532 | ||
alternative hypothesis: true odds ratio is not equal to 1 | alternative hypothesis: true odds ratio is not equal to 1 | ||
95 percent confidence interval: | 95 percent confidence interval: | ||
1.133367 1.705751 | |||
sample estimates: | sample estimates: | ||
odds ratio | odds ratio | ||
1.395672 | |||
----------------------- | |||
major and minor bases - Method1: | |||
-4 -3 -2 -1 SNP site 1 2 3 4 | |||
minor base 427 417 475 437 616 486 439 427 446 | |||
major base 198651 198715 198656 198645 198492 198500 198681 198693 198546 | |||
----------------------- | |||
major and minor bases - Method2: | |||
-4 -3 -2 -1 SNP site 1 2 3 4 | |||
minor base 75 76 96 73 114 86 79 80 89 | |||
major base 38022 38021 38001 38024 37983 38011 38018 38017 38008 | |||
---------------------- | ---------------------- | ||
Line 67: | Line 81: | ||
SE 0.002630455 0.003900376 | SE 0.002630455 0.003900376 | ||
</pre> | </pre> |
Revision as of 12:59, 27 June 2014
Angsd can estimate contamination, but only for chromosomes that exists in one genecopy (eg chrX for males). This method requires a list of HapMap sites along with their frequency and we also recommend to discard regions with low mappability.
We have included a mappability and HapMap files for chrX these are found in the RES subfolder of the angsd source package. So if you are working with humans, and your sample is a male then you can estimate the contamination with the follow two commands.
- First we generate a binary count file for chrX for a single BAM file (ANGSD cprogram)
- Then we do a fisher test for finding a p-value, and jackknife to get an estimate of contamination (Rprogram)
An example are found below:
./angsd -i my.bam -r X: -doCounts 1 -iCounts 1 -minMapQ 30 -minQ 20 Rscript contamination.R mapFile="map100.chrX.bz2" hapFile="hapMapCeuXlift.map.bz2" countFile="angsdput.icnts.gz" mc.cores=24
The contamination.R program is found in the R/ subfolder, and the resource files are found in the RES folder. The jackknive procedure can be quite slow, so we allocate 24 cores for this analysis mc.cores=24.
Output
The output from the above command is shown below
Rscript ../R/contamination.R mapFile="map100.chrX.bz2" hapFile="hapMapCeuXlift.map.bz2" countFile="angsdput.icnts.gz" mc.cores=24 Loading required package: parallel ----------------------- Doing Fisher exact test for Method1: SNP site adjacent site minor base 616 3554 major base 198492 1589087 Fisher's Exact Test for Count Data data: mat p-value = 5.286e-13 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 1.271632 1.512213 sample estimates: odds ratio 1.387606 ----------------------- Doing Fisher exact test for Method2: SNP site adjacent site minor base 114 654 major base 37983 304122 Fisher's Exact Test for Count Data data: mat2 p-value = 0.001532 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 1.133367 1.705751 sample estimates: odds ratio 1.395672 ----------------------- major and minor bases - Method1: -4 -3 -2 -1 SNP site 1 2 3 4 minor base 427 417 475 437 616 486 439 427 446 major base 198651 198715 198656 198645 198492 198500 198681 198693 198546 ----------------------- major and minor bases - Method2: -4 -3 -2 -1 SNP site 1 2 3 4 minor base 75 76 96 73 114 86 79 80 89 major base 38022 38021 38001 38024 37983 38011 38018 38017 38008 ---------------------- Running jackknife for Method1 (could be slow) Running jackknife for Method2 (could be slow) $est Method1 Method2 Contamination 0.03837625 0.03380983 llh 1034.078 483.5145 SE 0.002630455 0.003900376
Interpretation of outputfiles
Both methods shows a highly significant pvalue, and estimate the level of contamination to be approx 3%.