ANGSD: Analysis of next generation Sequencing Data
Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.
Thetas,Tajima,Neutrality tests: Difference between revisions
Line 20: | Line 20: | ||
</pre> | </pre> | ||
Obtain the maximum | Obtain the maximum likelihood estimate of the SFS | ||
<pre> | <pre> | ||
misc/emOptim2 out.saf 20 -P 24 > out.sfs | misc/emOptim2 out.saf 20 -P 24 > out.sfs | ||
Line 32: | Line 32: | ||
Estimate Tajimas D | Estimate Tajimas D | ||
<pre> | <pre> | ||
#create a binary version of thete.thetas.gz | |||
misc/thetaStat make_bed theta.thetas.gz | misc/thetaStat make_bed theta.thetas.gz | ||
#calculate Tajimas D | |||
misc/thetaStat do_stat theta.thetas.gz -nChr 20 | misc/thetaStat do_stat theta.thetas.gz -nChr 20 | ||
</pre> | </pre> |
Revision as of 10:37, 6 March 2014
This method will estimate different thetas (population scaled mutation rate) and can based on these thetas calculate Tajima's D and various other neutrality test statistics. Method is described in Korneliussen2013.
- NB Information on this website is for version 0.551 or higher.
- NB The Korneliussen2013 covers two methods,
- using an ML method
- using the emperical Bayes (EB) method. The information on this page relates to the EB method.
For performing the ML method, you should the use the SFS estimation method and define the region af interest.
Example
Below is a chain of commands used for caculating statistics. Its a 3 step procedure
- Estimate an site frequency spectrum. Output is out.sfs file. This is what is being used as the -pest argument in step2.
- Calculate per-site thetas. Output is a .thetas.gz file.
- Calculate neutrality tests statistics. Output is a .thetas.gz.pestPG file.
First estimate the site allele frequency likelihood
./angsd -bam bam.filelist -doSaf 1 -anc chimpHg19.fa -GL 2 -P 24 -out out
Obtain the maximum likelihood estimate of the SFS
misc/emOptim2 out.saf 20 -P 24 > out.sfs
Calculate the thetas
./angsd -bam bam.filelist -out out -doThetas 1 -doSaf 1 -pest out.sfs -anc chimpHg19.fa -doSaf 1 -GL 2
Estimate Tajimas D
#create a binary version of thete.thetas.gz misc/thetaStat make_bed theta.thetas.gz #calculate Tajimas D misc/thetaStat do_stat theta.thetas.gz -nChr 20
Remember that you will need to supply the ancestral state for the SFS Estimation, and you should try to remove the worst data by -minMapQ and -minQ.
Sliding Window example
We can easily do a sliding window analysis by adding -win/-step arguments to the last command
misc/thetaStat do_stat theta.thetas.gz -nChr 20 -win 50000 -step 10000
This will calculate the test statistic using a window size of 50kb and a step size of 10kb.
Example Output
.thetas.gz is
chr10 1 -7.041327 -7.362126 -6.399318 -8.788662 -7.840050 chr10 2 -8.337345 -8.921354 -7.358819 -11.052126 -9.502293 chr10 3 -9.283203 -9.945923 -8.207280 -12.281002 -10.546671 chr10 4 -1.105216 -0.617477 -35.602360 -0.730974 -0.672616 chr10 5 -7.705427 -8.374287 -6.622234 -10.726430 -8.976529 chr10 6 -11.683662 -12.369866 -10.578860 -14.766857 -12.975926 chr10 7 -11.688400 -12.374647 -10.583545 -14.771753 -12.980717 chr10 8 -12.989104 -13.675391 -11.884201 -16.072602 -14.281470 chr10 9 -13.682495 -14.368910 -12.577435 -16.766463 -14.975017 chr10 10 -1.105216 -0.655518 -37.649835 -0.982402 -0.805662 chr10 11 -14.433381 -15.119890 -13.328206 -17.517696 -15.726019 chr10 12 -14.410507 -15.097004 -13.305348 -17.494774 -15.703129 chr10 13 -14.963982 -15.650467 -13.858836 -18.048209 -16.256590 chr10 14 -15.657183 -16.343698 -14.552002 -18.741516 -16.949827 chr10 15 -15.762558 -16.449085 -14.657361 -18.846938 -17.055218 chr10 16 -1.105216 -0.766882 -12.373455 -1.539395 -1.080326 chr10 17 -15.876465 -16.563007 -14.771249 -18.960901 -17.169143 chr10 18 -17.256944 -17.943487 -16.151728 -20.341382 -18.549623
- 1. chromosome
- 2. position
- 3. ThetaWatterson
- 4. ThetaD (nucleotide diversity)
- 5. Theta? (singleton category)
- 6. ThetaH
- 7. ThetaL
.thetas.gz.pestPG
The .pestPG file is a 14 column file (tab seperated). The first column contains information about the region. The second and third column is the reference name and the center of the window.
We then have 5 different estimators of theta, these are: Watterson, pairwise, FuLi, fayH, L. And we have 5 different neutrality test statistics: Tajima's D, Fu&Li F's, Fu&Li's D, Fay's H, Zeng's E. The final column is the effetive number of sites with data in the window.
(59999,69999)(60000,70000)(60000,70000) chr1 65000 2349.039592 2008.865974 2791.401569 3817.828656 2913.347320 -0.545594 -0.626967 -0.486984 -0.617873 0.195337 10000 (69999,79999)(70000,80000)(70000,80000) chr1 75000 2349.113388 1993.792014 2764.051812 3979.987797 2986.889940 -0.569871 -0.617112 -0.456779 -0.678388 0.220762 10000 (79999,89999)(80000,90000)(80000,90000) chr1 85000 2349.154140 2035.577279 2649.132059 3902.254435 2968.915852 -0.502912 -0.491556 -0.330221 -0.637555 0.214522 10000 (89999,99999)(90000,100000)(90000,100000) chr1 95000 2349.462773 2048.143641 2533.193917 3881.554872 2964.849262 -0.483190 -0.388552 -0.202228 -0.626111 0.212980 10000 (99999,109999)(100000,110000)(100000,110000) chr1 105000 2349.306947 2103.402129 2608.611593 3738.658529 2921.030347 -0.394355 -0.404727 -0.285429 -0.558478 0.197881 10000 (109999,119999)(110000,120000)(110000,120000) chr1 115000 2348.965451 1867.325681 2725.815492 4491.310734 3179.318214 -0.772512 -0.687843 -0.414876 -0.896283 0.287438 10000 (119999,129999)(120000,130000)(120000,130000) chr1 125000 2349.437816 2077.636124 2623.517860 3755.631838 2916.633993 -0.435861 -0.437286 -0.301676 -0.573043 0.196304 10000
Format is:
(indexStart,indexStop)(posStart,posStop)(regStat,regStop) chrname wincenter tW tP tF tH tL tajD fulif fuliD fayH zengsE numSites
Most likely you are just interest in the wincenter (column 3) and the column 9 which is the Tajima's D statistic.
The first 3 columns relates to the region. The next 5 columns are 5 different estimators of theta, and the next 5 columns are neutrality test statistics. The final column is the number of sites with data in the region.
The first ()()() er mainly used for debugging the sliding window program. The interpretation is:
- The posStart and posStop is the first physical position, and last physical postion of sites included in the analysis.
- The regStat and regStop is the physical region for which the analysis is performed. Therefore the posStat and posStop is always included within the regStart and regStop
- The indexStart and IndexStop is the position within the internal array.
Unknown ancestral state (folded sfs)
- Below is for version 0.556 and above
If you don't have the ancestral states, you can still calculate the Watterson and Tajima theta, which means you can perform the Tajima's D neutrality test statistic. But this requires you to use the folded sfs. The output files will have the same format, but only the thetaW and thetaD, and tajimas D is meaningful.
Below is an example based on the earlier example where we now base our analysis on the folded spectrum. Notice the -fold 1 and that the second parameter to the emOptim2 is now 20 instead for 40.
#(estimate an SFS) ../angsd0.557/angsd -bam pop1.list -out bingo -doSaf 1 -fold 1 ../angsd0.557/misc/emOptim2 bingo.saf 20 -P 24 >bingo.em.ml #(calculate thetas) ../angsd0.557/angsd -bam pop1.list -out bongo -doThetas 1 -doSaf 1 -pest bingo.em.ml -fold 1 #(calculate Tajimas.) ../angsd0.557/misc/thetaStat make_bed bongo.thetas.gz ../angsd0.557/misc/thetaStat do_stat bongo.thetas.gz -nChr 40