ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Thetas,Tajima,Neutrality tests: Difference between revisions

From angsd
Jump to navigation Jump to search
 
(124 intermediate revisions by 2 users not shown)
Line 1: Line 1:
This method will estimate different thetas (population scaled mutation rate) and can based on these thetas calculate Tajima's D and various other neutrality test statistics. Method is described in [[Korneliussen2013]]
This method will estimate different thetas (population scaled mutation rate) and can based on these thetas calculate Tajima's D and various other neutrality test statistics. Method is described in [[Korneliussen2013]].
 
* NB Information on this website is for version  0.917-33-g6d2aec8 or higher.
* NB The [[Korneliussen2013]] covers two methods,
#  using an ML method
#  using the emperical Bayes (EB) method. The information on this page relates to the EB method.
For performing the ML method, you should the use the [[SFS Estimation]] method and define the region af interest.
 
=Quick Example=
Below is a chain of commands used for caculating statistics. These are based on the test files that can be dowloaded on the [[Quick Start ]] page.


=Example=
Below is a chain of commands used for caculating statistics.
Its a 3 step procedure
Its a 3 step procedure
# Estimate an site frequency spectrum. Output is '''.em.ml''' file.
# Estimate an site frequency spectrum. Output is '''out.sfs''' file. This is what is being used as the '''-pest ''' argument in step2.
# Calculate per-site thetas. Output is a '''.thetas.gz''' file.
# Calculate per-site thetas. Output is a '''.thetas.idx/.thetas.gz''' files. This contains the binary persite estimates of the thetas.
# Caclculate neutrality tests statistics. Output is a '''.thetas.gz.pestPG file.
# Calculate neutrality tests statistics. Output is a '''.thetas.idx.pestPG file.
==Full command list for below examples==
Here is the chain of commands required to do estimate the thetas, and perform neutrality test statistics. These different commands are described in great detail in the following '''step 1,... step 3b''' sub sections. If you do not have the ancestral state you can simply use the assembly you have mapped agains, but remember to add -fold 1 in the 'realSFS' and 'realSFS sf2theta' step.
<pre>
./angsd -bam bam.filelist -doSaf 1 -anc chimpHg19.fa -GL 1 -P 24 -out out
#for unfolded
./misc/realSFS out.saf.idx -P 24 > out.sfs
./misc/realSFS saf2theta out.saf.idx -outname out -sfs out.sfs
#for folded
./misc/realSFS out.saf.idx -P 24 -fold 1 > out.sfs
./misc/realSFS saf2theta out.saf.idx -outname out -sfs out.sfs -fold 1
#Estimate for every Chromosome/scaffold
./misc/thetaStat do_stat out.thetas.idx
#Do a sliding window analysis based on the output from the make_bed command.
./misc/thetaStat do_stat out.thetas.idx -win 50000 -step 10000  -outnames theta.thetasWindow.gz
</pre>
 
==Step 1: Finding a 'global estimate' of the SFS==
 
First estimate the site allele frequency likelihood
<div class="toccolours mw-collapsible mw-collapsed">
./angsd -bam bam.filelist -doSaf 1 -anc chimpHg19.fa -GL 1 -P 24 -out out
<pre class="mw-collapsible-content">


        -> Reading fasta: chimpHg19.fa
        -> Parsing 10 number of samples
        -> Printing at chr: 20 pos:14095817 chunknumber 3500
        -> Done reading data waiting for calculations to finish
        -> Calling destroy
        -> Done waiting for threads
        -> Output filenames:
                ->"out.arg"
                ->"out.saf"
                ->"out.saf.pos.gz"
        -> Mon Jun 30 12:02:58 2014
        -> Arguments and parameters for all analysis are located in .arg file
        [ALL done] cpu-time used =  47.19 sec
        [ALL done] walltime used =  43.00 sec
</pre>
</div>
Obtain the maximum likelihood estimate of the SFS using the '''realSFS''' program found in the misc subfolder. (See more here [[realSFS]])
<pre>
<pre>
#(estimate an SFS)
./misc/realSFS out.saf.idx -P 24 > out.sfs
../angsd0.551/angsd -bam pop1.list -out bingo -realSFS 1
</pre>
../angsd0.551/misc/emOptim2 bingo.sfs 40 -P 24 >bingo.em.ml
Or if want to calculate the folded spectrum.
#(calculate thetas)
<pre>
../angsd0.551/angsd -bam pop1.list -out bongo -doThetas 1 -realSFS 1 -pest bingo.em.ml
./misc/realSFS out.saf.idx -P 24 -fold 1 > out.sfs
../angsd0.551/misc/bgid make_bed taj/aout2.thetas.gz
#(calculate Tajimas.)
../angsd0.551/misc/bgid do_stat taj/aout2.thetas.gz -nChr 40
</pre>
</pre>


To plot the SFS in R :
<pre>
s<-scan('out.sfs')
s<-s[-c(1,length(s))]
s<-s/sum(s)
barplot(s,names=1:length(s),main='SFS')
</pre>


The prior is an estimate of the .sfs which can be found using the [[RealSFS]] method in angsd.
==Step 2: Calculate the thetas for each site==
The output is a file called .thetas.gz
<pre>
To perform the Tajimas D and other related neutralitet tests you should feed the .thetas.gz file into a seperate program called calcstat which can be found in the misc dir of the sourcecode
realSFS saf2theta out.saf.idx -sfs out.sfs -outname out
</pre>
The output from the above command are two files out.thetas.gz and out.thetas.idx. A formal description of these files can be found in the doc/formats.pdf in the angsd package. It is possible to extract the logscale persite thetas using the ./thetaStat print program.


<div class="toccolours mw-collapsible mw-collapsed">
thetaStat print out.thetas.idx 2>/dev/null |head
<pre class="mw-collapsible-content">
#Chromo Pos Watterson Pairwise thetaSingleton thetaH thetaL
1 14000032 -10.339284 -12.069325 -9.000927 -15.852173 -12.739969
1 14000033 -10.437878 -12.185619 -9.080596 -16.001343 -12.856984
1 14000034 -10.373872 -12.110464 -9.028572 -15.905591 -12.781380
1 14000035 -10.528192 -12.290763 -9.154920 -16.133823 -12.962708
1 14000036 -10.322074 -12.051400 -8.985016 -15.834049 -12.722040
1 14000037 -10.304955 -12.028814 -8.973260 -15.800330 -12.699204
1 14000038 -10.108563 -11.791546 -8.819884 -15.486384 -12.460146
1 14000039 -10.542117 -12.306631 -9.166698 -16.153168 -12.978650
1 14000040 -10.688401 -12.473763 -9.290272 -16.358398 -13.146564
</pre>
</div>
Per default the print command will also output the contents of the index file to the stderr.


<code>
==Step 3a: Estimate Tajimas D and other statistics==
./calcStat24 -tinput file.thetas.gz -step 50000 -win 50000
<pre>
</code>
#calculate Tajimas D
./misc/thetaStat do_stat out.thetas.idx
</pre>


NB: There is a minor bug in version24 that might cause an entire chromosome to be skipped if there are no datapoints in the first physical window. Use a higher version than 25
<div class="toccolours mw-collapsible mw-collapsed">
cat out.thetas.idx.pestPG
<pre class="mw-collapsible-content">
## thetaStat VERSION: 0.01 build:(Jun 30 2014,12:06:12)
#(indexStart,indexStop)(firstPos_withData,lastPos_withData)(WinStart,WinStop)  Chr    WinCenter      tW      tP      tF      tH      tL      Tajima  fuf    fud    fayh    zeng    nSites
(0,98316)(14000032,14100082)(0,14100082)        1      7050041 51.002623      46.171402      64.683834      51.290955      48.731178      -0.392892      -0.647071      -0.595302      -0.099654      -0.048444      98316
(0,98474)(13999910,14100060)(0,14100060)        2      7050030 92.689100      88.806005      101.768262      122.422498      105.614255      -0.174701      -0.252477      -0.220588      -0.360944      0.152373        98474
(0,93269)(14000529,14100095)(0,14100095)        3      7050047 70.757874      76.248087      75.447438      68.354514      72.301301      0.322902        0.020330        -0.148419      0.110921        0.023794        93269
(0,96339)(13999912,14100064)(0,14100064)        4      7050032 99.748624      107.898618      94.265208      130.283528      119.091076      0.340878        0.247030        0.123956        -0.223386      0.211971        96339
(0,99659)(13999926,14100063)(0,14100063)        5      7050031 120.941697      132.667821      86.726667      163.908351      148.288088      0.404945        0.688320        0.639821        -0.257254      0.247395        99659
(0,99541)(13999918,14100103)(0,14100103)        6      7050051 96.666344      112.146685      69.740992      143.403712      127.775201      0.667988        0.792499        0.627735        -0.321842      0.351730        99541
(0,99786)(13999926,14100047)(0,14100047)        7      7050023 93.164548      92.023886      92.742574      142.413716      117.218807      -0.051058      -0.013928      0.010201        -0.538288      0.282133        99786
(0,98759)(13999923,14100082)(0,14100082)        8      7050041 133.567125      177.157879      72.197498      204.069028      190.613463      1.363708        1.425567        1.040517        -0.200700      0.467490        98759
(0,97855)(14001686,14100094)(0,14100094)        9      7050047 88.777475      102.853333      64.660948      104.749694      103.801516      0.660983        0.776148        0.611265        -0.021256      0.184875        97855
(0,98031)(13999906,14100096)(0,14100096)        10      7050048 129.583334      134.877160      88.135115      213.231615      174.054390      0.170681        0.654145        0.724072        -0.602284      0.375595        98031
(0,99220)(13999900,14100060)(0,14100060)        11      7050030 66.349155      79.423643      60.194045      68.903312      74.163477      0.819589        0.520022        0.207421        0.157614        0.128409        99220
(0,99861)(13999913,14100078)(0,14100078)        12      7050039 86.461303      81.630083      96.156392      110.974922      96.302507      -0.232902      -0.302980      -0.252190      -0.337701      0.124323        99861
(0,98258)(13999943,14100097)(0,14100097)        16      7050048 83.191170      99.392421      77.561510      106.148748      102.770584      0.811500        0.472922        0.152079        -0.080798      0.257008        98258
(0,99428)(13999902,14100095)(0,14100095)        17      7050047 90.254620      99.816352      65.610351      113.328929      106.572638      0.441707        0.683942        0.614609        -0.148988      0.197530        99428
(0,97118)(13999898,14100071)(0,14100071)        18      7050035 79.843256      75.282296      86.844252      67.720321      71.501308      -0.237958      -0.260778      -0.196888      0.094212        -0.114062      97118
(0,93783)(13999895,14100089)(0,14100089)        19      7050044 54.311523      49.839190      64.913940      72.868913      61.354048      -0.341795      -0.495649      -0.434079      -0.421111      0.141133        93783
(0,98938)(13999916,14100091)(0,14100091)        20      7050045 68.148147      63.323800      78.463736      56.040370      59.682084      -0.294508      -0.398845      -0.338673      0.106250        -0.135474      98938
</pre>
</div>


NB: Use the bgid program found in the misc folder.
==Step 3b: Sliding Window example==
We can easily do a sliding window analysis by adding -win/-step arguments to the last command. [[ thetaStat ]]
<pre>
thetaStat do_stat out.thetas.idx -win 50000 -step 10000  -outnames theta.thetasWindow.gz
</pre>
This will calculate the test statistic using a window size of 50kb and a step size of 10kb.


=Example Output=
=Example Output=
==.thetas.gz is==
<pre>
<pre>
chr10  1      -7.041327      -7.362126      -6.399318      -8.788662       -7.840050
- Output in the ./thetaStat print thetas.idx are the log scaled per site estimates of the thetas
chr10  2       -8.337345       -8.921354      -7.358819       -11.052126     -9.502293
- Output in the pestPG file are the sum of the per site estimates for a region
chr10  3       -9.283203       -9.945923      -8.207280       -12.281002     -10.546671
</pre>
chr10  4      -1.105216       -0.617477       -35.602360     -0.730974       -0.672616
==./thetaStat print angsdput.thetas.idx==
chr10  5       -7.705427       -8.374287      -6.622234       -10.726430     -8.976529
<pre>
chr10  6       -11.683662      -12.369866     -10.578860      -14.766857     -12.975926
#Chromo Pos    Watterson       Pairwise        thetaSingleton  thetaH  thetaL
chr10  7       -11.688400      -12.374647     -10.583545      -14.771753     -12.980717
1       14000032        -9.457420       -10.372069      -8.319252       -13.025778     -10.997194
chr10  8       -12.989104      -13.675391      -11.884201      -16.072602     -14.281470
1       14000033        -9.463637       -10.379368      -8.324414       -13.035780     -11.004670
chr10  9      -13.682495     -14.368910      -12.577435     -16.766463      -14.975017
1      14000034        -9.463740       -10.379488     -8.324500       -13.035942      -11.004793
chr10  10      -1.105216       -0.655518       -37.649835     -0.982402       -0.805662
1       14000035        -9.463603       -10.379328      -8.324386       -13.035725     -11.004629
chr10  11      -14.433381      -15.119890     -13.328206      -17.517696     -15.726019
1       14000036        -9.323246      -10.218453     -8.204848      -12.826627     -10.840519
chr10  12      -14.410507      -15.097004      -13.305348      -17.494774      -15.703129
1       14000037        -9.179270      -10.048883     -8.086425      -12.596436     -10.666670
chr10  13      -14.963982      -15.650467     -13.858836      -18.048209     -16.256590
1       14000038        -9.004664      -9.845473      -7.941453      -12.328274     -10.458416
chr10  14      -15.657183      -16.343698     -14.552002      -18.741516     -16.949827
1      14000039        -9.327033       -10.222983     -8.207914      -12.833007     -10.845176
chr10  15      -15.762558      -16.449085     -14.657361      -18.846938     -17.055218
1      14000040        -9.621554       -10.557563     -8.461745       -13.262415      -11.185971
chr10  16      -1.105216       -0.766882       -12.373455     -1.539395       -1.080326
1      14000041        -9.617449      -10.552869     -8.458225      -13.256257     -11.181185
chr10  17      -15.876465      -16.563007     -14.771249      -18.960901     -17.169143
1      14000042        -7.337841      -8.161756      -204.045433    -5.457443      -6.085818
chr10  18      -17.256944      -17.943487     -16.151728      -20.341382     -18.549623
1      14000043        -9.570405      -10.502160     -8.415195      -13.197596     -11.129976
1      14000044        -9.511097      -10.434558     -8.364249      -13.110037     -11.061100
1      14000045        -9.563664      -10.494371     -8.409489      -13.187203     -11.122022
1      14000046        -9.617690       -10.555402     -8.456395       -13.265004      -11.184107
1      14000047        -9.563722      -10.494438     -8.409538      -13.187292     -11.122090
1      14000048        -9.856578      -10.819096     -8.669691      -13.587898     -11.451396
</pre>
</pre>
;1. chromosome
;1. chromosome
Line 60: Line 168:
;4. ThetaD (nucleotide diversity)
;4. ThetaD (nucleotide diversity)
;5. Theta? (singleton category)
;5. Theta? (singleton category)
;6. ThetaL
;6. ThetaH
;7. ThetaH
;7. ThetaL
==.thetas.gz.pestPG==
 
==.thetas.idx.pestPG==
The .pestPG file is a 14 column file (tab seperated). The first column contains information about the region. The second and third column is the reference name and the center of the window.
 
We then have 5 different estimators of theta, these are: Watterson, pairwise, FuLi, fayH, L.
And we have 5 different neutrality test statistics: Tajima's D, Fu&Li F's, Fu&Li's D, Fay's H, Zeng's E.
The final column is the effetive number of sites with data in the window.


<pre>
<pre>
(979699,1029444)(1000000,1050000)       chr10   1025000 -0.111223       -0.414967      -0.395241      -0.036406      -0.012747      183.897443     179.566460
## thetaStat VERSION: 0.01 build:(Jun 30 2014,12:06:12)
(1029444,1078839)(1050000,1100000)     chr10  1075000 0.007024       -0.229418       -0.191991       -0.077881       0.047612        152.483006      152.709929
#(indexStart,indexStop)(firstPos_withData,lastPos_withData)(WinStart,WinStopChr    WinCenter       tW      tP      tF      tH      tL     Tajima  fuf    fud    fayh    zeng    nSites
(1078839,1128837)(1100000,1150000)      chr10  1125000 -0.723711       -0.855712       -1.006656       -0.322342       -0.033026       197.304744      167.075081
(0,98316)(14000032,14100082)(0,14100082)       1      7050041 51.002623       46.171402       64.683834       51.290955      48.731178      -0.392892       -0.647071       -0.595302       -0.099654       -0.048444       98316
(1128837,1175623)(1150000,1200000)     chr10  1175000 -0.806392      -1.301255       -1.416623       0.113205        -0.312597       147.828056     122.568361
(0,98474)(13999910,14100060)(0,14100060)        2       7050030 92.689100       88.806005       101.768262     122.422498      105.614255     -0.174701       -0.252477       -0.220588       -0.360944       0.152373        98474
(1175623,1225623)(1200000,1250000)     chr10  1225000 -0.691517       -0.808718       -0.954130       -0.016786       -0.201603      151.001106      128.876508
(0,93269)(14000529,14100095)(0,14100095)        3      7050047 70.757874       76.248087       75.447438       68.354514      72.301301       0.322902        0.020330        -0.148419       0.110921        0.023794       93269
(1225623,1271896)(1250000,1300000)     chr10  1275000 -0.338848       -0.482030       -0.540336       0.007672        -0.108077       165.844115      153.941040
(0,96339)(13999912,14100064)(0,14100064)        4       7050032 99.748624       107.898618      94.265208       130.283528     119.091076     0.340878       0.247030       0.123956       -0.223386      0.211971       96339
(1271896,1320928)(1300000,1349963)      chr10  1324981 -0.387835      -0.070543      -0.210244       -0.245692      0.024921       215.743283      198.033441
(0,99659)(13999926,14100063)(0,14100063)       5      7050031 120.941697      132.667821      86.726667       163.908351     148.288088     0.404945       0.688320       0.639821       -0.257254      0.247395       99659
(1320928,1368926)(1349963,1400000)     chr10  1374981 -0.176277       -0.131296       -0.179671      -0.104470       0.007090        140.876813     135.613769
(0,99541)(13999918,14100103)(0,14100103)        6       7050051 96.666344       112.146685      69.740992       143.403712     127.775201     0.667988       0.792499       0.627735       -0.321842       0.351730       99541
(1368926,1418197)(1400000,1450000)     chr10  1425000 0.499577       0.217479       0.378439       0.233619        0.016264       328.809655      363.547699
(0,99786)(13999926,14100047)(0,14100047)       7      7050023 93.164548      92.023886      92.742574       142.413716      117.218807     -0.051058       -0.013928      0.010201       -0.538288       0.282133       99786
(1418197,1466420)(1450000,1500000)     chr10  1475000 0.305267       0.467985        0.515463        0.272164        -0.065638       220.966737     235.242873
(0,98759)(13999923,14100082)(0,14100082)       8      7050041 133.567125     177.157879      72.197498      204.069028      190.613463      1.363708       1.425567       1.040517       -0.200700      0.467490       98759
(1466420,1516380)(1500000,1550000)     chr10  1525000 0.391637       0.610786       0.670051       0.146180        0.034386       163.579256      177.149431
 
(1516380,1564813)(1550000,1600000)     chr10  1575000 -0.176713      -0.113138       -0.164435       0.043673        -0.079513       145.362704     139.919271
(1564813,1614333)(1600000,1650000)     chr10  1625000 0.463855       0.914256       0.955478       -0.376847       0.361592       113.119044      124.249808
(1614333,1662807)(1650000,1700000)     chr10  1675000 0.124714       0.972551        0.874178        -0.489383       0.323134        73.666310      75.620133
(1662807,1710569)(1700000,1750000)     chr10  1725000 -0.095232       0.493424        0.382792       -0.088652       0.022494       55.189016      54.068586
(1710569,1756421)(1750000,1800000)      chr10  1775000 0.527467       0.554005       0.674397       0.052715        0.130473       143.169177      159.172914
</pre>
</pre>
Format is:
Format is:


<code>
<code>
(indexStart,indexStop)(posStart,posStop) chrname  wincenter tajD fuliD fuliF fayH zengsE thetaW thetaPi
(indexStart,indexStop)(posStart,posStop)(regStat,regStop) chrname  wincenter tW tP tF tH tL  tajD fulif fuliD fayH zengsE numSites
</code>
</code>
Most likely you are just interest in the wincenter (column 3) and the column 9 which is the Tajima's D statistic.
The first 3 columns relates to the region. The next 5 columns are 5 different estimators of theta, and the next 5 columns are neutrality test statistics. The final column is the number of sites with data in the region.
The first '''()()()''' er mainly used for debugging the sliding window program. The interpretation is:
* The posStart and posStop is the first physical position, and last physical postion of sites included in the analysis.
* The regStat and regStop is the physical region for which the analysis is performed.  Therefore the posStat and posStop is always included within the regStart and regStop
* The indexStart and IndexStop is the position within the internal array.
=Unknown ancestral state (folded sfs)=
If you don't have the ancestral states, you can still calculate the Watterson and Tajima theta, which means you can perform the Tajima's D neutrality test statistic. But this requires you to use the folded sfs. The output files will have the same format, but only the thetaW and thetaD, and tajimas D is meaningful.
There was previously an example below that showed how to perform this analysis. This information has now been added to the examples above (notice the -fold 1) step in realSFS.


=Citation=
=Citation=
[[Korneliussen2013]]
[[Korneliussen2013]]

Latest revision as of 13:40, 27 August 2020

This method will estimate different thetas (population scaled mutation rate) and can based on these thetas calculate Tajima's D and various other neutrality test statistics. Method is described in Korneliussen2013.

  • NB Information on this website is for version 0.917-33-g6d2aec8 or higher.
  • NB The Korneliussen2013 covers two methods,
  1. using an ML method
  2. using the emperical Bayes (EB) method. The information on this page relates to the EB method.

For performing the ML method, you should the use the SFS Estimation method and define the region af interest.

Quick Example

Below is a chain of commands used for caculating statistics. These are based on the test files that can be dowloaded on the Quick Start page.

Its a 3 step procedure

  1. Estimate an site frequency spectrum. Output is out.sfs file. This is what is being used as the -pest argument in step2.
  2. Calculate per-site thetas. Output is a .thetas.idx/.thetas.gz files. This contains the binary persite estimates of the thetas.
  3. Calculate neutrality tests statistics. Output is a .thetas.idx.pestPG file.

Full command list for below examples

Here is the chain of commands required to do estimate the thetas, and perform neutrality test statistics. These different commands are described in great detail in the following step 1,... step 3b sub sections. If you do not have the ancestral state you can simply use the assembly you have mapped agains, but remember to add -fold 1 in the 'realSFS' and 'realSFS sf2theta' step.

./angsd -bam bam.filelist -doSaf 1 -anc chimpHg19.fa -GL 1 -P 24 -out out 
#for unfolded
./misc/realSFS out.saf.idx -P 24 > out.sfs
./misc/realSFS saf2theta out.saf.idx -outname out -sfs out.sfs
#for folded
./misc/realSFS out.saf.idx -P 24 -fold 1 > out.sfs
./misc/realSFS saf2theta out.saf.idx -outname out -sfs out.sfs -fold 1
#Estimate for every Chromosome/scaffold
./misc/thetaStat do_stat out.thetas.idx
#Do a sliding window analysis based on the output from the make_bed command.
./misc/thetaStat do_stat out.thetas.idx -win 50000 -step 10000  -outnames theta.thetasWindow.gz

Step 1: Finding a 'global estimate' of the SFS

First estimate the site allele frequency likelihood

./angsd -bam bam.filelist -doSaf 1 -anc chimpHg19.fa -GL 1 -P 24 -out out


        -> Reading fasta: chimpHg19.fa
        -> Parsing 10 number of samples 
        -> Printing at chr: 20 pos:14095817 chunknumber 3500
        -> Done reading data waiting for calculations to finish
        -> Calling destroy
        -> Done waiting for threads
        -> Output filenames:
                ->"out.arg"
                ->"out.saf"
                ->"out.saf.pos.gz"
        -> Mon Jun 30 12:02:58 2014
        -> Arguments and parameters for all analysis are located in .arg file
        [ALL done] cpu-time used =  47.19 sec
        [ALL done] walltime used =  43.00 sec




Obtain the maximum likelihood estimate of the SFS using the realSFS program found in the misc subfolder. (See more here realSFS)

./misc/realSFS out.saf.idx -P 24 > out.sfs

Or if want to calculate the folded spectrum.

./misc/realSFS out.saf.idx -P 24 -fold 1 > out.sfs

To plot the SFS in R :

s<-scan('out.sfs')
s<-s[-c(1,length(s))]
s<-s/sum(s)
barplot(s,names=1:length(s),main='SFS')
 

Step 2: Calculate the thetas for each site

realSFS saf2theta out.saf.idx -sfs out.sfs -outname out

The output from the above command are two files out.thetas.gz and out.thetas.idx. A formal description of these files can be found in the doc/formats.pdf in the angsd package. It is possible to extract the logscale persite thetas using the ./thetaStat print program.

thetaStat print out.thetas.idx 2>/dev/null |head

#Chromo	Pos	Watterson	Pairwise	thetaSingleton	thetaH	thetaL
1	14000032	-10.339284	-12.069325	-9.000927	-15.852173	-12.739969
1	14000033	-10.437878	-12.185619	-9.080596	-16.001343	-12.856984
1	14000034	-10.373872	-12.110464	-9.028572	-15.905591	-12.781380
1	14000035	-10.528192	-12.290763	-9.154920	-16.133823	-12.962708
1	14000036	-10.322074	-12.051400	-8.985016	-15.834049	-12.722040
1	14000037	-10.304955	-12.028814	-8.973260	-15.800330	-12.699204
1	14000038	-10.108563	-11.791546	-8.819884	-15.486384	-12.460146
1	14000039	-10.542117	-12.306631	-9.166698	-16.153168	-12.978650
1	14000040	-10.688401	-12.473763	-9.290272	-16.358398	-13.146564

Per default the print command will also output the contents of the index file to the stderr.

Step 3a: Estimate Tajimas D and other statistics

#calculate Tajimas D
./misc/thetaStat do_stat out.thetas.idx

cat out.thetas.idx.pestPG

## thetaStat VERSION: 0.01 build:(Jun 30 2014,12:06:12)
#(indexStart,indexStop)(firstPos_withData,lastPos_withData)(WinStart,WinStop)   Chr     WinCenter       tW      tP      tF      tH      tL      Tajima  fuf     fud     fayh    zeng    nSites
(0,98316)(14000032,14100082)(0,14100082)        1       7050041 51.002623       46.171402       64.683834       51.290955       48.731178       -0.392892       -0.647071       -0.595302       -0.099654       -0.048444       98316
(0,98474)(13999910,14100060)(0,14100060)        2       7050030 92.689100       88.806005       101.768262      122.422498      105.614255      -0.174701       -0.252477       -0.220588       -0.360944       0.152373        98474
(0,93269)(14000529,14100095)(0,14100095)        3       7050047 70.757874       76.248087       75.447438       68.354514       72.301301       0.322902        0.020330        -0.148419       0.110921        0.023794        93269
(0,96339)(13999912,14100064)(0,14100064)        4       7050032 99.748624       107.898618      94.265208       130.283528      119.091076      0.340878        0.247030        0.123956        -0.223386       0.211971        96339
(0,99659)(13999926,14100063)(0,14100063)        5       7050031 120.941697      132.667821      86.726667       163.908351      148.288088      0.404945        0.688320        0.639821        -0.257254       0.247395        99659
(0,99541)(13999918,14100103)(0,14100103)        6       7050051 96.666344       112.146685      69.740992       143.403712      127.775201      0.667988        0.792499        0.627735        -0.321842       0.351730        99541
(0,99786)(13999926,14100047)(0,14100047)        7       7050023 93.164548       92.023886       92.742574       142.413716      117.218807      -0.051058       -0.013928       0.010201        -0.538288       0.282133        99786
(0,98759)(13999923,14100082)(0,14100082)        8       7050041 133.567125      177.157879      72.197498       204.069028      190.613463      1.363708        1.425567        1.040517        -0.200700       0.467490        98759
(0,97855)(14001686,14100094)(0,14100094)        9       7050047 88.777475       102.853333      64.660948       104.749694      103.801516      0.660983        0.776148        0.611265        -0.021256       0.184875        97855
(0,98031)(13999906,14100096)(0,14100096)        10      7050048 129.583334      134.877160      88.135115       213.231615      174.054390      0.170681        0.654145        0.724072        -0.602284       0.375595        98031
(0,99220)(13999900,14100060)(0,14100060)        11      7050030 66.349155       79.423643       60.194045       68.903312       74.163477       0.819589        0.520022        0.207421        0.157614        0.128409        99220
(0,99861)(13999913,14100078)(0,14100078)        12      7050039 86.461303       81.630083       96.156392       110.974922      96.302507       -0.232902       -0.302980       -0.252190       -0.337701       0.124323        99861
(0,98258)(13999943,14100097)(0,14100097)        16      7050048 83.191170       99.392421       77.561510       106.148748      102.770584      0.811500        0.472922        0.152079        -0.080798       0.257008        98258
(0,99428)(13999902,14100095)(0,14100095)        17      7050047 90.254620       99.816352       65.610351       113.328929      106.572638      0.441707        0.683942        0.614609        -0.148988       0.197530        99428
(0,97118)(13999898,14100071)(0,14100071)        18      7050035 79.843256       75.282296       86.844252       67.720321       71.501308       -0.237958       -0.260778       -0.196888       0.094212        -0.114062       97118
(0,93783)(13999895,14100089)(0,14100089)        19      7050044 54.311523       49.839190       64.913940       72.868913       61.354048       -0.341795       -0.495649       -0.434079       -0.421111       0.141133        93783
(0,98938)(13999916,14100091)(0,14100091)        20      7050045 68.148147       63.323800       78.463736       56.040370       59.682084       -0.294508       -0.398845       -0.338673       0.106250        -0.135474       98938

Step 3b: Sliding Window example

We can easily do a sliding window analysis by adding -win/-step arguments to the last command. thetaStat

thetaStat do_stat out.thetas.idx -win 50000 -step 10000  -outnames theta.thetasWindow.gz

This will calculate the test statistic using a window size of 50kb and a step size of 10kb.

Example Output

- Output in the ./thetaStat print thetas.idx are the log scaled per site estimates of the thetas
- Output in the pestPG file are the sum of the per site estimates for a region

./thetaStat print angsdput.thetas.idx

#Chromo Pos     Watterson       Pairwise        thetaSingleton  thetaH  thetaL
1       14000032        -9.457420       -10.372069      -8.319252       -13.025778      -10.997194
1       14000033        -9.463637       -10.379368      -8.324414       -13.035780      -11.004670
1       14000034        -9.463740       -10.379488      -8.324500       -13.035942      -11.004793
1       14000035        -9.463603       -10.379328      -8.324386       -13.035725      -11.004629
1       14000036        -9.323246       -10.218453      -8.204848       -12.826627      -10.840519
1       14000037        -9.179270       -10.048883      -8.086425       -12.596436      -10.666670
1       14000038        -9.004664       -9.845473       -7.941453       -12.328274      -10.458416
1       14000039        -9.327033       -10.222983      -8.207914       -12.833007      -10.845176
1       14000040        -9.621554       -10.557563      -8.461745       -13.262415      -11.185971
1       14000041        -9.617449       -10.552869      -8.458225       -13.256257      -11.181185
1       14000042        -7.337841       -8.161756       -204.045433     -5.457443       -6.085818
1       14000043        -9.570405       -10.502160      -8.415195       -13.197596      -11.129976
1       14000044        -9.511097       -10.434558      -8.364249       -13.110037      -11.061100
1       14000045        -9.563664       -10.494371      -8.409489       -13.187203      -11.122022
1       14000046        -9.617690       -10.555402      -8.456395       -13.265004      -11.184107
1       14000047        -9.563722       -10.494438      -8.409538       -13.187292      -11.122090
1       14000048        -9.856578       -10.819096      -8.669691       -13.587898      -11.451396
1. chromosome
2. position
3. ThetaWatterson
4. ThetaD (nucleotide diversity)
5. Theta? (singleton category)
6. ThetaH
7. ThetaL

.thetas.idx.pestPG

The .pestPG file is a 14 column file (tab seperated). The first column contains information about the region. The second and third column is the reference name and the center of the window.

We then have 5 different estimators of theta, these are: Watterson, pairwise, FuLi, fayH, L. And we have 5 different neutrality test statistics: Tajima's D, Fu&Li F's, Fu&Li's D, Fay's H, Zeng's E. The final column is the effetive number of sites with data in the window.

## thetaStat VERSION: 0.01 build:(Jun 30 2014,12:06:12)
#(indexStart,indexStop)(firstPos_withData,lastPos_withData)(WinStart,WinStop)   Chr     WinCenter       tW      tP      tF      tH      tL      Tajima  fuf     fud     fayh    zeng    nSites
(0,98316)(14000032,14100082)(0,14100082)        1       7050041 51.002623       46.171402       64.683834       51.290955       48.731178       -0.392892       -0.647071       -0.595302       -0.099654       -0.048444       98316
(0,98474)(13999910,14100060)(0,14100060)        2       7050030 92.689100       88.806005       101.768262      122.422498      105.614255      -0.174701       -0.252477       -0.220588       -0.360944       0.152373        98474
(0,93269)(14000529,14100095)(0,14100095)        3       7050047 70.757874       76.248087       75.447438       68.354514       72.301301       0.322902        0.020330        -0.148419       0.110921        0.023794        93269
(0,96339)(13999912,14100064)(0,14100064)        4       7050032 99.748624       107.898618      94.265208       130.283528      119.091076      0.340878        0.247030        0.123956        -0.223386       0.211971        96339
(0,99659)(13999926,14100063)(0,14100063)        5       7050031 120.941697      132.667821      86.726667       163.908351      148.288088      0.404945        0.688320        0.639821        -0.257254       0.247395        99659
(0,99541)(13999918,14100103)(0,14100103)        6       7050051 96.666344       112.146685      69.740992       143.403712      127.775201      0.667988        0.792499        0.627735        -0.321842       0.351730        99541
(0,99786)(13999926,14100047)(0,14100047)        7       7050023 93.164548       92.023886       92.742574       142.413716      117.218807      -0.051058       -0.013928       0.010201        -0.538288       0.282133        99786
(0,98759)(13999923,14100082)(0,14100082)        8       7050041 133.567125      177.157879      72.197498       204.069028      190.613463      1.363708        1.425567        1.040517        -0.200700       0.467490        98759

Format is:

(indexStart,indexStop)(posStart,posStop)(regStat,regStop) chrname wincenter tW tP tF tH tL tajD fulif fuliD fayH zengsE numSites

Most likely you are just interest in the wincenter (column 3) and the column 9 which is the Tajima's D statistic.

The first 3 columns relates to the region. The next 5 columns are 5 different estimators of theta, and the next 5 columns are neutrality test statistics. The final column is the number of sites with data in the region.


The first ()()() er mainly used for debugging the sliding window program. The interpretation is:

  • The posStart and posStop is the first physical position, and last physical postion of sites included in the analysis.
  • The regStat and regStop is the physical region for which the analysis is performed. Therefore the posStat and posStop is always included within the regStart and regStop
  • The indexStart and IndexStop is the position within the internal array.

Unknown ancestral state (folded sfs)

If you don't have the ancestral states, you can still calculate the Watterson and Tajima theta, which means you can perform the Tajima's D neutrality test statistic. But this requires you to use the folded sfs. The output files will have the same format, but only the thetaW and thetaD, and tajimas D is meaningful.

There was previously an example below that showed how to perform this analysis. This information has now been added to the examples above (notice the -fold 1) step in realSFS.

Citation

Korneliussen2013