angsd - User contributions [en]

Installation

2022-07-16T12:04:42Z

Thorfinn:

There has been some confusion about the versions of ANGSD.

* Even versions are freezes from the last odd giversion

* Odd versions are git versions. Once there has been enough commits we will increment and make a release.

=Download and Installation=
To download and use ANGSD you need to download the htslib and the angsd source folder

You can either download the angsd0.938.tar.gz which contains both.
[http://popgen.dk/software/download/angsd/angsd0.938.tar.gz]

Or you can use github for the latest version of both htslib and angsd

Earlier versions from here: http://popgen.dk/software/download/angsd/
And here: https://github.com/ANGSD/angsd/releases

=Install=
Download and unpack the tarball, enter the directory and type make. Users on a mac computer, can use curl instead of wget.

===Unix===
The software can be compiled using make.
<pre>
wget http://popgen.dk/software/download/angsd/angsd0.938.tar.gz
tar xf angsd0.938.tar.gz
cd htslib;make;cd ..
cd angsd
make HTSSRC=../htslib
cd ..
</pre>
The executable then located in '''angsd/angsd'''.

=Install from github=
To install CRAM support you also need to install htslib and can be done using the following commands

<pre>
git clone --recursive https://github.com/samtools/htslib.git
git clone https://github.com/ANGSD/angsd.git
cd htslib;make;cd ../angsd ;make HTSSRC=../htslib
</pre>

=Systemwide installation of htslib?=
Then you just type make in the angsd directory

Thorfinn

2022-07-16T12:04:07Z

Thorfinn:

=How to deploy=
This page described how a new version should be put on wiki and github
==Make a combined angsd htslib to put on wiki download==
copy latest and make a new annotated tag and push to wiki
<pre>
VERSION=0.938
mkdir delme
cd delme
git clone --depth=1 https://github.com/ANGSD/angsd
git clone --depth=1 --recursive https://github.com/SAMtools/htslib
cd angsd; git tag -a ${VERSION} -m "time for version ${VERSION}"
sed "s/PACKAGE_VERSION = .*/PACKAGE_VERSION = ${VERSION}/" Makefile >tmp
diff Makefile tmp
mv tmp Makefile
cd ..
tar --exclude='.git' -cvf angsd${VERSION}.tar.gz angsd htslib/
scp angsd${VERSION}.tar.gz software@popgen.dk:/home/software/download/angsd/
</pre>

==Make a new github version to put on github==

<pre>
VERSION=0.939
cd angsd/
sed "s/PACKAGE_VERSION = .*/PACKAGE_VERSION = ${VERSION}/" Makefile >tmp
diff Makefile tmp
mv tmp Makefile
git commit Makefile -m "Preparing new ${VERSION} version"
git tag -a ${VERSION} -m "time for a new version ${VERSION}"
git push
git push --tags
</pre>

Topbar is here:
http://www.popgen.dk/angsd/index.php/MediaWiki:Sitenotice

MediaWiki:Sitenotice

2022-07-16T12:03:46Z

Thorfinn:

ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see [[Change_log]] for changes, and download it [[Download and installation | here]].

Thorfinn

2022-07-16T12:01:13Z

Thorfinn:

Change log

2022-07-16T11:59:14Z

Thorfinn:

=Latests=
Odd versions are github versions...
*0.937 https://github.com/ANGSD/angsd/compare/0.937...0.939
*0.935 https://github.com/ANGSD/angsd/compare/0.935...0.937
*0.933 https://github.com/ANGSD/angsd/compare/0.933...0.935
*0.931 https://github.com/ANGSD/angsd/compare/0.931...0.933
*0.929 https://github.com/ANGSD/angsd/compare/0.929...0.931
*0.927 https://github.com/ANGSD/angsd/compare/0.925...0.927
*0.925 https://github.com/ANGSD/angsd/compare/0.923...0.925
*0.923 https://github.com/ANGSD/angsd/compare/0.921...0.923
*0.921 https://github.com/ANGSD/angsd/compare/0.918...0.921
*0.918 https://github.com/ANGSD/angsd/compare/0.916...0.918
*0.916 https://github.com/ANGSD/angsd/compare/0.914...0.916
*0.914 https://github.com/ANGSD/angsd/compare/0.912...0.914
*0.912 https://github.com/ANGSD/angsd/compare/0.911...0.912
*0.911 https://github.com/ANGSD/angsd/compare/0.910...0.911
*0.910 https://github.com/ANGSD/angsd/compare/0.901...0.910
*0.901 https://github.com/ANGSD/angsd/compare/0.900...0.901
*0.900 https://github.com/ANGSD/angsd/compare/0.800...0.900
*0.800 https://github.com/ANGSD/angsd/compare/0.700...0.800
*0.700 https://github.com/ANGSD/angsd/compare/0.615...0.70

=0.6***=
*0.614 https://github.com/ANGSD/angsd/compare/0.614...0.615
*0.613 https://github.com/ANGSD/angsd/compare/0.613...0.614
*0.613 https://github.com/ANGSD/angsd/compare/0.612a...0.613
*0.612 https://github.com/ANGSD/angsd/compare/0.610...0.612a
*0.610 https://github.com/ANGSD/angsd/compare/0.609...0.610a
*0.609 https://github.com/ANGSD/angsd/compare/0.608...0.609
*0.608 not super usefull but we now compile with knetfile, so users can use remote .fa files. I really recommend that users download the fasta instead.
*0.607 changed name of all abstract base classes to the more reasonable abc*.cpp. Included contamination and the iCounts format. Added a templated class so users can see how to access the internal datastructures.
*0.606 added more info in the thetaStat subprogram renamed all analysis classes to abc*.cpp.
*0.605 continued updating the Eunjung code
*0.604 added a 'job' array for the analysis classes which should greatly reduce the number of needed function call. I doubt that will make any noticeable speed difference. Has copied and modified some code from Eunjung (from jnovembre lab) for a banded approach in the saf calculation. This is not working yet -doSaf 2
*0.603 1) fixed wrong branching if users used simfiles. 2) fixed 2 bugs in the inbreeeded angsd_realSFS.cpp. Very small changes
*0.602 1) programs output the actual chromosome, if there is a mismatch between the fasta and bamheader 2) added a check that the index files (fa vs fai,bam vs bai) is newer than the index. 3) Program now validates that the .so and include files used are the same.
*0.601 1) started the move to includeflags/discardflags, outcommented in this version 2) validates that data has been generted for the different samples by looking at difference between values, instead of comparing againts -0.0. 3) MAF estimates now skips estimation for a site if the updated GL shows that GLs are noninformative.
*0.600 fixed a change in default FLAGS when using bam files. If you hadn't removed the unmapped reads from your bamfiles these would have been included in the analysis.

=0.589 to 599=
* 0.589 removed soap/sim/glf/glfclean(bin and text) and tglf.
* 0.590 added text mpileup as new input format. Very useful
* 0.591 refactored the file reading, moved the arguments to multi reader, such that all file reading is done from multi reader. FREEZE version. We will only allow bug fixes in the next many versions.
* 0.592 -pest output in angsd_realSFS.cpp. smartcounts. fixed a bug in hetplas. edited the printout msg for 'if you really want angsd to exit' -pest output in angsd_realSFS.cpp. smartcounts. fixed a bug in hetplas. edited the printout msg for 'if you really want angsd to exit'. Maybe fixed an unknown bug. line 2 in .arg or screen output is now commandline used, and some other visual stuff. Program didn't complain if -doMaf but no -domajorminor. if chromosomesome name contained ':" it wouldn't work strchr->strrchr
* 0.593 fixed a strange bug, where the program would crash if, no analysis was chosen. fixed the 'shouldbeone' bug.
* 0.594 nochanges
* 0.595 nochanges
* 0.596 fixed a bug in persite depth counter (double count of C alleles). fixed a bug in smartcount subprogram.
* 0.597 treemix input file generation from smart counts
* 0.598 fixed a wrong compile flag in one of the utility programs 'smartCount'
* 0.599 many much more informative information if users forgot to add -fai argument. Fixed a bug in parsing of arguments with -doSaf 4.

=<0.570 and <0.588=
* 0.571 removed bf's from maf classs, negative values of -domaf disabled dumping of files, inbreeding has been added
* 0.572 Some funky new approach for the makefile is now being used, minor bug fixes (minDepth -> setMinDepth, extra header column in thetas.gz has been fixed)
* 0.573 added better info barfiles have different header. Added check if length of supplied reference/ancestral doesn't match bamheader. autosize in emOptim2 for 2dsfs. Fixed subtle issue if very large coverage between bamfiles, now the 'biggests(file size)' is used to select region instead of the first baffle. Netaccess is now deprecated.
* 0.574 modifed emOptim2 so now compiles on mac
* 0.575 smaller fixes to the inbreeding parsing, added p-value in analysisMaf instead of the raw llh.
*.0.576 -doSNP and -minLRT now deprecated, please use -SNP_pval instead
* 0.577 bugfix for -SNP_pval if value was one 1. doHWE now called -HWE_pval and can be used for filtering.
* 0.578 speedup in hew stuff
* 0.579 fixed an extremely rare assertion error (program was working, assertion was off). Redid all strcmp to strcasecom. Fixed a bug in -doMaf 2 with -snp_pval
* 0.580 nochanges...
* 0.581 if trimming has been enabled, N's will be plugged in instead of the bases. A number of small changes.
* 0.582 nochanges...
* 0.583 1) changed bugfix when using counts based estimator for major/minor 2) keepsites is now using the effecive number of samples in all cases 3) changed output of maf to a 'nicer' format
* 0.584 updated internal testing scripts.
* 0.585 1) fixed 'baq complains even though -ref was supplied' 2) fixed -doMajorMinor 4 and doMajorMinor 5 (sites not discarded) 3) added trumendounsly better information for the -sites 4) added some check for -doPost 5) program can now exit uncleanly if ctrl+c is pressed 3 times. 6) added an else to catch wrong arg in theteStat.
*0.586 1) fixed parsing of pars if input is -sim1 2) fixed a bug in -doFasta 3 (the ebd one) 3) fixed a printout problem in -doFasta
*0.587 1) fixed a number of minor instances where memory wasn't being freed/delete (mostly for keeping valgring silent) 2) fixed a memleak if -sites files contained 4 columns but -doMajorMinor not 3. 3) fixed a memory leak that could occur if -doPost 2 and -doMaf 0.
* 0.588 1) fixed small printout error that could cause segfaults in rare cases. 2) change stderr to pos+1 3) changed some checks of user supplied pars 4) fixed a stack overflow if a very long -rf file was supplied.

=old=
==Dirty==
* 0.16 Is now bundled with SAMtools-0.1.17 and the mpileup (and friends) command be used for passing data to dirty.
* 0.17 added extra options -minInd and -minMaf, for only printing and using sites above a threshold
* 0.18 added option to pass reference and ancestral allele as fasta files.(using faidx format) (doMaf is now encoded internally as a MAF_(UN)KNOWN_TYPE)
* 0.19 added support for tglf inputfiles, -tglf -posfile see runexamples, also added the likeratio test for snp calling
* 0.20 Added the check for missing data, before the major/minor. included -realSFS, changed the deallocation of the -doMAF results, such that its proper cleaned up.
* 0.21 refactored pml.cpp into pml_estError_genLikes.cpp and pml_freq_asso.cpp (fixed a bug that preventede -samglf and samglfclean from working)
* 0.22 Well this update was a mixture of edits from [[user:albrecht]] and BGI so its difficult to give a concise description
*.0.23 Program can now read simulated files (single pop only) An example can be seen in "full example ... sfs" and input types.
* 0.24 added the tajima estimator. This should go in tandem with some R scripts. Had to modify parseargs, shared, and pml_freq_asso
* 0.25 the depth is now being populated when using mpileup -g. The program can now get the counts from mpileup
==ANGSD==
=<0.5=
* 0.01.a - 0.01.b The bfgs now supports threading, maybe anders implemented a heteorzygosity estimator.
* 0.01.c A problem if we didn't observe any llh, caused the MAF estimator to 'nan'.
* 0.02 Fixed small bug in bfgs optimization of sfs optimization. When choosing a region bigger than what was covered by the .sfs file the program would hang. Added genotypecaller, added -sfsEst to the realSFS part of the program.
* 0.03 added and documented genotypecaller, can dump counts,-realSFS 1 dumps positions, -realSFS 2 is deprecated,S,pi and tajima has been added to sfstools along with possibility to do prior
* 0.3 clean version with less features. The lost features will be reintroduced later.
* 0.43 first very clean version, everything should be included
* 0.441 rewrote the SOAPsnp GL model, -L and -maxQ is not needed anymore. Also added an option to choose an output dir for the recalibration matrix
* 0.4471 error estimation is now working, the fasta reading is now threadsafe. all GLs are now likeratios.
* 0.512 After 0.500 we have changed the internal structure such that each chunk is enforced to be on the same chr. version c) fixes a problem of hardclipping
* 0.515 Alot of legacy code has been removed from mUppile.cpp. Program can now use remote files, build on code from SAMtools
* 0.520 Alot more legacy code has been removed from mUpPile.cpp. Program now does baq and adjustment of mapQ similar to -C in samtools. Also compiles on osx, but this is not supported
* 0.535 Bug in internal representation of mapQ's (only problematic for mapQ>128), we now use the flag to determine if a read has mapped. calcstat is now deprecated, users should use the bgid program now.
* 0.538 changed position output in association part. Fixed incorrect assert assumption in mUpPile.cpp. Added some downsampling options for errorEst and changed internal buffering when reading beagle files to allow for >10k individuals.
* 0.549 to many changes to remember

=<0.5 and <0.570=
* 0.551 tajima paper is now published, so the emOptim2 and bgid has now been properly documented. plink output is now supported and some snp filters can be outputted.
* 0.552 minor bug when calling genotypes without defining postcutoff -> missingness couldnt occur. removed the optimSFS and emOptim from the default compilelist
* 0.553 uint removed from code.
* 0.554 plugged in sfstools functionality into main angsd, (ability to output log posts)
* 0.555 anders added some concensus stuff
* 0.556 updated the filtering (if binary rep of keep file is incomplete it is removed again. It checks timestamps to see if file has been updated), folded spectra analysis should now be working
* 0.557 There was a bug in the realsfs part of the code, that was created in the 0.556 version. 0.557 is simply a fix of this, and the removal of a warning compiler flag in the msToGlf subprogram. We only observed the problematic compiler flag on a osx machine
* 0.558 tempversion,from this version bgid is now called thetaStat
* 0.559 abbababa
* 0.560 analysisCount.cpp has been updated to the nice standard of the wiki
* 0.561 program now compiles on clang, many small compiler warnings has been fixed.
* 0.562 Merge of forked versions, abba-baba fasta.
* 0.563 minQ filter has been moved to a much earlier step. Previously it was downstream classes that checked this. Now a base will be set to 'n' if it is below the threshold
* 0.564 cleaned up funky pars maf/asso such that all results are in ->extras[]
* 0.565 moved file reading stuff from shared to analysisFunction in namespace ail::
* 0.566 cleaned up small things, added a newer version of hetplas
* 0.567 cleaned up small things again. Started to add single pars e.g. -P -b
* 0.568 refactored compile order in general.cpp
* 0.569 added saf genotype calling, changed name from -realsfs to -doSaf
* 0.570 modified emOptim2 to estimate nSites and tell how much memory it will use, fixed empty -bam file

RealSFS

2022-06-26T11:20:36Z

Thorfinn:

* This program will estimate the (multi) SFS based on a .saf file generated from the '''./angsd [options] -doSaf '''.

* It can also textoutput the saf files.

* It can also convert to the old format.

* You can also specify regions of analysis using -r chromoname:start-stop

* you can also estimate fst and pbs with realSFS see [[Fst PCA]].

* You can also supply -sites directly to the realSFS subprogram for choosing a subset of sites. This could be useful if you are interested in the spectra for differenct functional categories.

* You can also merge saf files. Then you can run angsd on the separate chromosomes and merge afterwars

See also [[SFS Estimation]] and [[2d SFS Estimation]].
=Brief overview=
=Options=
<pre>
-> ---./realSFS------
-> EXAMPLES FOR ESTIMATING THE (MULTI) SFS:

-> Estimate the SFS for entire genome??
-> ./realSFS afile.saf.idx

-> 1) Estimate the SFS for entire chromosome 22 ??
-> ./realSFS afile.saf.idx -r chr22

-> 2) Estimate the 2d-SFS for entire chromosome 22 ??
-> ./realSFS afile1.saf.idx afile2.saf.idx -r chr22

-> 3) Estimate the SFS for the first 500megabases (this will span multiple chromosomes) ??
-> ./realSFS afile.saf.idx -nSites 500000000

-> 4) Estimate the SFS around a gene ??
-> ./realSFS afile.saf.idx -r chr2:135000000-140000000

-> Other options [-P nthreads -tole tolerence_for_breaking_EM -maxIter max_nr_iterations -bootstrap number_of_replications]

-> See realSFS print for possible print options
-> Use realSFS print_header for printing the header

->------------------
-> NB: Output is now counts of sites instead of log probs!!
-> NB: You can print data with ./realSFS print afile.saf.idx !!
-> NB: Higher order SFS's can be estimated by simply supplying multiple .saf.idx files!!
-> NB: Program uses accelerated EM, to use standard EM supply -m 0
</pre>
==Estimating SFS==
<pre>
#1d sfs
./realSFS afile.saf.idx [-sfs FNAME -P nThreads -tole tole -maxIter -nSites ]
#2dsfs
./realSFS pop1.saf.idx pop2.saf.idx [-sfs FNAME -P nThreads -tole tole -maxIter -nSites]
#3dsfs
./realSFS pop1.saf.idx pop2.saf.idx pop3.saf.idx [-sfs FNAME -P nThreads -tole tole -maxIter -nSites]
</pre>
The saf files are generated using [[SFS estimation| ./angsd -doSaf]].
;-start is a file containing a the expected values of the SFS that can be used as the start point for the EM optimisation.
;-tole When the difference in successive likelihood values in the EM algorithm gets below this value the optimisation will stop
;-P number of threads to allocate to program
;-nSites Limit the optimisation to a region of this size. If nothing is supplied the program will use the entire saf file
;-maxIter maximum number of iterations in the EM algorithm

You can also specify a region to use for estimating the SFS
<pre>
./realSFS pop1.saf.idx pop2.saf.idx -r chr22:10000000-20000000
</pre>

==Printing the SAF==
<pre>
#single population
./realSFS print pop1.saf.idx
#two populations
./realSFS print pop1.saf.idx pop2.saf.idx
#two populations chr2 from 100mb to 110mb
./realSFS print pop1.saf.idx pop2.saf.idx -r chr2:100000000-110000000
</pre>

And you can convert to the old <0.800 format using -oldout 1.

=Estimating 1d SFS=
<pre>
realSFS sfstest.saf.idx -P 4 >sfs.em
</pre>

The '''realSFS''' program will read in a block of the genome (from the .saf) file, and for this region it will estimate the SFS.

The size of the block can be choosen using -nSites argument, otherwise it will try to read in the entire saf file.

If you have .saf file larger than -nSites (you can check the number of sites in the .saf.pos file), then the program will loop over the genome and output the results for each block.
So each line in your Whit.saf.ml, is an SFS for a region.

=Estimating 2dsfs=
<pre>
./realSFS pop1.saf.idx pop2.saf.idx[-start FNAME -P nThreads -tole tole -maxIter -nSites ]
</pre>

=Output=
Main results are printed to the stdout. These are the expected values.
For 2dsfs the results is a single line, assuming we have n categories in population1 and m categories in population2, then the first m values will be the SFS for the first category in population1, etc.

=NB=
Use as many sites as possible, for more reliable estimates.
=-nSites=
The -nSites is used for choosing a max number of sites that should be used for the optimization. Using more sites will give you more reliable estimates. If you dont specify anything it will try to load all sites into memory.

=Using NGStools/NGSpopgen=
The software from Matteo Fumagalli [https://github.com/mfumagalli] expects the old format saf files, and these can be generated using realSFS

<pre>
#example using three pops
realSFS print pop1.saf.idx pop2.saf.idx pop3.saf.idx -oldout 1
#example using two pops
realSFS print pop1.saf.idx pop2.saf.idx -oldout 1
</pre>

This will then generate a single '''shared.pos.gz''' file, and a '''.saf''' file for each saf file. The output will only be generated for the sites that exists in all populations.

=Merge SAF files=
<pre>
./realSFS cat
-> This will cat together .saf files from angsd
-> regions has to be disjoint between saf files. This WONT be checked (alot) !
-> This has only been tested on safs for different chrs !
-> outnames: '(null)' number of safe:0
</pre>

Vcf

2022-05-23T10:32:26Z

Thorfinn:

Newer versions of angsd (master 27april2015) supports basic vcf output. This will only include gl and gp tags which can be useful for certain external programs

Supply
;-doVcf 1

Which is simply a wrapper around -gl -dopost -domajorminor -domaf.

==Example==
A full example commandline is given below:

<pre>
./angsd -b list.list -dovcf 1 -gl 1 -dopost 1 -domajorminor 1 -domaf 1 -snp_pval 1e-6
</pre>

==Output files==
This will generate a vcf file called angsdput.vcf.gz
<div class="toccolours mw-collapsible mw-collapsed">
angsdput.vcf.gz
<pre class="mw-collapsible-content">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="scaled Genotype Likelihoods (these are really llh eventhough they sum to one)">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ind0 ind1 ind2 ind3 ind4 ind5 ind6 ind7 ind8 ind9 ind10 ind11 ind12 ind13 ind14 ind15 ind16 ind17 ind18 ind19 ind20 ind21 ind22 ind23 ind24 ind25 ind26 ind27 ind28 ind29 ind30 ind31 ind32
1 14000202 . G A . PASS . GL:GP 0.013409,0.986591,0.000001:0.009959,0.990038,0.000003 0.532745,0.394297,0.072957:0.333333,0.333333,0.333333 0.532745,0.394297,0.072957:0.333333,0.333333,0.333333 0.729804,0.270070,0.000126:0.666110,0.333052,0.000839 0.532745,0.394297,0.072957:0.333333,0.333333,0.333333 0.013409,0.986589,0.000003:0.009959,0.990026,0.000015 0.843814,0.156129,0.000057:0.799685,0.199918,0.000397 0.003405,0.996582,0.000013:0.002523,0.997410,0.000068 0.915318,0.084679,0.000003:0.888870,0.111106,0.000025 0.843862,0.156138,0.000000:0.800003,0.199997,0.000000 0.001243,0.728985,0.269772:0.000420,0.333191,0.666388 0.955789,0.044211,0.000000:0.941178,0.058822,0.000000 0.843860,0.156137,0.000003:0.799987,0.199993,0.000020 0.000001,0.999999,0.000000:0.000001,0.999999,0.000000 0.824856,0.152621,0.022523:0.689947,0.172484,0.137569 0.701963,0.259767,0.038270:0.526842,0.263419,0.209740 0.532745,0.394297,0.072957:0.333333,0.333333,0.333333 0.004272,0.995611,0.000117:0.003164,0.996204,0.000632 0.532745,0.394297,0.072957:0.333333,0.333333,0.333333 0.021392,0.791754,0.186854:0.008712,0.435643,0.555645 0.843681,0.156104,0.000215:0.798816,0.199700,0.001484 0.729804,0.270070,0.000126:0.666110,0.333052,0.000839 0.002698,0.996374,0.000928:0.001990,0.993012,0.004997 0.977395,0.022605,0.000000:0.969698,0.030302,0.000000 0.532745,0.394297,0.072957:0.333333,0.333333,0.333333 0.002152,0.997847,0.000002:0.001593,0.998397,0.000010 0.915213,0.084669,0.000118:0.888149,0.111016,0.000835 0.032903,0.965645,0.001452:0.024405,0.967730,0.007865 0.000006,0.999994,0.000000:0.000005,0.999995,0.000000 0.701963,0.259767,0.038270:0.526842,0.263419,0.209740 0.843846,0.156135,0.000019:0.799896,0.199971,0.000133 0.907127,0.083921,0.008952:0.835380,0.104420,0.060201 0.332066,0.644735,0.023200:0.241926,0.634652,0.123421
1 14000873 . G A . PASS . GL:GP 0.000000,0.124151,0.875849:0.000000,0.030302,0.969698 0.692531,0.305335,0.002134:0.659698,0.329846,0.010456 0.900720,0.099279,0.000000:0.888891,0.111109,0.000000 0.993158,0.006842,0.000000:0.992249,0.007751,0.000000 0.999140,0.000860,0.000000:0.999024,0.000976,0.000000 0.000091,0.994381,0.005528:0.000078,0.975325,0.024596 0.900720,0.099279,0.000000:0.888891,0.111109,0.000000 0.819369,0.180627,0.000004:0.799988,0.199993,0.000018 0.000000,0.124151,0.875849:0.000000,0.030302,0.969698 0.693938,0.305955,0.000107:0.666316,0.333155,0.000529 0.986410,0.013590,0.000000:0.984616,0.015384,0.000000 0.693919,0.305947,0.000135:0.666224,0.333109,0.000666 0.007101,0.988541,0.004358:0.006172,0.974344,0.019484 0.000000,1.000000,0.000000:0.000000,1.000000,0.000000 0.030451,0.672875,0.296674:0.013127,0.328956,0.657917 0.973184,0.026816,0.000000:0.969698,0.030302,0.000000 0.973184,0.026816,0.000000:0.969698,0.030302,0.000000 0.993158,0.006842,0.000000:0.992249,0.007751,0.000000 0.004535,0.995465,0.000000:0.004001,0.995998,0.000000 0.000395,0.693734,0.305871:0.000167,0.333276,0.666557 0.900720,0.099279,0.000000:0.888891,0.111109,0.000000 0.693965,0.305967,0.000068:0.666446,0.333220,0.000334 0.947768,0.052232,0.000000:0.941178,0.058822,0.000000 0.000000,0.220881,0.779119:0.000000,0.058822,0.941178 0.900720,0.099279,0.000000:0.888891,0.111109,0.000000 0.000000,0.017410,0.982590:0.000000,0.003891,0.996109 0.000046,0.999954,0.000000:0.000040,0.999960,0.000000 0.000001,0.999999,0.000000:0.000001,0.999999,0.000000 0.947768,0.052232,0.000000:0.941178,0.058822,0.000000 0.310721,0.689279,0.000000:0.284441,0.715559,0.000000 0.900720,0.099279,0.000000:0.888891,0.111109,0.000000 0.973184,0.026816,0.000000:0.969698,0.030302,0.000000 0.000992,0.693320,0.305688:0.000420,0.333191,0.666388
1 14001018 . T C . PASS . GL:GP 0.000000,0.069163,0.930837:0.000000,0.015384,0.984616 0.826258,0.173742,0.000000:0.800002,0.199997,0.000002 0.159620,0.840380,0.000000:0.137752,0.862248,0.000000 0.950058,0.049942,0.000000:0.941178,0.058822,0.000000 0.904865,0.095135,0.000000:0.888891,0.111109,0.000000 0.826258,0.173742,0.000000:0.800003,0.199997,0.000000 0.495544,0.416810,0.087646:0.333333,0.333333,0.333333 0.993472,0.006528,0.000000:0.992249,0.007751,0.000000 0.001055,0.703204,0.295741:0.000420,0.333191,0.666388 0.703933,0.296042,0.000025:0.666580,0.333287,0.000133 0.826258,0.173742,0.000000:0.800003,0.199997,0.000000 0.703933,0.296042,0.000025:0.666580,0.333287,0.000133 0.000420,0.703651,0.295929:0.000167,0.333276,0.666557 0.000008,0.999992,0.000000:0.000006,0.999994,0.000000 0.085860,0.913986,0.000153:0.073175,0.926086,0.000739 0.904865,0.095135,0.000000:0.888891,0.111109,0.000000 0.495544,0.416810,0.087646:0.333333,0.333333,0.333333 0.993472,0.006528,0.000000:0.992249,0.007751,0.000000 0.703731,0.295957,0.000313:0.665554,0.332774,0.001672 0.000007,0.543141,0.456852:0.000002,0.199997,0.800001 0.904865,0.095135,0.000000:0.888891,0.111109,0.000000 0.904865,0.095135,0.000000:0.888891,0.111109,0.000000 0.904865,0.095135,0.000000:0.888891,0.111109,0.000000 0.000007,0.543141,0.456852:0.000002,0.199997,0.800001 0.904865,0.095135,0.000000:0.888892,0.111108,0.000000 0.000000,0.229117,0.770883:0.000000,0.058822,0.941178 0.045152,0.954211,0.000637:0.038160,0.958794,0.003045 0.000599,0.999394,0.000006:0.000504,0.999465,0.000031 0.974389,0.025611,0.000000:0.969698,0.030302,0.000000 0.000005,0.543142,0.456853:0.000002,0.199997,0.800002 0.974389,0.025611,0.000000:0.969698,0.030302,0.000000 0.703916,0.296035,0.000050:0.666492,0.333243,0.000265 0.495544,0.416810,0.087646:0.333333,0.333333,0.333333
1 14001867 . A G . PASS . GL:GP 0.000407,0.698622,0.300971:0.000167,0.333276,0.666557 0.902773,0.097227,0.000000:0.888891,0.111109,0.000000 0.000645,0.698456,0.300899:0.000265,0.333243,0.666492 0.902773,0.097227,0.000000:0.888891,0.111109,0.000000 0.902773,0.097227,0.000000:0.888891,0.111109,0.000000 0.000006,0.998276,0.001719:0.000005,0.992066,0.007929 0.986717,0.013283,0.000000:0.984616,0.015384,0.000000 0.822776,0.177224,0.000000:0.800002,0.199997,0.000001 0.000003,0.537165,0.462832:0.000001,0.199997,0.800002 0.948903,0.051097,0.000000:0.941178,0.058822,0.000000 0.948903,0.051097,0.000000:0.941178,0.058822,0.000000 0.986717,0.013283,0.000000:0.984616,0.015384,0.000000 0.000000,0.537166,0.462833:0.000000,0.199997,0.800003 0.000060,0.999940,0.000000:0.000051,0.999949,0.000000 0.000079,0.999921,0.000000:0.000068,0.999932,0.000000 0.902773,0.097227,0.000000:0.888891,0.111109,0.000000 0.902773,0.097227,0.000000:0.888891,0.111109,0.000000 0.999159,0.000841,0.000000:0.999024,0.000976,0.000000 0.004641,0.995359,0.000000:0.004001,0.995999,0.000000 0.000645,0.698456,0.300899:0.000265,0.333243,0.666492 0.948903,0.051097,0.000000:0.941178,0.058822,0.000000 0.822774,0.177223,0.000002:0.799994,0.199995,0.000011 0.996646,0.003354,0.000000:0.996109,0.003891,0.000000 0.000000,0.008985,0.991015:0.000000,0.001949,0.998051 0.822776,0.177224,0.000000:0.800002,0.199997,0.000001 0.000000,0.367208,0.632792:0.000000,0.111109,0.888891 0.000000,0.999135,0.000865:0.000000,0.995998,0.004001 0.005821,0.994177,0.000002:0.005019,0.994970,0.000011 0.993314,0.006686,0.000000:0.992249,0.007751,0.000000 0.000000,0.999998,0.000002:0.000000,0.999992,0.000008 0.948903,0.051097,0.000000:0.941178,0.058822,0.000000 0.822776,0.177224,0.000001:0.800000,0.199997,0.000003 0.007316,0.992667,0.000017:0.006309,0.993611,0.000079
[capped]
</pre>
</div>
Notice that the sampleDI simply an ind followed by and integer. These relate to the samples form the -b filelist
==References==

DistAngsd

2022-03-23T07:41:43Z

Thorfinn: Created page with "There are methods for inferring the genetic distance based on genotype likelihoods using proper models within molecular evolution. distAngsd has its very own website https://..."

There are methods for inferring the genetic distance based on genotype likelihoods using proper models within molecular evolution.

distAngsd has its very own website https://github.com/lz398/distAngsd

Mismatch

2022-03-02T21:28:15Z

Thorfinn:

Angsd can output mismatch count matrix from the reference given:

;- Distance from be beginning of the read (posi)
;- Distance from the end of the read (isop)
;- Strand
;- Qscore

Run like:

<pre>
./angsd -i my.cram -ref hg19.fa -doMisMatch 1
</pre>

Output looks like:

<pre>
posi isop qs strand Ref A C G T
0 24 13 0 1 0 2 0 0
0 24 13 0 3 0 0 0 1
0 24 13 1 0 1 0 0 0
0 24 13 1 3 0 0 0 1
0 24 14 0 0 152 1 4 0
0 24 14 0 1 0 128 0 0
0 24 14 0 2 11 2 196 1
0 24 14 0 3 1 0 2 12
0 24 14 1 0 20 1 2 2
0 24 14 1 1 1 200 1 16
0 24 14 1 2 0 5 118 6
0 24 14 1 3 0 6 3 164
0 24 15 0 0 186 1 2 0
0 24 15 0 1 4 91 0 0
0 24 15 0 2 8 0 90 0
0 24 15 0 3 0 1 2 7
0 24 15 1 0 14 1 0 0
0 24 15 1 1 0 93 1 2
0 24 15 1 2 0 2 79 2
</pre>
column 1 and colum two are the distance from each end of the read

qs is the quality score in numeric phred scale

Strand is which sequencing strand zero and one

Ref is the reference A,C,G,T = 0,1,2,3

The last 4 column is the count of bases acroos all reads where the position from the beginning is posi, distance from the end is isop, qualityscore is qs, strand is 0 or one.

Mismatch

2022-03-02T21:27:54Z

Thorfinn:

Angsd can output mismatch count matrix from the reference given:

;- Distance from be beginning of the read (posi)
;- Distance from the end of the read (isop)
;- Strand
;- Qscore

Run like:

<pre>
./angsd -i my.cram -ref hg19.fa -doMisMatch 1
</pre>

Output looks like:

<pre>
posi isop qs strand Ref A C G T
0 24 13 0 1 0 2 0 0
0 24 13 0 3 0 0 0 1
0 24 13 1 0 1 0 0 0
0 24 13 1 3 0 0 0 1
0 24 14 0 0 152 1 4 0
0 24 14 0 1 0 128 0 0
0 24 14 0 2 11 2 196 1
0 24 14 0 3 1 0 2 12
0 24 14 1 0 20 1 2 2
0 24 14 1 1 1 200 1 16
0 24 14 1 2 0 5 118 6
0 24 14 1 3 0 6 3 164
0 24 15 0 0 186 1 2 0
0 24 15 0 1 4 91 0 0
0 24 15 0 2 8 0 90 0
0 24 15 0 3 0 1 2 7
0 24 15 1 0 14 1 0 0
0 24 15 1 1 0 93 1 2
0 24 15 1 2 0 2 79 2
</pre>
column 1 and colum two are the distance from each end of the read

qs is the quality score in numeric phred scale

Strand is which sequencing strand zero and one

Ref is the reference A,C,G,T = 0,1,2,3
The last 4 column is the count of bases acroos all reads where the position from the beginning is posi, distance from the end is isop, qualityscore is qs, strand is 0 or one.

Genotype calling

2022-03-02T21:14:43Z

Thorfinn: /* Options */

We really don't recommend doing analysis based on called genotypes, but incorporate the uncertainty directly into the analysis you want to perform. But we recognise that many methods are still relying on called genotypes, and have therefore implemented a basic genotype caller into angsd.

Genotype calling in ANGSD is based on calculating the posterior probability of the genotypes. The '''-doGeno''' is therefore a simple wrapper around the '''-doPost''' along with some extra filtering options. See [[Allele Frequencies]] for more information.

=Brief Overview=
<pre>
./angsd -dogeno -> Wed Mar 2 12:39:19 2016
-----------------
abcCallGenotypes.cpp:

-doGeno 0
1: write major and minor
2: write the called genotype encoded as -1,0,1,2, -1=not called
4: write the called genotype directly: eg AA,AC etc
8: write the posterior probability of all possible genotypes
16: write the posterior probability of called genotype
32: write the posterior probabilities of the 3 gentypes as binary
-> A combination of the above can be choosen by summing the values, EG write 0,1,2 types with majorminor as -doGeno 3
-postCutoff=0.333333 (Only genotype to missing if below this threshold)
-geno_minDepth=-1 (-1 indicates no cutof)
-geno_maxDepth=-1 (-1 indicates no cutof)
-geno_minMM=-1.000000 (minimum fraction af major-minor bases)
-minInd=0 (only keep sites if you call genotypes from this number of individuals)

NB When writing the posterior the -postCutoff is not used
NB geno_minDepth requires -doCounts
NB geno_maxDepth requires -doCounts

</pre>

angsd can also use the full information of the sample allele frequencies for calling genotypes see [[SFS Estimation]].
==Options==
;-doGeno [int]
1: print out major minor

2: print the called genotype as -1,0,1,2 (count of minor)

4: print the called genotype as AA, AC, AG, ...

8: print all 3 posts (major,major),(major,minor),(minor,minor)

16: print the posterior of the called genotype

32: somewhat different dumps the binary posterior for all samples, encoded as 3*nind double

Use the sum of the above to give the output you want. Forexample -doGeno 5 (1+4) prins the major and minor allele followed by the genotype (AA, AC ...) for each individual

; -doPost [int]
1: estimate the posterior genotype probability based on the allele frequency as a prior

2: estimate the posterior genotype probability assuming a uniform prior

; -geno_minDepth [int]
set genotypes to missing if the individual depth is less than [int]

; -geno_maxDepth [int]
set genotypes to missing if the individual depth is larger than [int]

; -geno_minMM [float]
set genotypes to missing if less than [float] of the bases are the major or minor (likely a triallic site). e.g. 0.1 means that less than 10% of reads in this individual is either the major or the minor

; -postCutoff [float]
Call only a genotype with a posterior above this threshold.

NB if the raw posterior dump is requested the -postCutoff is not used

==Examples==
===Allele frequency as prior===
<pre>
./angsd -bam bam.filelist -GL 1 -out outfile -doMaf 2 -doMajorMinor 1 -SNP_pval 0.000001 -doGeno 5 -doPost 1 -postCutoff 0.95
</pre>

gives a output like this:

<pre>
1 14000202 G A GG NN NN GA NN
1 14000873 G A GG GG GG AA GA
1 14001018 T C NN NN NN CC NN
1 14001867 A G NN AA AA NN NN
1 14002342 C T CC CC CC CC CC
1 14002422 A T AA NN NN NN NN
1 14002474 T C TC TT TT TT TT
1 14003581 C T CC CC NN NN CT
1 14004623 T C TT TT TT NN TC
1 14005069 A G AA AA AA AA AA
</pre>
===Sample allele frequency with SFS as prior===
1. First get an estimate of the site frequency spectrum
<pre>
./angsd -dosaf 1 -anc ../hg19ancNoChr.fa.gz -gl 1 -b list
./realSFS angsdput.saf.idx >angsdput.saf.idx.ml
</pre>
2. Now calculate diallelic genotype posterior probablity with
<pre>
./angsd -dopost 3 -b list -gl 1 -domajorminor 1 -domaf 1 -pest angsdput.saf.idx.ml -dogeno 2 -r 1 -out angsdput2
</pre>

Installation

2022-02-08T11:47:03Z

Thorfinn:

There has been some confusion about the versions of ANGSD.

* Even versions are freezes from the last odd giversion

* Odd versions are git versions. Once there has been enough commits we will increment and make a release.

=Download and Installation=
To download and use ANGSD you need to download the htslib and the angsd source folder

You can either download the angsd0.934.tar.gz which contains both.
[http://popgen.dk/software/download/angsd/angsd0.934.tar.gz]

Or you can use github for the latest version of both htslib and angsd

Earlier versions from here: http://popgen.dk/software/download/angsd/
And here: https://github.com/ANGSD/angsd/releases

=Install=
Download and unpack the tarball, enter the directory and type make. Users on a mac computer, can use curl instead of wget.

===Unix===
The software can be compiled using make.
<pre>
wget http://popgen.dk/software/download/angsd/angsd0.936.tar.gz
tar xf angsd0.936.tar.gz
cd htslib;make;cd ..
cd angsd
make HTSSRC=../htslib
cd ..
</pre>
The executable then located in '''angsd/angsd'''.

=Install from github=
To install CRAM support you also need to install htslib and can be done using the following commands

<pre>
git clone --recursive https://github.com/samtools/htslib.git
git clone https://github.com/ANGSD/angsd.git
cd htslib;make;cd ../angsd ;make HTSSRC=../htslib
</pre>

=Systemwide installation of htslib?=
Then you just type make in the angsd directory

Change log

2022-02-08T11:24:35Z

Thorfinn:

=Latests=
Odd versions are github versions...
*0.935 https://github.com/ANGSD/angsd/compare/0.935...0.937
*0.933 https://github.com/ANGSD/angsd/compare/0.933...0.935
*0.931 https://github.com/ANGSD/angsd/compare/0.931...0.933
*0.929 https://github.com/ANGSD/angsd/compare/0.929...0.931
*0.927 https://github.com/ANGSD/angsd/compare/0.925...0.927
*0.925 https://github.com/ANGSD/angsd/compare/0.923...0.925
*0.923 https://github.com/ANGSD/angsd/compare/0.921...0.923
*0.921 https://github.com/ANGSD/angsd/compare/0.918...0.921
*0.918 https://github.com/ANGSD/angsd/compare/0.916...0.918
*0.916 https://github.com/ANGSD/angsd/compare/0.914...0.916
*0.914 https://github.com/ANGSD/angsd/compare/0.912...0.914
*0.912 https://github.com/ANGSD/angsd/compare/0.911...0.912
*0.911 https://github.com/ANGSD/angsd/compare/0.910...0.911
*0.910 https://github.com/ANGSD/angsd/compare/0.901...0.910
*0.901 https://github.com/ANGSD/angsd/compare/0.900...0.901
*0.900 https://github.com/ANGSD/angsd/compare/0.800...0.900
*0.800 https://github.com/ANGSD/angsd/compare/0.700...0.800
*0.700 https://github.com/ANGSD/angsd/compare/0.615...0.70

=0.6***=
*0.614 https://github.com/ANGSD/angsd/compare/0.614...0.615
*0.613 https://github.com/ANGSD/angsd/compare/0.613...0.614
*0.613 https://github.com/ANGSD/angsd/compare/0.612a...0.613
*0.612 https://github.com/ANGSD/angsd/compare/0.610...0.612a
*0.610 https://github.com/ANGSD/angsd/compare/0.609...0.610a
*0.609 https://github.com/ANGSD/angsd/compare/0.608...0.609
*0.608 not super usefull but we now compile with knetfile, so users can use remote .fa files. I really recommend that users download the fasta instead.
*0.607 changed name of all abstract base classes to the more reasonable abc*.cpp. Included contamination and the iCounts format. Added a templated class so users can see how to access the internal datastructures.
*0.606 added more info in the thetaStat subprogram renamed all analysis classes to abc*.cpp.
*0.605 continued updating the Eunjung code
*0.604 added a 'job' array for the analysis classes which should greatly reduce the number of needed function call. I doubt that will make any noticeable speed difference. Has copied and modified some code from Eunjung (from jnovembre lab) for a banded approach in the saf calculation. This is not working yet -doSaf 2
*0.603 1) fixed wrong branching if users used simfiles. 2) fixed 2 bugs in the inbreeeded angsd_realSFS.cpp. Very small changes
*0.602 1) programs output the actual chromosome, if there is a mismatch between the fasta and bamheader 2) added a check that the index files (fa vs fai,bam vs bai) is newer than the index. 3) Program now validates that the .so and include files used are the same.
*0.601 1) started the move to includeflags/discardflags, outcommented in this version 2) validates that data has been generted for the different samples by looking at difference between values, instead of comparing againts -0.0. 3) MAF estimates now skips estimation for a site if the updated GL shows that GLs are noninformative.
*0.600 fixed a change in default FLAGS when using bam files. If you hadn't removed the unmapped reads from your bamfiles these would have been included in the analysis.

=0.589 to 599=
* 0.589 removed soap/sim/glf/glfclean(bin and text) and tglf.
* 0.590 added text mpileup as new input format. Very useful
* 0.591 refactored the file reading, moved the arguments to multi reader, such that all file reading is done from multi reader. FREEZE version. We will only allow bug fixes in the next many versions.
* 0.592 -pest output in angsd_realSFS.cpp. smartcounts. fixed a bug in hetplas. edited the printout msg for 'if you really want angsd to exit' -pest output in angsd_realSFS.cpp. smartcounts. fixed a bug in hetplas. edited the printout msg for 'if you really want angsd to exit'. Maybe fixed an unknown bug. line 2 in .arg or screen output is now commandline used, and some other visual stuff. Program didn't complain if -doMaf but no -domajorminor. if chromosomesome name contained ':" it wouldn't work strchr->strrchr
* 0.593 fixed a strange bug, where the program would crash if, no analysis was chosen. fixed the 'shouldbeone' bug.
* 0.594 nochanges
* 0.595 nochanges
* 0.596 fixed a bug in persite depth counter (double count of C alleles). fixed a bug in smartcount subprogram.
* 0.597 treemix input file generation from smart counts
* 0.598 fixed a wrong compile flag in one of the utility programs 'smartCount'
* 0.599 many much more informative information if users forgot to add -fai argument. Fixed a bug in parsing of arguments with -doSaf 4.

=<0.570 and <0.588=
* 0.571 removed bf's from maf classs, negative values of -domaf disabled dumping of files, inbreeding has been added
* 0.572 Some funky new approach for the makefile is now being used, minor bug fixes (minDepth -> setMinDepth, extra header column in thetas.gz has been fixed)
* 0.573 added better info barfiles have different header. Added check if length of supplied reference/ancestral doesn't match bamheader. autosize in emOptim2 for 2dsfs. Fixed subtle issue if very large coverage between bamfiles, now the 'biggests(file size)' is used to select region instead of the first baffle. Netaccess is now deprecated.
* 0.574 modifed emOptim2 so now compiles on mac
* 0.575 smaller fixes to the inbreeding parsing, added p-value in analysisMaf instead of the raw llh.
*.0.576 -doSNP and -minLRT now deprecated, please use -SNP_pval instead
* 0.577 bugfix for -SNP_pval if value was one 1. doHWE now called -HWE_pval and can be used for filtering.
* 0.578 speedup in hew stuff
* 0.579 fixed an extremely rare assertion error (program was working, assertion was off). Redid all strcmp to strcasecom. Fixed a bug in -doMaf 2 with -snp_pval
* 0.580 nochanges...
* 0.581 if trimming has been enabled, N's will be plugged in instead of the bases. A number of small changes.
* 0.582 nochanges...
* 0.583 1) changed bugfix when using counts based estimator for major/minor 2) keepsites is now using the effecive number of samples in all cases 3) changed output of maf to a 'nicer' format
* 0.584 updated internal testing scripts.
* 0.585 1) fixed 'baq complains even though -ref was supplied' 2) fixed -doMajorMinor 4 and doMajorMinor 5 (sites not discarded) 3) added trumendounsly better information for the -sites 4) added some check for -doPost 5) program can now exit uncleanly if ctrl+c is pressed 3 times. 6) added an else to catch wrong arg in theteStat.
*0.586 1) fixed parsing of pars if input is -sim1 2) fixed a bug in -doFasta 3 (the ebd one) 3) fixed a printout problem in -doFasta
*0.587 1) fixed a number of minor instances where memory wasn't being freed/delete (mostly for keeping valgring silent) 2) fixed a memleak if -sites files contained 4 columns but -doMajorMinor not 3. 3) fixed a memory leak that could occur if -doPost 2 and -doMaf 0.
* 0.588 1) fixed small printout error that could cause segfaults in rare cases. 2) change stderr to pos+1 3) changed some checks of user supplied pars 4) fixed a stack overflow if a very long -rf file was supplied.

=old=
==Dirty==
* 0.16 Is now bundled with SAMtools-0.1.17 and the mpileup (and friends) command be used for passing data to dirty.
* 0.17 added extra options -minInd and -minMaf, for only printing and using sites above a threshold
* 0.18 added option to pass reference and ancestral allele as fasta files.(using faidx format) (doMaf is now encoded internally as a MAF_(UN)KNOWN_TYPE)
* 0.19 added support for tglf inputfiles, -tglf -posfile see runexamples, also added the likeratio test for snp calling
* 0.20 Added the check for missing data, before the major/minor. included -realSFS, changed the deallocation of the -doMAF results, such that its proper cleaned up.
* 0.21 refactored pml.cpp into pml_estError_genLikes.cpp and pml_freq_asso.cpp (fixed a bug that preventede -samglf and samglfclean from working)
* 0.22 Well this update was a mixture of edits from [[user:albrecht]] and BGI so its difficult to give a concise description
*.0.23 Program can now read simulated files (single pop only) An example can be seen in "full example ... sfs" and input types.
* 0.24 added the tajima estimator. This should go in tandem with some R scripts. Had to modify parseargs, shared, and pml_freq_asso
* 0.25 the depth is now being populated when using mpileup -g. The program can now get the counts from mpileup
==ANGSD==
=<0.5=
* 0.01.a - 0.01.b The bfgs now supports threading, maybe anders implemented a heteorzygosity estimator.
* 0.01.c A problem if we didn't observe any llh, caused the MAF estimator to 'nan'.
* 0.02 Fixed small bug in bfgs optimization of sfs optimization. When choosing a region bigger than what was covered by the .sfs file the program would hang. Added genotypecaller, added -sfsEst to the realSFS part of the program.
* 0.03 added and documented genotypecaller, can dump counts,-realSFS 1 dumps positions, -realSFS 2 is deprecated,S,pi and tajima has been added to sfstools along with possibility to do prior
* 0.3 clean version with less features. The lost features will be reintroduced later.
* 0.43 first very clean version, everything should be included
* 0.441 rewrote the SOAPsnp GL model, -L and -maxQ is not needed anymore. Also added an option to choose an output dir for the recalibration matrix
* 0.4471 error estimation is now working, the fasta reading is now threadsafe. all GLs are now likeratios.
* 0.512 After 0.500 we have changed the internal structure such that each chunk is enforced to be on the same chr. version c) fixes a problem of hardclipping
* 0.515 Alot of legacy code has been removed from mUppile.cpp. Program can now use remote files, build on code from SAMtools
* 0.520 Alot more legacy code has been removed from mUpPile.cpp. Program now does baq and adjustment of mapQ similar to -C in samtools. Also compiles on osx, but this is not supported
* 0.535 Bug in internal representation of mapQ's (only problematic for mapQ>128), we now use the flag to determine if a read has mapped. calcstat is now deprecated, users should use the bgid program now.
* 0.538 changed position output in association part. Fixed incorrect assert assumption in mUpPile.cpp. Added some downsampling options for errorEst and changed internal buffering when reading beagle files to allow for >10k individuals.
* 0.549 to many changes to remember

=<0.5 and <0.570=
* 0.551 tajima paper is now published, so the emOptim2 and bgid has now been properly documented. plink output is now supported and some snp filters can be outputted.
* 0.552 minor bug when calling genotypes without defining postcutoff -> missingness couldnt occur. removed the optimSFS and emOptim from the default compilelist
* 0.553 uint removed from code.
* 0.554 plugged in sfstools functionality into main angsd, (ability to output log posts)
* 0.555 anders added some concensus stuff
* 0.556 updated the filtering (if binary rep of keep file is incomplete it is removed again. It checks timestamps to see if file has been updated), folded spectra analysis should now be working
* 0.557 There was a bug in the realsfs part of the code, that was created in the 0.556 version. 0.557 is simply a fix of this, and the removal of a warning compiler flag in the msToGlf subprogram. We only observed the problematic compiler flag on a osx machine
* 0.558 tempversion,from this version bgid is now called thetaStat
* 0.559 abbababa
* 0.560 analysisCount.cpp has been updated to the nice standard of the wiki
* 0.561 program now compiles on clang, many small compiler warnings has been fixed.
* 0.562 Merge of forked versions, abba-baba fasta.
* 0.563 minQ filter has been moved to a much earlier step. Previously it was downstream classes that checked this. Now a base will be set to 'n' if it is below the threshold
* 0.564 cleaned up funky pars maf/asso such that all results are in ->extras[]
* 0.565 moved file reading stuff from shared to analysisFunction in namespace ail::
* 0.566 cleaned up small things, added a newer version of hetplas
* 0.567 cleaned up small things again. Started to add single pars e.g. -P -b
* 0.568 refactored compile order in general.cpp
* 0.569 added saf genotype calling, changed name from -realsfs to -doSaf
* 0.570 modified emOptim2 to estimate nSites and tell how much memory it will use, fixed empty -bam file

MediaWiki:Sitenotice

2022-02-08T11:23:51Z

Thorfinn:

ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.936/0.937 on github), see [[Change_log]] for changes, and download it [[Download and installation | here]].

Thorfinn

2022-02-08T11:23:11Z

Thorfinn:

=How to deploy=
This page described how a new version should be put on wiki and github
==Make a combined angsd htslib to put on wiki download==
copy latest and make a new annotated tag and push to wiki
<pre>
VERSION=0.936
mkdir delme
cd delme
git clone --depth=1 https://github.com/ANGSD/angsd
git clone --depth=1 --recursive https://github.com/SAMtools/htslib
cd angsd; git tag -a ${VERSION} -m "time for version ${VERSION}"
sed "s/PACKAGE_VERSION = .*/PACKAGE_VERSION = ${VERSION}/" Makefile >tmp
diff Makefile tmp
mv tmp Makefile
cd ..
tar --exclude='.git' -cvf angsd${VERSION}.tar.gz angsd htslib/
scp angsd${VERSION}.tar.gz software@popgen.dk:/home/software/download/angsd/
</pre>

==Make a new github version to put on github==

<pre>
VERSION=0.937
cd angsd/
sed "s/PACKAGE_VERSION = .*/PACKAGE_VERSION = ${VERSION}/" Makefile >tmp
diff Makefile tmp
mv tmp Makefile
git commit Makefile -m "Preparing new ${VERSION} version"
git tag -a ${VERSION} -m "time for a new version ${VERSION}"
git push
git push --tags
</pre>

SFS Estimation

2021-08-18T09:10:19Z

Thorfinn: /* Brief Overview */

Latest version can now do bootstrapping. Folding should now be done in realSFS and not in the saf file generation.

=Quick Start=
The process of estimating the SFS and multidimensional has improved a lot in the newer versions.

Assuming you have a bam/cram file list in the file 'file.list' and you have your ancestral state in ancestral.fasta, then the process is:

<pre>
#no filtering
./angsd -gl 1 -anc ancestral -dosaf 1
#or alot of filtering
./angsd -gl 1 -anc ancestral -dosaf 1 -baq 1 -C 50 -minMapQ 30 -minQ 20

#this will generate 3 files
1) angsdput.saf.idx 2) angsdput.saf.pos.gz 3) angsdput.saf.gz
#these are binary files that are formally defined in https://github.com/ANGSD/angsd/blob/newsaf/doc/formats.pdf

#To find the global SFS based on the run from above simply do
./realSFS angsdput.saf.idx
##or only use chromosome 22
./realSFS angsdput.saf.idx -r 22

## or specific regions
./realSFS angsdput.saf.idx -r 22:100000-150000000

##or limit to a fixed number of sites
./realSFS angsdput.saf.idx -r 17 -nSites 10000000

##or you can find the 2dim sf by
./realSFS ceu.saf.idx yri.saf.idx
##NB the program will find the intersect internally. No need for multiple runs with angsd main program.

##or you can find the 3dim sf by
./realSFS ceu.saf.idx yri.saf.idx MEX.saf.idx
</pre>

=SFS=
This method will estimate the site frequency spectrum, the method is described in [[Nielsen2012]]. The theory behind the model is briefly described [[realSFSmethod|here]]

This is a 2 step procedure first generate a ".saf" file (site allele frequency likelihood), followed by an optimization of the .saf file which will estimate the Site frequency spectrum (SFS).

For the optimization we have implemented 2 different approaches both found in the misc folder. The diagram below shows the how the method goes from raw bam files to the SFS.

You can also estimate a [[2d SFS Estimation| 2dsfs]] or even higher if you want to.
<pre>
* NB the ancestral state needs to be supplied for the full SFS, but you can use the -fold 1 to estimate the folded SFS and then use the reference as ancestral.
* NB the output from the -doSaf 2 are not sample allele frequency likelihoods but sample alle posteriors.
And applying the realSFS to this output is therefore NOT the ML estimate of the SFS as described in the Nielsen 2012 paper,
but the 'Incorporating deviations from Hardy-Weinberg Equilibrium (HWE)' section of that paper.

</pre>
<classdiagram type="dir:LR">
[sequence data{bg:orange}]->GL[genotype likelihoods|SAMtools;GATK;SOAPsnp;Kim et.al]
[genotype likelihoods|SAMtools;GATK;SOAPsnp;Kim et.al]->doSaf[.saf file{bg:blue}]
[.saf file{bg:blue}]->optimize('realSFS')[.saf.ml file{bg:red}]
</classdiagram>

=Brief Overview=
<pre>
./angsd -dosaf
-> angsd version: 0.935-44-g02a07fc-dirty (htslib: 1.12-1-g9672589) build(Jul 8 2021 08:04:55)
-> ./angsd -dosaf
-> Analysis helpbox/synopsis information:
-> Wed Aug 18 11:09:03 2021
-> doMcall=0
--------------
abcSaf.cpp:
-doSaf 0
1: perform multisample GL estimation
2: use an inbreeding version
3: calculate genotype probabilities (use -doPost 3 instead)
4: Assume genotype posteriors as input (still beta)
-underFlowProtect 0
-anc (null) (ancestral fasta)
-noTrans 0 (remove transitions)
-pest (null) (prior SFS)
-isHap 0 (is haploid beta!)
-doPost 0 (doPost 3,used for accesing saf based variables)
NB:
If -pest is supplied in addition to -doSaf then the output will then be posterior probability of the sample allelefrequency for each site
</pre>

<pre>
misc/realSFS
./realSFS afile.saf.idx [-start FNAME -P nThreads -tole tole -maxIter -nSites ]
</pre>
For information and parameters concerning the realSFS subprogram go here: [[realSFS]]

=Options=
;-doSaf 1: Calculate the Site allele frequency likelihood based on individual genotype likelihoods assuming HWE

;-doSaf 2:(version above 0.503) Calculate per site posterior probabilities of the site allele frequencies based on individual genotype likelihoods while taking into account individual inbreeding coefficients. This is implemented by Filipe G. Vieira. You need to supply a file containing all the inbreeding coefficients. -indF. Consider if you want to either get the MAP estimate by using all sites, or get the standardized values by conditioning on the called snpsites. See bottom of this page for examples.

;-doSaf 3: Calculate the genotype posterior probabilities for all samples forall sites, using an estimate of the sfs (sample allele frequency distribution). This needs a prior distribution of the SFS (which can be obtained from -doSaf 1/realSFS).

;-doSaf 4: Calculate the posterior probabilities of the sample allele frequency distribution for each site based on genotype probabilities. The genotype probabilities should be provided by the using using the -beagle options. Often the genotype probabilities will be obtained by haplotype imputation.

;-underFlowProtect [INT]
0: (default) no underflow protection. 1: use underflow protection. For large data sets (large number of individuals) underflow projection is needed.

=Output file=
The output file from the ''-doSaf'' is described in detail in angsd/doc/formats.pdf. These binary annoying files can be printed with
<pre>
realSFS print myfile.saf.idx
#or
realSFS print mayflies.saf.idx -r chr1:10000-20000
</pre>
==Example==
A full example is shown below where we use the test data that can be found on the [[quick start]] page. In this example we use GATK genotype likelihoods.

first generate .saf file with 4 threads
<pre>
./angsd -bam bam.filelist -doSaf 1 -out small -anc chimpHg19.fa -GL 2 -P 4
</pre>
We always recommend that you filter out the bad qscore bases and meaningless mapQ reads. eg '''-minMapQ 1 -minQ 20'''. So the above analysis with these filters can be written as:
<pre>
./angsd -bam bam.filelist -doSaf 1 -out small -anc chimpHg19.fa -GL 2 -P 4 -minMapQ 1 -minQ 20
</pre>
Obtain a maximum likelihood estimate of the SFS using EM algorithm
<pre>
misc/realSFS small.saf.idx -maxIter 100 -P 4 >small.sfs
</pre>

[[File:SfsSmall.png|thumb]]

A plot of this figure are seen on the right. The jaggedness is due to the very low number of sites in this small dataset.

=Interpretation of the output file=
Each row is a region of the genome (see below).
Each row is the expected values of the SFS.
==NB==
The generation of the .saf file contains a saf for each site, whereas the optimization requires information for a region of the genome. The optimization will therefore use large amounts of memory.

=Folded spectra=
If you don't have the ancestral state, you can instead estimate the folded SFS. This is done by supplying the -anc with the reference genome and applying -fold 1 to realSFS.

The above example would then be

<pre>
#first generate .saf file
./angsd -bam bam.filelist -doSaf 1 -out smallFolded -anc chimpHg19.fa -GL 2
#now try the EM optimization with 4 threads
misc/realSFS smallFolded.saf.idx -maxIter 100 -P 4 >smallFolded.sfs
#in R
sfs<-scan("smallFolded.sfs")
barplot(sfs[-1])
</pre>
[[File:SmallFolded.png|thumb]]

=Posterior of the per-site distributions of the sample allele frequency=
If you supply a prior for the SFS (which can be obtained from the -doSaf/realSFS analysis), the output of the .saf file will no longer be site allele frequency likelihoods but instead will be the log posterior probability of the sample allele frequency for each site in logspace.

=Format specification of binary .saf* files=
This can be found in the angsd/doc/formats.pdf

* If the -fold 1 has been set, then the dimension is no longer 2*nInd+1 but nInd+1 (this is deprecated)
* If the -pest parameter has been supplied the output is no longer likelihoods but log posterior site allele frequencies

=Bootstrapping=
We have recently added the possibility to bootstrap the SFS. Which can be very usefull for getting confidence intervals of the estimated SFS.

This is done by:

<pre>
realSFS pop.saf.idx -bootstrap 100 -P number_of_cores
</pre>
The program will then get you 100 estimates of SFS, based on data that has been subsampled with replacement.

=How to plot=
Assuming the we have obtained a single global sfs(only one line in the output) from '''realSFS''' program, and this is located in '''file.saf.sfs''', then we can plot the results simply like:
<pre>
sfs<-(scan("small.sfs")) #read in the log sfs
barplot(sfs[-c(1,length(sfs))]) #plot variable sites
</pre>
[[File:SfsSmall.png|thumb]]
We can make it more fancy like below:

<pre>
#function to normalize
norm <- function(x) x/sum(x)
#read data
sfs <- (scan("small.sfs"))
#the variability as percentile
pvar<- (1-sfs[1]-sfs[length(sfs)])*100
#the variable categories of the sfs
sfs<-norm(sfs[-c(1,length(sfs))])
barplot(sfs,legend=paste("Variability:= ",round(pvar,3),"%"),xlab="Chromosomes",
names=1:length(sfs),ylab="Proportions",main="mySFS plot",col='blue')
</pre>
[[File:SfsSmallFine.png|thumb]]

If your output from '''realSFS''' contains more than one line, it is because you have estimated multiple local SFS's. Then you can't use the above commands directly but should first pick a specific row.

<pre>
sfs<-(as.numeric(read.table("multiple.sfs")[1,])) #first region.
#do the above
sfs<-(as.numeric(read.table("multiple.sfs")[2,])) #second region.
</pre>

=Which genotype likelihood model should I choose ?=
It depends on the data. As shown on this example [[Glcomparison]], there was a huge difference between '''-GL 1''' and '''-GL 2''' for older 1000genomes BAM files, but little difference for newer bam files.
=Validation=
The validation is based on the pre 0.900 version
==-doSaf 1==
<pre>
cd misc;
./supersim -outfiles test -npop 1 -nind 12 -pvar 0.9 -nsites 50000
echo testchr1 100000 >test.fai
../angsd -fai test.fai -glf test.glf.gz -nind 12 -doSaf 1 -issim 1
./realSFS angsdput.saf 24 2>/dev/null >res
cat res
31465.429798 4938.453115 2568.586388 1661.227445 1168.891114 975.302535 794.727537 632.691896 648.223566 546.293853 487.936192 417.178505 396.200026 409.813797 308.434836 371.699254 245.585920 322.293532 282.980046 292.584975 212.845183 196.682483 221.802128 236.221205 197.914673
</pre>

==-doSaf 2==
<pre>
ngsSim=../ngsSim/ngsSim
angsd=./angsd
realSFS=./misc/realSFS

$ngsSim -npop 1 -nind 24 -nsites 1000000 -depth 4 -F 0.0 -outfiles testF0.0
$ngsSim -npop 1 -nind 24 -nsites 1000000 -depth 4 -F 0.9 -outfiles testF0.9

for i in `seq 24`;do echo 0.9;done >indF
echo testchr1 250000000 >test.fai
$angsd -fai test.fai -issim 1 -glf testF0.0.glf.gz -nind 24 -out noF -dosaf 1
$angsd -fai test.fai -issim 1 -glf testF0.9.glf.gz -nind 24 -out withF -dosaf 2 -domajorminor 1 -domaf 1 -indF indF
$angsd -fai test.fai -issim 1 -glf testF0.9.glf.gz -nind 24 -out withFsnp -dosaf 2 -domajorminor 1 -domaf 1 -indF indF -snp_pval 1e-4

$realSFS noF.saf 48 >noF.sfs
$realSFS withF.saf 48 >withF.sfs

#in R
trueNoF<-scan("testF0.0.frq")
trueWithF<-scan("testF0.9.frq")
pdf("sfsFcomparison.pdf",width=14)
par(mfrow=c(1,2),width=14)
barplot(trueNoF[-1],main='true sfs F=0.0')
barplot(trueWithF[-1],main='true sfs F=0.9')

estWithF<-scan("withF.sfs")
estNoF<-scan("noF.sfs")

barplot(rbind(trueNoF,estNoF)[,-1],main="true vs est SFS F=0 (ML) (all sites)",be=T,col=1:2)
barplot(rbind(trueWithF,estWithF)[,-1],main='true vs est sfs=0.9 (MAP) (all sites)',be=T,col=1:2)

readBjoint <- function(file=NULL,nind=10,nsites=10){
ff <- gzfile(file,"rb")
m<-matrix(readBin(ff,double(),(2*nind+1)*nsites),ncol=(2*nind+1),byrow=TRUE)
close(ff)
return(m)
}

m <- exp(readBjoint("withF.saf",nind=24,5e6))
barplot(rbind(trueWithF,colMeans(m))[,-1],main='true vs est sfs F=0.9 (colmean of site pp) (all sites)',be=T,col=1:2)
m <- exp(readBjoint("withFsnp.saf",nind=24,5e6))
m <- colMeans(m)*nrow(m)
##m contains SFS for absolute frequencies
m[1] <-1e6-sum(m[-1])
##m now contains a corrected estimate containing the zero category
barplot(rbind(trueWithF,norm(m))[,-1],main='true vs est sfs F=0.9 (colmean of site pp) (called snp sites)',be=T,col=1:2)

dev.off()

</pre>
See results from above here:http://www.popgen.dk/angsd/sfsFcomparison.pdf

=safv3 comparison=
Between 0.800 and 0.900 i decided to move to a better format than the raw sad files. This new format takes up half the storage and allows for easy random access and generalizes to unto 5dimensional sfs. A comparison can be found here: [[safv3]]
=Using NGStools=
See [[realSFS]] for how to convert the new safformat to the old safformat if you use NGStools.

MsToGlf

2021-07-26T14:01:49Z

Thorfinn:

For the [[Korneliussen2013]] paper, we simulated data according to genotypes simulated from ms/msms output. For this we used the msToGlf program found in the 'misc/' subfolder of the angsd source directory.

This program assumes that the user is generating diploid samples, which means that the user should supply a msfile containing 2xNind haplotypes.
=Brief Overview=
<pre>
./msToGlf
Probs with args, supply -in -out
also -err -depth -depthFile -singleOut -regLen -nind
</pre>
;-in ms/msms outputfilename
;-out prefix output filename
;-regLen [int] Number of base pairs the ms/msms output is supposed to represent. This is for each repetition.
;-singleOut [zero or one] ms/msms can generate multiple replicates of the same scenario '-singleOut 1' will generate a single output file
;-depth average sequencing depth
;-nind Number of individuals in the ms/msms file (only needed in combination with -depthfile)
;-err errorrate, a value 0.005 corresponds to a 0.5% errorrate.
;-depthFile filename, This is useful if you want to force a different mean depth between individuals, remember to also use -nind if you use this option.
; -pileup [int] 0 print GLF, 1 print mpileup format that can be read my ANGSD using the -pileup option
; -Nsites [int] 0 normal ms output, 1 msms -N [INT] output

=Output format=
The program will dump a binary compressed file. It will calculate all 10 possible genotype likelihoods for each individual for all sites. The genotypes are in the order AA,AC,AG,AT,CC,CG,CT,GG,GT,TT.
These are encoded as ctype 'double'. So the size requirements for a single site for N individuals are 'N*10*sizeof(double)'.

=Examples=
==Standard neutral model==
This ms/msms command will generate haplotypes assuming human recombination/mutation rates for a 1mb region.
We will make 50 haplotypes (25 diploids) and do 14 repetitions.
<pre>
msms -ms 50 14 -t 900 -r 400 -oTPi 0.05 0.05 -oAFS >msoutput
</pre>
Now we will simulate genotype likelihoods assuming an errorate of 1.5% and a sequencing depth of 8x, but only for the variable/informative sites contained in the msoutputfile

<pre>
./msToGlf -in msoutput -out msoutputNoInvar.gl -err 0.015 -depth 8 -nind 25 -singleOut 1
</pre>

The output is single, very small file called 'msoutputNoInvar.gl.glf.gz'.

Now lets do a more realistic example, where we don't limit ourselves to the informative sites but also simulate all the invariable sites for our 1mb region.
<pre>
./msToGlf -in msoutput -out msoutputWithInvar.gl -err 0.015 -depth 8 -nind 25 -singleOut 1 -regLen 1000000
</pre>
These can be feed into angsd using -glf argument as input
<pre>
../angsd -glf msoutputNoInvar.gl.glf.gz -nind 25 -doMajorMinor 1 -doMaf 1 -fai hg19.fai -isSim 1
</pre>
If you do sample allele frequency based analysis '-doSaf' then the ancestral states are assumed to be 'A'.

==With Selection==
The below command will generate 100 replicates of a scenario with strong positive selection in the center of 1mb region, assuming 25 diploids.

<pre>
msms -ms 50 100 -t 900 -r 400 -SAA 1000 -SaA 500 -N 10000 -SF 0 -Sp .5 -oTPi 0.05 0.05 -oAFS >msoutput
</pre>

And lets generate genotype likelihoods corresponding to the above command. This will take some time and fill up considerable amounts of diskspace. Because its the full data for a 100mb region for 25 samples. We here assume 2x data with 0.5% errors.

<pre>
./msToGlf -in msoutput -out withselection.gl -err 0.005 -depth 2 -nind 25 -singleOut 0 -regLen 1000000
</pre>

==Two populations==
This will generate msoutput for 20 diploids in total doing 10 repetitions each based on a 1mb region. Using human mutaiton/recombination rates.
These parameters are supposed to mimic the population bottleneck followed by rapid expansion similar to europeans and african populations. We have 12 individuals i population1 and 8 individuals form population2.

;Not really sure where I got this command.
<pre>
msms -ms 40 10 -t 930 -r 400 -I 2 24 16 0 -g 1 9.70406 -n 1 2 -n 2 1 -ma x 0.0 0.0 x -ej 0.07142857 2 1 >msoutput
</pre>

Let's run the mstoglf command:
<pre>
./msToGlf -in msoutput -out raw -singleOut 1 -regLen 0 -depth 6 -err 0.005
</pre>
We here specify a mean sequencing depth of 6, and an error rate of 0.5%. We only generate genotype likelihoods for the informative sites (-regLen 0), and generate a single output file.

We now slice out the two populations into seperate files:
<pre>
angsd/misc/splitgl raw.glf.gz 20 1 12 >pop1.glf.gz
angsd/misc/splitgl raw.glf.gz 20 13 20 >pop2.glf.gz
</pre>

And we run -doSaf 1 on both files
<pre>
./angsd -glf pop1.glf.gz -nInd 12 -doSaf 1 -out pop1 -fai hg19.fai -isSim 1
./angsd -glf pop2.glf.gz -nInd 8 -doSaf 1 -out pop2 -fai hg19.fai -isSim 1
</pre>

And finally lets estimate the 2dsfs using the full ML method included in '''realSFS''':
<pre>
realSFS 2dsfs pop1.saf pop2.saf 24 16 -P 4 >pop.em.ml
</pre>
The output is 25x17 matrix
You can then read in the data in R and barplot the marginals

<pre>
a<-matrix(scan("pop.em.ml"),17)
barplot(colSums(a))
barplot(rowSums(a))
</pre>

Change log

2021-03-15T09:47:55Z

Thorfinn:

=Latests=
Odd versions are github versions...
*0.933 https://github.com/ANGSD/angsd/compare/0.933...0.935
*0.931 https://github.com/ANGSD/angsd/compare/0.931...0.933
*0.929 https://github.com/ANGSD/angsd/compare/0.929...0.931
*0.927 https://github.com/ANGSD/angsd/compare/0.925...0.927
*0.925 https://github.com/ANGSD/angsd/compare/0.923...0.925
*0.923 https://github.com/ANGSD/angsd/compare/0.921...0.923
*0.921 https://github.com/ANGSD/angsd/compare/0.918...0.921
*0.918 https://github.com/ANGSD/angsd/compare/0.916...0.918
*0.916 https://github.com/ANGSD/angsd/compare/0.914...0.916
*0.914 https://github.com/ANGSD/angsd/compare/0.912...0.914
*0.912 https://github.com/ANGSD/angsd/compare/0.911...0.912
*0.911 https://github.com/ANGSD/angsd/compare/0.910...0.911
*0.910 https://github.com/ANGSD/angsd/compare/0.901...0.910
*0.901 https://github.com/ANGSD/angsd/compare/0.900...0.901
*0.900 https://github.com/ANGSD/angsd/compare/0.800...0.900
*0.800 https://github.com/ANGSD/angsd/compare/0.700...0.800
*0.700 https://github.com/ANGSD/angsd/compare/0.615...0.70

=0.6***=
*0.614 https://github.com/ANGSD/angsd/compare/0.614...0.615
*0.613 https://github.com/ANGSD/angsd/compare/0.613...0.614
*0.613 https://github.com/ANGSD/angsd/compare/0.612a...0.613
*0.612 https://github.com/ANGSD/angsd/compare/0.610...0.612a
*0.610 https://github.com/ANGSD/angsd/compare/0.609...0.610a
*0.609 https://github.com/ANGSD/angsd/compare/0.608...0.609
*0.608 not super usefull but we now compile with knetfile, so users can use remote .fa files. I really recommend that users download the fasta instead.
*0.607 changed name of all abstract base classes to the more reasonable abc*.cpp. Included contamination and the iCounts format. Added a templated class so users can see how to access the internal datastructures.
*0.606 added more info in the thetaStat subprogram renamed all analysis classes to abc*.cpp.
*0.605 continued updating the Eunjung code
*0.604 added a 'job' array for the analysis classes which should greatly reduce the number of needed function call. I doubt that will make any noticeable speed difference. Has copied and modified some code from Eunjung (from jnovembre lab) for a banded approach in the saf calculation. This is not working yet -doSaf 2
*0.603 1) fixed wrong branching if users used simfiles. 2) fixed 2 bugs in the inbreeeded angsd_realSFS.cpp. Very small changes
*0.602 1) programs output the actual chromosome, if there is a mismatch between the fasta and bamheader 2) added a check that the index files (fa vs fai,bam vs bai) is newer than the index. 3) Program now validates that the .so and include files used are the same.
*0.601 1) started the move to includeflags/discardflags, outcommented in this version 2) validates that data has been generted for the different samples by looking at difference between values, instead of comparing againts -0.0. 3) MAF estimates now skips estimation for a site if the updated GL shows that GLs are noninformative.
*0.600 fixed a change in default FLAGS when using bam files. If you hadn't removed the unmapped reads from your bamfiles these would have been included in the analysis.

=0.589 to 599=
* 0.589 removed soap/sim/glf/glfclean(bin and text) and tglf.
* 0.590 added text mpileup as new input format. Very useful
* 0.591 refactored the file reading, moved the arguments to multi reader, such that all file reading is done from multi reader. FREEZE version. We will only allow bug fixes in the next many versions.
* 0.592 -pest output in angsd_realSFS.cpp. smartcounts. fixed a bug in hetplas. edited the printout msg for 'if you really want angsd to exit' -pest output in angsd_realSFS.cpp. smartcounts. fixed a bug in hetplas. edited the printout msg for 'if you really want angsd to exit'. Maybe fixed an unknown bug. line 2 in .arg or screen output is now commandline used, and some other visual stuff. Program didn't complain if -doMaf but no -domajorminor. if chromosomesome name contained ':" it wouldn't work strchr->strrchr
* 0.593 fixed a strange bug, where the program would crash if, no analysis was chosen. fixed the 'shouldbeone' bug.
* 0.594 nochanges
* 0.595 nochanges
* 0.596 fixed a bug in persite depth counter (double count of C alleles). fixed a bug in smartcount subprogram.
* 0.597 treemix input file generation from smart counts
* 0.598 fixed a wrong compile flag in one of the utility programs 'smartCount'
* 0.599 many much more informative information if users forgot to add -fai argument. Fixed a bug in parsing of arguments with -doSaf 4.

=<0.570 and <0.588=
* 0.571 removed bf's from maf classs, negative values of -domaf disabled dumping of files, inbreeding has been added
* 0.572 Some funky new approach for the makefile is now being used, minor bug fixes (minDepth -> setMinDepth, extra header column in thetas.gz has been fixed)
* 0.573 added better info barfiles have different header. Added check if length of supplied reference/ancestral doesn't match bamheader. autosize in emOptim2 for 2dsfs. Fixed subtle issue if very large coverage between bamfiles, now the 'biggests(file size)' is used to select region instead of the first baffle. Netaccess is now deprecated.
* 0.574 modifed emOptim2 so now compiles on mac
* 0.575 smaller fixes to the inbreeding parsing, added p-value in analysisMaf instead of the raw llh.
*.0.576 -doSNP and -minLRT now deprecated, please use -SNP_pval instead
* 0.577 bugfix for -SNP_pval if value was one 1. doHWE now called -HWE_pval and can be used for filtering.
* 0.578 speedup in hew stuff
* 0.579 fixed an extremely rare assertion error (program was working, assertion was off). Redid all strcmp to strcasecom. Fixed a bug in -doMaf 2 with -snp_pval
* 0.580 nochanges...
* 0.581 if trimming has been enabled, N's will be plugged in instead of the bases. A number of small changes.
* 0.582 nochanges...
* 0.583 1) changed bugfix when using counts based estimator for major/minor 2) keepsites is now using the effecive number of samples in all cases 3) changed output of maf to a 'nicer' format
* 0.584 updated internal testing scripts.
* 0.585 1) fixed 'baq complains even though -ref was supplied' 2) fixed -doMajorMinor 4 and doMajorMinor 5 (sites not discarded) 3) added trumendounsly better information for the -sites 4) added some check for -doPost 5) program can now exit uncleanly if ctrl+c is pressed 3 times. 6) added an else to catch wrong arg in theteStat.
*0.586 1) fixed parsing of pars if input is -sim1 2) fixed a bug in -doFasta 3 (the ebd one) 3) fixed a printout problem in -doFasta
*0.587 1) fixed a number of minor instances where memory wasn't being freed/delete (mostly for keeping valgring silent) 2) fixed a memleak if -sites files contained 4 columns but -doMajorMinor not 3. 3) fixed a memory leak that could occur if -doPost 2 and -doMaf 0.
* 0.588 1) fixed small printout error that could cause segfaults in rare cases. 2) change stderr to pos+1 3) changed some checks of user supplied pars 4) fixed a stack overflow if a very long -rf file was supplied.

=old=
==Dirty==
* 0.16 Is now bundled with SAMtools-0.1.17 and the mpileup (and friends) command be used for passing data to dirty.
* 0.17 added extra options -minInd and -minMaf, for only printing and using sites above a threshold
* 0.18 added option to pass reference and ancestral allele as fasta files.(using faidx format) (doMaf is now encoded internally as a MAF_(UN)KNOWN_TYPE)
* 0.19 added support for tglf inputfiles, -tglf -posfile see runexamples, also added the likeratio test for snp calling
* 0.20 Added the check for missing data, before the major/minor. included -realSFS, changed the deallocation of the -doMAF results, such that its proper cleaned up.
* 0.21 refactored pml.cpp into pml_estError_genLikes.cpp and pml_freq_asso.cpp (fixed a bug that preventede -samglf and samglfclean from working)
* 0.22 Well this update was a mixture of edits from [[user:albrecht]] and BGI so its difficult to give a concise description
*.0.23 Program can now read simulated files (single pop only) An example can be seen in "full example ... sfs" and input types.
* 0.24 added the tajima estimator. This should go in tandem with some R scripts. Had to modify parseargs, shared, and pml_freq_asso
* 0.25 the depth is now being populated when using mpileup -g. The program can now get the counts from mpileup
==ANGSD==
=<0.5=
* 0.01.a - 0.01.b The bfgs now supports threading, maybe anders implemented a heteorzygosity estimator.
* 0.01.c A problem if we didn't observe any llh, caused the MAF estimator to 'nan'.
* 0.02 Fixed small bug in bfgs optimization of sfs optimization. When choosing a region bigger than what was covered by the .sfs file the program would hang. Added genotypecaller, added -sfsEst to the realSFS part of the program.
* 0.03 added and documented genotypecaller, can dump counts,-realSFS 1 dumps positions, -realSFS 2 is deprecated,S,pi and tajima has been added to sfstools along with possibility to do prior
* 0.3 clean version with less features. The lost features will be reintroduced later.
* 0.43 first very clean version, everything should be included
* 0.441 rewrote the SOAPsnp GL model, -L and -maxQ is not needed anymore. Also added an option to choose an output dir for the recalibration matrix
* 0.4471 error estimation is now working, the fasta reading is now threadsafe. all GLs are now likeratios.
* 0.512 After 0.500 we have changed the internal structure such that each chunk is enforced to be on the same chr. version c) fixes a problem of hardclipping
* 0.515 Alot of legacy code has been removed from mUppile.cpp. Program can now use remote files, build on code from SAMtools
* 0.520 Alot more legacy code has been removed from mUpPile.cpp. Program now does baq and adjustment of mapQ similar to -C in samtools. Also compiles on osx, but this is not supported
* 0.535 Bug in internal representation of mapQ's (only problematic for mapQ>128), we now use the flag to determine if a read has mapped. calcstat is now deprecated, users should use the bgid program now.
* 0.538 changed position output in association part. Fixed incorrect assert assumption in mUpPile.cpp. Added some downsampling options for errorEst and changed internal buffering when reading beagle files to allow for >10k individuals.
* 0.549 to many changes to remember

=<0.5 and <0.570=
* 0.551 tajima paper is now published, so the emOptim2 and bgid has now been properly documented. plink output is now supported and some snp filters can be outputted.
* 0.552 minor bug when calling genotypes without defining postcutoff -> missingness couldnt occur. removed the optimSFS and emOptim from the default compilelist
* 0.553 uint removed from code.
* 0.554 plugged in sfstools functionality into main angsd, (ability to output log posts)
* 0.555 anders added some concensus stuff
* 0.556 updated the filtering (if binary rep of keep file is incomplete it is removed again. It checks timestamps to see if file has been updated), folded spectra analysis should now be working
* 0.557 There was a bug in the realsfs part of the code, that was created in the 0.556 version. 0.557 is simply a fix of this, and the removal of a warning compiler flag in the msToGlf subprogram. We only observed the problematic compiler flag on a osx machine
* 0.558 tempversion,from this version bgid is now called thetaStat
* 0.559 abbababa
* 0.560 analysisCount.cpp has been updated to the nice standard of the wiki
* 0.561 program now compiles on clang, many small compiler warnings has been fixed.
* 0.562 Merge of forked versions, abba-baba fasta.
* 0.563 minQ filter has been moved to a much earlier step. Previously it was downstream classes that checked this. Now a base will be set to 'n' if it is below the threshold
* 0.564 cleaned up funky pars maf/asso such that all results are in ->extras[]
* 0.565 moved file reading stuff from shared to analysisFunction in namespace ail::
* 0.566 cleaned up small things, added a newer version of hetplas
* 0.567 cleaned up small things again. Started to add single pars e.g. -P -b
* 0.568 refactored compile order in general.cpp
* 0.569 added saf genotype calling, changed name from -realsfs to -doSaf
* 0.570 modified emOptim2 to estimate nSites and tell how much memory it will use, fixed empty -bam file

MediaWiki:Sitenotice

2021-03-15T09:47:19Z

Thorfinn:

ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.934/0.935 on github), see [[Change_log]] for changes, and download it [[Download and installation | here]].

Installation

2021-03-15T09:45:55Z

Thorfinn:

There has been some confusion about the versions of ANGSD.

* Even versions are freezes from the last odd giversion

* Odd versions are git versions. Once there has been enough commits we will increment and make a release.

=Download and Installation=
To download and use ANGSD you need to download the htslib and the angsd source folder

You can either download the angsd0.934.tar.gz which contains both.
[http://popgen.dk/software/download/angsd/angsd0.934.tar.gz]

Or you can use github for the latest version of both htslib and angsd

Earlier versions from here: http://popgen.dk/software/download/angsd/
And here: https://github.com/ANGSD/angsd/releases

=Install=
Download and unpack the tarball, enter the directory and type make. Users on a mac computer, can use curl instead of wget.

===Unix===
The software can be compiled using make.
<pre>
wget http://popgen.dk/software/download/angsd/angsd0.934.tar.gz
tar xf angsd0.934.tar.gz
cd htslib;make;cd ..
cd angsd
make HTSSRC=../htslib
cd ..
</pre>
The executable then located in '''angsd/angsd'''.

=Install from github=
To install CRAM support you also need to install htslib and can be done using the following commands

<pre>
git clone --recursive https://github.com/samtools/htslib.git
git clone https://github.com/ANGSD/angsd.git
cd htslib;make;cd ../angsd ;make HTSSRC=../htslib
</pre>

=Systemwide installation of htslib?=
Then you just type make in the angsd directory

Thorfinn

2021-03-15T09:39:23Z

Thorfinn: /* Make a new github version to put on github */

=How to deploy=
This page described how a new version should be put on wiki and github
==Make a combined angsd htslib to put on wiki download==
copy latest and make a new annotated tag and push to wiki
<pre>
VERSION=0.934
mkdir delme
cd delme
git clone --depth=1 https://github.com/ANGSD/angsd
git clone --depth=1 --recursive https://github.com/SAMtools/htslib
cd angsd; git tag -a ${VERSION} -m "time for version ${VERSION}"
sed "s/PACKAGE_VERSION = .*/PACKAGE_VERSION = ${VERSION}/" Makefile >tmp
diff Makefile tmp
mv tmp Makefile
cd ..
tar --exclude='.git' -cvf angsd${VERSION}.tar.gz angsd htslib/
scp angsd${VERSION}.tar.gz software@popgen.dk:/home/software/download/angsd/
</pre>

==Make a new github version to put on github==

<pre>
VERSION=0.935
cd angsd/
sed "s/PACKAGE_VERSION = .*/PACKAGE_VERSION = ${VERSION}/" Makefile >tmp
diff Makefile tmp
mv tmp Makefile
git commit Makefile -m "Preparing new ${VERSION} version"
git tag -a ${VERSION} -m "time for a new version ${VERSION}"
git push
git push --tags
</pre>

Thorfinn

2021-03-15T09:38:08Z

Thorfinn: /* Make a combined angsd htslib to put on wiki download */

=How to deploy=
This page described how a new version should be put on wiki and github
==Make a combined angsd htslib to put on wiki download==
copy latest and make a new annotated tag and push to wiki
<pre>
VERSION=0.934
mkdir delme
cd delme
git clone --depth=1 https://github.com/ANGSD/angsd
git clone --depth=1 --recursive https://github.com/SAMtools/htslib
cd angsd; git tag -a ${VERSION} -m "time for version ${VERSION}"
sed "s/PACKAGE_VERSION = .*/PACKAGE_VERSION = ${VERSION}/" Makefile >tmp
diff Makefile tmp
mv tmp Makefile
cd ..
tar --exclude='.git' -cvf angsd${VERSION}.tar.gz angsd htslib/
scp angsd${VERSION}.tar.gz software@popgen.dk:/home/software/download/angsd/
</pre>

==Make a new github version to put on github==

<pre>
VERSION=0.933
cd angsd/
sed "s/PACKAGE_VERSION = .*/PACKAGE_VERSION = ${VERSION}/" Makefile >tmp
diff Makefile tmp
mv tmp Makefile
git commit Makefile -m "Preparing new ${VERSION} version"
git tag -a ${VERSION} -m "time for a new version ${VERSION}"
git push
git push --tags
</pre>

Haploid calling

2020-10-27T09:21:25Z

Thorfinn: /* Major bug in version 0.911 (not in <0.911) */

Simple haploid output based on sampling or consensus. Latest github version of angsd has a small utility program in the misc folde that converts to plink output (tfam/tped).

__TOC__

<classdiagram type="dir:LR">
[BAM files{bg:orange}]->[Sequence data|Random base;Consensus base]
[sequence data]->[*.haplo.gz|single base file{bg:blue}]
</classdiagram>

=Brief Overview=
<pre>
> ./angsd -doHaploCall
-> angsd version: 0.910-45-g2b2b4f0-dirty (htslib: 1.2.1-192-ge7e2b3d) build(Jan 3 2016 14:45:41)
-> Analysis helpbox/synopsis information:
-> Command:
./angsd -doHaploCall -> Sun Jan 3 15:18:15 2016
--------------
abcHaploCall.cpp:
-doHaploCall 0
(Sampling strategies)
0: no haploid calling
1: (Sample single base)
2: (Concensus base)
-doCounts 0 Must choose -doCount 1
Optional
-minMinor 0 Minimum observed minor alleles
-maxMis -1 Maximum missing bases (per site)

</pre>

This function outputs a base for each individual for each site

=Options=
;-doHaploCall [int]
1; sample a random base
2; most frequent base. Random base for ties
; -doCounts 1
use -doCounts 1 in order to count the bases at each sites after filters.
;-minMinor [int]
Minimum observed minor alleles; only prints sites with more than minMinor sampled alleles (across individuals).
; -maxMis [int]
maximum allowed missing alleles (accross individuals). -maxMis 0 means only sites without missing alleles are printed

=Output=
;*.haplo.gz
Output: Each line represents site. chromsome name (Column 1), position (Column 2), major allele (Column 3). One column for each individual with the sampled allele.

==Example==
Create a fasta file bases from a random samples of bases.

<pre>
./angsd -bam bam.filelist -dohaplocall 1 -doCounts 1 -r 1: -minMinor 1
</pre>

=Output=

<pre>
chr pos major ind0 ind1 ind2 ind3 ind4 ind5 ind6
1 14000170 C T T C N C C C
1 14000202 A A N G A N N G
1 14000457 G G G G G G N A
1 14000459 G G G G G A N N
1 14000774 G T G G G G G T
1 14002083 C G N C C C C C
1 14002351 A A C C A C N A
1 14002950 A T A A A T N T
1 14004832 G G G A G G A G
1 14006543 G T G G G G G G
1 14006631 A C N A N A N A
1 14007068 G T T T G G G N
1 14009284 A A C C C N A N
1 14009775 G G G G G C G C
1 14009787 T T T G T G T T
1 14009791 A G G A G A G A
1 14009794 A A A A N N A A
1 14009800 A G A A G N G A
1 14010748 A G N A G A A A
</pre>
columns are

; chr
chromosome
; pos
position
; major
major allele (most common of the sampled alleles)
; ind0
first individual - same order as in the input files

Thetas,Tajima,Neutrality tests

2020-08-27T12:40:14Z

Thorfinn: /* Full command list for below examples */

This method will estimate different thetas (population scaled mutation rate) and can based on these thetas calculate Tajima's D and various other neutrality test statistics. Method is described in [[Korneliussen2013]].

* NB Information on this website is for version 0.917-33-g6d2aec8 or higher.
* NB The [[Korneliussen2013]] covers two methods,
# using an ML method
# using the emperical Bayes (EB) method. The information on this page relates to the EB method.
For performing the ML method, you should the use the [[SFS Estimation]] method and define the region af interest.

=Quick Example=
Below is a chain of commands used for caculating statistics. These are based on the test files that can be dowloaded on the [[Quick Start ]] page.

Its a 3 step procedure
# Estimate an site frequency spectrum. Output is '''out.sfs''' file. This is what is being used as the '''-pest ''' argument in step2.
# Calculate per-site thetas. Output is a '''.thetas.idx/.thetas.gz''' files. This contains the binary persite estimates of the thetas.
# Calculate neutrality tests statistics. Output is a '''.thetas.idx.pestPG file.
==Full command list for below examples==
Here is the chain of commands required to do estimate the thetas, and perform neutrality test statistics. These different commands are described in great detail in the following '''step 1,... step 3b''' sub sections. If you do not have the ancestral state you can simply use the assembly you have mapped agains, but remember to add -fold 1 in the 'realSFS' and 'realSFS sf2theta' step.
<pre>
./angsd -bam bam.filelist -doSaf 1 -anc chimpHg19.fa -GL 1 -P 24 -out out
#for unfolded
./misc/realSFS out.saf.idx -P 24 > out.sfs
./misc/realSFS saf2theta out.saf.idx -outname out -sfs out.sfs
#for folded
./misc/realSFS out.saf.idx -P 24 -fold 1 > out.sfs
./misc/realSFS saf2theta out.saf.idx -outname out -sfs out.sfs -fold 1
#Estimate for every Chromosome/scaffold
./misc/thetaStat do_stat out.thetas.idx
#Do a sliding window analysis based on the output from the make_bed command.
./misc/thetaStat do_stat out.thetas.idx -win 50000 -step 10000 -outnames theta.thetasWindow.gz
</pre>

==Step 1: Finding a 'global estimate' of the SFS==

First estimate the site allele frequency likelihood
<div class="toccolours mw-collapsible mw-collapsed">
./angsd -bam bam.filelist -doSaf 1 -anc chimpHg19.fa -GL 1 -P 24 -out out
<pre class="mw-collapsible-content">

-> Reading fasta: chimpHg19.fa
-> Parsing 10 number of samples
-> Printing at chr: 20 pos:14095817 chunknumber 3500
-> Done reading data waiting for calculations to finish
-> Calling destroy
-> Done waiting for threads
-> Output filenames:
->"out.arg"
->"out.saf"
->"out.saf.pos.gz"
-> Mon Jun 30 12:02:58 2014
-> Arguments and parameters for all analysis are located in .arg file
[ALL done] cpu-time used = 47.19 sec
[ALL done] walltime used = 43.00 sec

</pre>
</div>

Obtain the maximum likelihood estimate of the SFS using the '''realSFS''' program found in the misc subfolder. (See more here [[realSFS]])
<pre>
./misc/realSFS out.saf.idx -P 24 > out.sfs
</pre>
Or if want to calculate the folded spectrum.
<pre>
./misc/realSFS out.saf.idx -P 24 -fold 1 > out.sfs
</pre>

To plot the SFS in R :
<pre>
s<-scan('out.sfs')
s<-s[-c(1,length(s))]
s<-s/sum(s)
barplot(s,names=1:length(s),main='SFS')
</pre>

==Step 2: Calculate the thetas for each site==
<pre>
realSFS saf2theta out.saf.idx -sfs out.sfs -outname out
</pre>
The output from the above command are two files out.thetas.gz and out.thetas.idx. A formal description of these files can be found in the doc/formats.pdf in the angsd package. It is possible to extract the logscale persite thetas using the ./thetaStat print program.

<div class="toccolours mw-collapsible mw-collapsed">
thetaStat print out.thetas.idx 2>/dev/null |head
<pre class="mw-collapsible-content">
#Chromo Pos Watterson Pairwise thetaSingleton thetaH thetaL
1 14000032 -10.339284 -12.069325 -9.000927 -15.852173 -12.739969
1 14000033 -10.437878 -12.185619 -9.080596 -16.001343 -12.856984
1 14000034 -10.373872 -12.110464 -9.028572 -15.905591 -12.781380
1 14000035 -10.528192 -12.290763 -9.154920 -16.133823 -12.962708
1 14000036 -10.322074 -12.051400 -8.985016 -15.834049 -12.722040
1 14000037 -10.304955 -12.028814 -8.973260 -15.800330 -12.699204
1 14000038 -10.108563 -11.791546 -8.819884 -15.486384 -12.460146
1 14000039 -10.542117 -12.306631 -9.166698 -16.153168 -12.978650
1 14000040 -10.688401 -12.473763 -9.290272 -16.358398 -13.146564
</pre>
</div>
Per default the print command will also output the contents of the index file to the stderr.

==Step 3a: Estimate Tajimas D and other statistics==
<pre>
#calculate Tajimas D
./misc/thetaStat do_stat out.thetas.idx
</pre>

<div class="toccolours mw-collapsible mw-collapsed">
cat out.thetas.idx.pestPG
<pre class="mw-collapsible-content">
## thetaStat VERSION: 0.01 build:(Jun 30 2014,12:06:12)
#(indexStart,indexStop)(firstPos_withData,lastPos_withData)(WinStart,WinStop) Chr WinCenter tW tP tF tH tL Tajima fuf fud fayh zeng nSites
(0,98316)(14000032,14100082)(0,14100082) 1 7050041 51.002623 46.171402 64.683834 51.290955 48.731178 -0.392892 -0.647071 -0.595302 -0.099654 -0.048444 98316
(0,98474)(13999910,14100060)(0,14100060) 2 7050030 92.689100 88.806005 101.768262 122.422498 105.614255 -0.174701 -0.252477 -0.220588 -0.360944 0.152373 98474
(0,93269)(14000529,14100095)(0,14100095) 3 7050047 70.757874 76.248087 75.447438 68.354514 72.301301 0.322902 0.020330 -0.148419 0.110921 0.023794 93269
(0,96339)(13999912,14100064)(0,14100064) 4 7050032 99.748624 107.898618 94.265208 130.283528 119.091076 0.340878 0.247030 0.123956 -0.223386 0.211971 96339
(0,99659)(13999926,14100063)(0,14100063) 5 7050031 120.941697 132.667821 86.726667 163.908351 148.288088 0.404945 0.688320 0.639821 -0.257254 0.247395 99659
(0,99541)(13999918,14100103)(0,14100103) 6 7050051 96.666344 112.146685 69.740992 143.403712 127.775201 0.667988 0.792499 0.627735 -0.321842 0.351730 99541
(0,99786)(13999926,14100047)(0,14100047) 7 7050023 93.164548 92.023886 92.742574 142.413716 117.218807 -0.051058 -0.013928 0.010201 -0.538288 0.282133 99786
(0,98759)(13999923,14100082)(0,14100082) 8 7050041 133.567125 177.157879 72.197498 204.069028 190.613463 1.363708 1.425567 1.040517 -0.200700 0.467490 98759
(0,97855)(14001686,14100094)(0,14100094) 9 7050047 88.777475 102.853333 64.660948 104.749694 103.801516 0.660983 0.776148 0.611265 -0.021256 0.184875 97855
(0,98031)(13999906,14100096)(0,14100096) 10 7050048 129.583334 134.877160 88.135115 213.231615 174.054390 0.170681 0.654145 0.724072 -0.602284 0.375595 98031
(0,99220)(13999900,14100060)(0,14100060) 11 7050030 66.349155 79.423643 60.194045 68.903312 74.163477 0.819589 0.520022 0.207421 0.157614 0.128409 99220
(0,99861)(13999913,14100078)(0,14100078) 12 7050039 86.461303 81.630083 96.156392 110.974922 96.302507 -0.232902 -0.302980 -0.252190 -0.337701 0.124323 99861
(0,98258)(13999943,14100097)(0,14100097) 16 7050048 83.191170 99.392421 77.561510 106.148748 102.770584 0.811500 0.472922 0.152079 -0.080798 0.257008 98258
(0,99428)(13999902,14100095)(0,14100095) 17 7050047 90.254620 99.816352 65.610351 113.328929 106.572638 0.441707 0.683942 0.614609 -0.148988 0.197530 99428
(0,97118)(13999898,14100071)(0,14100071) 18 7050035 79.843256 75.282296 86.844252 67.720321 71.501308 -0.237958 -0.260778 -0.196888 0.094212 -0.114062 97118
(0,93783)(13999895,14100089)(0,14100089) 19 7050044 54.311523 49.839190 64.913940 72.868913 61.354048 -0.341795 -0.495649 -0.434079 -0.421111 0.141133 93783
(0,98938)(13999916,14100091)(0,14100091) 20 7050045 68.148147 63.323800 78.463736 56.040370 59.682084 -0.294508 -0.398845 -0.338673 0.106250 -0.135474 98938
</pre>
</div>

==Step 3b: Sliding Window example==
We can easily do a sliding window analysis by adding -win/-step arguments to the last command. [[ thetaStat ]]
<pre>
thetaStat do_stat out.thetas.idx -win 50000 -step 10000 -outnames theta.thetasWindow.gz
</pre>
This will calculate the test statistic using a window size of 50kb and a step size of 10kb.

=Example Output=
<pre>
- Output in the ./thetaStat print thetas.idx are the log scaled per site estimates of the thetas
- Output in the pestPG file are the sum of the per site estimates for a region
</pre>
==./thetaStat print angsdput.thetas.idx==
<pre>
#Chromo Pos Watterson Pairwise thetaSingleton thetaH thetaL
1 14000032 -9.457420 -10.372069 -8.319252 -13.025778 -10.997194
1 14000033 -9.463637 -10.379368 -8.324414 -13.035780 -11.004670
1 14000034 -9.463740 -10.379488 -8.324500 -13.035942 -11.004793
1 14000035 -9.463603 -10.379328 -8.324386 -13.035725 -11.004629
1 14000036 -9.323246 -10.218453 -8.204848 -12.826627 -10.840519
1 14000037 -9.179270 -10.048883 -8.086425 -12.596436 -10.666670
1 14000038 -9.004664 -9.845473 -7.941453 -12.328274 -10.458416
1 14000039 -9.327033 -10.222983 -8.207914 -12.833007 -10.845176
1 14000040 -9.621554 -10.557563 -8.461745 -13.262415 -11.185971
1 14000041 -9.617449 -10.552869 -8.458225 -13.256257 -11.181185
1 14000042 -7.337841 -8.161756 -204.045433 -5.457443 -6.085818
1 14000043 -9.570405 -10.502160 -8.415195 -13.197596 -11.129976
1 14000044 -9.511097 -10.434558 -8.364249 -13.110037 -11.061100
1 14000045 -9.563664 -10.494371 -8.409489 -13.187203 -11.122022
1 14000046 -9.617690 -10.555402 -8.456395 -13.265004 -11.184107
1 14000047 -9.563722 -10.494438 -8.409538 -13.187292 -11.122090
1 14000048 -9.856578 -10.819096 -8.669691 -13.587898 -11.451396
</pre>
;1. chromosome
;2. position
;3. ThetaWatterson
;4. ThetaD (nucleotide diversity)
;5. Theta? (singleton category)
;6. ThetaH
;7. ThetaL

==.thetas.idx.pestPG==
The .pestPG file is a 14 column file (tab seperated). The first column contains information about the region. The second and third column is the reference name and the center of the window.

We then have 5 different estimators of theta, these are: Watterson, pairwise, FuLi, fayH, L.
And we have 5 different neutrality test statistics: Tajima's D, Fu&Li F's, Fu&Li's D, Fay's H, Zeng's E.
The final column is the effetive number of sites with data in the window.

<pre>
## thetaStat VERSION: 0.01 build:(Jun 30 2014,12:06:12)
#(indexStart,indexStop)(firstPos_withData,lastPos_withData)(WinStart,WinStop) Chr WinCenter tW tP tF tH tL Tajima fuf fud fayh zeng nSites
(0,98316)(14000032,14100082)(0,14100082) 1 7050041 51.002623 46.171402 64.683834 51.290955 48.731178 -0.392892 -0.647071 -0.595302 -0.099654 -0.048444 98316
(0,98474)(13999910,14100060)(0,14100060) 2 7050030 92.689100 88.806005 101.768262 122.422498 105.614255 -0.174701 -0.252477 -0.220588 -0.360944 0.152373 98474
(0,93269)(14000529,14100095)(0,14100095) 3 7050047 70.757874 76.248087 75.447438 68.354514 72.301301 0.322902 0.020330 -0.148419 0.110921 0.023794 93269
(0,96339)(13999912,14100064)(0,14100064) 4 7050032 99.748624 107.898618 94.265208 130.283528 119.091076 0.340878 0.247030 0.123956 -0.223386 0.211971 96339
(0,99659)(13999926,14100063)(0,14100063) 5 7050031 120.941697 132.667821 86.726667 163.908351 148.288088 0.404945 0.688320 0.639821 -0.257254 0.247395 99659
(0,99541)(13999918,14100103)(0,14100103) 6 7050051 96.666344 112.146685 69.740992 143.403712 127.775201 0.667988 0.792499 0.627735 -0.321842 0.351730 99541
(0,99786)(13999926,14100047)(0,14100047) 7 7050023 93.164548 92.023886 92.742574 142.413716 117.218807 -0.051058 -0.013928 0.010201 -0.538288 0.282133 99786
(0,98759)(13999923,14100082)(0,14100082) 8 7050041 133.567125 177.157879 72.197498 204.069028 190.613463 1.363708 1.425567 1.040517 -0.200700 0.467490 98759

</pre>
Format is:

<code>
(indexStart,indexStop)(posStart,posStop)(regStat,regStop) chrname wincenter tW tP tF tH tL tajD fulif fuliD fayH zengsE numSites
</code>

Most likely you are just interest in the wincenter (column 3) and the column 9 which is the Tajima's D statistic.

The first 3 columns relates to the region. The next 5 columns are 5 different estimators of theta, and the next 5 columns are neutrality test statistics. The final column is the number of sites with data in the region.

The first '''()()()''' er mainly used for debugging the sliding window program. The interpretation is:
* The posStart and posStop is the first physical position, and last physical postion of sites included in the analysis.
* The regStat and regStop is the physical region for which the analysis is performed. Therefore the posStat and posStop is always included within the regStart and regStop
* The indexStart and IndexStop is the position within the internal array.

=Unknown ancestral state (folded sfs)=

If you don't have the ancestral states, you can still calculate the Watterson and Tajima theta, which means you can perform the Tajima's D neutrality test statistic. But this requires you to use the folded sfs. The output files will have the same format, but only the thetaW and thetaD, and tajimas D is meaningful.

There was previously an example below that showed how to perform this analysis. This information has now been added to the examples above (notice the -fold 1) step in realSFS.

=Citation=
[[Korneliussen2013]]

Allele Counts

2020-05-02T04:34:54Z

Thorfinn: /* Depth Distribution */

__TOC__
Sometimes we want or need the frequency of the different bases. This is what -doCounts does.

You can refine which bases to be included using the filter parameters '''-minMapQ/-minQ/-trim'''. Based on the total depth for each you can discard sites for further analysis if the total depth is below/above some threshold '''-setMaxDepth/setMinDepth''', and you can discard a site if the effective sample size is below some threshold '''-minInd'''.

You can output summary statistics such as Q score distribution '''-doQsDist''', depth distribution '''-doDepth''', or various per site counts '''-dumpCounts'''. All output files has a nice header which should make the interpretation straightforward.

=Brief Overview=
<pre>
./angsd -doCounts
-> angsd version: 0.560 build(Dec 4 2013 13:27:02)
-> Analysis helpbox/synopsis information:
---------------
analysisCount.cpp:
-doCounts 0 (Count the number A,C,G,T. All sites, All samples)
-minQ 13 (remove bases with qscore<minQ)
-minQfile (null) file with individuals quality score threshold)
-setMaxDepth -1 (If total depth is larger then site is removed from analysis.
-1 indicates no filtering)
-setMinDepth -1 (If total depth is smaller then site is removed from analysis.
-1 indicates no filtering)
-trim 0 (trim ends of reads)
-minInd 0 (Discard site if effective sample size below value.
0 indicates no filtering)
Filedumping:
-doDepth 0 (dump distribution of seqdepth) .depthSample,.depthGlobal
-maxDepth 100 (bin together high depths)
-doQsDist 0 (dump distribution of qscores) .qs
-dumpCounts 0
1: total seqdepth for site .pos.gz
2: seqdepth persample .pos.gz,.counts.gz
3: A,C,G,T sum all samples .pos.gz,.counts.gz
4: A,C,G,T sum every sample .pos.gz,.counts.gz
</pre>

=Options=
==Filtering==
;-minQ [int]
Default 13, Discard bases with a qscore below this threshold.
;-trim [int]
Default 0. Trim [int] bases at both ends of the reads. Useful for ancient DNA.
;-setMinDepth [int]
Default -1. If the total depth is below this value, the site is discarded
;-setMaxDepth [int]
Default -1. If the total depth is above this value, the site is discarded
;-minQfile [fileName]
Default NULL. File with individual base quality score. This should be a file with the number of rows matching the number of individuals and the number of columns should either be 1 or 4. If four columns are given then a separate quality threshold is used for each base (A C G T). Both space and tab is acceptable as delimiters.

==output summary==

; -dumpCounts [int]
Default 0. See examples below. Output files are called '''.pos,.counts.gz'''.
; -doQsDist [int]
Default 0. Output the distribution of scores. Output files are called '''.qs'''.
; -doDepth [int]
Default 0. Output the distribution of sequencing depths. Sites with depth above> '''-maxDepth''', will be binned. Output files are called '''.depthSample,depthGlobal''''.
;-maxDepth [int]
Default 100. See '''-doDepth''' parameter.

=Output formats=
==Printing Counts per site==
; -dumpCounts [int]
1: Print overall depth in the .pos file. This depth is the sum of reads covering a sites for all individuals. The first column is the chromosome, the second it the position the third is the total depth.
<pre>
chr pos totDepth
1 13999902 1
1 13999903 1
1 13999904 1
1 13999905 2
1 13999906 2
1 13999907 2
1 13999908 2
1 13999909 2
1 13999910 2
</pre>

2: prints the depth of each individual. Example of the depth of 10 individuals. Each line corresponce to the same line in the postion file.
<pre>
ind0TotDepth ind1TotDepth ind2TotDepth ind3TotDepth ind4TotDepth
0 0 0 7 0
0 3 0 0 0
0 0 4 4 0
0 0 0 0 1
5 0 0 0 0
0 0 10 0 0
0 0 0 0 1
0 4 0 10 0
0 0 0 2 0

</pre>
3: Prints the depth for each of the four bases across all individuals. Each line corresponce to the same line in the postion file.
<pre>
totA totC totG totT
1 0 0 0
0 0 1 0
0 1 0 0
0 0 0 2
2 0 0 0
0 2 0 0
0 0 0 2
0 2 0 0
0 0 2 0
0 0 2 0
0 0 2 0
2 0 0 0
0 0 2 0
</pre>

4: Prints the depth for each of the four bases for each indivial for each site. Example with the first four column belonging to the first individuals the counts of the number of A C G and Ts. Only two indivduals are shown. Each line corresponce to the same line in the postion file.
<pre>
ind0_A ind0_C ind0_G ind0_T ind1_A ind1_C ind1_G ind1_T ind2_A ind2_C ind2_G ind2_T
0 1 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0 1
0 0 1 0 0 0 0 0 0 0 1 0
0 1 0 0 0 0 0 0 0 0 0 0
</pre>
===Example===
Print the individuals depth from bam files
<pre>
./angsd -out out -doCounts 1 -dumpCounts 2 -bam bam.filelist
</pre>

Print the individuals depth from bam files but filter away low quality bases
<pre>
./angsd -out out -doCounts 1 -dumpCounts 2 -bam bam.filelist -minQ 20
</pre>

Print the individuals depth from bam files but filter away low quality bases based on different threshold per individuals and base type
<pre>
./angsd -out out -doCounts 1 -dumpCounts 2 -bam bam.filelist -minQ 20 -minQfile qThres.txt
</pre>

qThres.txt:
<pre>
20 23 23 20
30 34 34 30
30 34 34 30
30 34 34 30
30 34 34 30
20 23 23 20
30 34 34 30
30 34 34 30
20 30 30 20
20 23 23 20
</pre>
The above analysis removes A and T bases with a Q score less then 20 for individual 1. The other individuals uses different thresholds

==qscore Distribution==
Column 1 is the qscore value, and column 2 are the corresponding count.
<pre>
qscore counts
13 87501
14 102888
15 113625
16 130494
17 145577
18 163049
19 180678
20 209447
21 247044
22 279325
23 332391
24 401459
25 484744
26 554127
27 609758
28 772123
29 1041218
30 1204349
31 1516248
32 1934112
33 2210498
34 2269812
35 2083536
36 1901735
37 1151146
38 441422
39 78625
40 21617
41 5870
42 1577
43 551
44 183
45 55
46 23
47 13
48 2
</pre>

==Depth Distribution==
Column1 in the '''.depthSample,.depthGlobal''' contains the number of sites with sequencing depth of 0. Column2 is the number of sites with a sequencing depth of 1, etc.

The '''.depthSample''' contains depth per sample. Line one corresponds to individual 1. Column2 corresponds to individual 2 etc.
<pre>
29403 87426 162912 229726 267115 259774 222153 170894 114295 71777 41654 22149 11030 5305 2425 1037 419 257 84 60 31 18 19 16 25 16 10
26318 88728 171544 244276 275342 263071 217952 162616 107571 65839 37466 20070 10150 4828 2237 1110 531 253 111 31 3 0 0 0 0 0 00
211936 393333 422459 322225 191564 95488 39672 15427 5220 1658 460 157 90 71 53 38 24 60 32 7 1 2 2 4 2
</pre>

The '''.depthGlobal''' file contains the depth distribution across all individuals.

<pre>
395 4299 7207 13203 23358 37489 56976 80588 107748 131669 150595 160482 161650 153690 138321 118217 96207 75735 57501 41561 29112 19549 12818 8200 5114 3247 1936 1123 646 378 238 165 105 75 71 43 43 33 27 19 15 17 17 21 24 11 7 7 14 5 1 3 3 3 1 1 3 2 3 1 1 5 4 5 6 11 4 2 1 2 0
</pre>

Change log

2020-04-27T16:38:13Z

Thorfinn: /* Latests */

=Latests=
Odd versions are github versions...
*0.931 https://github.com/ANGSD/angsd/compare/0.931...0.933
*0.929 https://github.com/ANGSD/angsd/compare/0.929...0.931
*0.927 https://github.com/ANGSD/angsd/compare/0.925...0.927
*0.925 https://github.com/ANGSD/angsd/compare/0.923...0.925
*0.923 https://github.com/ANGSD/angsd/compare/0.921...0.923
*0.921 https://github.com/ANGSD/angsd/compare/0.918...0.921
*0.918 https://github.com/ANGSD/angsd/compare/0.916...0.918
*0.916 https://github.com/ANGSD/angsd/compare/0.914...0.916
*0.914 https://github.com/ANGSD/angsd/compare/0.912...0.914
*0.912 https://github.com/ANGSD/angsd/compare/0.911...0.912
*0.911 https://github.com/ANGSD/angsd/compare/0.910...0.911
*0.910 https://github.com/ANGSD/angsd/compare/0.901...0.910
*0.901 https://github.com/ANGSD/angsd/compare/0.900...0.901
*0.900 https://github.com/ANGSD/angsd/compare/0.800...0.900
*0.800 https://github.com/ANGSD/angsd/compare/0.700...0.800
*0.700 https://github.com/ANGSD/angsd/compare/0.615...0.70

=0.6***=
*0.614 https://github.com/ANGSD/angsd/compare/0.614...0.615
*0.613 https://github.com/ANGSD/angsd/compare/0.613...0.614
*0.613 https://github.com/ANGSD/angsd/compare/0.612a...0.613
*0.612 https://github.com/ANGSD/angsd/compare/0.610...0.612a
*0.610 https://github.com/ANGSD/angsd/compare/0.609...0.610a
*0.609 https://github.com/ANGSD/angsd/compare/0.608...0.609
*0.608 not super usefull but we now compile with knetfile, so users can use remote .fa files. I really recommend that users download the fasta instead.
*0.607 changed name of all abstract base classes to the more reasonable abc*.cpp. Included contamination and the iCounts format. Added a templated class so users can see how to access the internal datastructures.
*0.606 added more info in the thetaStat subprogram renamed all analysis classes to abc*.cpp.
*0.605 continued updating the Eunjung code
*0.604 added a 'job' array for the analysis classes which should greatly reduce the number of needed function call. I doubt that will make any noticeable speed difference. Has copied and modified some code from Eunjung (from jnovembre lab) for a banded approach in the saf calculation. This is not working yet -doSaf 2
*0.603 1) fixed wrong branching if users used simfiles. 2) fixed 2 bugs in the inbreeeded angsd_realSFS.cpp. Very small changes
*0.602 1) programs output the actual chromosome, if there is a mismatch between the fasta and bamheader 2) added a check that the index files (fa vs fai,bam vs bai) is newer than the index. 3) Program now validates that the .so and include files used are the same.
*0.601 1) started the move to includeflags/discardflags, outcommented in this version 2) validates that data has been generted for the different samples by looking at difference between values, instead of comparing againts -0.0. 3) MAF estimates now skips estimation for a site if the updated GL shows that GLs are noninformative.
*0.600 fixed a change in default FLAGS when using bam files. If you hadn't removed the unmapped reads from your bamfiles these would have been included in the analysis.

=0.589 to 599=
* 0.589 removed soap/sim/glf/glfclean(bin and text) and tglf.
* 0.590 added text mpileup as new input format. Very useful
* 0.591 refactored the file reading, moved the arguments to multi reader, such that all file reading is done from multi reader. FREEZE version. We will only allow bug fixes in the next many versions.
* 0.592 -pest output in angsd_realSFS.cpp. smartcounts. fixed a bug in hetplas. edited the printout msg for 'if you really want angsd to exit' -pest output in angsd_realSFS.cpp. smartcounts. fixed a bug in hetplas. edited the printout msg for 'if you really want angsd to exit'. Maybe fixed an unknown bug. line 2 in .arg or screen output is now commandline used, and some other visual stuff. Program didn't complain if -doMaf but no -domajorminor. if chromosomesome name contained ':" it wouldn't work strchr->strrchr
* 0.593 fixed a strange bug, where the program would crash if, no analysis was chosen. fixed the 'shouldbeone' bug.
* 0.594 nochanges
* 0.595 nochanges
* 0.596 fixed a bug in persite depth counter (double count of C alleles). fixed a bug in smartcount subprogram.
* 0.597 treemix input file generation from smart counts
* 0.598 fixed a wrong compile flag in one of the utility programs 'smartCount'
* 0.599 many much more informative information if users forgot to add -fai argument. Fixed a bug in parsing of arguments with -doSaf 4.

=<0.570 and <0.588=
* 0.571 removed bf's from maf classs, negative values of -domaf disabled dumping of files, inbreeding has been added
* 0.572 Some funky new approach for the makefile is now being used, minor bug fixes (minDepth -> setMinDepth, extra header column in thetas.gz has been fixed)
* 0.573 added better info barfiles have different header. Added check if length of supplied reference/ancestral doesn't match bamheader. autosize in emOptim2 for 2dsfs. Fixed subtle issue if very large coverage between bamfiles, now the 'biggests(file size)' is used to select region instead of the first baffle. Netaccess is now deprecated.
* 0.574 modifed emOptim2 so now compiles on mac
* 0.575 smaller fixes to the inbreeding parsing, added p-value in analysisMaf instead of the raw llh.
*.0.576 -doSNP and -minLRT now deprecated, please use -SNP_pval instead
* 0.577 bugfix for -SNP_pval if value was one 1. doHWE now called -HWE_pval and can be used for filtering.
* 0.578 speedup in hew stuff
* 0.579 fixed an extremely rare assertion error (program was working, assertion was off). Redid all strcmp to strcasecom. Fixed a bug in -doMaf 2 with -snp_pval
* 0.580 nochanges...
* 0.581 if trimming has been enabled, N's will be plugged in instead of the bases. A number of small changes.
* 0.582 nochanges...
* 0.583 1) changed bugfix when using counts based estimator for major/minor 2) keepsites is now using the effecive number of samples in all cases 3) changed output of maf to a 'nicer' format
* 0.584 updated internal testing scripts.
* 0.585 1) fixed 'baq complains even though -ref was supplied' 2) fixed -doMajorMinor 4 and doMajorMinor 5 (sites not discarded) 3) added trumendounsly better information for the -sites 4) added some check for -doPost 5) program can now exit uncleanly if ctrl+c is pressed 3 times. 6) added an else to catch wrong arg in theteStat.
*0.586 1) fixed parsing of pars if input is -sim1 2) fixed a bug in -doFasta 3 (the ebd one) 3) fixed a printout problem in -doFasta
*0.587 1) fixed a number of minor instances where memory wasn't being freed/delete (mostly for keeping valgring silent) 2) fixed a memleak if -sites files contained 4 columns but -doMajorMinor not 3. 3) fixed a memory leak that could occur if -doPost 2 and -doMaf 0.
* 0.588 1) fixed small printout error that could cause segfaults in rare cases. 2) change stderr to pos+1 3) changed some checks of user supplied pars 4) fixed a stack overflow if a very long -rf file was supplied.

=old=
==Dirty==
* 0.16 Is now bundled with SAMtools-0.1.17 and the mpileup (and friends) command be used for passing data to dirty.
* 0.17 added extra options -minInd and -minMaf, for only printing and using sites above a threshold
* 0.18 added option to pass reference and ancestral allele as fasta files.(using faidx format) (doMaf is now encoded internally as a MAF_(UN)KNOWN_TYPE)
* 0.19 added support for tglf inputfiles, -tglf -posfile see runexamples, also added the likeratio test for snp calling
* 0.20 Added the check for missing data, before the major/minor. included -realSFS, changed the deallocation of the -doMAF results, such that its proper cleaned up.
* 0.21 refactored pml.cpp into pml_estError_genLikes.cpp and pml_freq_asso.cpp (fixed a bug that preventede -samglf and samglfclean from working)
* 0.22 Well this update was a mixture of edits from [[user:albrecht]] and BGI so its difficult to give a concise description
*.0.23 Program can now read simulated files (single pop only) An example can be seen in "full example ... sfs" and input types.
* 0.24 added the tajima estimator. This should go in tandem with some R scripts. Had to modify parseargs, shared, and pml_freq_asso
* 0.25 the depth is now being populated when using mpileup -g. The program can now get the counts from mpileup
==ANGSD==
=<0.5=
* 0.01.a - 0.01.b The bfgs now supports threading, maybe anders implemented a heteorzygosity estimator.
* 0.01.c A problem if we didn't observe any llh, caused the MAF estimator to 'nan'.
* 0.02 Fixed small bug in bfgs optimization of sfs optimization. When choosing a region bigger than what was covered by the .sfs file the program would hang. Added genotypecaller, added -sfsEst to the realSFS part of the program.
* 0.03 added and documented genotypecaller, can dump counts,-realSFS 1 dumps positions, -realSFS 2 is deprecated,S,pi and tajima has been added to sfstools along with possibility to do prior
* 0.3 clean version with less features. The lost features will be reintroduced later.
* 0.43 first very clean version, everything should be included
* 0.441 rewrote the SOAPsnp GL model, -L and -maxQ is not needed anymore. Also added an option to choose an output dir for the recalibration matrix
* 0.4471 error estimation is now working, the fasta reading is now threadsafe. all GLs are now likeratios.
* 0.512 After 0.500 we have changed the internal structure such that each chunk is enforced to be on the same chr. version c) fixes a problem of hardclipping
* 0.515 Alot of legacy code has been removed from mUppile.cpp. Program can now use remote files, build on code from SAMtools
* 0.520 Alot more legacy code has been removed from mUpPile.cpp. Program now does baq and adjustment of mapQ similar to -C in samtools. Also compiles on osx, but this is not supported
* 0.535 Bug in internal representation of mapQ's (only problematic for mapQ>128), we now use the flag to determine if a read has mapped. calcstat is now deprecated, users should use the bgid program now.
* 0.538 changed position output in association part. Fixed incorrect assert assumption in mUpPile.cpp. Added some downsampling options for errorEst and changed internal buffering when reading beagle files to allow for >10k individuals.
* 0.549 to many changes to remember

=<0.5 and <0.570=
* 0.551 tajima paper is now published, so the emOptim2 and bgid has now been properly documented. plink output is now supported and some snp filters can be outputted.
* 0.552 minor bug when calling genotypes without defining postcutoff -> missingness couldnt occur. removed the optimSFS and emOptim from the default compilelist
* 0.553 uint removed from code.
* 0.554 plugged in sfstools functionality into main angsd, (ability to output log posts)
* 0.555 anders added some concensus stuff
* 0.556 updated the filtering (if binary rep of keep file is incomplete it is removed again. It checks timestamps to see if file has been updated), folded spectra analysis should now be working
* 0.557 There was a bug in the realsfs part of the code, that was created in the 0.556 version. 0.557 is simply a fix of this, and the removal of a warning compiler flag in the msToGlf subprogram. We only observed the problematic compiler flag on a osx machine
* 0.558 tempversion,from this version bgid is now called thetaStat
* 0.559 abbababa
* 0.560 analysisCount.cpp has been updated to the nice standard of the wiki
* 0.561 program now compiles on clang, many small compiler warnings has been fixed.
* 0.562 Merge of forked versions, abba-baba fasta.
* 0.563 minQ filter has been moved to a much earlier step. Previously it was downstream classes that checked this. Now a base will be set to 'n' if it is below the threshold
* 0.564 cleaned up funky pars maf/asso such that all results are in ->extras[]
* 0.565 moved file reading stuff from shared to analysisFunction in namespace ail::
* 0.566 cleaned up small things, added a newer version of hetplas
* 0.567 cleaned up small things again. Started to add single pars e.g. -P -b
* 0.568 refactored compile order in general.cpp
* 0.569 added saf genotype calling, changed name from -realsfs to -doSaf
* 0.570 modified emOptim2 to estimate nSites and tell how much memory it will use, fixed empty -bam file

Thorfinn

2020-04-27T15:37:58Z

Thorfinn:

=How to deploy=
This page described how a new version should be put on wiki and github
==Make a combined angsd htslib to put on wiki download==
copy latest and make a new annotated tag and push to wiki
<pre>
VERSION=0.932
mkdir delme
cd delme
git clone --depth=1 https://github.com/ANGSD/angsd
git clone --depth=1 https://github.com/SAMtools/htslib
cd angsd; git tag -a ${VERSION} -m "time for version ${VERSION}"
sed "s/PACKAGE_VERSION = .*/PACKAGE_VERSION = ${VERSION}/" Makefile >tmp
diff Makefile tmp
mv tmp Makefile
cd ..
tar --exclude='.git' -cvf angsd${VERSION}.tar.gz angsd htslib/
scp angsd${VERSION}.tar.gz software@popgen.dk:/home/software/download/angsd/
</pre>

==Make a new github version to put on github==

<pre>
VERSION=0.933
cd angsd/
sed "s/PACKAGE_VERSION = .*/PACKAGE_VERSION = ${VERSION}/" Makefile >tmp
diff Makefile tmp
mv tmp Makefile
git commit Makefile -m "Preparing new ${VERSION} version"
git tag -a ${VERSION} -m "time for a new version ${VERSION}"
git push
git push --tags
</pre>

Thetas,Tajima,Neutrality tests

2020-04-01T11:06:34Z

Thorfinn: /* Step 1: Finding a 'global estimate' of the SFS */

This method will estimate different thetas (population scaled mutation rate) and can based on these thetas calculate Tajima's D and various other neutrality test statistics. Method is described in [[Korneliussen2013]].

* NB Information on this website is for version 0.917-33-g6d2aec8 or higher.
* NB The [[Korneliussen2013]] covers two methods,
# using an ML method
# using the emperical Bayes (EB) method. The information on this page relates to the EB method.
For performing the ML method, you should the use the [[SFS Estimation]] method and define the region af interest.

=Quick Example=
Below is a chain of commands used for caculating statistics. These are based on the test files that can be dowloaded on the [[Quick Start ]] page.

Its a 3 step procedure
# Estimate an site frequency spectrum. Output is '''out.sfs''' file. This is what is being used as the '''-pest ''' argument in step2.
# Calculate per-site thetas. Output is a '''.thetas.idx/.thetas.gz''' files. This contains the binary persite estimates of the thetas.
# Calculate neutrality tests statistics. Output is a '''.thetas.idx.pestPG file.
==Full command list for below examples==
Here is the chain of commands required to do estimate the thetas, and perform neutrality test statistics. These different commands are described in great detail in the following '''step 1,... step 3b''' sub sections. If you do not have the ancestral state you can simply use the assembly you have mapped agains, but remember to add -fold 1 in the 'realSFS' and 'realSFS sf2theta' step.
<pre>
./angsd -bam bam.filelist -doSaf 1 -anc chimpHg19.fa -GL 1 -P 24 -out out
./misc/realSFS out.saf.idx -P 24 > out.sfs
#use -fold 1 in the above command if you dont have ancestral state.
./misc/realSFS saf2theta out.saf.idx -outname out -sfs out.sfs
#Estimate for every Chromosome/scaffold
./misc/thetaStat do_stat out.thetas.idx
#Do a sliding window analysis based on the output from the make_bed command.
./misc/thetaStat do_stat out.thetas.idx -win 50000 -step 10000 -outnames theta.thetasWindow.gz
</pre>

==Step 1: Finding a 'global estimate' of the SFS==

First estimate the site allele frequency likelihood
<div class="toccolours mw-collapsible mw-collapsed">
./angsd -bam bam.filelist -doSaf 1 -anc chimpHg19.fa -GL 1 -P 24 -out out
<pre class="mw-collapsible-content">

-> Reading fasta: chimpHg19.fa
-> Parsing 10 number of samples
-> Printing at chr: 20 pos:14095817 chunknumber 3500
-> Done reading data waiting for calculations to finish
-> Calling destroy
-> Done waiting for threads
-> Output filenames:
->"out.arg"
->"out.saf"
->"out.saf.pos.gz"
-> Mon Jun 30 12:02:58 2014
-> Arguments and parameters for all analysis are located in .arg file
[ALL done] cpu-time used = 47.19 sec
[ALL done] walltime used = 43.00 sec

</pre>
</div>

Obtain the maximum likelihood estimate of the SFS using the '''realSFS''' program found in the misc subfolder. (See more here [[realSFS]])
<pre>
./misc/realSFS out.saf.idx -P 24 > out.sfs
</pre>
Or if want to calculate the folded spectrum.
<pre>
./misc/realSFS out.saf.idx -P 24 -fold 1 > out.sfs
</pre>

To plot the SFS in R :
<pre>
s<-scan('out.sfs')
s<-s[-c(1,length(s))]
s<-s/sum(s)
barplot(s,names=1:length(s),main='SFS')
</pre>

==Step 2: Calculate the thetas for each site==
<pre>
realSFS saf2theta out.saf.idx -sfs out.sfs -outname out
</pre>
The output from the above command are two files out.thetas.gz and out.thetas.idx. A formal description of these files can be found in the doc/formats.pdf in the angsd package. It is possible to extract the logscale persite thetas using the ./thetaStat print program.

<div class="toccolours mw-collapsible mw-collapsed">
thetaStat print out.thetas.idx 2>/dev/null |head
<pre class="mw-collapsible-content">
#Chromo Pos Watterson Pairwise thetaSingleton thetaH thetaL
1 14000032 -10.339284 -12.069325 -9.000927 -15.852173 -12.739969
1 14000033 -10.437878 -12.185619 -9.080596 -16.001343 -12.856984
1 14000034 -10.373872 -12.110464 -9.028572 -15.905591 -12.781380
1 14000035 -10.528192 -12.290763 -9.154920 -16.133823 -12.962708
1 14000036 -10.322074 -12.051400 -8.985016 -15.834049 -12.722040
1 14000037 -10.304955 -12.028814 -8.973260 -15.800330 -12.699204
1 14000038 -10.108563 -11.791546 -8.819884 -15.486384 -12.460146
1 14000039 -10.542117 -12.306631 -9.166698 -16.153168 -12.978650
1 14000040 -10.688401 -12.473763 -9.290272 -16.358398 -13.146564
</pre>
</div>
Per default the print command will also output the contents of the index file to the stderr.

==Step 3a: Estimate Tajimas D and other statistics==
<pre>
#calculate Tajimas D
./misc/thetaStat do_stat out.thetas.idx
</pre>

<div class="toccolours mw-collapsible mw-collapsed">
cat out.thetas.idx.pestPG
<pre class="mw-collapsible-content">
## thetaStat VERSION: 0.01 build:(Jun 30 2014,12:06:12)
#(indexStart,indexStop)(firstPos_withData,lastPos_withData)(WinStart,WinStop) Chr WinCenter tW tP tF tH tL Tajima fuf fud fayh zeng nSites
(0,98316)(14000032,14100082)(0,14100082) 1 7050041 51.002623 46.171402 64.683834 51.290955 48.731178 -0.392892 -0.647071 -0.595302 -0.099654 -0.048444 98316
(0,98474)(13999910,14100060)(0,14100060) 2 7050030 92.689100 88.806005 101.768262 122.422498 105.614255 -0.174701 -0.252477 -0.220588 -0.360944 0.152373 98474
(0,93269)(14000529,14100095)(0,14100095) 3 7050047 70.757874 76.248087 75.447438 68.354514 72.301301 0.322902 0.020330 -0.148419 0.110921 0.023794 93269
(0,96339)(13999912,14100064)(0,14100064) 4 7050032 99.748624 107.898618 94.265208 130.283528 119.091076 0.340878 0.247030 0.123956 -0.223386 0.211971 96339
(0,99659)(13999926,14100063)(0,14100063) 5 7050031 120.941697 132.667821 86.726667 163.908351 148.288088 0.404945 0.688320 0.639821 -0.257254 0.247395 99659
(0,99541)(13999918,14100103)(0,14100103) 6 7050051 96.666344 112.146685 69.740992 143.403712 127.775201 0.667988 0.792499 0.627735 -0.321842 0.351730 99541
(0,99786)(13999926,14100047)(0,14100047) 7 7050023 93.164548 92.023886 92.742574 142.413716 117.218807 -0.051058 -0.013928 0.010201 -0.538288 0.282133 99786
(0,98759)(13999923,14100082)(0,14100082) 8 7050041 133.567125 177.157879 72.197498 204.069028 190.613463 1.363708 1.425567 1.040517 -0.200700 0.467490 98759
(0,97855)(14001686,14100094)(0,14100094) 9 7050047 88.777475 102.853333 64.660948 104.749694 103.801516 0.660983 0.776148 0.611265 -0.021256 0.184875 97855
(0,98031)(13999906,14100096)(0,14100096) 10 7050048 129.583334 134.877160 88.135115 213.231615 174.054390 0.170681 0.654145 0.724072 -0.602284 0.375595 98031
(0,99220)(13999900,14100060)(0,14100060) 11 7050030 66.349155 79.423643 60.194045 68.903312 74.163477 0.819589 0.520022 0.207421 0.157614 0.128409 99220
(0,99861)(13999913,14100078)(0,14100078) 12 7050039 86.461303 81.630083 96.156392 110.974922 96.302507 -0.232902 -0.302980 -0.252190 -0.337701 0.124323 99861
(0,98258)(13999943,14100097)(0,14100097) 16 7050048 83.191170 99.392421 77.561510 106.148748 102.770584 0.811500 0.472922 0.152079 -0.080798 0.257008 98258
(0,99428)(13999902,14100095)(0,14100095) 17 7050047 90.254620 99.816352 65.610351 113.328929 106.572638 0.441707 0.683942 0.614609 -0.148988 0.197530 99428
(0,97118)(13999898,14100071)(0,14100071) 18 7050035 79.843256 75.282296 86.844252 67.720321 71.501308 -0.237958 -0.260778 -0.196888 0.094212 -0.114062 97118
(0,93783)(13999895,14100089)(0,14100089) 19 7050044 54.311523 49.839190 64.913940 72.868913 61.354048 -0.341795 -0.495649 -0.434079 -0.421111 0.141133 93783
(0,98938)(13999916,14100091)(0,14100091) 20 7050045 68.148147 63.323800 78.463736 56.040370 59.682084 -0.294508 -0.398845 -0.338673 0.106250 -0.135474 98938
</pre>
</div>

==Step 3b: Sliding Window example==
We can easily do a sliding window analysis by adding -win/-step arguments to the last command. [[ thetaStat ]]
<pre>
thetaStat do_stat out.thetas.idx -win 50000 -step 10000 -outnames theta.thetasWindow.gz
</pre>
This will calculate the test statistic using a window size of 50kb and a step size of 10kb.

=Example Output=
<pre>
- Output in the ./thetaStat print thetas.idx are the log scaled per site estimates of the thetas
- Output in the pestPG file are the sum of the per site estimates for a region
</pre>
==./thetaStat print angsdput.thetas.idx==
<pre>
#Chromo Pos Watterson Pairwise thetaSingleton thetaH thetaL
1 14000032 -9.457420 -10.372069 -8.319252 -13.025778 -10.997194
1 14000033 -9.463637 -10.379368 -8.324414 -13.035780 -11.004670
1 14000034 -9.463740 -10.379488 -8.324500 -13.035942 -11.004793
1 14000035 -9.463603 -10.379328 -8.324386 -13.035725 -11.004629
1 14000036 -9.323246 -10.218453 -8.204848 -12.826627 -10.840519
1 14000037 -9.179270 -10.048883 -8.086425 -12.596436 -10.666670
1 14000038 -9.004664 -9.845473 -7.941453 -12.328274 -10.458416
1 14000039 -9.327033 -10.222983 -8.207914 -12.833007 -10.845176
1 14000040 -9.621554 -10.557563 -8.461745 -13.262415 -11.185971
1 14000041 -9.617449 -10.552869 -8.458225 -13.256257 -11.181185
1 14000042 -7.337841 -8.161756 -204.045433 -5.457443 -6.085818
1 14000043 -9.570405 -10.502160 -8.415195 -13.197596 -11.129976
1 14000044 -9.511097 -10.434558 -8.364249 -13.110037 -11.061100
1 14000045 -9.563664 -10.494371 -8.409489 -13.187203 -11.122022
1 14000046 -9.617690 -10.555402 -8.456395 -13.265004 -11.184107
1 14000047 -9.563722 -10.494438 -8.409538 -13.187292 -11.122090
1 14000048 -9.856578 -10.819096 -8.669691 -13.587898 -11.451396
</pre>
;1. chromosome
;2. position
;3. ThetaWatterson
;4. ThetaD (nucleotide diversity)
;5. Theta? (singleton category)
;6. ThetaH
;7. ThetaL

==.thetas.idx.pestPG==
The .pestPG file is a 14 column file (tab seperated). The first column contains information about the region. The second and third column is the reference name and the center of the window.

We then have 5 different estimators of theta, these are: Watterson, pairwise, FuLi, fayH, L.
And we have 5 different neutrality test statistics: Tajima's D, Fu&Li F's, Fu&Li's D, Fay's H, Zeng's E.
The final column is the effetive number of sites with data in the window.

<pre>
## thetaStat VERSION: 0.01 build:(Jun 30 2014,12:06:12)
#(indexStart,indexStop)(firstPos_withData,lastPos_withData)(WinStart,WinStop) Chr WinCenter tW tP tF tH tL Tajima fuf fud fayh zeng nSites
(0,98316)(14000032,14100082)(0,14100082) 1 7050041 51.002623 46.171402 64.683834 51.290955 48.731178 -0.392892 -0.647071 -0.595302 -0.099654 -0.048444 98316
(0,98474)(13999910,14100060)(0,14100060) 2 7050030 92.689100 88.806005 101.768262 122.422498 105.614255 -0.174701 -0.252477 -0.220588 -0.360944 0.152373 98474
(0,93269)(14000529,14100095)(0,14100095) 3 7050047 70.757874 76.248087 75.447438 68.354514 72.301301 0.322902 0.020330 -0.148419 0.110921 0.023794 93269
(0,96339)(13999912,14100064)(0,14100064) 4 7050032 99.748624 107.898618 94.265208 130.283528 119.091076 0.340878 0.247030 0.123956 -0.223386 0.211971 96339
(0,99659)(13999926,14100063)(0,14100063) 5 7050031 120.941697 132.667821 86.726667 163.908351 148.288088 0.404945 0.688320 0.639821 -0.257254 0.247395 99659
(0,99541)(13999918,14100103)(0,14100103) 6 7050051 96.666344 112.146685 69.740992 143.403712 127.775201 0.667988 0.792499 0.627735 -0.321842 0.351730 99541
(0,99786)(13999926,14100047)(0,14100047) 7 7050023 93.164548 92.023886 92.742574 142.413716 117.218807 -0.051058 -0.013928 0.010201 -0.538288 0.282133 99786
(0,98759)(13999923,14100082)(0,14100082) 8 7050041 133.567125 177.157879 72.197498 204.069028 190.613463 1.363708 1.425567 1.040517 -0.200700 0.467490 98759

</pre>
Format is:

<code>
(indexStart,indexStop)(posStart,posStop)(regStat,regStop) chrname wincenter tW tP tF tH tL tajD fulif fuliD fayH zengsE numSites
</code>

Most likely you are just interest in the wincenter (column 3) and the column 9 which is the Tajima's D statistic.

The first 3 columns relates to the region. The next 5 columns are 5 different estimators of theta, and the next 5 columns are neutrality test statistics. The final column is the number of sites with data in the region.

The first '''()()()''' er mainly used for debugging the sliding window program. The interpretation is:
* The posStart and posStop is the first physical position, and last physical postion of sites included in the analysis.
* The regStat and regStop is the physical region for which the analysis is performed. Therefore the posStat and posStop is always included within the regStart and regStop
* The indexStart and IndexStop is the position within the internal array.

=Unknown ancestral state (folded sfs)=

If you don't have the ancestral states, you can still calculate the Watterson and Tajima theta, which means you can perform the Tajima's D neutrality test statistic. But this requires you to use the folded sfs. The output files will have the same format, but only the thetaW and thetaD, and tajimas D is meaningful.

There was previously an example below that showed how to perform this analysis. This information has now been added to the examples above (notice the -fold 1) step in realSFS.

=Citation=
[[Korneliussen2013]]

Thetas,Tajima,Neutrality tests

2020-04-01T11:05:30Z

Thorfinn: /* Full command list for below examples */

This method will estimate different thetas (population scaled mutation rate) and can based on these thetas calculate Tajima's D and various other neutrality test statistics. Method is described in [[Korneliussen2013]].

* NB Information on this website is for version 0.917-33-g6d2aec8 or higher.
* NB The [[Korneliussen2013]] covers two methods,
# using an ML method
# using the emperical Bayes (EB) method. The information on this page relates to the EB method.
For performing the ML method, you should the use the [[SFS Estimation]] method and define the region af interest.

=Quick Example=
Below is a chain of commands used for caculating statistics. These are based on the test files that can be dowloaded on the [[Quick Start ]] page.

Its a 3 step procedure
# Estimate an site frequency spectrum. Output is '''out.sfs''' file. This is what is being used as the '''-pest ''' argument in step2.
# Calculate per-site thetas. Output is a '''.thetas.idx/.thetas.gz''' files. This contains the binary persite estimates of the thetas.
# Calculate neutrality tests statistics. Output is a '''.thetas.idx.pestPG file.
==Full command list for below examples==
Here is the chain of commands required to do estimate the thetas, and perform neutrality test statistics. These different commands are described in great detail in the following '''step 1,... step 3b''' sub sections. If you do not have the ancestral state you can simply use the assembly you have mapped agains, but remember to add -fold 1 in the 'realSFS' and 'realSFS sf2theta' step.
<pre>
./angsd -bam bam.filelist -doSaf 1 -anc chimpHg19.fa -GL 1 -P 24 -out out
./misc/realSFS out.saf.idx -P 24 > out.sfs
#use -fold 1 in the above command if you dont have ancestral state.
./misc/realSFS saf2theta out.saf.idx -outname out -sfs out.sfs
#Estimate for every Chromosome/scaffold
./misc/thetaStat do_stat out.thetas.idx
#Do a sliding window analysis based on the output from the make_bed command.
./misc/thetaStat do_stat out.thetas.idx -win 50000 -step 10000 -outnames theta.thetasWindow.gz
</pre>

==Step 1: Finding a 'global estimate' of the SFS==

First estimate the site allele frequency likelihood
<div class="toccolours mw-collapsible mw-collapsed">
./angsd -bam bam.filelist -doSaf 1 -anc chimpHg19.fa -GL 1 -P 24 -out out
<pre class="mw-collapsible-content">

-> Reading fasta: chimpHg19.fa
-> Parsing 10 number of samples
-> Printing at chr: 20 pos:14095817 chunknumber 3500
-> Done reading data waiting for calculations to finish
-> Calling destroy
-> Done waiting for threads
-> Output filenames:
->"out.arg"
->"out.saf"
->"out.saf.pos.gz"
-> Mon Jun 30 12:02:58 2014
-> Arguments and parameters for all analysis are located in .arg file
[ALL done] cpu-time used = 47.19 sec
[ALL done] walltime used = 43.00 sec

</pre>
</div>

Obtain the maximum likelihood estimate of the SFS using the '''realSFS''' program found in the misc subfolder. (See more here [[realSFS]])
<pre>
./misc/realSFS out.saf.idx -P 24 > out.sfs
</pre>

To plot the SFS in R :
<pre>
s<-scan('out.sfs')
s<-s[-c(1,length(s))]
s<-s/sum(s)
barplot(s,names=1:length(s),main='SFS')
</pre>

==Step 2: Calculate the thetas for each site==
<pre>
realSFS saf2theta out.saf.idx -sfs out.sfs -outname out
</pre>
The output from the above command are two files out.thetas.gz and out.thetas.idx. A formal description of these files can be found in the doc/formats.pdf in the angsd package. It is possible to extract the logscale persite thetas using the ./thetaStat print program.

<div class="toccolours mw-collapsible mw-collapsed">
thetaStat print out.thetas.idx 2>/dev/null |head
<pre class="mw-collapsible-content">
#Chromo Pos Watterson Pairwise thetaSingleton thetaH thetaL
1 14000032 -10.339284 -12.069325 -9.000927 -15.852173 -12.739969
1 14000033 -10.437878 -12.185619 -9.080596 -16.001343 -12.856984
1 14000034 -10.373872 -12.110464 -9.028572 -15.905591 -12.781380
1 14000035 -10.528192 -12.290763 -9.154920 -16.133823 -12.962708
1 14000036 -10.322074 -12.051400 -8.985016 -15.834049 -12.722040
1 14000037 -10.304955 -12.028814 -8.973260 -15.800330 -12.699204
1 14000038 -10.108563 -11.791546 -8.819884 -15.486384 -12.460146
1 14000039 -10.542117 -12.306631 -9.166698 -16.153168 -12.978650
1 14000040 -10.688401 -12.473763 -9.290272 -16.358398 -13.146564
</pre>
</div>
Per default the print command will also output the contents of the index file to the stderr.

==Step 3a: Estimate Tajimas D and other statistics==
<pre>
#calculate Tajimas D
./misc/thetaStat do_stat out.thetas.idx
</pre>

<div class="toccolours mw-collapsible mw-collapsed">
cat out.thetas.idx.pestPG
<pre class="mw-collapsible-content">
## thetaStat VERSION: 0.01 build:(Jun 30 2014,12:06:12)
#(indexStart,indexStop)(firstPos_withData,lastPos_withData)(WinStart,WinStop) Chr WinCenter tW tP tF tH tL Tajima fuf fud fayh zeng nSites
(0,98316)(14000032,14100082)(0,14100082) 1 7050041 51.002623 46.171402 64.683834 51.290955 48.731178 -0.392892 -0.647071 -0.595302 -0.099654 -0.048444 98316
(0,98474)(13999910,14100060)(0,14100060) 2 7050030 92.689100 88.806005 101.768262 122.422498 105.614255 -0.174701 -0.252477 -0.220588 -0.360944 0.152373 98474
(0,93269)(14000529,14100095)(0,14100095) 3 7050047 70.757874 76.248087 75.447438 68.354514 72.301301 0.322902 0.020330 -0.148419 0.110921 0.023794 93269
(0,96339)(13999912,14100064)(0,14100064) 4 7050032 99.748624 107.898618 94.265208 130.283528 119.091076 0.340878 0.247030 0.123956 -0.223386 0.211971 96339
(0,99659)(13999926,14100063)(0,14100063) 5 7050031 120.941697 132.667821 86.726667 163.908351 148.288088 0.404945 0.688320 0.639821 -0.257254 0.247395 99659
(0,99541)(13999918,14100103)(0,14100103) 6 7050051 96.666344 112.146685 69.740992 143.403712 127.775201 0.667988 0.792499 0.627735 -0.321842 0.351730 99541
(0,99786)(13999926,14100047)(0,14100047) 7 7050023 93.164548 92.023886 92.742574 142.413716 117.218807 -0.051058 -0.013928 0.010201 -0.538288 0.282133 99786
(0,98759)(13999923,14100082)(0,14100082) 8 7050041 133.567125 177.157879 72.197498 204.069028 190.613463 1.363708 1.425567 1.040517 -0.200700 0.467490 98759
(0,97855)(14001686,14100094)(0,14100094) 9 7050047 88.777475 102.853333 64.660948 104.749694 103.801516 0.660983 0.776148 0.611265 -0.021256 0.184875 97855
(0,98031)(13999906,14100096)(0,14100096) 10 7050048 129.583334 134.877160 88.135115 213.231615 174.054390 0.170681 0.654145 0.724072 -0.602284 0.375595 98031
(0,99220)(13999900,14100060)(0,14100060) 11 7050030 66.349155 79.423643 60.194045 68.903312 74.163477 0.819589 0.520022 0.207421 0.157614 0.128409 99220
(0,99861)(13999913,14100078)(0,14100078) 12 7050039 86.461303 81.630083 96.156392 110.974922 96.302507 -0.232902 -0.302980 -0.252190 -0.337701 0.124323 99861
(0,98258)(13999943,14100097)(0,14100097) 16 7050048 83.191170 99.392421 77.561510 106.148748 102.770584 0.811500 0.472922 0.152079 -0.080798 0.257008 98258
(0,99428)(13999902,14100095)(0,14100095) 17 7050047 90.254620 99.816352 65.610351 113.328929 106.572638 0.441707 0.683942 0.614609 -0.148988 0.197530 99428
(0,97118)(13999898,14100071)(0,14100071) 18 7050035 79.843256 75.282296 86.844252 67.720321 71.501308 -0.237958 -0.260778 -0.196888 0.094212 -0.114062 97118
(0,93783)(13999895,14100089)(0,14100089) 19 7050044 54.311523 49.839190 64.913940 72.868913 61.354048 -0.341795 -0.495649 -0.434079 -0.421111 0.141133 93783
(0,98938)(13999916,14100091)(0,14100091) 20 7050045 68.148147 63.323800 78.463736 56.040370 59.682084 -0.294508 -0.398845 -0.338673 0.106250 -0.135474 98938
</pre>
</div>

==Step 3b: Sliding Window example==
We can easily do a sliding window analysis by adding -win/-step arguments to the last command. [[ thetaStat ]]
<pre>
thetaStat do_stat out.thetas.idx -win 50000 -step 10000 -outnames theta.thetasWindow.gz
</pre>
This will calculate the test statistic using a window size of 50kb and a step size of 10kb.

=Example Output=
<pre>
- Output in the ./thetaStat print thetas.idx are the log scaled per site estimates of the thetas
- Output in the pestPG file are the sum of the per site estimates for a region
</pre>
==./thetaStat print angsdput.thetas.idx==
<pre>
#Chromo Pos Watterson Pairwise thetaSingleton thetaH thetaL
1 14000032 -9.457420 -10.372069 -8.319252 -13.025778 -10.997194
1 14000033 -9.463637 -10.379368 -8.324414 -13.035780 -11.004670
1 14000034 -9.463740 -10.379488 -8.324500 -13.035942 -11.004793
1 14000035 -9.463603 -10.379328 -8.324386 -13.035725 -11.004629
1 14000036 -9.323246 -10.218453 -8.204848 -12.826627 -10.840519
1 14000037 -9.179270 -10.048883 -8.086425 -12.596436 -10.666670
1 14000038 -9.004664 -9.845473 -7.941453 -12.328274 -10.458416
1 14000039 -9.327033 -10.222983 -8.207914 -12.833007 -10.845176
1 14000040 -9.621554 -10.557563 -8.461745 -13.262415 -11.185971
1 14000041 -9.617449 -10.552869 -8.458225 -13.256257 -11.181185
1 14000042 -7.337841 -8.161756 -204.045433 -5.457443 -6.085818
1 14000043 -9.570405 -10.502160 -8.415195 -13.197596 -11.129976
1 14000044 -9.511097 -10.434558 -8.364249 -13.110037 -11.061100
1 14000045 -9.563664 -10.494371 -8.409489 -13.187203 -11.122022
1 14000046 -9.617690 -10.555402 -8.456395 -13.265004 -11.184107
1 14000047 -9.563722 -10.494438 -8.409538 -13.187292 -11.122090
1 14000048 -9.856578 -10.819096 -8.669691 -13.587898 -11.451396
</pre>
;1. chromosome
;2. position
;3. ThetaWatterson
;4. ThetaD (nucleotide diversity)
;5. Theta? (singleton category)
;6. ThetaH
;7. ThetaL

==.thetas.idx.pestPG==
The .pestPG file is a 14 column file (tab seperated). The first column contains information about the region. The second and third column is the reference name and the center of the window.

We then have 5 different estimators of theta, these are: Watterson, pairwise, FuLi, fayH, L.
And we have 5 different neutrality test statistics: Tajima's D, Fu&Li F's, Fu&Li's D, Fay's H, Zeng's E.
The final column is the effetive number of sites with data in the window.

<pre>
## thetaStat VERSION: 0.01 build:(Jun 30 2014,12:06:12)
#(indexStart,indexStop)(firstPos_withData,lastPos_withData)(WinStart,WinStop) Chr WinCenter tW tP tF tH tL Tajima fuf fud fayh zeng nSites
(0,98316)(14000032,14100082)(0,14100082) 1 7050041 51.002623 46.171402 64.683834 51.290955 48.731178 -0.392892 -0.647071 -0.595302 -0.099654 -0.048444 98316
(0,98474)(13999910,14100060)(0,14100060) 2 7050030 92.689100 88.806005 101.768262 122.422498 105.614255 -0.174701 -0.252477 -0.220588 -0.360944 0.152373 98474
(0,93269)(14000529,14100095)(0,14100095) 3 7050047 70.757874 76.248087 75.447438 68.354514 72.301301 0.322902 0.020330 -0.148419 0.110921 0.023794 93269
(0,96339)(13999912,14100064)(0,14100064) 4 7050032 99.748624 107.898618 94.265208 130.283528 119.091076 0.340878 0.247030 0.123956 -0.223386 0.211971 96339
(0,99659)(13999926,14100063)(0,14100063) 5 7050031 120.941697 132.667821 86.726667 163.908351 148.288088 0.404945 0.688320 0.639821 -0.257254 0.247395 99659
(0,99541)(13999918,14100103)(0,14100103) 6 7050051 96.666344 112.146685 69.740992 143.403712 127.775201 0.667988 0.792499 0.627735 -0.321842 0.351730 99541
(0,99786)(13999926,14100047)(0,14100047) 7 7050023 93.164548 92.023886 92.742574 142.413716 117.218807 -0.051058 -0.013928 0.010201 -0.538288 0.282133 99786
(0,98759)(13999923,14100082)(0,14100082) 8 7050041 133.567125 177.157879 72.197498 204.069028 190.613463 1.363708 1.425567 1.040517 -0.200700 0.467490 98759

</pre>
Format is:

<code>
(indexStart,indexStop)(posStart,posStop)(regStat,regStop) chrname wincenter tW tP tF tH tL tajD fulif fuliD fayH zengsE numSites
</code>

Most likely you are just interest in the wincenter (column 3) and the column 9 which is the Tajima's D statistic.

The first 3 columns relates to the region. The next 5 columns are 5 different estimators of theta, and the next 5 columns are neutrality test statistics. The final column is the number of sites with data in the region.

The first '''()()()''' er mainly used for debugging the sliding window program. The interpretation is:
* The posStart and posStop is the first physical position, and last physical postion of sites included in the analysis.
* The regStat and regStop is the physical region for which the analysis is performed. Therefore the posStat and posStop is always included within the regStart and regStop
* The indexStart and IndexStop is the position within the internal array.

=Unknown ancestral state (folded sfs)=

If you don't have the ancestral states, you can still calculate the Watterson and Tajima theta, which means you can perform the Tajima's D neutrality test statistic. But this requires you to use the folded sfs. The output files will have the same format, but only the thetaW and thetaD, and tajimas D is meaningful.

There was previously an example below that showed how to perform this analysis. This information has now been added to the examples above (notice the -fold 1) step in realSFS.

=Citation=
[[Korneliussen2013]]

Thetas,Tajima,Neutrality tests

2020-04-01T11:03:25Z

Thorfinn: /* Full command list for below examples */

This method will estimate different thetas (population scaled mutation rate) and can based on these thetas calculate Tajima's D and various other neutrality test statistics. Method is described in [[Korneliussen2013]].

* NB Information on this website is for version 0.917-33-g6d2aec8 or higher.
* NB The [[Korneliussen2013]] covers two methods,
# using an ML method
# using the emperical Bayes (EB) method. The information on this page relates to the EB method.
For performing the ML method, you should the use the [[SFS Estimation]] method and define the region af interest.

=Quick Example=
Below is a chain of commands used for caculating statistics. These are based on the test files that can be dowloaded on the [[Quick Start ]] page.

Its a 3 step procedure
# Estimate an site frequency spectrum. Output is '''out.sfs''' file. This is what is being used as the '''-pest ''' argument in step2.
# Calculate per-site thetas. Output is a '''.thetas.idx/.thetas.gz''' files. This contains the binary persite estimates of the thetas.
# Calculate neutrality tests statistics. Output is a '''.thetas.idx.pestPG file.
==Full command list for below examples==
Here is the chain of commands required to do estimate the thetas, and perform neutrality test statistics. These different commands are described in great detail in the following '''step 1,... step 3b''' sub sections.
<pre>
./angsd -bam bam.filelist -doSaf 1 -anc chimpHg19.fa -GL 1 -P 24 -out out
./misc/realSFS out.saf.idx -P 24 > out.sfs
#use -fold 1 in the above command if you dont have ancestral state.
./misc/realSFS saf2theta out.saf.idx -outname out -sfs out.sfs
#Estimate for every Chromosome/scaffold
./misc/thetaStat do_stat out.thetas.idx
#Do a sliding window analysis based on the output from the make_bed command.
./misc/thetaStat do_stat out.thetas.idx -win 50000 -step 10000 -outnames theta.thetasWindow.gz
</pre>

==Step 1: Finding a 'global estimate' of the SFS==

First estimate the site allele frequency likelihood
<div class="toccolours mw-collapsible mw-collapsed">
./angsd -bam bam.filelist -doSaf 1 -anc chimpHg19.fa -GL 1 -P 24 -out out
<pre class="mw-collapsible-content">

-> Reading fasta: chimpHg19.fa
-> Parsing 10 number of samples
-> Printing at chr: 20 pos:14095817 chunknumber 3500
-> Done reading data waiting for calculations to finish
-> Calling destroy
-> Done waiting for threads
-> Output filenames:
->"out.arg"
->"out.saf"
->"out.saf.pos.gz"
-> Mon Jun 30 12:02:58 2014
-> Arguments and parameters for all analysis are located in .arg file
[ALL done] cpu-time used = 47.19 sec
[ALL done] walltime used = 43.00 sec

</pre>
</div>

Obtain the maximum likelihood estimate of the SFS using the '''realSFS''' program found in the misc subfolder. (See more here [[realSFS]])
<pre>
./misc/realSFS out.saf.idx -P 24 > out.sfs
</pre>

To plot the SFS in R :
<pre>
s<-scan('out.sfs')
s<-s[-c(1,length(s))]
s<-s/sum(s)
barplot(s,names=1:length(s),main='SFS')
</pre>

==Step 2: Calculate the thetas for each site==
<pre>
realSFS saf2theta out.saf.idx -sfs out.sfs -outname out
</pre>
The output from the above command are two files out.thetas.gz and out.thetas.idx. A formal description of these files can be found in the doc/formats.pdf in the angsd package. It is possible to extract the logscale persite thetas using the ./thetaStat print program.

<div class="toccolours mw-collapsible mw-collapsed">
thetaStat print out.thetas.idx 2>/dev/null |head
<pre class="mw-collapsible-content">
#Chromo Pos Watterson Pairwise thetaSingleton thetaH thetaL
1 14000032 -10.339284 -12.069325 -9.000927 -15.852173 -12.739969
1 14000033 -10.437878 -12.185619 -9.080596 -16.001343 -12.856984
1 14000034 -10.373872 -12.110464 -9.028572 -15.905591 -12.781380
1 14000035 -10.528192 -12.290763 -9.154920 -16.133823 -12.962708
1 14000036 -10.322074 -12.051400 -8.985016 -15.834049 -12.722040
1 14000037 -10.304955 -12.028814 -8.973260 -15.800330 -12.699204
1 14000038 -10.108563 -11.791546 -8.819884 -15.486384 -12.460146
1 14000039 -10.542117 -12.306631 -9.166698 -16.153168 -12.978650
1 14000040 -10.688401 -12.473763 -9.290272 -16.358398 -13.146564
</pre>
</div>
Per default the print command will also output the contents of the index file to the stderr.

==Step 3a: Estimate Tajimas D and other statistics==
<pre>
#calculate Tajimas D
./misc/thetaStat do_stat out.thetas.idx
</pre>

<div class="toccolours mw-collapsible mw-collapsed">
cat out.thetas.idx.pestPG
<pre class="mw-collapsible-content">
## thetaStat VERSION: 0.01 build:(Jun 30 2014,12:06:12)
#(indexStart,indexStop)(firstPos_withData,lastPos_withData)(WinStart,WinStop) Chr WinCenter tW tP tF tH tL Tajima fuf fud fayh zeng nSites
(0,98316)(14000032,14100082)(0,14100082) 1 7050041 51.002623 46.171402 64.683834 51.290955 48.731178 -0.392892 -0.647071 -0.595302 -0.099654 -0.048444 98316
(0,98474)(13999910,14100060)(0,14100060) 2 7050030 92.689100 88.806005 101.768262 122.422498 105.614255 -0.174701 -0.252477 -0.220588 -0.360944 0.152373 98474
(0,93269)(14000529,14100095)(0,14100095) 3 7050047 70.757874 76.248087 75.447438 68.354514 72.301301 0.322902 0.020330 -0.148419 0.110921 0.023794 93269
(0,96339)(13999912,14100064)(0,14100064) 4 7050032 99.748624 107.898618 94.265208 130.283528 119.091076 0.340878 0.247030 0.123956 -0.223386 0.211971 96339
(0,99659)(13999926,14100063)(0,14100063) 5 7050031 120.941697 132.667821 86.726667 163.908351 148.288088 0.404945 0.688320 0.639821 -0.257254 0.247395 99659
(0,99541)(13999918,14100103)(0,14100103) 6 7050051 96.666344 112.146685 69.740992 143.403712 127.775201 0.667988 0.792499 0.627735 -0.321842 0.351730 99541
(0,99786)(13999926,14100047)(0,14100047) 7 7050023 93.164548 92.023886 92.742574 142.413716 117.218807 -0.051058 -0.013928 0.010201 -0.538288 0.282133 99786
(0,98759)(13999923,14100082)(0,14100082) 8 7050041 133.567125 177.157879 72.197498 204.069028 190.613463 1.363708 1.425567 1.040517 -0.200700 0.467490 98759
(0,97855)(14001686,14100094)(0,14100094) 9 7050047 88.777475 102.853333 64.660948 104.749694 103.801516 0.660983 0.776148 0.611265 -0.021256 0.184875 97855
(0,98031)(13999906,14100096)(0,14100096) 10 7050048 129.583334 134.877160 88.135115 213.231615 174.054390 0.170681 0.654145 0.724072 -0.602284 0.375595 98031
(0,99220)(13999900,14100060)(0,14100060) 11 7050030 66.349155 79.423643 60.194045 68.903312 74.163477 0.819589 0.520022 0.207421 0.157614 0.128409 99220
(0,99861)(13999913,14100078)(0,14100078) 12 7050039 86.461303 81.630083 96.156392 110.974922 96.302507 -0.232902 -0.302980 -0.252190 -0.337701 0.124323 99861
(0,98258)(13999943,14100097)(0,14100097) 16 7050048 83.191170 99.392421 77.561510 106.148748 102.770584 0.811500 0.472922 0.152079 -0.080798 0.257008 98258
(0,99428)(13999902,14100095)(0,14100095) 17 7050047 90.254620 99.816352 65.610351 113.328929 106.572638 0.441707 0.683942 0.614609 -0.148988 0.197530 99428
(0,97118)(13999898,14100071)(0,14100071) 18 7050035 79.843256 75.282296 86.844252 67.720321 71.501308 -0.237958 -0.260778 -0.196888 0.094212 -0.114062 97118
(0,93783)(13999895,14100089)(0,14100089) 19 7050044 54.311523 49.839190 64.913940 72.868913 61.354048 -0.341795 -0.495649 -0.434079 -0.421111 0.141133 93783
(0,98938)(13999916,14100091)(0,14100091) 20 7050045 68.148147 63.323800 78.463736 56.040370 59.682084 -0.294508 -0.398845 -0.338673 0.106250 -0.135474 98938
</pre>
</div>

==Step 3b: Sliding Window example==
We can easily do a sliding window analysis by adding -win/-step arguments to the last command. [[ thetaStat ]]
<pre>
thetaStat do_stat out.thetas.idx -win 50000 -step 10000 -outnames theta.thetasWindow.gz
</pre>
This will calculate the test statistic using a window size of 50kb and a step size of 10kb.

=Example Output=
<pre>
- Output in the ./thetaStat print thetas.idx are the log scaled per site estimates of the thetas
- Output in the pestPG file are the sum of the per site estimates for a region
</pre>
==./thetaStat print angsdput.thetas.idx==
<pre>
#Chromo Pos Watterson Pairwise thetaSingleton thetaH thetaL
1 14000032 -9.457420 -10.372069 -8.319252 -13.025778 -10.997194
1 14000033 -9.463637 -10.379368 -8.324414 -13.035780 -11.004670
1 14000034 -9.463740 -10.379488 -8.324500 -13.035942 -11.004793
1 14000035 -9.463603 -10.379328 -8.324386 -13.035725 -11.004629
1 14000036 -9.323246 -10.218453 -8.204848 -12.826627 -10.840519
1 14000037 -9.179270 -10.048883 -8.086425 -12.596436 -10.666670
1 14000038 -9.004664 -9.845473 -7.941453 -12.328274 -10.458416
1 14000039 -9.327033 -10.222983 -8.207914 -12.833007 -10.845176
1 14000040 -9.621554 -10.557563 -8.461745 -13.262415 -11.185971
1 14000041 -9.617449 -10.552869 -8.458225 -13.256257 -11.181185
1 14000042 -7.337841 -8.161756 -204.045433 -5.457443 -6.085818
1 14000043 -9.570405 -10.502160 -8.415195 -13.197596 -11.129976
1 14000044 -9.511097 -10.434558 -8.364249 -13.110037 -11.061100
1 14000045 -9.563664 -10.494371 -8.409489 -13.187203 -11.122022
1 14000046 -9.617690 -10.555402 -8.456395 -13.265004 -11.184107
1 14000047 -9.563722 -10.494438 -8.409538 -13.187292 -11.122090
1 14000048 -9.856578 -10.819096 -8.669691 -13.587898 -11.451396
</pre>
;1. chromosome
;2. position
;3. ThetaWatterson
;4. ThetaD (nucleotide diversity)
;5. Theta? (singleton category)
;6. ThetaH
;7. ThetaL

==.thetas.idx.pestPG==
The .pestPG file is a 14 column file (tab seperated). The first column contains information about the region. The second and third column is the reference name and the center of the window.

We then have 5 different estimators of theta, these are: Watterson, pairwise, FuLi, fayH, L.
And we have 5 different neutrality test statistics: Tajima's D, Fu&Li F's, Fu&Li's D, Fay's H, Zeng's E.
The final column is the effetive number of sites with data in the window.

<pre>
## thetaStat VERSION: 0.01 build:(Jun 30 2014,12:06:12)
#(indexStart,indexStop)(firstPos_withData,lastPos_withData)(WinStart,WinStop) Chr WinCenter tW tP tF tH tL Tajima fuf fud fayh zeng nSites
(0,98316)(14000032,14100082)(0,14100082) 1 7050041 51.002623 46.171402 64.683834 51.290955 48.731178 -0.392892 -0.647071 -0.595302 -0.099654 -0.048444 98316
(0,98474)(13999910,14100060)(0,14100060) 2 7050030 92.689100 88.806005 101.768262 122.422498 105.614255 -0.174701 -0.252477 -0.220588 -0.360944 0.152373 98474
(0,93269)(14000529,14100095)(0,14100095) 3 7050047 70.757874 76.248087 75.447438 68.354514 72.301301 0.322902 0.020330 -0.148419 0.110921 0.023794 93269
(0,96339)(13999912,14100064)(0,14100064) 4 7050032 99.748624 107.898618 94.265208 130.283528 119.091076 0.340878 0.247030 0.123956 -0.223386 0.211971 96339
(0,99659)(13999926,14100063)(0,14100063) 5 7050031 120.941697 132.667821 86.726667 163.908351 148.288088 0.404945 0.688320 0.639821 -0.257254 0.247395 99659
(0,99541)(13999918,14100103)(0,14100103) 6 7050051 96.666344 112.146685 69.740992 143.403712 127.775201 0.667988 0.792499 0.627735 -0.321842 0.351730 99541
(0,99786)(13999926,14100047)(0,14100047) 7 7050023 93.164548 92.023886 92.742574 142.413716 117.218807 -0.051058 -0.013928 0.010201 -0.538288 0.282133 99786
(0,98759)(13999923,14100082)(0,14100082) 8 7050041 133.567125 177.157879 72.197498 204.069028 190.613463 1.363708 1.425567 1.040517 -0.200700 0.467490 98759

</pre>
Format is:

<code>
(indexStart,indexStop)(posStart,posStop)(regStat,regStop) chrname wincenter tW tP tF tH tL tajD fulif fuliD fayH zengsE numSites
</code>

Most likely you are just interest in the wincenter (column 3) and the column 9 which is the Tajima's D statistic.

The first 3 columns relates to the region. The next 5 columns are 5 different estimators of theta, and the next 5 columns are neutrality test statistics. The final column is the number of sites with data in the region.

The first '''()()()''' er mainly used for debugging the sliding window program. The interpretation is:
* The posStart and posStop is the first physical position, and last physical postion of sites included in the analysis.
* The regStat and regStop is the physical region for which the analysis is performed. Therefore the posStat and posStop is always included within the regStart and regStop
* The indexStart and IndexStop is the position within the internal array.

=Unknown ancestral state (folded sfs)=

If you don't have the ancestral states, you can still calculate the Watterson and Tajima theta, which means you can perform the Tajima's D neutrality test statistic. But this requires you to use the folded sfs. The output files will have the same format, but only the thetaW and thetaD, and tajimas D is meaningful.

There was previously an example below that showed how to perform this analysis. This information has now been added to the examples above (notice the -fold 1) step in realSFS.

=Citation=
[[Korneliussen2013]]

Thetas,Tajima,Neutrality tests

2020-04-01T10:56:52Z

Thorfinn: /* Unknown ancestral state (folded sfs) */

This method will estimate different thetas (population scaled mutation rate) and can based on these thetas calculate Tajima's D and various other neutrality test statistics. Method is described in [[Korneliussen2013]].

* NB Information on this website is for version 0.917-33-g6d2aec8 or higher.
* NB The [[Korneliussen2013]] covers two methods,
# using an ML method
# using the emperical Bayes (EB) method. The information on this page relates to the EB method.
For performing the ML method, you should the use the [[SFS Estimation]] method and define the region af interest.

=Quick Example=
Below is a chain of commands used for caculating statistics. These are based on the test files that can be dowloaded on the [[Quick Start ]] page.

Its a 3 step procedure
# Estimate an site frequency spectrum. Output is '''out.sfs''' file. This is what is being used as the '''-pest ''' argument in step2.
# Calculate per-site thetas. Output is a '''.thetas.idx/.thetas.gz''' files. This contains the binary persite estimates of the thetas.
# Calculate neutrality tests statistics. Output is a '''.thetas.idx.pestPG file.
==Full command list for below examples==
Here is the chain of commands required to do estimate the thetas, and perform neutrality test statistics. These different commands are described in great detail in the following '''step 1,... step 3b''' sub sections.
<pre>
./angsd -bam bam.filelist -doSaf 1 -anc chimpHg19.fa -GL 1 -P 24 -out out
./misc/realSFS out.saf.idx -P 24 > out.sfs
./angsd -bam bam.filelist -out out -doThetas 1 -doSaf 1 -pest out.sfs -anc chimpHg19.fa -GL 1
#Estimate for every Chromosome/scaffold
./misc/thetaStat do_stat out.thetas.idx
#Do a sliding window analysis based on the output from the make_bed command.
./misc/thetaStat do_stat out.thetas.idx -win 50000 -step 10000 -outnames theta.thetasWindow.gz
</pre>

==Step 1: Finding a 'global estimate' of the SFS==

First estimate the site allele frequency likelihood
<div class="toccolours mw-collapsible mw-collapsed">
./angsd -bam bam.filelist -doSaf 1 -anc chimpHg19.fa -GL 1 -P 24 -out out
<pre class="mw-collapsible-content">

-> Reading fasta: chimpHg19.fa
-> Parsing 10 number of samples
-> Printing at chr: 20 pos:14095817 chunknumber 3500
-> Done reading data waiting for calculations to finish
-> Calling destroy
-> Done waiting for threads
-> Output filenames:
->"out.arg"
->"out.saf"
->"out.saf.pos.gz"
-> Mon Jun 30 12:02:58 2014
-> Arguments and parameters for all analysis are located in .arg file
[ALL done] cpu-time used = 47.19 sec
[ALL done] walltime used = 43.00 sec

</pre>
</div>

Obtain the maximum likelihood estimate of the SFS using the '''realSFS''' program found in the misc subfolder. (See more here [[realSFS]])
<pre>
./misc/realSFS out.saf.idx -P 24 > out.sfs
</pre>

To plot the SFS in R :
<pre>
s<-scan('out.sfs')
s<-s[-c(1,length(s))]
s<-s/sum(s)
barplot(s,names=1:length(s),main='SFS')
</pre>

==Step 2: Calculate the thetas for each site==
<pre>
realSFS saf2theta out.saf.idx -sfs out.sfs -outname out
</pre>
The output from the above command are two files out.thetas.gz and out.thetas.idx. A formal description of these files can be found in the doc/formats.pdf in the angsd package. It is possible to extract the logscale persite thetas using the ./thetaStat print program.

<div class="toccolours mw-collapsible mw-collapsed">
thetaStat print out.thetas.idx 2>/dev/null |head
<pre class="mw-collapsible-content">
#Chromo Pos Watterson Pairwise thetaSingleton thetaH thetaL
1 14000032 -10.339284 -12.069325 -9.000927 -15.852173 -12.739969
1 14000033 -10.437878 -12.185619 -9.080596 -16.001343 -12.856984
1 14000034 -10.373872 -12.110464 -9.028572 -15.905591 -12.781380
1 14000035 -10.528192 -12.290763 -9.154920 -16.133823 -12.962708
1 14000036 -10.322074 -12.051400 -8.985016 -15.834049 -12.722040
1 14000037 -10.304955 -12.028814 -8.973260 -15.800330 -12.699204
1 14000038 -10.108563 -11.791546 -8.819884 -15.486384 -12.460146
1 14000039 -10.542117 -12.306631 -9.166698 -16.153168 -12.978650
1 14000040 -10.688401 -12.473763 -9.290272 -16.358398 -13.146564
</pre>
</div>
Per default the print command will also output the contents of the index file to the stderr.

==Step 3a: Estimate Tajimas D and other statistics==
<pre>
#calculate Tajimas D
./misc/thetaStat do_stat out.thetas.idx
</pre>

<div class="toccolours mw-collapsible mw-collapsed">
cat out.thetas.idx.pestPG
<pre class="mw-collapsible-content">
## thetaStat VERSION: 0.01 build:(Jun 30 2014,12:06:12)
#(indexStart,indexStop)(firstPos_withData,lastPos_withData)(WinStart,WinStop) Chr WinCenter tW tP tF tH tL Tajima fuf fud fayh zeng nSites
(0,98316)(14000032,14100082)(0,14100082) 1 7050041 51.002623 46.171402 64.683834 51.290955 48.731178 -0.392892 -0.647071 -0.595302 -0.099654 -0.048444 98316
(0,98474)(13999910,14100060)(0,14100060) 2 7050030 92.689100 88.806005 101.768262 122.422498 105.614255 -0.174701 -0.252477 -0.220588 -0.360944 0.152373 98474
(0,93269)(14000529,14100095)(0,14100095) 3 7050047 70.757874 76.248087 75.447438 68.354514 72.301301 0.322902 0.020330 -0.148419 0.110921 0.023794 93269
(0,96339)(13999912,14100064)(0,14100064) 4 7050032 99.748624 107.898618 94.265208 130.283528 119.091076 0.340878 0.247030 0.123956 -0.223386 0.211971 96339
(0,99659)(13999926,14100063)(0,14100063) 5 7050031 120.941697 132.667821 86.726667 163.908351 148.288088 0.404945 0.688320 0.639821 -0.257254 0.247395 99659
(0,99541)(13999918,14100103)(0,14100103) 6 7050051 96.666344 112.146685 69.740992 143.403712 127.775201 0.667988 0.792499 0.627735 -0.321842 0.351730 99541
(0,99786)(13999926,14100047)(0,14100047) 7 7050023 93.164548 92.023886 92.742574 142.413716 117.218807 -0.051058 -0.013928 0.010201 -0.538288 0.282133 99786
(0,98759)(13999923,14100082)(0,14100082) 8 7050041 133.567125 177.157879 72.197498 204.069028 190.613463 1.363708 1.425567 1.040517 -0.200700 0.467490 98759
(0,97855)(14001686,14100094)(0,14100094) 9 7050047 88.777475 102.853333 64.660948 104.749694 103.801516 0.660983 0.776148 0.611265 -0.021256 0.184875 97855
(0,98031)(13999906,14100096)(0,14100096) 10 7050048 129.583334 134.877160 88.135115 213.231615 174.054390 0.170681 0.654145 0.724072 -0.602284 0.375595 98031
(0,99220)(13999900,14100060)(0,14100060) 11 7050030 66.349155 79.423643 60.194045 68.903312 74.163477 0.819589 0.520022 0.207421 0.157614 0.128409 99220
(0,99861)(13999913,14100078)(0,14100078) 12 7050039 86.461303 81.630083 96.156392 110.974922 96.302507 -0.232902 -0.302980 -0.252190 -0.337701 0.124323 99861
(0,98258)(13999943,14100097)(0,14100097) 16 7050048 83.191170 99.392421 77.561510 106.148748 102.770584 0.811500 0.472922 0.152079 -0.080798 0.257008 98258
(0,99428)(13999902,14100095)(0,14100095) 17 7050047 90.254620 99.816352 65.610351 113.328929 106.572638 0.441707 0.683942 0.614609 -0.148988 0.197530 99428
(0,97118)(13999898,14100071)(0,14100071) 18 7050035 79.843256 75.282296 86.844252 67.720321 71.501308 -0.237958 -0.260778 -0.196888 0.094212 -0.114062 97118
(0,93783)(13999895,14100089)(0,14100089) 19 7050044 54.311523 49.839190 64.913940 72.868913 61.354048 -0.341795 -0.495649 -0.434079 -0.421111 0.141133 93783
(0,98938)(13999916,14100091)(0,14100091) 20 7050045 68.148147 63.323800 78.463736 56.040370 59.682084 -0.294508 -0.398845 -0.338673 0.106250 -0.135474 98938
</pre>
</div>

==Step 3b: Sliding Window example==
We can easily do a sliding window analysis by adding -win/-step arguments to the last command. [[ thetaStat ]]
<pre>
thetaStat do_stat out.thetas.idx -win 50000 -step 10000 -outnames theta.thetasWindow.gz
</pre>
This will calculate the test statistic using a window size of 50kb and a step size of 10kb.

=Example Output=
<pre>
- Output in the ./thetaStat print thetas.idx are the log scaled per site estimates of the thetas
- Output in the pestPG file are the sum of the per site estimates for a region
</pre>
==./thetaStat print angsdput.thetas.idx==
<pre>
#Chromo Pos Watterson Pairwise thetaSingleton thetaH thetaL
1 14000032 -9.457420 -10.372069 -8.319252 -13.025778 -10.997194
1 14000033 -9.463637 -10.379368 -8.324414 -13.035780 -11.004670
1 14000034 -9.463740 -10.379488 -8.324500 -13.035942 -11.004793
1 14000035 -9.463603 -10.379328 -8.324386 -13.035725 -11.004629
1 14000036 -9.323246 -10.218453 -8.204848 -12.826627 -10.840519
1 14000037 -9.179270 -10.048883 -8.086425 -12.596436 -10.666670
1 14000038 -9.004664 -9.845473 -7.941453 -12.328274 -10.458416
1 14000039 -9.327033 -10.222983 -8.207914 -12.833007 -10.845176
1 14000040 -9.621554 -10.557563 -8.461745 -13.262415 -11.185971
1 14000041 -9.617449 -10.552869 -8.458225 -13.256257 -11.181185
1 14000042 -7.337841 -8.161756 -204.045433 -5.457443 -6.085818
1 14000043 -9.570405 -10.502160 -8.415195 -13.197596 -11.129976
1 14000044 -9.511097 -10.434558 -8.364249 -13.110037 -11.061100
1 14000045 -9.563664 -10.494371 -8.409489 -13.187203 -11.122022
1 14000046 -9.617690 -10.555402 -8.456395 -13.265004 -11.184107
1 14000047 -9.563722 -10.494438 -8.409538 -13.187292 -11.122090
1 14000048 -9.856578 -10.819096 -8.669691 -13.587898 -11.451396
</pre>
;1. chromosome
;2. position
;3. ThetaWatterson
;4. ThetaD (nucleotide diversity)
;5. Theta? (singleton category)
;6. ThetaH
;7. ThetaL

==.thetas.idx.pestPG==
The .pestPG file is a 14 column file (tab seperated). The first column contains information about the region. The second and third column is the reference name and the center of the window.

We then have 5 different estimators of theta, these are: Watterson, pairwise, FuLi, fayH, L.
And we have 5 different neutrality test statistics: Tajima's D, Fu&Li F's, Fu&Li's D, Fay's H, Zeng's E.
The final column is the effetive number of sites with data in the window.

<pre>
## thetaStat VERSION: 0.01 build:(Jun 30 2014,12:06:12)
#(indexStart,indexStop)(firstPos_withData,lastPos_withData)(WinStart,WinStop) Chr WinCenter tW tP tF tH tL Tajima fuf fud fayh zeng nSites
(0,98316)(14000032,14100082)(0,14100082) 1 7050041 51.002623 46.171402 64.683834 51.290955 48.731178 -0.392892 -0.647071 -0.595302 -0.099654 -0.048444 98316
(0,98474)(13999910,14100060)(0,14100060) 2 7050030 92.689100 88.806005 101.768262 122.422498 105.614255 -0.174701 -0.252477 -0.220588 -0.360944 0.152373 98474
(0,93269)(14000529,14100095)(0,14100095) 3 7050047 70.757874 76.248087 75.447438 68.354514 72.301301 0.322902 0.020330 -0.148419 0.110921 0.023794 93269
(0,96339)(13999912,14100064)(0,14100064) 4 7050032 99.748624 107.898618 94.265208 130.283528 119.091076 0.340878 0.247030 0.123956 -0.223386 0.211971 96339
(0,99659)(13999926,14100063)(0,14100063) 5 7050031 120.941697 132.667821 86.726667 163.908351 148.288088 0.404945 0.688320 0.639821 -0.257254 0.247395 99659
(0,99541)(13999918,14100103)(0,14100103) 6 7050051 96.666344 112.146685 69.740992 143.403712 127.775201 0.667988 0.792499 0.627735 -0.321842 0.351730 99541
(0,99786)(13999926,14100047)(0,14100047) 7 7050023 93.164548 92.023886 92.742574 142.413716 117.218807 -0.051058 -0.013928 0.010201 -0.538288 0.282133 99786
(0,98759)(13999923,14100082)(0,14100082) 8 7050041 133.567125 177.157879 72.197498 204.069028 190.613463 1.363708 1.425567 1.040517 -0.200700 0.467490 98759

</pre>
Format is:

<code>
(indexStart,indexStop)(posStart,posStop)(regStat,regStop) chrname wincenter tW tP tF tH tL tajD fulif fuliD fayH zengsE numSites
</code>

Most likely you are just interest in the wincenter (column 3) and the column 9 which is the Tajima's D statistic.

The first 3 columns relates to the region. The next 5 columns are 5 different estimators of theta, and the next 5 columns are neutrality test statistics. The final column is the number of sites with data in the region.

The first '''()()()''' er mainly used for debugging the sliding window program. The interpretation is:
* The posStart and posStop is the first physical position, and last physical postion of sites included in the analysis.
* The regStat and regStop is the physical region for which the analysis is performed. Therefore the posStat and posStop is always included within the regStart and regStop
* The indexStart and IndexStop is the position within the internal array.

=Unknown ancestral state (folded sfs)=

If you don't have the ancestral states, you can still calculate the Watterson and Tajima theta, which means you can perform the Tajima's D neutrality test statistic. But this requires you to use the folded sfs. The output files will have the same format, but only the thetaW and thetaD, and tajimas D is meaningful.

There was previously an example below that showed how to perform this analysis. This information has now been added to the examples above (notice the -fold 1) step in realSFS.

=Citation=
[[Korneliussen2013]]

Thetas,Tajima,Neutrality tests

2020-04-01T10:52:30Z

Thorfinn: /* Step 2: Calculate the thetas for each site */

This method will estimate different thetas (population scaled mutation rate) and can based on these thetas calculate Tajima's D and various other neutrality test statistics. Method is described in [[Korneliussen2013]].

* NB Information on this website is for version 0.917-33-g6d2aec8 or higher.
* NB The [[Korneliussen2013]] covers two methods,
# using an ML method
# using the emperical Bayes (EB) method. The information on this page relates to the EB method.
For performing the ML method, you should the use the [[SFS Estimation]] method and define the region af interest.

=Quick Example=
Below is a chain of commands used for caculating statistics. These are based on the test files that can be dowloaded on the [[Quick Start ]] page.

Its a 3 step procedure
# Estimate an site frequency spectrum. Output is '''out.sfs''' file. This is what is being used as the '''-pest ''' argument in step2.
# Calculate per-site thetas. Output is a '''.thetas.idx/.thetas.gz''' files. This contains the binary persite estimates of the thetas.
# Calculate neutrality tests statistics. Output is a '''.thetas.idx.pestPG file.
==Full command list for below examples==
Here is the chain of commands required to do estimate the thetas, and perform neutrality test statistics. These different commands are described in great detail in the following '''step 1,... step 3b''' sub sections.
<pre>
./angsd -bam bam.filelist -doSaf 1 -anc chimpHg19.fa -GL 1 -P 24 -out out
./misc/realSFS out.saf.idx -P 24 > out.sfs
./angsd -bam bam.filelist -out out -doThetas 1 -doSaf 1 -pest out.sfs -anc chimpHg19.fa -GL 1
#Estimate for every Chromosome/scaffold
./misc/thetaStat do_stat out.thetas.idx
#Do a sliding window analysis based on the output from the make_bed command.
./misc/thetaStat do_stat out.thetas.idx -win 50000 -step 10000 -outnames theta.thetasWindow.gz
</pre>

==Step 1: Finding a 'global estimate' of the SFS==

First estimate the site allele frequency likelihood
<div class="toccolours mw-collapsible mw-collapsed">
./angsd -bam bam.filelist -doSaf 1 -anc chimpHg19.fa -GL 1 -P 24 -out out
<pre class="mw-collapsible-content">

-> Reading fasta: chimpHg19.fa
-> Parsing 10 number of samples
-> Printing at chr: 20 pos:14095817 chunknumber 3500
-> Done reading data waiting for calculations to finish
-> Calling destroy
-> Done waiting for threads
-> Output filenames:
->"out.arg"
->"out.saf"
->"out.saf.pos.gz"
-> Mon Jun 30 12:02:58 2014
-> Arguments and parameters for all analysis are located in .arg file
[ALL done] cpu-time used = 47.19 sec
[ALL done] walltime used = 43.00 sec

</pre>
</div>

Obtain the maximum likelihood estimate of the SFS using the '''realSFS''' program found in the misc subfolder. (See more here [[realSFS]])
<pre>
./misc/realSFS out.saf.idx -P 24 > out.sfs
</pre>

To plot the SFS in R :
<pre>
s<-scan('out.sfs')
s<-s[-c(1,length(s))]
s<-s/sum(s)
barplot(s,names=1:length(s),main='SFS')
</pre>

==Step 2: Calculate the thetas for each site==
<pre>
realSFS saf2theta out.saf.idx -sfs out.sfs -outname out
</pre>
The output from the above command are two files out.thetas.gz and out.thetas.idx. A formal description of these files can be found in the doc/formats.pdf in the angsd package. It is possible to extract the logscale persite thetas using the ./thetaStat print program.

<div class="toccolours mw-collapsible mw-collapsed">
thetaStat print out.thetas.idx 2>/dev/null |head
<pre class="mw-collapsible-content">
#Chromo Pos Watterson Pairwise thetaSingleton thetaH thetaL
1 14000032 -10.339284 -12.069325 -9.000927 -15.852173 -12.739969
1 14000033 -10.437878 -12.185619 -9.080596 -16.001343 -12.856984
1 14000034 -10.373872 -12.110464 -9.028572 -15.905591 -12.781380
1 14000035 -10.528192 -12.290763 -9.154920 -16.133823 -12.962708
1 14000036 -10.322074 -12.051400 -8.985016 -15.834049 -12.722040
1 14000037 -10.304955 -12.028814 -8.973260 -15.800330 -12.699204
1 14000038 -10.108563 -11.791546 -8.819884 -15.486384 -12.460146
1 14000039 -10.542117 -12.306631 -9.166698 -16.153168 -12.978650
1 14000040 -10.688401 -12.473763 -9.290272 -16.358398 -13.146564
</pre>
</div>
Per default the print command will also output the contents of the index file to the stderr.

==Step 3a: Estimate Tajimas D and other statistics==
<pre>
#calculate Tajimas D
./misc/thetaStat do_stat out.thetas.idx
</pre>

<div class="toccolours mw-collapsible mw-collapsed">
cat out.thetas.idx.pestPG
<pre class="mw-collapsible-content">
## thetaStat VERSION: 0.01 build:(Jun 30 2014,12:06:12)
#(indexStart,indexStop)(firstPos_withData,lastPos_withData)(WinStart,WinStop) Chr WinCenter tW tP tF tH tL Tajima fuf fud fayh zeng nSites
(0,98316)(14000032,14100082)(0,14100082) 1 7050041 51.002623 46.171402 64.683834 51.290955 48.731178 -0.392892 -0.647071 -0.595302 -0.099654 -0.048444 98316
(0,98474)(13999910,14100060)(0,14100060) 2 7050030 92.689100 88.806005 101.768262 122.422498 105.614255 -0.174701 -0.252477 -0.220588 -0.360944 0.152373 98474
(0,93269)(14000529,14100095)(0,14100095) 3 7050047 70.757874 76.248087 75.447438 68.354514 72.301301 0.322902 0.020330 -0.148419 0.110921 0.023794 93269
(0,96339)(13999912,14100064)(0,14100064) 4 7050032 99.748624 107.898618 94.265208 130.283528 119.091076 0.340878 0.247030 0.123956 -0.223386 0.211971 96339
(0,99659)(13999926,14100063)(0,14100063) 5 7050031 120.941697 132.667821 86.726667 163.908351 148.288088 0.404945 0.688320 0.639821 -0.257254 0.247395 99659
(0,99541)(13999918,14100103)(0,14100103) 6 7050051 96.666344 112.146685 69.740992 143.403712 127.775201 0.667988 0.792499 0.627735 -0.321842 0.351730 99541
(0,99786)(13999926,14100047)(0,14100047) 7 7050023 93.164548 92.023886 92.742574 142.413716 117.218807 -0.051058 -0.013928 0.010201 -0.538288 0.282133 99786
(0,98759)(13999923,14100082)(0,14100082) 8 7050041 133.567125 177.157879 72.197498 204.069028 190.613463 1.363708 1.425567 1.040517 -0.200700 0.467490 98759
(0,97855)(14001686,14100094)(0,14100094) 9 7050047 88.777475 102.853333 64.660948 104.749694 103.801516 0.660983 0.776148 0.611265 -0.021256 0.184875 97855
(0,98031)(13999906,14100096)(0,14100096) 10 7050048 129.583334 134.877160 88.135115 213.231615 174.054390 0.170681 0.654145 0.724072 -0.602284 0.375595 98031
(0,99220)(13999900,14100060)(0,14100060) 11 7050030 66.349155 79.423643 60.194045 68.903312 74.163477 0.819589 0.520022 0.207421 0.157614 0.128409 99220
(0,99861)(13999913,14100078)(0,14100078) 12 7050039 86.461303 81.630083 96.156392 110.974922 96.302507 -0.232902 -0.302980 -0.252190 -0.337701 0.124323 99861
(0,98258)(13999943,14100097)(0,14100097) 16 7050048 83.191170 99.392421 77.561510 106.148748 102.770584 0.811500 0.472922 0.152079 -0.080798 0.257008 98258
(0,99428)(13999902,14100095)(0,14100095) 17 7050047 90.254620 99.816352 65.610351 113.328929 106.572638 0.441707 0.683942 0.614609 -0.148988 0.197530 99428
(0,97118)(13999898,14100071)(0,14100071) 18 7050035 79.843256 75.282296 86.844252 67.720321 71.501308 -0.237958 -0.260778 -0.196888 0.094212 -0.114062 97118
(0,93783)(13999895,14100089)(0,14100089) 19 7050044 54.311523 49.839190 64.913940 72.868913 61.354048 -0.341795 -0.495649 -0.434079 -0.421111 0.141133 93783
(0,98938)(13999916,14100091)(0,14100091) 20 7050045 68.148147 63.323800 78.463736 56.040370 59.682084 -0.294508 -0.398845 -0.338673 0.106250 -0.135474 98938
</pre>
</div>

==Step 3b: Sliding Window example==
We can easily do a sliding window analysis by adding -win/-step arguments to the last command. [[ thetaStat ]]
<pre>
thetaStat do_stat out.thetas.idx -win 50000 -step 10000 -outnames theta.thetasWindow.gz
</pre>
This will calculate the test statistic using a window size of 50kb and a step size of 10kb.

=Example Output=
<pre>
- Output in the ./thetaStat print thetas.idx are the log scaled per site estimates of the thetas
- Output in the pestPG file are the sum of the per site estimates for a region
</pre>
==./thetaStat print angsdput.thetas.idx==
<pre>
#Chromo Pos Watterson Pairwise thetaSingleton thetaH thetaL
1 14000032 -9.457420 -10.372069 -8.319252 -13.025778 -10.997194
1 14000033 -9.463637 -10.379368 -8.324414 -13.035780 -11.004670
1 14000034 -9.463740 -10.379488 -8.324500 -13.035942 -11.004793
1 14000035 -9.463603 -10.379328 -8.324386 -13.035725 -11.004629
1 14000036 -9.323246 -10.218453 -8.204848 -12.826627 -10.840519
1 14000037 -9.179270 -10.048883 -8.086425 -12.596436 -10.666670
1 14000038 -9.004664 -9.845473 -7.941453 -12.328274 -10.458416
1 14000039 -9.327033 -10.222983 -8.207914 -12.833007 -10.845176
1 14000040 -9.621554 -10.557563 -8.461745 -13.262415 -11.185971
1 14000041 -9.617449 -10.552869 -8.458225 -13.256257 -11.181185
1 14000042 -7.337841 -8.161756 -204.045433 -5.457443 -6.085818
1 14000043 -9.570405 -10.502160 -8.415195 -13.197596 -11.129976
1 14000044 -9.511097 -10.434558 -8.364249 -13.110037 -11.061100
1 14000045 -9.563664 -10.494371 -8.409489 -13.187203 -11.122022
1 14000046 -9.617690 -10.555402 -8.456395 -13.265004 -11.184107
1 14000047 -9.563722 -10.494438 -8.409538 -13.187292 -11.122090
1 14000048 -9.856578 -10.819096 -8.669691 -13.587898 -11.451396
</pre>
;1. chromosome
;2. position
;3. ThetaWatterson
;4. ThetaD (nucleotide diversity)
;5. Theta? (singleton category)
;6. ThetaH
;7. ThetaL

==.thetas.idx.pestPG==
The .pestPG file is a 14 column file (tab seperated). The first column contains information about the region. The second and third column is the reference name and the center of the window.

We then have 5 different estimators of theta, these are: Watterson, pairwise, FuLi, fayH, L.
And we have 5 different neutrality test statistics: Tajima's D, Fu&Li F's, Fu&Li's D, Fay's H, Zeng's E.
The final column is the effetive number of sites with data in the window.

<pre>
## thetaStat VERSION: 0.01 build:(Jun 30 2014,12:06:12)
#(indexStart,indexStop)(firstPos_withData,lastPos_withData)(WinStart,WinStop) Chr WinCenter tW tP tF tH tL Tajima fuf fud fayh zeng nSites
(0,98316)(14000032,14100082)(0,14100082) 1 7050041 51.002623 46.171402 64.683834 51.290955 48.731178 -0.392892 -0.647071 -0.595302 -0.099654 -0.048444 98316
(0,98474)(13999910,14100060)(0,14100060) 2 7050030 92.689100 88.806005 101.768262 122.422498 105.614255 -0.174701 -0.252477 -0.220588 -0.360944 0.152373 98474
(0,93269)(14000529,14100095)(0,14100095) 3 7050047 70.757874 76.248087 75.447438 68.354514 72.301301 0.322902 0.020330 -0.148419 0.110921 0.023794 93269
(0,96339)(13999912,14100064)(0,14100064) 4 7050032 99.748624 107.898618 94.265208 130.283528 119.091076 0.340878 0.247030 0.123956 -0.223386 0.211971 96339
(0,99659)(13999926,14100063)(0,14100063) 5 7050031 120.941697 132.667821 86.726667 163.908351 148.288088 0.404945 0.688320 0.639821 -0.257254 0.247395 99659
(0,99541)(13999918,14100103)(0,14100103) 6 7050051 96.666344 112.146685 69.740992 143.403712 127.775201 0.667988 0.792499 0.627735 -0.321842 0.351730 99541
(0,99786)(13999926,14100047)(0,14100047) 7 7050023 93.164548 92.023886 92.742574 142.413716 117.218807 -0.051058 -0.013928 0.010201 -0.538288 0.282133 99786
(0,98759)(13999923,14100082)(0,14100082) 8 7050041 133.567125 177.157879 72.197498 204.069028 190.613463 1.363708 1.425567 1.040517 -0.200700 0.467490 98759

</pre>
Format is:

<code>
(indexStart,indexStop)(posStart,posStop)(regStat,regStop) chrname wincenter tW tP tF tH tL tajD fulif fuliD fayH zengsE numSites
</code>

Most likely you are just interest in the wincenter (column 3) and the column 9 which is the Tajima's D statistic.

The first 3 columns relates to the region. The next 5 columns are 5 different estimators of theta, and the next 5 columns are neutrality test statistics. The final column is the number of sites with data in the region.

The first '''()()()''' er mainly used for debugging the sliding window program. The interpretation is:
* The posStart and posStop is the first physical position, and last physical postion of sites included in the analysis.
* The regStat and regStop is the physical region for which the analysis is performed. Therefore the posStat and posStop is always included within the regStart and regStop
* The indexStart and IndexStop is the position within the internal array.

=Unknown ancestral state (folded sfs)=
* Below is for version 0.556 and above

If you don't have the ancestral states, you can still calculate the Watterson and Tajima theta, which means you can perform the Tajima's D neutrality test statistic. But this requires you to use the folded sfs. The output files will have the same format, but only the thetaW and thetaD, and tajimas D is meaningful.

Below is an example based on the earlier example where we now base our analysis on the folded spectrum. Notice the -fold 1

First estimate the folded site allele frequency likelihood
<pre>
./angsd -bam bam.filelist -doSaf 1 -anc hg19.fa -GL 1 -P 24 -out outFold -fold 1
</pre>

Obtain the maximum likelihood estimate of the SFS
<pre>
misc/realSFS outFold.saf.idx -P 24 > outFold.sfs
</pre>

Calculate the thetas (remember to fold)
<pre>
./angsd -bam bam.filelist -out outFold -doThetas 1 -doSaf 1 -pest outFold.sfs -anc hg19.fa -GL 1 -fold 1
</pre>

Estimate Tajimas D
<pre>
thetaStat do_stat outFold.thetas.idx -nChr 10
</pre>

=Citation=
[[Korneliussen2013]]

Filters

2019-11-15T06:49:14Z

Thorfinn: /* Number of non missing individuals */

We allow for filtering at many different levels.

# Read level, MapQ, unique mapped reads etc
# Base level, qscore
# Sequencing depth
# Regions (using BAM indexing (active lookup))
# Single sites (passive lookup, also allows for forcing major and minor) [[Sites |-sites]]
# Filtering based on downstream analysis. minimum MAF, LRT for SNP calling etc.
# Trimming out the ends of the reads
# etc

It follows that some filters will select a subset of data, and some of the filters will discard certain sites. If multiple filters has been chosen, the analysis will be limited to the chain of filters.

=Filters for reads in Bam files=

We allow for filtering and manipulation a the read level. These filters include minimum mapping and base qualtity, paired reads and others. Additionally specific regions can be analysed. All of the filters for bam files are described in [[Input#BAM_files]].

=Selected Sites=
For analysing specfic regions see [[Input#BAM_files]]. If you are interested in running your analysis at individual sites that are distributed throughout the entire genome, it might be faster to simply to loop over the entire data, but only analyse the data at specific positions. This can be done by supplying the [[Sites | -sites]] argument. With this approach we also allows for the forcing of major/minor alleles using external information.

=Allele frequencies=
; -minMaf [float]: only work with sites with a maf above [float]

Requires [[Allele Frequency estimation | -doMaf]].

=Polymorphic sites=

; -SNP_pval [float]: only work with sites with a p-value less than [float]

Requires [[Allele Frequency estimation | -doMaf]].

=Number of non missing individuals=

; -minInd [int]: Only keep sites with at least minIndDepth (default is 1) from at least [int] individuals

=Extra=
;-setMinDepth [int]:
Discard site if total sequencing depth (all individuals added together) is below [int].
Requires [[Alleles counts | -doCounts]]

;-setMaxDepth [int]:
Discard site if total sequencing depth (all individuals added together) is above [int]
[[Alleles counts | -doCounts]]

;-setMinDepthInd [int]:
Discard individual if sequencing depth for an individual is below [int]. This filter is only applied to analysis which are based on counts of alleles i.e. analysis that uses [[Alleles counts | -doCounts]]

;-setMaxDepthInd [int]:
Discard individual if sequencing depth for an individual is above [int] This filter is only applied to analysis which are based on counts of alleles i.e. analysis that uses [[Alleles counts | -doCounts]]

;-geno_minDepth [int]
Only call genotypes if the depth is as least [int] for that individuals

This requires [[Alleles counts | -doCounts]] and [[Genotype calling |-doGeno ]]

=Examples=

First we do a run with no filters

<pre>
./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1:
</pre>

<div class="toccolours mw-collapsible mw-collapsed">
gunzip -c TSK.mafs.gz | head
<pre class="mw-collapsible-content">
chromo position major minor unknownEM nInd
1 13999919 A C 0.000006 1
1 13999920 G A 0.000006 1
1 13999921 G A 0.000006 1
1 13999922 C A 0.000006 1
1 13999923 A C 0.000006 1
1 13999924 G A 0.000006 1
1 13999925 G A 0.000006 1
1 13999926 A C 0.000006 1
1 13999927 G A 0.000006 1
</pre>
</div>

Now we do a filter with MAF cutoff of 1\%

<pre>
./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minMaf 0.01
</pre>

<div class="toccolours mw-collapsible mw-collapsed">
gunzip -c TSK.mafs.gz | head
<pre class="mw-collapsible-content">
chromo position major minor unknownEM nInd
1 14000003 G A 0.032285 9
1 14000013 G A 0.058291 9
1 14000019 G T 0.013709 9
1 14000023 C A 0.025033 9
1 14000170 C T 0.031133 10
1 14000176 G A 0.028189 10
1 14000200 C A 0.075946 7
1 14000202 G A 0.257007 7
1 14000774 G T 0.030039 10
</pre>
</div>

Similar if we only want sites with information for atleast 5 samples
<pre>
./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minInd 5
</pre>
<div class="toccolours mw-collapsible mw-collapsed">
gunzip -c TSK.mafs.gz | head
<pre class="mw-collapsible-content">
chromo position major minor unknownEM nInd
1 13999972 G A 0.000003 5
1 13999973 C A 0.000002 5
1 13999974 G A 0.000002 5
1 13999975 C A 0.000002 5
1 13999976 C A 0.000002 5
1 13999977 A C 0.000000 5
1 13999978 C A 0.000000 5
1 13999979 T A 0.000000 5
1 13999980 G A 0.000001 5
</pre>
</div>

If we are interested in all sites with a p-value of 10^(-6) of being variable
<pre>
./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -SNP_pval 1e-6
</pre>
<div class="toccolours mw-collapsible mw-collapsed">
gunzip -c TSK.mafs.gz | head
<pre class="mw-collapsible-content">
chromo position major minor unknownEM pu-EM nInd
1 14000873 G A 0.282476 0.000000e+00 10
1 14001018 T C 0.259890 7.494005e-14 9
1 14001867 A G 0.272099 6.361578e-14 10
1 14002422 A T 0.377890 0.000000e+00 9
1 14003581 C T 0.194393 5.551115e-16 9
1 14004623 T C 0.259172 2.424727e-13 10
1 14007493 A G 0.297176 5.114086e-07 9
1 14007558 C T 0.381770 0.000000e+00 8
1 14007649 G A 0.220547 1.054967e-11 9
</pre>
</div>

Filters

2019-11-15T06:48:52Z

Thorfinn: /* Number of non missing individuals */

We allow for filtering at many different levels.

# Read level, MapQ, unique mapped reads etc
# Base level, qscore
# Sequencing depth
# Regions (using BAM indexing (active lookup))
# Single sites (passive lookup, also allows for forcing major and minor) [[Sites |-sites]]
# Filtering based on downstream analysis. minimum MAF, LRT for SNP calling etc.
# Trimming out the ends of the reads
# etc

It follows that some filters will select a subset of data, and some of the filters will discard certain sites. If multiple filters has been chosen, the analysis will be limited to the chain of filters.

=Filters for reads in Bam files=

We allow for filtering and manipulation a the read level. These filters include minimum mapping and base qualtity, paired reads and others. Additionally specific regions can be analysed. All of the filters for bam files are described in [[Input#BAM_files]].

=Selected Sites=
For analysing specfic regions see [[Input#BAM_files]]. If you are interested in running your analysis at individual sites that are distributed throughout the entire genome, it might be faster to simply to loop over the entire data, but only analyse the data at specific positions. This can be done by supplying the [[Sites | -sites]] argument. With this approach we also allows for the forcing of major/minor alleles using external information.

=Allele frequencies=
; -minMaf [float]: only work with sites with a maf above [float]

Requires [[Allele Frequency estimation | -doMaf]].

=Polymorphic sites=

; -SNP_pval [float]: only work with sites with a p-value less than [float]

Requires [[Allele Frequency estimation | -doMaf]].

=Number of non missing individuals=

; -minInd [int]: Only keep sites with at least minIndDepth (default is 1) from at least [int] individuals

; -setMinDepthInd [int]: Only works with the -minInd filter. Change the minimum depth the individuals must have in order to keep the site. Default is 1.

=Extra=
;-setMinDepth [int]:
Discard site if total sequencing depth (all individuals added together) is below [int].
Requires [[Alleles counts | -doCounts]]

;-setMaxDepth [int]:
Discard site if total sequencing depth (all individuals added together) is above [int]
[[Alleles counts | -doCounts]]

;-setMinDepthInd [int]:
Discard individual if sequencing depth for an individual is below [int]. This filter is only applied to analysis which are based on counts of alleles i.e. analysis that uses [[Alleles counts | -doCounts]]

;-setMaxDepthInd [int]:
Discard individual if sequencing depth for an individual is above [int] This filter is only applied to analysis which are based on counts of alleles i.e. analysis that uses [[Alleles counts | -doCounts]]

;-geno_minDepth [int]
Only call genotypes if the depth is as least [int] for that individuals

This requires [[Alleles counts | -doCounts]] and [[Genotype calling |-doGeno ]]

=Examples=

First we do a run with no filters

<pre>
./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1:
</pre>

<div class="toccolours mw-collapsible mw-collapsed">
gunzip -c TSK.mafs.gz | head
<pre class="mw-collapsible-content">
chromo position major minor unknownEM nInd
1 13999919 A C 0.000006 1
1 13999920 G A 0.000006 1
1 13999921 G A 0.000006 1
1 13999922 C A 0.000006 1
1 13999923 A C 0.000006 1
1 13999924 G A 0.000006 1
1 13999925 G A 0.000006 1
1 13999926 A C 0.000006 1
1 13999927 G A 0.000006 1
</pre>
</div>

Now we do a filter with MAF cutoff of 1\%

<pre>
./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minMaf 0.01
</pre>

<div class="toccolours mw-collapsible mw-collapsed">
gunzip -c TSK.mafs.gz | head
<pre class="mw-collapsible-content">
chromo position major minor unknownEM nInd
1 14000003 G A 0.032285 9
1 14000013 G A 0.058291 9
1 14000019 G T 0.013709 9
1 14000023 C A 0.025033 9
1 14000170 C T 0.031133 10
1 14000176 G A 0.028189 10
1 14000200 C A 0.075946 7
1 14000202 G A 0.257007 7
1 14000774 G T 0.030039 10
</pre>
</div>

Similar if we only want sites with information for atleast 5 samples
<pre>
./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minInd 5
</pre>
<div class="toccolours mw-collapsible mw-collapsed">
gunzip -c TSK.mafs.gz | head
<pre class="mw-collapsible-content">
chromo position major minor unknownEM nInd
1 13999972 G A 0.000003 5
1 13999973 C A 0.000002 5
1 13999974 G A 0.000002 5
1 13999975 C A 0.000002 5
1 13999976 C A 0.000002 5
1 13999977 A C 0.000000 5
1 13999978 C A 0.000000 5
1 13999979 T A 0.000000 5
1 13999980 G A 0.000001 5
</pre>
</div>

If we are interested in all sites with a p-value of 10^(-6) of being variable
<pre>
./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -SNP_pval 1e-6
</pre>
<div class="toccolours mw-collapsible mw-collapsed">
gunzip -c TSK.mafs.gz | head
<pre class="mw-collapsible-content">
chromo position major minor unknownEM pu-EM nInd
1 14000873 G A 0.282476 0.000000e+00 10
1 14001018 T C 0.259890 7.494005e-14 9
1 14001867 A G 0.272099 6.361578e-14 10
1 14002422 A T 0.377890 0.000000e+00 9
1 14003581 C T 0.194393 5.551115e-16 9
1 14004623 T C 0.259172 2.424727e-13 10
1 14007493 A G 0.297176 5.114086e-07 9
1 14007558 C T 0.381770 0.000000e+00 8
1 14007649 G A 0.220547 1.054967e-11 9
</pre>
</div>

SFS Estimation

2019-10-25T15:52:00Z

Thorfinn:

Latest version can now do bootstrapping. Folding should now be done in realSFS and not in the saf file generation.

=Quick Start=
The process of estimating the SFS and multidimensional has improved a lot in the newer versions.

Assuming you have a bam/cram file list in the file 'file.list' and you have your ancestral state in ancestral.fasta, then the process is:

<pre>
#no filtering
./angsd -gl 1 -anc ancestral -dosaf 1
#or alot of filtering
./angsd -gl 1 -anc ancestral -dosaf 1 -baq 1 -C 50 -minMapQ 30 -minQ 20

#this will generate 3 files
1) angsdput.saf.idx 2) angsdput.saf.pos.gz 3) angsdput.saf.gz
#these are binary files that are formally defined in https://github.com/ANGSD/angsd/blob/newsaf/doc/formats.pdf

#To find the global SFS based on the run from above simply do
./realSFS angsdput.saf.idx
##or only use chromosome 22
./realSFS angsdput.saf.idx -r 22

## or specific regions
./realSFS angsdput.saf.idx -r 22:100000-150000000

##or limit to a fixed number of sites
./realSFS angsdput.saf.idx -r 17 -nSites 10000000

##or you can find the 2dim sf by
./realSFS ceu.saf.idx yri.saf.idx
##NB the program will find the intersect internally. No need for multiple runs with angsd main program.

##or you can find the 3dim sf by
./realSFS ceu.saf.idx yri.saf.idx MEX.saf.idx
</pre>

=SFS=
This method will estimate the site frequency spectrum, the method is described in [[Nielsen2012]]. The theory behind the model is briefly described [[realSFSmethod|here]]

This is a 2 step procedure first generate a ".saf" file (site allele frequency likelihood), followed by an optimization of the .saf file which will estimate the Site frequency spectrum (SFS).

For the optimization we have implemented 2 different approaches both found in the misc folder. The diagram below shows the how the method goes from raw bam files to the SFS.

You can also estimate a [[2d SFS Estimation| 2dsfs]] or even higher if you want to.
<pre>
* NB the ancestral state needs to be supplied for the full SFS, but you can use the -fold 1 to estimate the folded SFS and then use the reference as ancestral.
* NB the output from the -doSaf 2 are not sample allele frequency likelihoods but sample alle posteriors.
And applying the realSFS to this output is therefore NOT the ML estimate of the SFS as described in the Nielsen 2012 paper,
but the 'Incorporating deviations from Hardy-Weinberg Equilibrium (HWE)' section of that paper.

</pre>
<classdiagram type="dir:LR">
[sequence data{bg:orange}]->GL[genotype likelihoods|SAMtools;GATK;SOAPsnp;Kim et.al]
[genotype likelihoods|SAMtools;GATK;SOAPsnp;Kim et.al]->doSaf[.saf file{bg:blue}]
[.saf file{bg:blue}]->optimize('realSFS')[.saf.ml file{bg:red}]
</classdiagram>

=Brief Overview=
<pre>
./angsd -doSaf
-> angsd version: 0.910-76-gad32889 (htslib: 1.3-32-gecdc348) build(Mar 2 2016 12:38:33)
-> Analysis helpbox/synopsis information:
-> Command:
./angsd -doSaf -> Wed Mar 2 12:47:13 2016
--------------
abcSaf.cpp:
-doSaf 0
1: perform multisample GL estimation
2: use an inbreeding version
3: calculate genotype probabilities (use -doPost 3 instead)
4: Assume genotype posteriors as input (still beta)
-doThetas 0 (calculate thetas)
-underFlowProtect 0
-fold 0 (deprecated)
-anc (null) (ancestral fasta)
-noTrans 0 (remove transitions)
-pest (null) (prior SFS)
-isHap 0 (is haploid beta!)
-doPost 0 (doPost 3,used for accesing saf based variables)
NB:
If -pest is supplied in addition to -doSaf then the output will then be posterior probability of the sample allelefrequency for each site

</pre>

<pre>
misc/realSFS
./realSFS afile.saf.idx [-start FNAME -P nThreads -tole tole -maxIter -nSites ]
</pre>
For information and parameters concerning the realSFS subprogram go here: [[realSFS]]

=Options=
;-doSaf 1: Calculate the Site allele frequency likelihood based on individual genotype likelihoods assuming HWE

;-doSaf 2:(version above 0.503) Calculate per site posterior probabilities of the site allele frequencies based on individual genotype likelihoods while taking into account individual inbreeding coefficients. This is implemented by Filipe G. Vieira. You need to supply a file containing all the inbreeding coefficients. -indF. Consider if you want to either get the MAP estimate by using all sites, or get the standardized values by conditioning on the called snpsites. See bottom of this page for examples.

;-doSaf 3: Calculate the genotype posterior probabilities for all samples forall sites, using an estimate of the sfs (sample allele frequency distribution). This needs a prior distribution of the SFS (which can be obtained from -doSaf 1/realSFS).

;-doSaf 4: Calculate the posterior probabilities of the sample allele frequency distribution for each site based on genotype probabilities. The genotype probabilities should be provided by the using using the -beagle options. Often the genotype probabilities will be obtained by haplotype imputation.

;-underFlowProtect [INT]
0: (default) no underflow protection. 1: use underflow protection. For large data sets (large number of individuals) underflow projection is needed.

=Output file=
The output file from the ''-doSaf'' is described in detail in angsd/doc/formats.pdf. These binary annoying files can be printed with
<pre>
realSFS print myfile.saf.idx
#or
realSFS print mayflies.saf.idx -r chr1:10000-20000
</pre>
==Example==
A full example is shown below where we use the test data that can be found on the [[quick start]] page. In this example we use GATK genotype likelihoods.

first generate .saf file with 4 threads
<pre>
./angsd -bam bam.filelist -doSaf 1 -out small -anc chimpHg19.fa -GL 2 -P 4
</pre>
We always recommend that you filter out the bad qscore bases and meaningless mapQ reads. eg '''-minMapQ 1 -minQ 20'''. So the above analysis with these filters can be written as:
<pre>
./angsd -bam bam.filelist -doSaf 1 -out small -anc chimpHg19.fa -GL 2 -P 4 -minMapQ 1 -minQ 20
</pre>
Obtain a maximum likelihood estimate of the SFS using EM algorithm
<pre>
misc/realSFS small.saf.idx -maxIter 100 -P 4 >small.sfs
</pre>

[[File:SfsSmall.png|thumb]]

A plot of this figure are seen on the right. The jaggedness is due to the very low number of sites in this small dataset.

=Interpretation of the output file=
Each row is a region of the genome (see below).
Each row is the expected values of the SFS.
==NB==
The generation of the .saf file contains a saf for each site, whereas the optimization requires information for a region of the genome. The optimization will therefore use large amounts of memory.

=Folded spectra=
If you don't have the ancestral state, you can instead estimate the folded SFS. This is done by supplying the -anc with the reference genome and applying -fold 1 to realSFS.

The above example would then be

<pre>
#first generate .saf file
./angsd -bam bam.filelist -doSaf 1 -out smallFolded -anc chimpHg19.fa -GL 2
#now try the EM optimization with 4 threads
misc/realSFS smallFolded.saf.idx -maxIter 100 -P 4 >smallFolded.sfs
#in R
sfs<-scan("smallFolded.sfs")
barplot(sfs[-1])
</pre>
[[File:SmallFolded.png|thumb]]

=Posterior of the per-site distributions of the sample allele frequency=
If you supply a prior for the SFS (which can be obtained from the -doSaf/realSFS analysis), the output of the .saf file will no longer be site allele frequency likelihoods but instead will be the log posterior probability of the sample allele frequency for each site in logspace.

=Format specification of binary .saf* files=
This can be found in the angsd/doc/formats.pdf

* If the -fold 1 has been set, then the dimension is no longer 2*nInd+1 but nInd+1 (this is deprecated)
* If the -pest parameter has been supplied the output is no longer likelihoods but log posterior site allele frequencies

=Bootstrapping=
We have recently added the possibility to bootstrap the SFS. Which can be very usefull for getting confidence intervals of the estimated SFS.

This is done by:

<pre>
realSFS pop.saf.idx -bootstrap 100 -P number_of_cores
</pre>
The program will then get you 100 estimates of SFS, based on data that has been subsampled with replacement.

=How to plot=
Assuming the we have obtained a single global sfs(only one line in the output) from '''realSFS''' program, and this is located in '''file.saf.sfs''', then we can plot the results simply like:
<pre>
sfs<-(scan("small.sfs")) #read in the log sfs
barplot(sfs[-c(1,length(sfs))]) #plot variable sites
</pre>
[[File:SfsSmall.png|thumb]]
We can make it more fancy like below:

<pre>
#function to normalize
norm <- function(x) x/sum(x)
#read data
sfs <- (scan("small.sfs"))
#the variability as percentile
pvar<- (1-sfs[1]-sfs[length(sfs)])*100
#the variable categories of the sfs
sfs<-norm(sfs[-c(1,length(sfs))])
barplot(sfs,legend=paste("Variability:= ",round(pvar,3),"%"),xlab="Chromosomes",
names=1:length(sfs),ylab="Proportions",main="mySFS plot",col='blue')
</pre>
[[File:SfsSmallFine.png|thumb]]

If your output from '''realSFS''' contains more than one line, it is because you have estimated multiple local SFS's. Then you can't use the above commands directly but should first pick a specific row.

<pre>
sfs<-(as.numeric(read.table("multiple.sfs")[1,])) #first region.
#do the above
sfs<-(as.numeric(read.table("multiple.sfs")[2,])) #second region.
</pre>

=Which genotype likelihood model should I choose ?=
It depends on the data. As shown on this example [[Glcomparison]], there was a huge difference between '''-GL 1''' and '''-GL 2''' for older 1000genomes BAM files, but little difference for newer bam files.
=Validation=
The validation is based on the pre 0.900 version
==-doSaf 1==
<pre>
cd misc;
./supersim -outfiles test -npop 1 -nind 12 -pvar 0.9 -nsites 50000
echo testchr1 100000 >test.fai
../angsd -fai test.fai -glf test.glf.gz -nind 12 -doSaf 1 -issim 1
./realSFS angsdput.saf 24 2>/dev/null >res
cat res
31465.429798 4938.453115 2568.586388 1661.227445 1168.891114 975.302535 794.727537 632.691896 648.223566 546.293853 487.936192 417.178505 396.200026 409.813797 308.434836 371.699254 245.585920 322.293532 282.980046 292.584975 212.845183 196.682483 221.802128 236.221205 197.914673
</pre>

==-doSaf 2==
<pre>
ngsSim=../ngsSim/ngsSim
angsd=./angsd
realSFS=./misc/realSFS

$ngsSim -npop 1 -nind 24 -nsites 1000000 -depth 4 -F 0.0 -outfiles testF0.0
$ngsSim -npop 1 -nind 24 -nsites 1000000 -depth 4 -F 0.9 -outfiles testF0.9

for i in `seq 24`;do echo 0.9;done >indF
echo testchr1 250000000 >test.fai
$angsd -fai test.fai -issim 1 -glf testF0.0.glf.gz -nind 24 -out noF -dosaf 1
$angsd -fai test.fai -issim 1 -glf testF0.9.glf.gz -nind 24 -out withF -dosaf 2 -domajorminor 1 -domaf 1 -indF indF
$angsd -fai test.fai -issim 1 -glf testF0.9.glf.gz -nind 24 -out withFsnp -dosaf 2 -domajorminor 1 -domaf 1 -indF indF -snp_pval 1e-4

$realSFS noF.saf 48 >noF.sfs
$realSFS withF.saf 48 >withF.sfs

#in R
trueNoF<-scan("testF0.0.frq")
trueWithF<-scan("testF0.9.frq")
pdf("sfsFcomparison.pdf",width=14)
par(mfrow=c(1,2),width=14)
barplot(trueNoF[-1],main='true sfs F=0.0')
barplot(trueWithF[-1],main='true sfs F=0.9')

estWithF<-scan("withF.sfs")
estNoF<-scan("noF.sfs")

barplot(rbind(trueNoF,estNoF)[,-1],main="true vs est SFS F=0 (ML) (all sites)",be=T,col=1:2)
barplot(rbind(trueWithF,estWithF)[,-1],main='true vs est sfs=0.9 (MAP) (all sites)',be=T,col=1:2)

readBjoint <- function(file=NULL,nind=10,nsites=10){
ff <- gzfile(file,"rb")
m<-matrix(readBin(ff,double(),(2*nind+1)*nsites),ncol=(2*nind+1),byrow=TRUE)
close(ff)
return(m)
}

m <- exp(readBjoint("withF.saf",nind=24,5e6))
barplot(rbind(trueWithF,colMeans(m))[,-1],main='true vs est sfs F=0.9 (colmean of site pp) (all sites)',be=T,col=1:2)
m <- exp(readBjoint("withFsnp.saf",nind=24,5e6))
m <- colMeans(m)*nrow(m)
##m contains SFS for absolute frequencies
m[1] <-1e6-sum(m[-1])
##m now contains a corrected estimate containing the zero category
barplot(rbind(trueWithF,norm(m))[,-1],main='true vs est sfs F=0.9 (colmean of site pp) (called snp sites)',be=T,col=1:2)

dev.off()

</pre>
See results from above here:http://www.popgen.dk/angsd/sfsFcomparison.pdf

=safv3 comparison=
Between 0.800 and 0.900 i decided to move to a better format than the raw sad files. This new format takes up half the storage and allows for easy random access and generalizes to unto 5dimensional sfs. A comparison can be found here: [[safv3]]
=Using NGStools=
See [[realSFS]] for how to convert the new safformat to the old safformat if you use NGStools.

Installation

2019-08-26T10:12:32Z

Thorfinn:

There has been some confusion about the versions of ANGSD.

* Even versions are freezes from the last odd giversion

* Odd versions are git versions. Once there has been enough commits we will increment and make a release.

=Download and Installation=
To download and use ANGSD you need to download the htslib and the angsd source folder

You can either download the angsd0.930.tar.gz which contains both.
[http://popgen.dk/software/download/angsd/angsd0.930.tar.gz]

Or you can use github for the latest version of both htslib and angsd

Earlier versions from here: http://popgen.dk/software/download/angsd/
And here: https://github.com/ANGSD/angsd/releases

=Install=
Download and unpack the tarball, enter the directory and type make. Users on a mac computer, can use curl instead of wget.

===Unix===
The software can be compiled using make.
<pre>
wget http://popgen.dk/software/download/angsd/angsd0.930.tar.gz
tar xf angsd0.930.tar.gz
cd htslib;make;cd ..
cd angsd
make HTSSRC=../htslib
cd ..
</pre>
The executable then located in '''angsd/angsd'''.

=Install from github=
To install CRAM support you also need to install htslib and can be done using the following commands

<pre>
git clone https://github.com/samtools/htslib.git
git clone https://github.com/ANGSD/angsd.git
cd htslib;make;cd ../angsd ;make HTSSRC=../htslib
</pre>

=Systemwide installation of htslib?=
Then you just type make in the angsd directory

Change log

2019-08-26T10:10:21Z

Thorfinn:

=Latests=
Odd versions are github versions...
*0.929 https://github.com/ANGSD/angsd/compare/0.929...0.931
*0.929 https://github.com/ANGSD/angsd/compare/0.925...0.929
*0.925 https://github.com/ANGSD/angsd/compare/0.923...0.925
*0.923 https://github.com/ANGSD/angsd/compare/0.921...0.923
*0.921 https://github.com/ANGSD/angsd/compare/0.918...0.921
*0.918 https://github.com/ANGSD/angsd/compare/0.916...0.918
*0.916 https://github.com/ANGSD/angsd/compare/0.914...0.916
*0.914 https://github.com/ANGSD/angsd/compare/0.912...0.914
*0.912 https://github.com/ANGSD/angsd/compare/0.911...0.912
*0.911 https://github.com/ANGSD/angsd/compare/0.910...0.911
*0.910 https://github.com/ANGSD/angsd/compare/0.901...0.910
*0.901 https://github.com/ANGSD/angsd/compare/0.900...0.901
*0.900 https://github.com/ANGSD/angsd/compare/0.800...0.900
*0.800 https://github.com/ANGSD/angsd/compare/0.700...0.800
*0.700 https://github.com/ANGSD/angsd/compare/0.615...0.70

=0.6***=
*0.614 https://github.com/ANGSD/angsd/compare/0.614...0.615
*0.613 https://github.com/ANGSD/angsd/compare/0.613...0.614
*0.613 https://github.com/ANGSD/angsd/compare/0.612a...0.613
*0.612 https://github.com/ANGSD/angsd/compare/0.610...0.612a
*0.610 https://github.com/ANGSD/angsd/compare/0.609...0.610a
*0.609 https://github.com/ANGSD/angsd/compare/0.608...0.609
*0.608 not super usefull but we now compile with knetfile, so users can use remote .fa files. I really recommend that users download the fasta instead.
*0.607 changed name of all abstract base classes to the more reasonable abc*.cpp. Included contamination and the iCounts format. Added a templated class so users can see how to access the internal datastructures.
*0.606 added more info in the thetaStat subprogram renamed all analysis classes to abc*.cpp.
*0.605 continued updating the Eunjung code
*0.604 added a 'job' array for the analysis classes which should greatly reduce the number of needed function call. I doubt that will make any noticeable speed difference. Has copied and modified some code from Eunjung (from jnovembre lab) for a banded approach in the saf calculation. This is not working yet -doSaf 2
*0.603 1) fixed wrong branching if users used simfiles. 2) fixed 2 bugs in the inbreeeded angsd_realSFS.cpp. Very small changes
*0.602 1) programs output the actual chromosome, if there is a mismatch between the fasta and bamheader 2) added a check that the index files (fa vs fai,bam vs bai) is newer than the index. 3) Program now validates that the .so and include files used are the same.
*0.601 1) started the move to includeflags/discardflags, outcommented in this version 2) validates that data has been generted for the different samples by looking at difference between values, instead of comparing againts -0.0. 3) MAF estimates now skips estimation for a site if the updated GL shows that GLs are noninformative.
*0.600 fixed a change in default FLAGS when using bam files. If you hadn't removed the unmapped reads from your bamfiles these would have been included in the analysis.

=0.589 to 599=
* 0.589 removed soap/sim/glf/glfclean(bin and text) and tglf.
* 0.590 added text mpileup as new input format. Very useful
* 0.591 refactored the file reading, moved the arguments to multi reader, such that all file reading is done from multi reader. FREEZE version. We will only allow bug fixes in the next many versions.
* 0.592 -pest output in angsd_realSFS.cpp. smartcounts. fixed a bug in hetplas. edited the printout msg for 'if you really want angsd to exit' -pest output in angsd_realSFS.cpp. smartcounts. fixed a bug in hetplas. edited the printout msg for 'if you really want angsd to exit'. Maybe fixed an unknown bug. line 2 in .arg or screen output is now commandline used, and some other visual stuff. Program didn't complain if -doMaf but no -domajorminor. if chromosomesome name contained ':" it wouldn't work strchr->strrchr
* 0.593 fixed a strange bug, where the program would crash if, no analysis was chosen. fixed the 'shouldbeone' bug.
* 0.594 nochanges
* 0.595 nochanges
* 0.596 fixed a bug in persite depth counter (double count of C alleles). fixed a bug in smartcount subprogram.
* 0.597 treemix input file generation from smart counts
* 0.598 fixed a wrong compile flag in one of the utility programs 'smartCount'
* 0.599 many much more informative information if users forgot to add -fai argument. Fixed a bug in parsing of arguments with -doSaf 4.

=<0.570 and <0.588=
* 0.571 removed bf's from maf classs, negative values of -domaf disabled dumping of files, inbreeding has been added
* 0.572 Some funky new approach for the makefile is now being used, minor bug fixes (minDepth -> setMinDepth, extra header column in thetas.gz has been fixed)
* 0.573 added better info barfiles have different header. Added check if length of supplied reference/ancestral doesn't match bamheader. autosize in emOptim2 for 2dsfs. Fixed subtle issue if very large coverage between bamfiles, now the 'biggests(file size)' is used to select region instead of the first baffle. Netaccess is now deprecated.
* 0.574 modifed emOptim2 so now compiles on mac
* 0.575 smaller fixes to the inbreeding parsing, added p-value in analysisMaf instead of the raw llh.
*.0.576 -doSNP and -minLRT now deprecated, please use -SNP_pval instead
* 0.577 bugfix for -SNP_pval if value was one 1. doHWE now called -HWE_pval and can be used for filtering.
* 0.578 speedup in hew stuff
* 0.579 fixed an extremely rare assertion error (program was working, assertion was off). Redid all strcmp to strcasecom. Fixed a bug in -doMaf 2 with -snp_pval
* 0.580 nochanges...
* 0.581 if trimming has been enabled, N's will be plugged in instead of the bases. A number of small changes.
* 0.582 nochanges...
* 0.583 1) changed bugfix when using counts based estimator for major/minor 2) keepsites is now using the effecive number of samples in all cases 3) changed output of maf to a 'nicer' format
* 0.584 updated internal testing scripts.
* 0.585 1) fixed 'baq complains even though -ref was supplied' 2) fixed -doMajorMinor 4 and doMajorMinor 5 (sites not discarded) 3) added trumendounsly better information for the -sites 4) added some check for -doPost 5) program can now exit uncleanly if ctrl+c is pressed 3 times. 6) added an else to catch wrong arg in theteStat.
*0.586 1) fixed parsing of pars if input is -sim1 2) fixed a bug in -doFasta 3 (the ebd one) 3) fixed a printout problem in -doFasta
*0.587 1) fixed a number of minor instances where memory wasn't being freed/delete (mostly for keeping valgring silent) 2) fixed a memleak if -sites files contained 4 columns but -doMajorMinor not 3. 3) fixed a memory leak that could occur if -doPost 2 and -doMaf 0.
* 0.588 1) fixed small printout error that could cause segfaults in rare cases. 2) change stderr to pos+1 3) changed some checks of user supplied pars 4) fixed a stack overflow if a very long -rf file was supplied.

=old=
==Dirty==
* 0.16 Is now bundled with SAMtools-0.1.17 and the mpileup (and friends) command be used for passing data to dirty.
* 0.17 added extra options -minInd and -minMaf, for only printing and using sites above a threshold
* 0.18 added option to pass reference and ancestral allele as fasta files.(using faidx format) (doMaf is now encoded internally as a MAF_(UN)KNOWN_TYPE)
* 0.19 added support for tglf inputfiles, -tglf -posfile see runexamples, also added the likeratio test for snp calling
* 0.20 Added the check for missing data, before the major/minor. included -realSFS, changed the deallocation of the -doMAF results, such that its proper cleaned up.
* 0.21 refactored pml.cpp into pml_estError_genLikes.cpp and pml_freq_asso.cpp (fixed a bug that preventede -samglf and samglfclean from working)
* 0.22 Well this update was a mixture of edits from [[user:albrecht]] and BGI so its difficult to give a concise description
*.0.23 Program can now read simulated files (single pop only) An example can be seen in "full example ... sfs" and input types.
* 0.24 added the tajima estimator. This should go in tandem with some R scripts. Had to modify parseargs, shared, and pml_freq_asso
* 0.25 the depth is now being populated when using mpileup -g. The program can now get the counts from mpileup
==ANGSD==
=<0.5=
* 0.01.a - 0.01.b The bfgs now supports threading, maybe anders implemented a heteorzygosity estimator.
* 0.01.c A problem if we didn't observe any llh, caused the MAF estimator to 'nan'.
* 0.02 Fixed small bug in bfgs optimization of sfs optimization. When choosing a region bigger than what was covered by the .sfs file the program would hang. Added genotypecaller, added -sfsEst to the realSFS part of the program.
* 0.03 added and documented genotypecaller, can dump counts,-realSFS 1 dumps positions, -realSFS 2 is deprecated,S,pi and tajima has been added to sfstools along with possibility to do prior
* 0.3 clean version with less features. The lost features will be reintroduced later.
* 0.43 first very clean version, everything should be included
* 0.441 rewrote the SOAPsnp GL model, -L and -maxQ is not needed anymore. Also added an option to choose an output dir for the recalibration matrix
* 0.4471 error estimation is now working, the fasta reading is now threadsafe. all GLs are now likeratios.
* 0.512 After 0.500 we have changed the internal structure such that each chunk is enforced to be on the same chr. version c) fixes a problem of hardclipping
* 0.515 Alot of legacy code has been removed from mUppile.cpp. Program can now use remote files, build on code from SAMtools
* 0.520 Alot more legacy code has been removed from mUpPile.cpp. Program now does baq and adjustment of mapQ similar to -C in samtools. Also compiles on osx, but this is not supported
* 0.535 Bug in internal representation of mapQ's (only problematic for mapQ>128), we now use the flag to determine if a read has mapped. calcstat is now deprecated, users should use the bgid program now.
* 0.538 changed position output in association part. Fixed incorrect assert assumption in mUpPile.cpp. Added some downsampling options for errorEst and changed internal buffering when reading beagle files to allow for >10k individuals.
* 0.549 to many changes to remember

=<0.5 and <0.570=
* 0.551 tajima paper is now published, so the emOptim2 and bgid has now been properly documented. plink output is now supported and some snp filters can be outputted.
* 0.552 minor bug when calling genotypes without defining postcutoff -> missingness couldnt occur. removed the optimSFS and emOptim from the default compilelist
* 0.553 uint removed from code.
* 0.554 plugged in sfstools functionality into main angsd, (ability to output log posts)
* 0.555 anders added some concensus stuff
* 0.556 updated the filtering (if binary rep of keep file is incomplete it is removed again. It checks timestamps to see if file has been updated), folded spectra analysis should now be working
* 0.557 There was a bug in the realsfs part of the code, that was created in the 0.556 version. 0.557 is simply a fix of this, and the removal of a warning compiler flag in the msToGlf subprogram. We only observed the problematic compiler flag on a osx machine
* 0.558 tempversion,from this version bgid is now called thetaStat
* 0.559 abbababa
* 0.560 analysisCount.cpp has been updated to the nice standard of the wiki
* 0.561 program now compiles on clang, many small compiler warnings has been fixed.
* 0.562 Merge of forked versions, abba-baba fasta.
* 0.563 minQ filter has been moved to a much earlier step. Previously it was downstream classes that checked this. Now a base will be set to 'n' if it is below the threshold
* 0.564 cleaned up funky pars maf/asso such that all results are in ->extras[]
* 0.565 moved file reading stuff from shared to analysisFunction in namespace ail::
* 0.566 cleaned up small things, added a newer version of hetplas
* 0.567 cleaned up small things again. Started to add single pars e.g. -P -b
* 0.568 refactored compile order in general.cpp
* 0.569 added saf genotype calling, changed name from -realsfs to -doSaf
* 0.570 modified emOptim2 to estimate nSites and tell how much memory it will use, fixed empty -bam file

MediaWiki:Sitenotice

2019-08-26T10:09:53Z

Thorfinn:

ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.930/0.931 on github), see [[Change_log]] for changes, and download it [[Download and installation | here]].

Thorfinn

2019-08-26T10:09:00Z

Thorfinn:

=How to deploy=
This page described how a new version should be put on wiki and github
==Make a combined angsd htslib to put on wiki download==
copy latest and make a new annotated tag and push to wiki
<pre>
VERSION=0.930
mkdir delme
cd delme
git clone --depth=1 https://github.com/ANGSD/angsd
git clone --depth=1 https://github.com/SAMtools/htslib
cd angsd; git tag -a ${VERSION} -m "time for version ${VERSION}"
sed "s/PACKAGE_VERSION = .*/PACKAGE_VERSION = ${VERSION}/" Makefile >tmp
diff Makefile tmp
mv tmp Makefile
cd ..
tar --exclude='.git' -cvf angsd${VERSION}.tar.gz angsd htslib/
scp angsd${VERSION}.tar.gz software@popgen.dk:/home/software/download/angsd/
</pre>

==Make a new github version to put on github==

<pre>
VERSION=0.931
cd angsd/
sed "s/PACKAGE_VERSION = .*/PACKAGE_VERSION = ${VERSION}/" Makefile >tmp
diff Makefile tmp
mv tmp Makefile
git commit Makefile -m "Preparing new ${VERSION} version"
git tag -a ${VERSION} -m "time for a new version ${VERSION}"
git push
git push --tags
</pre>

Fst

2019-08-17T01:38:16Z

Thorfinn: /* Note about fst for folded spectra */

Our program can estimate fst between populations. And has been generalized to give all pairwise fst estimates if you supply the command with multiple populations.

;if you supply 3 populations, the program will also output the pbs statistic.
;NB we have removed the very unusefull unweighted fst estimator in the output, and have included a header. The output example below will be updated at some point.
The procedure is

- Use angsd for calculating '''saf''' files for each population

- Use realSFS to calculate 2d sfs for each pair

- Use the above calculated 2dsfs as priors jointly with all '''safs''' from step1 to calculate '''fst''' binary files

- Use realSFS to extract the the fst values from the '''fst'''

NB;
In the latest github version there is a different fst estimator which should be preferable for small sample sizes. Feel free to try that out with
<pre>
./realSFS fst index [saf.idx's] -whichFst 1
</pre>

=Note about fst for folded spectra=
* Earlier versions of angsd/realSFS could output folded 1d sample allele frequencies which would be usefull for 1population neutrality test like Tajima. This is however not appropriate to use for calculating fst since the folding was done within population.

* We have therefore added a proper folding procedure for the optimization based on the UNFOLDED .saf.idx files generated by -doSaf. These are the ones that should be used for calculating fst.
Therefore please remember to add -fold 1 if you want angsd (the realSFS subfunction) to perform fst and pbs estimation using the folded spectra.

=Two Populations real data=
<pre>
#this is with 2pops
#first calculate per pop saf for each populatoin
../angsd -b list1 -anc hg19ancNoChr.fa -out pop1 -dosaf 1 -gl 1
../angsd -b list2 -anc hg19ancNoChr.fa -out pop2 -dosaf 1 -gl 1
#calculate the 2dsfs prior
../misc/realSFS pop1.saf.idx pop2.saf.idx >pop1.pop2.ml
#prepare the fst for easy window analysis etc
../misc/realSFS fst index pop1.saf.idx pop2.saf.idx -sfs pop1.pop2.ml -fstout here
#get the global estimate
../misc/realSFS fst stats here.fst.idx
-> FST.Unweight:0.069395 Fst.Weight:0.042349
#below is not tested that much, but seems to work
../misc/realSFS fst stats2 here.fst.idx -win 50000 -step 10000 >slidingwindow
</pre>

=3 Populations real data=
In commands below im using 24 threads, because this is what I have. Adjust accordingly
<pre>
#this is with 2pops
#first calculate per pop saf for each populatoin
./angsd -b list10 -anc hg19ancNoChr.fa -out pop1 -dosaf 1 -gl 1
./angsd -b list11 -anc hg19ancNoChr.fa -out pop2 -dosaf 1 -gl 1
./angsd -b list12 -anc hg19ancNoChr.fa -out pop3 -dosaf 1 -gl 1
#calculate all pairwise 2dsfs's
./misc/realSFS pop1.saf.idx pop2.saf.idx -P 24 >pop1.pop2.ml
./misc/realSFS pop1.saf.idx pop3.saf.idx -P 24 >pop1.pop3.ml
./misc/realSFS pop2.saf.idx pop3.saf.idx -P 24 >pop2.pop3.ml
#prepare the fst for easy analysis etc
./misc/realSFS fst index pop1.saf.idx pop2.saf.idx pop3.saf.idx -sfs pop1.pop2.ml -sfs pop1.pop3.ml -sfs pop2.pop3.ml -fstout here
#get the global estimate
-> Assuming idxname:here.fst.idx
-> Assuming .fst.gz file: here.fst.gz
-> FST.Unweight[nObs:1666245]:0.022063 Fst.Weight:0.034513
0.022063 0.034513
-> FST.Unweight[nObs:1666245]:0.026867 Fst.Weight:0.031989
0.026867 0.031989
-> FST.Unweight[nObs:1666245]:0.025324 Fst.Weight:0.021118
0.025324 0.021118
-> pbs.pop1 0.023145
-> pbs.pop2 0.005088
-> pbs.pop3 0.009367
#below is not tested that much, but seems to work
../misc/realSFS fst stats2 here.fst.idx -win 50000 -step 10000 >slidingwindow
</pre>
In the presence of 3 populations, the program will also calculate the population branch statistics
==Sliding Window output==
The sliding window seems to work so we have documented it here:
;Second column is chromosome, third is center of window followed by:
;fst.unweight(pop1,pop2) fst.weight(pop1,pop2) fst.unweight(pop1,pop3) fst.weight(pop1,pop3) fst.unweight(pop2,pop3) fst.weight(pop2,pop3)
<pre>
(9133,58895)(14010000,14059999)(14010000,14060000) 1 14035000 0.022099 0.016387 0.026686 0.027731 0.025311 0.047920 -0.002231 0.035045 0.030353
(19114,68881)(14020000,14069999)(14020000,14070000) 1 14045000 0.022096 0.019076 0.026777 0.024238 0.025290 0.052793 -0.005220 0.041969 0.029757
(28951,78655)(14030000,14079999)(14030000,14080000) 1 14055000 0.022043 0.021025 0.026915 0.023368 0.025342 0.056975 -0.006884 0.046840 0.030530
(38928,88632)(14040000,14089999)(14040000,14090000) 1 14065000 0.022083 0.016525 0.026846 0.029560 0.025345 0.053421 -0.004116 0.039898 0.034122
(48917,98170)(14050000,14099999)(14050000,14100000) 1 14075000 0.022132 0.022082 0.026742 0.025564 0.025262 0.037071 0.005226 0.024827 0.020671
(74,49191)(14000000,14049999)(14000000,14050000) 10 14025000 0.022704 0.101955 0.026479 0.095713 0.025102 0.001924 0.103108 -0.048378 -0.002500
(9734,58555)(14010000,14059999)(14010000,14060000) 10 14035000 0.022779 0.102670 0.026425 0.088015 0.025118 0.002721 0.098870 -0.043342 -0.006738
</pre>
The last 3 columns are the populations branch statistic for population1, popultion2 and population3
==Relative window positions?==

We allow for 3 different ways of defining window positions, these are chosen with the '''-type''' argument in realSFS

;-type 2 Use pos=1 as the leftmost position of first window. Even though there isn't any data.

;-type 1 Use first position with data, as leftmost position for the first window.

;-type 0 Split out the genome into blocks. And use the first window that have data for the entire window. Then we will have the same windowcenters across datasets.
=realSFS fst print=
You can print out the precalculated A and B with

''./realSFS fst print pop1.pop2.fst.idx''

Assuming we have pop1.saf.idx, pop2.saf.idx.
<pre>
./realSFS pop1.saf.idx pop2.saf.idx >pop1.pop2.saf.idx.ml
./realSFS fst index pop1.saf.idx pop2.saf.idx -fstout pop1.pop2 -sfs pop1.pop2.saf.idx.ml
./realSFS fst print pop1.pop2.fst.idx
</pre>

The weighted fst for a region is the ratio between the sum of As and the sum of B. The unweighted is the mean of the persite ratios.

A is the alpha from the reynolds 1983 (or Bhatia) and B is the alpha + beta.

Fst

2019-08-17T01:31:00Z

Thorfinn:

Our program can estimate fst between populations. And has been generalized to give all pairwise fst estimates if you supply the command with multiple populations.

;if you supply 3 populations, the program will also output the pbs statistic.
;NB we have removed the very unusefull unweighted fst estimator in the output, and have included a header. The output example below will be updated at some point.
The procedure is

- Use angsd for calculating '''saf''' files for each population

- Use realSFS to calculate 2d sfs for each pair

- Use the above calculated 2dsfs as priors jointly with all '''safs''' from step1 to calculate '''fst''' binary files

- Use realSFS to extract the the fst values from the '''fst'''

NB;
In the latest github version there is a different fst estimator which should be preferable for small sample sizes. Feel free to try that out with
<pre>
./realSFS fst index [saf.idx's] -whichFst 1
</pre>

=Note about fst for folded spectra=

=Two Populations real data=
<pre>
#this is with 2pops
#first calculate per pop saf for each populatoin
../angsd -b list1 -anc hg19ancNoChr.fa -out pop1 -dosaf 1 -gl 1
../angsd -b list2 -anc hg19ancNoChr.fa -out pop2 -dosaf 1 -gl 1
#calculate the 2dsfs prior
../misc/realSFS pop1.saf.idx pop2.saf.idx >pop1.pop2.ml
#prepare the fst for easy window analysis etc
../misc/realSFS fst index pop1.saf.idx pop2.saf.idx -sfs pop1.pop2.ml -fstout here
#get the global estimate
../misc/realSFS fst stats here.fst.idx
-> FST.Unweight:0.069395 Fst.Weight:0.042349
#below is not tested that much, but seems to work
../misc/realSFS fst stats2 here.fst.idx -win 50000 -step 10000 >slidingwindow
</pre>

=3 Populations real data=
In commands below im using 24 threads, because this is what I have. Adjust accordingly
<pre>
#this is with 2pops
#first calculate per pop saf for each populatoin
./angsd -b list10 -anc hg19ancNoChr.fa -out pop1 -dosaf 1 -gl 1
./angsd -b list11 -anc hg19ancNoChr.fa -out pop2 -dosaf 1 -gl 1
./angsd -b list12 -anc hg19ancNoChr.fa -out pop3 -dosaf 1 -gl 1
#calculate all pairwise 2dsfs's
./misc/realSFS pop1.saf.idx pop2.saf.idx -P 24 >pop1.pop2.ml
./misc/realSFS pop1.saf.idx pop3.saf.idx -P 24 >pop1.pop3.ml
./misc/realSFS pop2.saf.idx pop3.saf.idx -P 24 >pop2.pop3.ml
#prepare the fst for easy analysis etc
./misc/realSFS fst index pop1.saf.idx pop2.saf.idx pop3.saf.idx -sfs pop1.pop2.ml -sfs pop1.pop3.ml -sfs pop2.pop3.ml -fstout here
#get the global estimate
-> Assuming idxname:here.fst.idx
-> Assuming .fst.gz file: here.fst.gz
-> FST.Unweight[nObs:1666245]:0.022063 Fst.Weight:0.034513
0.022063 0.034513
-> FST.Unweight[nObs:1666245]:0.026867 Fst.Weight:0.031989
0.026867 0.031989
-> FST.Unweight[nObs:1666245]:0.025324 Fst.Weight:0.021118
0.025324 0.021118
-> pbs.pop1 0.023145
-> pbs.pop2 0.005088
-> pbs.pop3 0.009367
#below is not tested that much, but seems to work
../misc/realSFS fst stats2 here.fst.idx -win 50000 -step 10000 >slidingwindow
</pre>
In the presence of 3 populations, the program will also calculate the population branch statistics
==Sliding Window output==
The sliding window seems to work so we have documented it here:
;Second column is chromosome, third is center of window followed by:
;fst.unweight(pop1,pop2) fst.weight(pop1,pop2) fst.unweight(pop1,pop3) fst.weight(pop1,pop3) fst.unweight(pop2,pop3) fst.weight(pop2,pop3)
<pre>
(9133,58895)(14010000,14059999)(14010000,14060000) 1 14035000 0.022099 0.016387 0.026686 0.027731 0.025311 0.047920 -0.002231 0.035045 0.030353
(19114,68881)(14020000,14069999)(14020000,14070000) 1 14045000 0.022096 0.019076 0.026777 0.024238 0.025290 0.052793 -0.005220 0.041969 0.029757
(28951,78655)(14030000,14079999)(14030000,14080000) 1 14055000 0.022043 0.021025 0.026915 0.023368 0.025342 0.056975 -0.006884 0.046840 0.030530
(38928,88632)(14040000,14089999)(14040000,14090000) 1 14065000 0.022083 0.016525 0.026846 0.029560 0.025345 0.053421 -0.004116 0.039898 0.034122
(48917,98170)(14050000,14099999)(14050000,14100000) 1 14075000 0.022132 0.022082 0.026742 0.025564 0.025262 0.037071 0.005226 0.024827 0.020671
(74,49191)(14000000,14049999)(14000000,14050000) 10 14025000 0.022704 0.101955 0.026479 0.095713 0.025102 0.001924 0.103108 -0.048378 -0.002500
(9734,58555)(14010000,14059999)(14010000,14060000) 10 14035000 0.022779 0.102670 0.026425 0.088015 0.025118 0.002721 0.098870 -0.043342 -0.006738
</pre>
The last 3 columns are the populations branch statistic for population1, popultion2 and population3
==Relative window positions?==

We allow for 3 different ways of defining window positions, these are chosen with the '''-type''' argument in realSFS

;-type 2 Use pos=1 as the leftmost position of first window. Even though there isn't any data.

;-type 1 Use first position with data, as leftmost position for the first window.

;-type 0 Split out the genome into blocks. And use the first window that have data for the entire window. Then we will have the same windowcenters across datasets.
=realSFS fst print=
You can print out the precalculated A and B with

''./realSFS fst print pop1.pop2.fst.idx''

Assuming we have pop1.saf.idx, pop2.saf.idx.
<pre>
./realSFS pop1.saf.idx pop2.saf.idx >pop1.pop2.saf.idx.ml
./realSFS fst index pop1.saf.idx pop2.saf.idx -fstout pop1.pop2 -sfs pop1.pop2.saf.idx.ml
./realSFS fst print pop1.pop2.fst.idx
</pre>

The weighted fst for a region is the ratio between the sum of As and the sum of B. The unweighted is the mean of the persite ratios.

A is the alpha from the reynolds 1983 (or Bhatia) and B is the alpha + beta.

2d SFS Estimation

2019-08-17T01:26:23Z

Thorfinn:

Angsd can estimate a 2d site frequency spectrum. This is an extension of the 1d site frequency spectrum [[SFS Estimation|method]].
* Newer versions of ANGSD can estimate even higher dimensions (upto 4).
* From august17 2019 the program can now do a proper folding of the 2dsfs, which is done by supplying it with the UNFOLDED saf.idx fiels generated by -dosaf 1

Below are some examples:
And is best explained by a full example.
==Example==
* Assume you have a 12 bamfiles for population in the file '''pop1.list'''
* Assume you have a 14 bamfiles for population in the file '''pop2.list'''
* Assume you have a fastafile containing the ancestral state in the '''anc.fa'''

Let's start by finding the positions for which we have data in population1 and population2
<pre>
# as always you can add -minMapQ 1 and -minQ 20 to only keep high quality data.
angsd -GL 1 -b pop1.list -anc anc.fa -r chr1: -P 10 -out pop1 -doSaf 1
angsd -GL 1 -b pop2.list -anc anc.fa -r chr1: -P 10 -out pop2 -doSaf 1
</pre>

==1 dimensional frequency spectra==
If we were interested in estimating the 1d sfs for each population we could do it like this using the [[realSFS]] program. (See more on [[SFS Estimation |page]] )
<pre>
#sfs for pop1
realSFS pop1.saf.idx -P 24 >pop1.saf.sfs
#sfs for pop2
realSFS pop2.saf.idx -P 24 >pop2.saf.sfs
#2d sfs for pop1 and pop2
realSFS pop1.saf.idx pop2.saf.idx -P 24 >2dsfs.sfs
</pre>
The output is then located in a nice flattened matrix format(25x29) in the file: '''2dsfs.sfs'''. Good luck visualising it, some people are using dadi, we have been using heat maps in R.

==2d sfs (folded)==
<pre>
#2d sfs for pop1 and pop2 doing proper folding
realSFS pop1.saf.idx pop2.saf.idx -P 24 -fold 1 >2dsfs.sfs
</pre>

2d SFS Estimation

2019-08-17T01:25:47Z

Thorfinn:

Angsd can estimate a 2d site frequency spectrum. This is an extension of the 1d site frequency spectrum [[SFS Estimation|method]]. Newer versions of ANGSD can estimate even higher dimensions (upto 4). From august17 2019 the program can now do a proper folding, which is done by supplying it with the UNFOLDED saf.idx fiels generated by -dosaf 1

Below are some examples:

And is best explained by a full example.
==Example==
* Assume you have a 12 bamfiles for population in the file '''pop1.list'''
* Assume you have a 14 bamfiles for population in the file '''pop2.list'''
* Assume you have a fastafile containing the ancestral state in the '''anc.fa'''

Let's start by finding the positions for which we have data in population1 and population2
<pre>
# as always you can add -minMapQ 1 and -minQ 20 to only keep high quality data.
angsd -GL 1 -b pop1.list -anc anc.fa -r chr1: -P 10 -out pop1 -doSaf 1
angsd -GL 1 -b pop2.list -anc anc.fa -r chr1: -P 10 -out pop2 -doSaf 1
</pre>

==1 dimensional frequency spectra==
If we were interested in estimating the 1d sfs for each population we could do it like this using the [[realSFS]] program. (See more on [[SFS Estimation |page]] )
<pre>
#sfs for pop1
realSFS pop1.saf.idx -P 24 >pop1.saf.sfs
#sfs for pop2
realSFS pop2.saf.idx -P 24 >pop2.saf.sfs
#2d sfs for pop1 and pop2
realSFS pop1.saf.idx pop2.saf.idx -P 24 >2dsfs.sfs
</pre>
The output is then located in a nice flattened matrix format(25x29) in the file: '''2dsfs.sfs'''. Good luck visualising it, some people are using dadi, we have been using heat maps in R.

==2d sfs (folded)==
<pre>
#2d sfs for pop1 and pop2 doing proper folding
realSFS pop1.saf.idx pop2.saf.idx -P 24 -fold 1 >2dsfs.sfs
</pre>

2d SFS Estimation

2019-08-17T01:24:37Z

Thorfinn:

2d SFS Estimation

2019-08-17T01:23:51Z

Thorfinn:

Angsd can estimate a 2d site frequency spectrum. This is an extension of the 1d site frequency spectrum [[SFS Estimation|method]]. Newer versions of ANGSD can estimate even higher dimensions (upto 4). From august17 2019 the program can now do a proper folding, which is done by supplying it with the UNFOLDED saf.idx fiels generated by -dosaf 1

Below are some examples:

And is best explained by a full example.
==Example==
* Assume you have a 12 bamfiles for population in the file '''pop1.list'''
* Assume you have a 14 bamfiles for population in the file '''pop2.list'''
* Assume you have a fastafile containing the ancestral state in the '''anc.fa'''

Let's start by finding the positions for which we have data in population1 and population2
<pre>
# as always you can add -minMapQ 1 and -minQ 20 to only keep high quality data.
angsd -GL 1 -b pop1.list -anc anc.fa -r chr1: -P 10 -out pop1 -doSaf 1
angsd -GL 1 -b pop2.list -anc anc.fa -r chr1: -P 10 -out pop2 -doSaf 1
</pre>

If we were interested in estimating the 1d sfs for each population we could do it like this using the [[realSFS]] program. (See more on [[SFS Estimation |page]] )
<pre>
#sfs for pop1
realSFS pop1.saf.idx -P 24 >pop1.saf.sfs
#sfs for pop2
realSFS pop2.saf.idx -P 24 >pop2.saf.sfs
#2d sfs for pop1 and pop2
realSFS pop1.saf.idx pop2.saf.idx -P 24 >2dsfs.sfs
</pre>
The output is then located in a nice flattened matrix format(25x29) in the file: '''2dsfs.sfs'''. Good luck visualising it, some people are using dadi, we have been using heat maps in R.

Plink

2019-04-30T15:26:57Z

Thorfinn:

From version 0.549 we now support plink output files. Currently we only support the transposed '''.tfam/.tped''' files. But we are working on a native '''.bed/.bim/.fam''' output.

There are 3 approaches for making plink file. First is based on calling genotypes with angsd and outputting the results as a .tped file. The second approach is a based on doing a pseudohaploid call for sites for a list of individuals followed by a subprogram for converting this into a .tped file. Final approach is based on generating a fasta file (which also contains the pseudo haploid concensus) for each individual followed by a subprogram that extract the reference bases for specific sites and is then merged into a tped file.

This method is essentially a wrapper around the existing genotype caller, and all options for the genotype caller can therefore be used for the plink formated output file.

See [[Genotype calling]] for options relating to calling genotypes.
For dumping the plink file you should supply:

;-doPlink 2

==Brief Overview==
<pre>
------------------------
writePlink.cpp:
-doPlink 0
1: binary fam/bim/bed format (still beta, not really working)
2: tfam/tped format

NB This is a wrapper around -doGeno see more information for that option
</pre>

==Example==
A full example commandline is given below:

<pre>
./angsd -bam bam.filelist -out outnames -doPlink 2 -doGeno -4 -doPost 1 -doMajorMinor 1 -GL 1 -doCounts 1 -doMaf 2 -postCutoff 0.99 -SNP_pval 1e-6 -geno_minDepth 4
</pre>

Notice the extra minus in the '''-dogeno -4''' argument, this will suppress the -doGeno output.

==Output files==
The above commands will generate a '''.tfam/.tped''' files
<div class="toccolours mw-collapsible mw-collapsed">
output.tfam
<pre class="mw-collapsible-content">
1 1 0 0 0 -9
2 1 0 0 0 -9
3 1 0 0 0 -9
4 1 0 0 0 -9
5 1 0 0 0 -9
6 1 0 0 0 -9
7 1 0 0 0 -9
8 1 0 0 0 -9
9 1 0 0 0 -9
10 1 0 0 0 -9
11 1 0 0 0 -9
12 1 0 0 0 -9
13 1 0 0 0 -9
14 1 0 0 0 -9
15 1 0 0 0 -9
[capped]
</pre>
</div>
<div class="toccolours mw-collapsible mw-collapsed">
output.tped
<pre class="mw-collapsible-content">
1 1_14000202 0 14000202 G G G G G G G A G G G G G A G A G A G G G G G A G G G G G A G A G G G G G G G A G G G G G A G G G G G G G G G G G A G G G A G A G G
1 1_14000873 0 14000873 G G G G G G A A G A G G G G G G G A G G G A G G G A G G G G A A G G G A G G G A G A G G G A G G G G A A A A G G G A G G G A G G G G
1 1_14001018 0 14001018 T T T T T T C C T C T T T T T T T C T T T C T T T T T T T T C C T T T C T T T T T C T T T T T C T T T C T C T T T C T T T C T T T T
1 1_14001867 0 14001867 A A A A A A A G A G A A A A A A A G A A A G A A A G A A A A G G A A A G A A A G A G A A A G A G A A A G G G A A A G A A A G A A A A
</pre>
</div>
Notice that the family id simply an incrementing integer, and that the SNPid is the genomic position.

==NB==
We highly recommand that users, don't perform analysis on called genotypes unless you have high depth, since calling genotypes is likely to cause bias in the downstream analysis.

==References==
[http://pngu.mgh.harvard.edu/~purcell/plink/]

Installation

2019-02-22T16:40:46Z

Thorfinn:

There has been some confusion about the versions of ANGSD.

* Even versions are freezes from the last odd giversion

* Odd versions are git versions. Once there has been enough commits we will increment and make a release.

=Download and Installation=
To download and use ANGSD you need to download the htslib and the angsd source folder

You can either download the angsd0.928.tar.gz which contains both.
[http://popgen.dk/software/download/angsd/angsd0.928.tar.gz]

Or you can use github for the latest version of both htslib and angsd

Earlier versions from here: http://popgen.dk/software/download/angsd/
And here: https://github.com/ANGSD/angsd/releases

=Install=
Download and unpack the tarball, enter the directory and type make. Users on a mac computer, can use curl instead of wget.

===Unix===
The software can be compiled using make.
<pre>
wget http://popgen.dk/software/download/angsd/angsd0.928.tar.gz
tar xf angsd0.928.tar.gz
cd htslib;make;cd ..
cd angsd
make HTSSRC=../htslib
cd ..
</pre>
The executable then located in '''angsd/angsd'''.

=Install from github=
To install CRAM support you also need to install htslib and can be done using the following commands

<pre>
git clone https://github.com/samtools/htslib.git
git clone https://github.com/ANGSD/angsd.git
cd htslib;make;cd ../angsd ;make HTSSRC=../htslib
</pre>

=Systemwide installation of htslib?=
Then you just type make in the angsd directory

Change log

2019-02-22T16:39:20Z

Thorfinn: /* Latests */

=Latests=
Odd versions are github versions...
*0.929 https://github.com/ANGSD/angsd/compare/0.925...0.929
*0.925 https://github.com/ANGSD/angsd/compare/0.923...0.925
*0.923 https://github.com/ANGSD/angsd/compare/0.921...0.923
*0.921 https://github.com/ANGSD/angsd/compare/0.918...0.921
*0.918 https://github.com/ANGSD/angsd/compare/0.916...0.918
*0.916 https://github.com/ANGSD/angsd/compare/0.914...0.916
*0.914 https://github.com/ANGSD/angsd/compare/0.912...0.914
*0.912 https://github.com/ANGSD/angsd/compare/0.911...0.912
*0.911 https://github.com/ANGSD/angsd/compare/0.910...0.911
*0.910 https://github.com/ANGSD/angsd/compare/0.901...0.910
*0.901 https://github.com/ANGSD/angsd/compare/0.900...0.901
*0.900 https://github.com/ANGSD/angsd/compare/0.800...0.900
*0.800 https://github.com/ANGSD/angsd/compare/0.700...0.800
*0.700 https://github.com/ANGSD/angsd/compare/0.615...0.70

=0.6***=
*0.614 https://github.com/ANGSD/angsd/compare/0.614...0.615
*0.613 https://github.com/ANGSD/angsd/compare/0.613...0.614
*0.613 https://github.com/ANGSD/angsd/compare/0.612a...0.613
*0.612 https://github.com/ANGSD/angsd/compare/0.610...0.612a
*0.610 https://github.com/ANGSD/angsd/compare/0.609...0.610a
*0.609 https://github.com/ANGSD/angsd/compare/0.608...0.609
*0.608 not super usefull but we now compile with knetfile, so users can use remote .fa files. I really recommend that users download the fasta instead.
*0.607 changed name of all abstract base classes to the more reasonable abc*.cpp. Included contamination and the iCounts format. Added a templated class so users can see how to access the internal datastructures.
*0.606 added more info in the thetaStat subprogram renamed all analysis classes to abc*.cpp.
*0.605 continued updating the Eunjung code
*0.604 added a 'job' array for the analysis classes which should greatly reduce the number of needed function call. I doubt that will make any noticeable speed difference. Has copied and modified some code from Eunjung (from jnovembre lab) for a banded approach in the saf calculation. This is not working yet -doSaf 2
*0.603 1) fixed wrong branching if users used simfiles. 2) fixed 2 bugs in the inbreeeded angsd_realSFS.cpp. Very small changes
*0.602 1) programs output the actual chromosome, if there is a mismatch between the fasta and bamheader 2) added a check that the index files (fa vs fai,bam vs bai) is newer than the index. 3) Program now validates that the .so and include files used are the same.
*0.601 1) started the move to includeflags/discardflags, outcommented in this version 2) validates that data has been generted for the different samples by looking at difference between values, instead of comparing againts -0.0. 3) MAF estimates now skips estimation for a site if the updated GL shows that GLs are noninformative.
*0.600 fixed a change in default FLAGS when using bam files. If you hadn't removed the unmapped reads from your bamfiles these would have been included in the analysis.

=0.589 to 599=
* 0.589 removed soap/sim/glf/glfclean(bin and text) and tglf.
* 0.590 added text mpileup as new input format. Very useful
* 0.591 refactored the file reading, moved the arguments to multi reader, such that all file reading is done from multi reader. FREEZE version. We will only allow bug fixes in the next many versions.
* 0.592 -pest output in angsd_realSFS.cpp. smartcounts. fixed a bug in hetplas. edited the printout msg for 'if you really want angsd to exit' -pest output in angsd_realSFS.cpp. smartcounts. fixed a bug in hetplas. edited the printout msg for 'if you really want angsd to exit'. Maybe fixed an unknown bug. line 2 in .arg or screen output is now commandline used, and some other visual stuff. Program didn't complain if -doMaf but no -domajorminor. if chromosomesome name contained ':" it wouldn't work strchr->strrchr
* 0.593 fixed a strange bug, where the program would crash if, no analysis was chosen. fixed the 'shouldbeone' bug.
* 0.594 nochanges
* 0.595 nochanges
* 0.596 fixed a bug in persite depth counter (double count of C alleles). fixed a bug in smartcount subprogram.
* 0.597 treemix input file generation from smart counts
* 0.598 fixed a wrong compile flag in one of the utility programs 'smartCount'
* 0.599 many much more informative information if users forgot to add -fai argument. Fixed a bug in parsing of arguments with -doSaf 4.

=<0.570 and <0.588=
* 0.571 removed bf's from maf classs, negative values of -domaf disabled dumping of files, inbreeding has been added
* 0.572 Some funky new approach for the makefile is now being used, minor bug fixes (minDepth -> setMinDepth, extra header column in thetas.gz has been fixed)
* 0.573 added better info barfiles have different header. Added check if length of supplied reference/ancestral doesn't match bamheader. autosize in emOptim2 for 2dsfs. Fixed subtle issue if very large coverage between bamfiles, now the 'biggests(file size)' is used to select region instead of the first baffle. Netaccess is now deprecated.
* 0.574 modifed emOptim2 so now compiles on mac
* 0.575 smaller fixes to the inbreeding parsing, added p-value in analysisMaf instead of the raw llh.
*.0.576 -doSNP and -minLRT now deprecated, please use -SNP_pval instead
* 0.577 bugfix for -SNP_pval if value was one 1. doHWE now called -HWE_pval and can be used for filtering.
* 0.578 speedup in hew stuff
* 0.579 fixed an extremely rare assertion error (program was working, assertion was off). Redid all strcmp to strcasecom. Fixed a bug in -doMaf 2 with -snp_pval
* 0.580 nochanges...
* 0.581 if trimming has been enabled, N's will be plugged in instead of the bases. A number of small changes.
* 0.582 nochanges...
* 0.583 1) changed bugfix when using counts based estimator for major/minor 2) keepsites is now using the effecive number of samples in all cases 3) changed output of maf to a 'nicer' format
* 0.584 updated internal testing scripts.
* 0.585 1) fixed 'baq complains even though -ref was supplied' 2) fixed -doMajorMinor 4 and doMajorMinor 5 (sites not discarded) 3) added trumendounsly better information for the -sites 4) added some check for -doPost 5) program can now exit uncleanly if ctrl+c is pressed 3 times. 6) added an else to catch wrong arg in theteStat.
*0.586 1) fixed parsing of pars if input is -sim1 2) fixed a bug in -doFasta 3 (the ebd one) 3) fixed a printout problem in -doFasta
*0.587 1) fixed a number of minor instances where memory wasn't being freed/delete (mostly for keeping valgring silent) 2) fixed a memleak if -sites files contained 4 columns but -doMajorMinor not 3. 3) fixed a memory leak that could occur if -doPost 2 and -doMaf 0.
* 0.588 1) fixed small printout error that could cause segfaults in rare cases. 2) change stderr to pos+1 3) changed some checks of user supplied pars 4) fixed a stack overflow if a very long -rf file was supplied.

=old=
==Dirty==
* 0.16 Is now bundled with SAMtools-0.1.17 and the mpileup (and friends) command be used for passing data to dirty.
* 0.17 added extra options -minInd and -minMaf, for only printing and using sites above a threshold
* 0.18 added option to pass reference and ancestral allele as fasta files.(using faidx format) (doMaf is now encoded internally as a MAF_(UN)KNOWN_TYPE)
* 0.19 added support for tglf inputfiles, -tglf -posfile see runexamples, also added the likeratio test for snp calling
* 0.20 Added the check for missing data, before the major/minor. included -realSFS, changed the deallocation of the -doMAF results, such that its proper cleaned up.
* 0.21 refactored pml.cpp into pml_estError_genLikes.cpp and pml_freq_asso.cpp (fixed a bug that preventede -samglf and samglfclean from working)
* 0.22 Well this update was a mixture of edits from [[user:albrecht]] and BGI so its difficult to give a concise description
*.0.23 Program can now read simulated files (single pop only) An example can be seen in "full example ... sfs" and input types.
* 0.24 added the tajima estimator. This should go in tandem with some R scripts. Had to modify parseargs, shared, and pml_freq_asso
* 0.25 the depth is now being populated when using mpileup -g. The program can now get the counts from mpileup
==ANGSD==
=<0.5=
* 0.01.a - 0.01.b The bfgs now supports threading, maybe anders implemented a heteorzygosity estimator.
* 0.01.c A problem if we didn't observe any llh, caused the MAF estimator to 'nan'.
* 0.02 Fixed small bug in bfgs optimization of sfs optimization. When choosing a region bigger than what was covered by the .sfs file the program would hang. Added genotypecaller, added -sfsEst to the realSFS part of the program.
* 0.03 added and documented genotypecaller, can dump counts,-realSFS 1 dumps positions, -realSFS 2 is deprecated,S,pi and tajima has been added to sfstools along with possibility to do prior
* 0.3 clean version with less features. The lost features will be reintroduced later.
* 0.43 first very clean version, everything should be included
* 0.441 rewrote the SOAPsnp GL model, -L and -maxQ is not needed anymore. Also added an option to choose an output dir for the recalibration matrix
* 0.4471 error estimation is now working, the fasta reading is now threadsafe. all GLs are now likeratios.
* 0.512 After 0.500 we have changed the internal structure such that each chunk is enforced to be on the same chr. version c) fixes a problem of hardclipping
* 0.515 Alot of legacy code has been removed from mUppile.cpp. Program can now use remote files, build on code from SAMtools
* 0.520 Alot more legacy code has been removed from mUpPile.cpp. Program now does baq and adjustment of mapQ similar to -C in samtools. Also compiles on osx, but this is not supported
* 0.535 Bug in internal representation of mapQ's (only problematic for mapQ>128), we now use the flag to determine if a read has mapped. calcstat is now deprecated, users should use the bgid program now.
* 0.538 changed position output in association part. Fixed incorrect assert assumption in mUpPile.cpp. Added some downsampling options for errorEst and changed internal buffering when reading beagle files to allow for >10k individuals.
* 0.549 to many changes to remember

=<0.5 and <0.570=
* 0.551 tajima paper is now published, so the emOptim2 and bgid has now been properly documented. plink output is now supported and some snp filters can be outputted.
* 0.552 minor bug when calling genotypes without defining postcutoff -> missingness couldnt occur. removed the optimSFS and emOptim from the default compilelist
* 0.553 uint removed from code.
* 0.554 plugged in sfstools functionality into main angsd, (ability to output log posts)
* 0.555 anders added some concensus stuff
* 0.556 updated the filtering (if binary rep of keep file is incomplete it is removed again. It checks timestamps to see if file has been updated), folded spectra analysis should now be working
* 0.557 There was a bug in the realsfs part of the code, that was created in the 0.556 version. 0.557 is simply a fix of this, and the removal of a warning compiler flag in the msToGlf subprogram. We only observed the problematic compiler flag on a osx machine
* 0.558 tempversion,from this version bgid is now called thetaStat
* 0.559 abbababa
* 0.560 analysisCount.cpp has been updated to the nice standard of the wiki
* 0.561 program now compiles on clang, many small compiler warnings has been fixed.
* 0.562 Merge of forked versions, abba-baba fasta.
* 0.563 minQ filter has been moved to a much earlier step. Previously it was downstream classes that checked this. Now a base will be set to 'n' if it is below the threshold
* 0.564 cleaned up funky pars maf/asso such that all results are in ->extras[]
* 0.565 moved file reading stuff from shared to analysisFunction in namespace ail::
* 0.566 cleaned up small things, added a newer version of hetplas
* 0.567 cleaned up small things again. Started to add single pars e.g. -P -b
* 0.568 refactored compile order in general.cpp
* 0.569 added saf genotype calling, changed name from -realsfs to -doSaf
* 0.570 modified emOptim2 to estimate nSites and tell how much memory it will use, fixed empty -bam file

MediaWiki:Sitenotice

2019-02-22T16:38:48Z

Thorfinn:

ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.928/0.929 on github), see [[Change_log]] for changes, and download it [[Download and installation | here]].

Thorfinn

2019-02-22T16:37:58Z

Thorfinn: /* Make a new github version to put on github */

=How to deploy=
This page described how a new version should be put on wiki and github
==Make a combined angsd htslib to put on wiki download==
copy latest and make a new annotated tag and push to wiki
<pre>
VERSION=0.928
mkdir delme
cd delme
git clone --depth=1 https://github.com/ANGSD/angsd
git clone --depth=1 https://github.com/SAMtools/htslib
cd angsd; git tag -a ${VERSION} -m "time for a new version"
sed "s/PACKAGE_VERSION = .*/PACKAGE_VERSION = ${VERSION}/" Makefile >tmp
diff Makefile tmp
mv tmp Makefile
cd ..
tar --exclude='.git' -cvf angsd${VERSION}.tar.gz angsd htslib/
scp angsd${VERSION}.tar.gz software@popgen.dk:/home/software/download/angsd/
</pre>

==Make a new github version to put on github==

<pre>
VERSION=0.929
cd angsd/
sed "s/PACKAGE_VERSION = .*/PACKAGE_VERSION = ${VERSION}/" Makefile >tmp
diff Makefile tmp
mv tmp Makefile
git commit Makefile -m "Preparing new ${VERSION} version"
git tag -a ${VERSION} -m "time for a new version"
git push
git push --tags
</pre>