ANGSD: Analysis of next generation Sequencing Data
Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.
SFS Estimation
This method will estimate the site frequency spectrum, the method is described in Nielsen2012.
This is a 2 step procedure first generate a ".sfs" file, followed by an optimization of the .sfs file which will estimate the Site frequency spectrum. For the optimization we have implemented 2 different approaches both found in the misc subdir of the root subdir.This is shown in the diagram below.
NB the ancestral state needs to be supplied for this method, the information on this page relates to versions 0.551 or higher. <classdiagram type="dir:LR">
[sequence data{bg:orange}]->GL[genotype likelihoods|SAMtools;GATK;SOAPsnp;Kim et.al]
[genotype likelihoods|SAMtools;GATK;SOAPsnp;Kim et.al]->realSFS[.sfs file{bg:blue}] [.sfs file{bg:blue}]->optimize[.sfs.ml file{bg:red}] [.sfs file{bg:blue}]->optimize[.sfs.em.ml file{bg:red}]
</classdiagram>
- -realSFS 1
- an sfs file will be generated.
- -realSFS 2
- (version above 0.503) Taking into account perIndividual inbreeding coefficients. This is the work of Filipe G. Vieira
- -realSFS 4
- snpcalling (not implemented, in this angsd)
- -realSFS 8
- genotypecalling (not implemented, int this angsd)
For the inbreeding version you need to supply a file containing all the inbreeding coefficients. -indF
options
- -underFlowProtect [INT]
a very basic underflowprotection
Example
A full example is shown below, here we use GATK genotype likelihoods and hg19.fa as the ancestral. The 'optimSFS and emOptim can be found in the misc subfolder.
#first generate .sfs file ./angsd -bam smallBam.filelist -realSFS 1 -out small -anc hg19.fa -GL 2 #now try the EM optimization with 4 threads ./emOptim.g++ -binput small.sfs -nChr 20 -maxIter 100 -nThread 4 #lets also try the optimization that uses derivates (bfgs) ./optimSFS.gcc small.sfs -nChr 20 -nThreads 4
NB
If you have say 10 diploid samples, then you should do -nChr 20 if you have say 12 diploid samples, then you should do -nChr 24.
The outpiles are then called small.sfs.em.ml and small.sfs.ml
0.995120 0.001202 0.000469 0.000255 0.000239 0.000254 0.000125 #capped
This is to be interpreted as:
column1 := probabilty of sampling zero derived alleles
column2 := probabilty of sampling one derived allele
column3 := probabilty of sampling two derived allele
column4 := probabilty of sampling three derived allele
etc
NB
The generation of the .sfs file is done on a persite basis, whereas the optimization requires information for a region of the genome.The optimization will therefore use large amounts of memory.
NB The optimSFS and emOptim program are now deprecated. And the functionality of these programs are now in 'emOptim2'
./emOptim2 input.sfs 10 -P 20 >input.sfs.em.ml
Output is in input.sfs.em.ml in logscale. we are assuming 10 chromosomes (5 diploid samples), and we are using 20 threads.
The program defaults to 50megebase regions, and will loop over the genome using 50 megebase chunks. You can increase this adding -nSites 500000000. Which will then use 500megabase regions.