ANGSD: Analysis of next generation Sequencing Data
Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.
RealSFSmethod
We will try to elaborate on the theory behind the methods. Below is only a preliminary version of the theory. This method is described in detail in Nielsen2012.
SFS definition
For 'n' diploid samples, the site frequency spectrum (SFS) is the (2n+1) vector containing the proportion of site carrying 'k'-mutations. This means that the first element in the SFS is the proportion of sites where we don't observe any mutations, The second value is the proportion of sites where we observe 1 mutations. The last value is the proportion of sites we only observe mutations. It follows that the first and last column are the invariable categories and assuming that the SFS contains relative frequencies the variability in the sample can be estimated by:
Sample allele frequency/Multisample GL
By supplying the -doSaf 1, flag to angsd. Angsd will calculate the likelihood of the sample allele frequency for each site and dump these into the .saf file. The likelihood of the sample allele frequency are in this context the likelihood of sampling k-derived alleles. This is estimated on the basis of the 10 possible genotype likelihoods for all individuals by summing over all combinations. This is done using the recursive algorithm described in Nielsen2012. This we write as meaning the likelihood of sampling j derived alleles for site s. And we calculate the folded as
Likelihood of the SFS
The likelihood of the sfs is then given as:
Here is our sfs. In the case of the folded sfs, we use n instead of 2n in the summation. We can find the MLE of the SFS by using either an BFGS approach that uses derivatives or by using en EM algorithm. Both is implemented in the realSFS program.