ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

SFS Estimation

From angsd
Jump to navigation Jump to search

This method will estimate the site frequency spectrum, the method is described in Nielsen2012.

This is a 2 step procedure first generate a ".sfs" file, followed by an optimization of the .sfs file which will estimate the Site frequency spectrum.

For the optimization we have implemented 2 different approaches both found in the misc subdir of the root subdir.This is shown in the diagram below.

NB the ancestral state needs to be supplied for this method, the information on this page relates to versions 0.551 or higher.

<classdiagram type="dir:LR">

[sequence data{bg:orange}]->GL[genotype likelihoods|SAMtools;GATK;SOAPsnp;Kim et.al]

[genotype likelihoods|SAMtools;GATK;SOAPsnp;Kim et.al]->realSFS[.sfs file{bg:blue}] [.sfs file{bg:blue}]->optimize('emOptim2')[.sfs.ml file{bg:red}]

</classdiagram>

Brief Overview

--------------
angsd_realSFS.cpp:
	-realSFS		0
	1: perform multisample GL estimation
	2: use an inbreeding version
	-doThetas		0 (calculate thetas)
	-underFlowProtect	0
	-fold			0 (deprecated)
	-anc			(null) (ancestral fasta)
	-noTrans		0 (remove transitions)
	-doSFS		0	(Using genotype posteriors (untested))
	-pest			(null) (prior SFS)

options

-realSFS 1
an sfs file will be generated.
-realSFS 2
(version above 0.503) Taking into account perIndividual inbreeding coefficients. This is the work of Filipe G. Vieira
-realSFS 4
snpcalling (not implemented, in this angsd)
-realSFS 8
genotypecalling (not implemented, int this angsd)

For the inbreeding version you need to supply a file containing all the inbreeding coefficients. -indF


-underFlowProtect [INT]

a very basic underflowprotection


Example

A full example is shown below, here we use GATK genotype likelihoods and hg19.fa as the ancestral. The emOptim2 can be found in the misc subfolder.

#first generate .sfs file
./angsd -bam smallBam.filelist -realSFS 1 -out small -anc  hg19.fa -GL 2 [options]
#now try the EM optimization with 4 threads
./emOptim2 small.sfs 20 -maxIter 100 -P 4 >small.sfs.em.ml

We always recommend that you filter out the bad qscore bases and meaningless mapQ reads. eg -minMapQ 1 -minQ 20.

If you have say 10 diploid samples, then you should do -nChr 20
if you have say 12 diploid samples, then you should do -nChr 24.

Interpretation of output file

The outpiles are then called small.em.ml. This will be the SFS in logscale. This is to be interpreted as:

column1 := probabilty of sampling zero derived alleles

column2 := probabilty of sampling one derived allele

column3 := probabilty of sampling two derived allele

column4 := probabilty of sampling three derived allele

etc

NB

The generation of the .sfs file is done on a persite basis, whereas the optimization requires information for a region of the genome. The optimization will therefore use large amounts of memory. The program defaults to 50megabase regions, and will loop over the genome using 50 megebase chunks. You can increase this adding -nSites 500000000. Which will then use 500megabase regions.

Format specification of binary .sfs file

The information in this section is only usefull for people who wants to work with the "multisample GL"/"sample allele frequencies" for the individual sites.

Assuming 'n' individuals we have (2n+1) categories for each site, each value is encoded as ctype double which on 'all known' platforms occupies 8 bytes. The (2n+1) values are log scaled like ratios, which means that the most likely category will have a value of 0.

The information for the first site is the first (2n+1)sizeof(double) bytes. The information for the next site is the next (2n+1) bytes etc. The positions are given by the ascii file called '.sfs.pos'


#sample code to read .sfs in c/c++ assuming 10 individuals
FILE *fp = fopen("mySFS.sfs","rb");
int nInd = 10;
double persite[2*nInd+1];
fread(persite,sizeof(double)*(2*nInd+1),1,fp);
for(int i=0;i<2*nind+1;i++)
   fprintf(stderr,"cat: %d = %f\n",i,persite[i]);

Folded spectra

Below is for version 0.556 and above

If you don't have the ancestral state, you can instead estimate the folded SFS. This is done by supplying the -anc with the reference and also supply -fold 1. Then you should remember to change the second parameter to the emOptim2 function as the number of individuals instead of the number of chromosomes.

The above example would then be

#first generate .sfs file
./angsd -bam smallBam.filelist -realSFS 1 -out small -anc  hg19.fa -GL 2 -fold 1 [options]
#now try the EM optimization with 4 threads
./emOptim2 small.sfs 10 -maxIter 100 -P 4 >small.sfs.em.ml