ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

SFS Estimation: Difference between revisions

From angsd
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
==realSFS==
==realSFS==
This method will estimate the site frequency spectrum, the method is described in [[Nielsen2012]]
This method will estimate the site frequency spectrum, the method is described in [[Nielsen2012]].
 
This is a 2 step procedure first generate a ".sfs" file using -realSFS 1. And then do an optimization of the .sfs file which will estimate the Site frequency spectrum.
 
<classdiagram>
[sequence data]->GL[genotype likelihoods|SAMtools;GATK;SOAPsnp;Kim et.al]
[genotype likelihoods|SAMtools;GATK;SOAPsnp;Kim et.al]->realSFS[.sfs file]
[.sfs file]->optimize[.ml file]
</classdiagram>
 


From version 0.20 we can now estimate the joint likelihood of being in frequency j.


;-realSFS 1:  an sfs file will be generated.
;-realSFS 1:  an sfs file will be generated.
Line 9: Line 17:


;-realSFS 4: genotypecalling (not implemented, int this angsd)
;-realSFS 4: genotypecalling (not implemented, int this angsd)
= Full example for simulating nextgen data and estimating SFS (somewhat deprecated should use sfstools now) =
First we simulate a small dataset with very high probability of being variabe.
Number of individuals =15,number of sites=10000, pvar=0.9
<pre>
./simnextgen.gcc -nind 15 -nsites 10000 -outfiles small -pvar 0.9
->Using args: -nind 15 -errate 0.007500 -depth 5.000000 -pvar 0.900000 -nsites 10000 -F 0.000000 -model 1
->Dumping files: sequencefile: small.seq.gz glffile: small.glf.gz truefreq: small.frq args:small.args geno:small.geno
</pre>
The .frq file contains the folded spectra, we use getUnfolded on the "true" genotypes, to get the unfolded spectra
We now run dirty on the glf.gz file.
<pre>
./angsd -sim1 misc/small.glf.gz -nInd 15 -outfiles results -realSFS 1
##screen output removed
</pre>
And we now run the optimSFS program to get the spectra, we supply the filename, the number of chromosomes, and the number of threads we want to use
<pre>
./optimSFS.gcc -binput ../results.sfs -ncat 30 -nThread 4
##screen output removed
</pre>
The final output is in results.sfs.ml
We start R and plot our estimated SFS along with the true SFS
<pre>
mytrue <- as.numeric(read.table("small.geno.unfolded",skip=1))
myest <-scan("../results.sfs.ml")
barplot(rbind(mytrue,myest),beside=T)
</pre>
[[Image:SfsExample.png]]

Revision as of 14:07, 10 October 2012

realSFS

This method will estimate the site frequency spectrum, the method is described in Nielsen2012.

This is a 2 step procedure first generate a ".sfs" file using -realSFS 1. And then do an optimization of the .sfs file which will estimate the Site frequency spectrum.

<classdiagram>

[sequence data]->GL[genotype likelihoods|SAMtools;GATK;SOAPsnp;Kim et.al]

[genotype likelihoods|SAMtools;GATK;SOAPsnp;Kim et.al]->realSFS[.sfs file] [.sfs file]->optimize[.ml file]

</classdiagram>


-realSFS 1
an sfs file will be generated.
-realSFS 2
snpcalling (not implemented, in this angsd)
-realSFS 4
genotypecalling (not implemented, int this angsd)