ANGSD: Analysis of next generation Sequencing Data
Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.
2d SFS Estimation: Difference between revisions
Line 13: | Line 13: | ||
<pre> | <pre> | ||
# as always you can add -minMapQ 1 and -minQ 20 to only keep high quality data. | # as always you can add -minMapQ 1 and -minQ 20 to only keep high quality data. | ||
angsd -GL 1 -b pop1.list -anc anc.fa -r chr1: -P 10 -out pop1 | angsd -GL 1 -b pop1.list -anc anc.fa -r chr1: -P 10 -out pop1 -doSaf 1 | ||
angsd -GL 1 -b pop2.list -anc anc.fa -r chr1: -P 10 -out pop2 | angsd -GL 1 -b pop2.list -anc anc.fa -r chr1: -P 10 -out pop2 -doSaf 1 | ||
</pre> | </pre> | ||
Each run will generate 2 files of interest: '''pop1.saf,pop1.saf.pos''' and '''pop2.saf,pop2.saf.pos''' | Each run will generate 2 files of interest: '''pop1.saf,pop1.saf.pos''' and '''pop2.saf,pop2.saf.pos''' | ||
Line 32: | Line 32: | ||
<pre> | <pre> | ||
# as always you can add -minMapQ 1 and -minQ 20 to only keep high quality data. | # as always you can add -minMapQ 1 and -minQ 20 to only keep high quality data. | ||
angsd -GL 1 -b pop1.list -anc anc.fa -r chr1: -P 10 -out pop1 -sites intersect.txt | angsd -GL 1 -b pop1.list -anc anc.fa -r chr1: -P 10 -out pop1 -sites intersect.txt -doSaf 1 | ||
angsd -GL 1 -b pop2.list -anc anc.fa -r chr1: -P 10 -out pop2 -sites intersect.txt | angsd -GL 1 -b pop2.list -anc anc.fa -r chr1: -P 10 -out pop2 -sites intersect.txt -doSaf 1 | ||
</pre> | </pre> | ||
Notice that the last 2 commands will overwrite the: '''pop1.saf,pop1.saf.pos''' and '''pop2.saf,pop2.saf.pos''' files. | Notice that the last 2 commands will overwrite the: '''pop1.saf,pop1.saf.pos''' and '''pop2.saf,pop2.saf.pos''' files. |
Revision as of 19:06, 9 September 2014
Angsd can estimate a 2d site frequency spectrum. This is an extension of the 1d site frequency spectrum method.
The method works by calculating population specific sample allele frequencies. A minor annoyance in the current implementation is that you will need to limit the analysis to the sites that has coverage in both populations. This in effect means that you will need to do two passes for each population.
And is best explained by a full example.
Example
- Assume you have a 12 bamfiles for population in the file pop1.list
- Assume you have a 14 bamfiles for population in the file pop2.list
- Assume you have a fastafile containing the ancestral state in the anc.fa
- Assume we are only interested in chr1
Let's start by finding the positions for which we have data in population1 and population2
# as always you can add -minMapQ 1 and -minQ 20 to only keep high quality data. angsd -GL 1 -b pop1.list -anc anc.fa -r chr1: -P 10 -out pop1 -doSaf 1 angsd -GL 1 -b pop2.list -anc anc.fa -r chr1: -P 10 -out pop2 -doSaf 1
Each run will generate 2 files of interest: pop1.saf,pop1.saf.pos and pop2.saf,pop2.saf.pos
If we were interested in estimating the 1d sfs for each population we could do it like this using the realSFS program. (See more on page )
realSFS pop1.saf 24 -P 24 >pop1.saf.sfs realSFS pop2.saf 28 -P 24 >pop2.saf.sfs #first argument is saf file, second argument is the number of chromosomes, -P 24 is the number of cores we want to use
Now we find the positions that occurs both in population1 and population2 using the uniq POSIX program.
gunzip -c pop1.saf.pos.gz pop2.saf.pos.gz|sort -S 50%|uniq -d|sort -k1,1 -S 50% >intersect.txt
And now we redo the angsd sample allelefrequence calculation by conditioning on the sites that occur in both populations
# as always you can add -minMapQ 1 and -minQ 20 to only keep high quality data. angsd -GL 1 -b pop1.list -anc anc.fa -r chr1: -P 10 -out pop1 -sites intersect.txt -doSaf 1 angsd -GL 1 -b pop2.list -anc anc.fa -r chr1: -P 10 -out pop2 -sites intersect.txt -doSaf 1
Notice that the last 2 commands will overwrite the: pop1.saf,pop1.saf.pos and pop2.saf,pop2.saf.pos files.
And we now estimate the joint site frequency spectra by using the realSFS program
realSFS 2dsfs pop1.saf pop2.saf 24 28 -P 24 >2dsfs.sfs
The output is then located in a nice matrix format(25x29) in the file: 2dsfs.sfs. Good luck visualising it, some people are using dadi, we have been using heat maps in R.