ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Fasta: Difference between revisions

From angsd
Jump to navigation Jump to search
(Created page with "This option create a fasta file from a sequencing data file. <classdiagram type="dir:LR"> [One bam file{bg:orange}]->[sequencing data|random base (-doFasta 1);consensus base...")
 
No edit summary
Line 1: Line 1:
This option create a fasta file from a sequencing data file.
This option create a fasta file from a sequencing data file. The function uses genome information in the bam header to determing the length and chromosome names. For the sites without data an "N" is written.  


<classdiagram type="dir:LR">
<classdiagram type="dir:LR">
Line 8: Line 8:
=Brief Overview=
=Brief Overview=
<pre>
<pre>
> ./angsd -doFasta
--------------
--------------
analysisFasta.cpp:
analysisFasta.cpp:
Line 17: Line 18:


=options=
=options=
;-realSFS 1: an sfs file will be generated.
;-doFasta 1: sample a random base


;-realSFS 2:(version above 0.503) Taking into account perIndividual inbreeding coefficients. This is the work of Filipe G. Vieira
;-doFasta 2: use the most common base. In the case of ties a random base is chosen among the bases with the same counts


;-realSFS 4: snpcalling (not implemented, in this angsd)
;-minQ [INT]  
 
minimum base quality score
;-realSFS 8: genotypecalling (not implemented, int this angsd)
 
For the inbreeding version you need to supply a file containing all the inbreeding coefficients. -indF
 
 
;-underFlowProtect [INT]  
a very basic underflowprotection




==Example==
==Example==
A full example is shown below, here we use GATK genotype likelihoods and hg19.fa as the ancestral. The [[emOptim2]] can be found in the misc subfolder.
Create a fasta file bases on a random samples of bases
 
<pre>
#first generate .sfs file
./angsd -bam smallBam.filelist -realSFS 1 -out small -anc  hg19.fa -GL 2 [options]
#now try the EM optimization with 4 threads
./emOptim2 small.sfs 20 -maxIter 100 -P 4 >small.sfs.em.ml
</pre>
We always recommend that you filter out the bad qscore bases and meaningless mapQ reads. eg '''-minMapQ 1 -minQ 20'''.
<pre>
If you have say 10 diploid samples, then you should do -nChr 20
if you have say 12 diploid samples, then you should do -nChr 24.
</pre>
 
=Interpretation of output file=
 
The outpiles are then called small.em.ml. This will be the SFS in logscale.
This is to be interpreted as:
 
column1 := probabilty of sampling zero derived alleles
 
column2 := probabilty of sampling one derived allele
 
column3 := probabilty of sampling two derived allele
 
column4 := probabilty of sampling three derived allele
 
etc


==NB==
The generation of the .sfs file is done on a persite basis, whereas the optimization requires information for a region of the genome. The optimization will therefore use large amounts of memory.
The program defaults to 50megabase regions, and will loop over the genome using 50 megebase chunks. You can increase this adding -nSites 500000000. Which will then use 500megabase regions.
=Folded spectra=
<pre>
<pre>
Below is for version 0.556 and above
./angsd -i smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1
</pre>
If you don't have the ancestral state, you can instead estimate the folded SFS. This is done by supplying the -anc with the reference and also supply -fold 1.
Then you should remember to change the second parameter to the emOptim2 function as the number of individuals instead of the number of chromosomes.
 
The above example would then be
 
<pre>
<pre>
#first generate .sfs file
./angsd -bam smallBam.filelist -realSFS 1 -out small -anc  hg19.fa -GL 2 -fold 1 [options]
#now try the EM optimization with 4 threads
./emOptim2 small.sfs 10 -maxIter 100 -P 4 >small.sfs.em.ml
</pre>
=Posterior per-site distributions of the sample allele frequency=
This is assuming version 0.556 or higher.
If you supply a prior for the -realSFS analysis, the output of the .sfs file will no longer be loglikeratios of the different categories but log posteriors of the different categories.
=Format specification of binary .sfs file=
The information in this section is only usefull for people who wants to work with the "multisample GL"/"sample allele frequencies" for the individual sites.
Assuming 'n' individuals we have (2n+1) categories for each site, each value is encoded as ctype double which on 'all known' platforms occupies 8 bytes. The (2n+1) values are log scaled like ratios, which means that the most likely category will have a value of 0.
The information for the first site is the first (2n+1)sizeof(double) bytes. The information for the next site is the next (2n+1) bytes etc. The positions are given by the ascii file called '.sfs.pos'
<pre>
#sample code to read .sfs in c/c++ assuming 10 individuals
FILE *fp = fopen("mySFS.sfs","rb");
int nInd = 10;
double persite[2*nInd+1];
fread(persite,sizeof(double)*(2*nInd+1),1,fp);
for(int i=0;i<2*nind+1;i++)
  fprintf(stderr,"cat: %d = %f\n",i,persite[i]);
</pre>
* If the -fold 1 has been set, then the dimension is no longer 2*nInd+1 but nInd+1
* If the -pest parameter has been supplied the output is no longer like ratios to the most likely but log posts of the different categories.
=Theory=
<pre>
We will try to elaborate on the theory behind the methods. Below is only a preliminary version of the theory.
</pre>
This method is described in [[Nielsen2012]].
==SFS definition==
For 'n' diploid samples, the site frequency spectrum '''(SFS)''' is the (2n+1) vector containing the proportion of site carrying 'k'-mutations. This means that the first element in the SFS is the proportion of sites where we don't observe any mutations, The second value is the proportion of sites where we observe 1 mutations. The last value is the proportion of sites we only observe mutations. It follows that the first and last column are the invariable categories and assuming that the SFS contains relative frequencies the variability in the sample can be estimated by:
<math>pvar=1-sfs_0-sfs_{2n}=\sum_{i=1}^{2n-1}sfs_i</math>
==Sample allele frequency/Multisample GL==
By supplying the -realSFS 1, flag to angsd. Angsd will calculate the likelihood of the sample allele frequency for each site and dump these into the .sfs file. The likelihood of the sample allele frequency are in this context the likelihood of sampling k-derived alleles. This is estimated on the basis of the 10 possible genotype likelihoods for all individuals by summing over all combinations. This is done using the recursive algorithm described in [[Nielsen2012]]. This we write as <math>p(X^s\mid j)</math> meaning the likelihood of sampling j derived alleles for site s. And we calculate the folded as
<math>
p_{fold}(x^s\mid j) =p(x^s\mid j) + p(x^s\mid 2n- j),\qquad j\in\{0,1,3,\ldots,n-1\},
</math>
<math>
p_{fold}(x^s\mid j) =2p(x^s\mid j) ,\qquad j=n
</math>
==Likelihood of the SFS==
The likelihood of the sfs is then given as:
<math>
p(X|\theta) = \prod_{s=0}^S\sum_{i=0}^{2n} p(X^s\mid i )\theta_i
</math>
Here <math>\theta</math> is our sfs. In the case of the folded sfs, we use n instead of 2n in the summation. We can find the MLE of the SFS by using either an BFGS approach that uses derivatives or by using en EM algorithm. Both is implemented in the emOptim2 program.

Revision as of 18:43, 27 November 2013

This option create a fasta file from a sequencing data file. The function uses genome information in the bam header to determing the length and chromosome names. For the sites without data an "N" is written.

<classdiagram type="dir:LR">

[One bam file{bg:orange}]->[sequencing data|random base (-doFasta 1);consensus base (-doFasta 2)]

[sequencing data]->doFasta[fasta file{bg:blue}]

</classdiagram>

Brief Overview

> ./angsd -doFasta
--------------
analysisFasta.cpp:
	-doFasta	0
	1: use a random base
	2: use the most common base (needs -doCounts 1)
	-minQ		13	(remove bases with qscore<minQ)

options

-doFasta 1
sample a random base
-doFasta 2
use the most common base. In the case of ties a random base is chosen among the bases with the same counts
-minQ [INT]

minimum base quality score


Example

Create a fasta file bases on a random samples of bases

./angsd -i smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1