Revision as of 16:18, 8 February 2016

Available from version 0.559+.

Newer githubversions will limit the output to the chrs with data. Earlier versions was printout 'N"

This option creates a fasta.gz file from a sequencing data file (BAM file). The function uses genome information in the BAM header to determine the length and chromosome names. For the sites without data an "N" is written.

[Single BAM file{bg:orange}]->[Sequence data|Random base (-doFasta 1);Consensus base (-doFasta 2);Highest EBD (-doFasta 3)]

[sequence data]->doFasta[fasta file{bg:blue}]

</classdiagram>

This can be used as input for the ANGSD analysis:

Brief Overview

> ./angsd -doFasta
--------------
analysisFasta.cpp:
	-doFasta	0
	1: use a random base
	2: use the most common base (needs -doCounts 1)
	3: use the base with highest ebd (under development) 
	-basesPerLine	50	(Number of bases perline in output file)
	-explode	0	(Should we include chrs with no data?)
	-rmTrans	0	(remove transitions (as different from -ref bases)?)
	-ref	(null)	(reference fasta, only used with -rmTrans 1)

This function will dump a fasta file, the full header information from the SAM/BAM file will be used. This means that a fasta will be generated for the entire chromosome even if '-r/-rf -sites' is used.

Options

-doFasta 1: sample a random base at each position.

-doFasta 2: use the most common base. In the case of ties a random base is chosen among the bases with the same maximum counts. The "-doCounts 1" options for allele counts is needed in order to determine the most common base.

-doFasta 3: use the base with thie highest effective depth (EBD).

-basesPerLine [INT]

Number of bases perline in output fasta file (default is 50)

-explode [INT]

0 (default) only output chromosome with data. 1: write out all chromosomes

For filters see Filters

Output

Output is a fasta file, a normal looking fast file. Nothing special about this. For -doFasta 1, sometimes its big letters sometime small letters. This is due to the results being copied directly from the sequencing data. So small/big letters correspond to which strand for the original data. For the consensus fasta all letters are capital letters.

Example

Create a fasta file bases from a random samples of bases.

./angsd -i bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1

EBD

For four bases we have 4 different EBD, each EBD is the product of the mapping quality and scores for the base under consideration. The EBD is the effective base depth, as defined by [1]:

$EBD_{A}=\sum _{b_{i}=A,C,G,T}^{N}(phred(mapq_{i})*phred(qscore_{i})),\qquad phred(q)=10^{-q/10}$

@@ Line 32: / Line 32: @@
 This function will dump a fasta file, the full header information from the SAM/BAM file will be used. This means that a fasta will be generated for the entire chromosome even if '-r/-rf -sites' is used.
-For four bases we have 4 different EBD, each EBD is the product of the mapping quality and scores for the base under consideration.
-=EBD=
-The EBD is the effective base depth, as defined by [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3638139/]:
-<math>
-EBD_A = \sum_{b_i=A,C,G,T}^N (phred(mapq_i)*phred(qscore_i)),\qquad phred(q) =10^{-q/10}
-</math>
 =Options=
@@ Line 66: / Line 57: @@
 ./angsd -i bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1
 </pre>
+==EBD==
+For four bases we have 4 different EBD, each EBD is the product of the mapping quality and scores for the base under consideration.
+The EBD is the effective base depth, as defined by [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3638139/]:
+<math>
+EBD_A = \sum_{b_i=A,C,G,T}^N (phred(mapq_i)*phred(qscore_i)),\qquad phred(q) =10^{-q/10}
+</math>

Fasta: Difference between revisions

Revision as of 16:18, 8 February 2016

Contents

Brief Overview

Options

Output

Example

EBD

Navigation menu