ANGSD: Analysis of next generation Sequencing Data
Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.
Fasta: Difference between revisions
No edit summary |
|||
Line 32: | Line 32: | ||
This function will dump a fasta file, the full header information from the SAM/BAM file will be used. This means that a fasta will be generated for the entire chromosome even if '-r/-rf -sites' is used. | This function will dump a fasta file, the full header information from the SAM/BAM file will be used. This means that a fasta will be generated for the entire chromosome even if '-r/-rf -sites' is used. | ||
=Options= | =Options= | ||
Line 66: | Line 57: | ||
./angsd -i bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 | ./angsd -i bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 | ||
</pre> | </pre> | ||
==EBD== | |||
For four bases we have 4 different EBD, each EBD is the product of the mapping quality and scores for the base under consideration. | |||
The EBD is the effective base depth, as defined by [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3638139/]: | |||
<math> | |||
EBD_A = \sum_{b_i=A,C,G,T}^N (phred(mapq_i)*phred(qscore_i)),\qquad phred(q) =10^{-q/10} | |||
</math> |
Revision as of 16:18, 8 February 2016
Available from version 0.559+.
Newer githubversions will limit the output to the chrs with data. Earlier versions was printout 'N"
This option creates a fasta.gz file from a sequencing data file (BAM file). The function uses genome information in the BAM header to determine the length and chromosome names. For the sites without data an "N" is written.
<classdiagram type="dir:LR">
[Single BAM file{bg:orange}]->[Sequence data|Random base (-doFasta 1);Consensus base (-doFasta 2);Highest EBD (-doFasta 3)]
[sequence data]->doFasta[fasta file{bg:blue}]
</classdiagram>
This can be used as input for the ANGSD analysis:
Brief Overview
> ./angsd -doFasta -------------- analysisFasta.cpp: -doFasta 0 1: use a random base 2: use the most common base (needs -doCounts 1) 3: use the base with highest ebd (under development) -basesPerLine 50 (Number of bases perline in output file) -explode 0 (Should we include chrs with no data?) -rmTrans 0 (remove transitions (as different from -ref bases)?) -ref (null) (reference fasta, only used with -rmTrans 1)
This function will dump a fasta file, the full header information from the SAM/BAM file will be used. This means that a fasta will be generated for the entire chromosome even if '-r/-rf -sites' is used.
Options
- -doFasta 1
- sample a random base at each position.
- -doFasta 2
- use the most common base. In the case of ties a random base is chosen among the bases with the same maximum counts. The "-doCounts 1" options for allele counts is needed in order to determine the most common base.
- -doFasta 3
- use the base with thie highest effective depth (EBD).
- -basesPerLine [INT]
Number of bases perline in output fasta file (default is 50)
- -explode [INT]
0 (default) only output chromosome with data. 1: write out all chromosomes
For filters see Filters
Output
Output is a fasta file, a normal looking fast file. Nothing special about this. For -doFasta 1, sometimes its big letters sometime small letters. This is due to the results being copied directly from the sequencing data. So small/big letters correspond to which strand for the original data. For the consensus fasta all letters are capital letters.
Example
Create a fasta file bases from a random samples of bases.
./angsd -i bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1
EBD
For four bases we have 4 different EBD, each EBD is the product of the mapping quality and scores for the base under consideration. The EBD is the effective base depth, as defined by [1]: