ANGSD: Analysis of next generation Sequencing Data
Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.
Fasta
This option creates a fasta.gz file from a sequencing data file (BAM file). The function uses genome information in the BAM header to determine the length and chromosome names. For the sites without data an "N" is written.
<classdiagram type="dir:LR">
[Single BAM file{bg:orange}]->[Sequence data|Random base (-doFasta 1);Consensus base (-doFasta 2);Highest EBD (-doFasta 3); write iupac (-doFasta 4)]
[sequence data]->doFasta[fasta file{bg:blue}]
</classdiagram>
<classdiagram type="dir:LR">
[Multiple BAM files{bg:orange}]->[Sequence data|Random base (-doFasta 1);Consensus base (-doFasta 2);write iupac (-doFasta 4)]
[sequence data]->doFasta[fasta file{bg:blue}]
</classdiagram>
This can be used as input for the ANGSD analysis:
The iupac output code was kindly provided by Kristian Ullrich.
Brief Overview
./angsd -dofasta -> Tue Sep 26 17:02:07 2017 -------------- abcWriteFasta.cpp: -doFasta 0 1: use a random (non N) base (needs -doCounts 1) 2: use the most common (non N) base (needs -doCounts 1) 3: use the base with highest ebd (under development) 4: output iupac codes (under development) -basesPerLine 50 (Number of bases perline in output file) -explode 0 print chromosome where we have no data (0:no,1:yes) -rmTrans 0 remove transitions as different from -ref bases (0:no,1:yes) -ref (null) reference fasta, only used with -rmTrans 1 -seed 0 use non random seed of value 1
This function will dump a fasta file, the full header information from the SAM/BAM file will be used. This means that a fasta will be generated for the entire chromosome even if '-r/-rf -sites' is used.
Options
- -doFasta 1
- sample a random base at each position. N's or filtered based are ignored. The "-doCounts 1" options for allele counts is needed in order to determine the most common base. If multiple individuals are used the four bases are counted across individuals.
- -doFasta 2
- use the most common base. In the case of ties a random base is chosen among the bases with the same maximum counts. N's or filtered based are ignored. The "-doCounts 1" options for allele counts is needed in order to determine the most common base. If multiple individuals are used the four bases are counted across individuals.
- -doFasta 3
- use the base with thie highest effective depth (EBD). This only works for one individual
- -basesPerLine [INT]
Number of bases perline in output fasta file (default is 50)
- -explode [INT]
0 (default) only output chromosomes with data. 1: write out all chromosomes
- -rmTrans [INT]
0 (default) all sites are used. 1: Remove transition. Here transitions are determined using a fasta file such as a reference genome.
- -ref [fileName]
a fasta file used to determine if a site is a transitions (needed when using -rmTrans 1 is used)
- -seed [INT]
Use a seed in order to replicate results
For filters see Filters
Output
Output is a fasta file, a normal looking fast file. Nothing special about this. For -doFasta 1, sometimes its big letters sometime small letters. This is due to the results being copied directly from the sequencing data. So small/big letters correspond to which strand for the original data. For the consensus fasta all letters are capital letters.
Example
Create a fasta file bases from a random samples of bases.
./angsd -i bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1
EBD
For four bases we have 4 different EBD, each EBD is the product of the mapping quality and scores for the base under consideration. The EBD is the effective base depth, as defined by [1]:
where is a certain base, is the number of reads with base
Ancestral fasta
If you have outgroup species map to your reference genome and you want to use them to make a fasta file with ancestral alleles. You can use one or more outgroup individuals e.g. for human you could have a four outgroup bam file from a chimp, a bonobo, a gorrilla and a orangotan. Assuming you want to make a fasta file where the alleles is the same for all outgroup species then you can use a command like
./angsd -b fourOutgroup.bamlist -out myFasta -doCounts 1 -snp_pval 0.01 -domaf 1 -domajorminor 1 -gl 2 -rmSNPs 1 -minind 4 -setMinDepthInd 10 -explode 1
-b fourOutgroup.bamlist contains the bam files for four outgroup individuals.
-out myFasta the output name
-doCounts 1 counts bases accross individuals to determine the concensus allele
-snp_pval 0.01 p-value threshold for defining a SNP. A lower threshold will need more evidence to call a SNPs.
-domaf 1 estimate allele frequency (use to call SNPs) with that the major and minor alleles inferred from data
-domajorminor 1 infer the major and minor allele from data
-gl 2 use genotype likelihoods based on the GATK model
-rmSNPs 1 remove polymorphic sites. instead of keeping sites that are polymorphic then we remove them such that all outgroups have the same allele.
-minind 4 remove site where you don't have data for all four individuals
-setMinDepthInd 10 require at least 10 read for each individual
-explode 1 make the fasta file for the whole genome not just the chromosomes/scaffolds were you have data
The sites that are polymorphic or do not have enough data will be labeled as 'N' in the fasta file