Latest revision as of 17:45, 6 February 2023

This option creates a fasta.gz file from a sequencing data file (BAM file). The function uses genome information in the BAM header to determine the length and chromosome names. For the sites without data an "N" is written.

[Single BAM file{bg:orange}]->[Sequence data|Random base (-doFasta 1);Consensus base (-doFasta 2);Highest EBD (-doFasta 3); write iupac (-doFasta 4)]

[sequence data]->doFasta[fasta file{bg:blue}]

</classdiagram>

[Multiple BAM files{bg:orange}]->[Sequence data|Random base (-doFasta 1);Consensus base (-doFasta 2);write iupac (-doFasta 4)]

[sequence data]->doFasta[fasta file{bg:blue}]

</classdiagram>

This can be used as input for the ANGSD analysis:

The iupac output code was kindly provided by Kristian Ullrich.

Brief Overview

./angsd -dofasta 	-> Tue Sep 26 17:02:07 2017
--------------
abcWriteFasta.cpp:
	-doFasta	0
	1: use a random (non N) base (needs -doCounts 1)
	2: use the most common (non N) base (needs -doCounts 1)
	3: use the base with highest ebd (under development) 
	4: output iupac codes (under development) 
	-basesPerLine	50	(Number of bases perline in output file)
	-explode	0	 print chromosome where we have no data (0:no,1:yes)
	-rmTrans	0	 remove transitions as different from -ref bases (0:no,1:yes)
	-ref	(null)	 reference fasta, only used with -rmTrans 1
	-seed	0	 use non random seed of value 1

This function will dump a fasta file, the full header information from the SAM/BAM file will be used. This means that a fasta will be generated for the entire chromosome even if '-r/-rf -sites' is used.

Options

-doFasta 1: sample a random base at each position. N's or filtered based are ignored. The "-doCounts 1" options for allele counts is needed in order to determine the most common base. If multiple individuals are used the four bases are counted across individuals.

-doFasta 2: use the most common base. In the case of ties a random base is chosen among the bases with the same maximum counts. N's or filtered based are ignored. The "-doCounts 1" options for allele counts is needed in order to determine the most common base. If multiple individuals are used the four bases are counted across individuals.

-doFasta 3: use the base with thie highest effective depth (EBD). This only works for one individual

-basesPerLine [INT]

Number of bases perline in output fasta file (default is 50)

-explode [INT]

0 (default) only output chromosomes with data. 1: write out all chromosomes

-rmTrans [INT]

0 (default) all sites are used. 1: Remove transition. Here transitions are determined using a fasta file such as a reference genome.

-ref [fileName]

a fasta file used to determine if a site is a transitions (needed when using -rmTrans 1 is used)

-seed [INT]

Use a seed in order to replicate results ( relevant when using random sample -dofasta 1 )

For filters see Filters

Output

Output is a fasta file, a normal looking fast file. Nothing special about this. For -doFasta 1, sometimes its big letters sometime small letters. This is due to the results being copied directly from the sequencing data. So small/big letters correspond to which strand for the original data. For the consensus fasta all letters are capital letters.

Example

Create a fasta file bases from a random samples of bases.

./angsd -i bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1

EBD

For four bases we have 4 different EBD, each EBD is the product of the mapping quality and scores for the base under consideration. The EBD is the effective base depth, as defined by [1]:

$EBD_{b}=\sum _{i=1}^{N_{b}}(phred(mapq_{i})*phred(qscore_{i})),\qquad phred(q)=10^{-q/10}\qquad b\in \{A,C,G,T\}$

where $b$ is a certain base, $N_{b}$ is the number of reads with base $b$

Ancestral fasta using multiple outgroups

If you have outgroup species map to your reference genome and you want to use them to make a fasta file with ancestral alleles. You can use one or more outgroup individuals e.g. for human you could have a four outgroup bam file from a chimp, a bonobo, a gorrilla and a orangotan. Assuming you want to make a fasta file where the alleles is the same for all outgroup species then you can use a command like

./angsd -b fourOutgroup.bamlist -out myFasta -doCounts 1 -snp_pval 0.01 -domaf 1 -domajorminor 1 -gl 2 -rmSNPs 1 -minind 4 -setMinDepthInd 10 -explode 1

-b fourOutgroup.bamlist contains the bam files for four outgroup individuals.

-out myFasta the output name

-doCounts 1 counts bases accross individuals to determine the concensus allele

-snp_pval 0.01 p-value threshold for defining a SNP. A lower threshold will need more evidence to call a SNPs.

-domaf 1 estimate allele frequency (use to call SNPs) with that the major and minor alleles inferred from data

-domajorminor 1 infer the major and minor allele from data

-gl 2 use genotype likelihoods based on the GATK model

-rmSNPs 1 remove polymorphic sites. instead of keeping sites that are polymorphic then we remove them such that all outgroups have the same allele.

-minind 4 remove site where you don't have data for all four individuals

-setMinDepthInd 10 require at least 10 read for each individual

-explode 1 make the fasta file for the whole genome not just the chromosomes/scaffolds were you have data

The sites that are polymorphic or do not have enough data will be labeled as 'N' in the fasta file

Fasta: Difference between revisions

Latest revision as of 17:45, 6 February 2023

Contents

Brief Overview

Options

Output

Example

EBD

Ancestral fasta using multiple outgroups

Navigation menu

@@ Line 1: / Line 1: @@
-availeble from version 0.559+
-This option creates a fasta file from a sequencing data file (bam file). The function uses genome information in the bam header to determine the length and chromosome names. For the sites without data an "N" is written.
+This option creates a fasta.gz file from a sequencing data file (BAM file). The function uses genome information in the BAM header to determine the length and chromosome names. For the sites without data an "N" is written.
 <classdiagram type="dir:LR">
-  [One bam file{bg:orange}]->[sequencing data|random base (-doFasta 1);consensus base (-doFasta 2)]
+  [Single BAM file{bg:orange}]->[Sequence data|Random base (-doFasta 1);Consensus base (-doFasta 2);Highest EBD (-doFasta 3); write iupac (-doFasta 4)]
-[sequencing data]->doFasta[fasta file{bg:blue}]
+[sequence data]->doFasta[fasta file{bg:blue}]
   </classdiagram>
+<classdiagram type="dir:LR">
+ [Multiple BAM files{bg:orange}]->[Sequence data|Random base (-doFasta 1);Consensus base (-doFasta 2);write iupac (-doFasta 4)]
+[sequence data]->doFasta[fasta file{bg:blue}]
+ </classdiagram>
+This can be used as input for the ANGSD analysis:
+# [[Error estimation]]
+# [[ABBA-BABA]]
+The iupac output code was kindly provided by Kristian Ullrich.
 =Brief Overview=
 <pre>
-> ./angsd -doFasta
+./angsd -dofasta 	-> Tue Sep 26 17:02:07 2017
 --------------
-analysisFasta.cpp:
+abcWriteFasta.cpp:
 	-doFasta	0
-: use a random base
+: use a random (non N) base (needs -doCounts 1)
-: use the most common base (needs -doCounts 1)
+: use the most common (non N) base (needs -doCounts 1)
-	-minQ		13	(remove bases with qscore<minQ)
+: use the base with highest ebd (under development)
+: output iupac codes (under development)
+	-basesPerLine	50	(Number of bases perline in output file)
+	-explode	0	 print chromosome where we have no data (0:no,1:yes)
+	-rmTrans	0	 remove transitions as different from -ref bases (0:no,1:yes)
+	-ref	(null)	 reference fasta, only used with -rmTrans 1
+	-seed	0	 use non random seed of value 1
 </pre>
-=options=
+This function will dump a fasta file, the full header information from the SAM/BAM file will be used. This means that a fasta will be generated for the entire chromosome even if '-r/-rf -sites' is used.
-;-doFasta 1: sample a random base at each position
+=Options=
+;-doFasta 1: sample a random base at each position. N's or filtered based are ignored. The "-doCounts 1" options for [[Alleles_counts|allele counts]] is needed in order to determine the most common base. If multiple individuals are used the four bases are counted across individuals.
-;-doFasta 2: use the most common base. In the case of ties a random base is chosen among the bases with the same maximum counts. The "-doCounts 1" options for [[Alleles_counts|allele counts]] is needed in order to determine the most common base
+;-doFasta 2: use the most common base. In the case of ties a random base is chosen among the bases with the same maximum counts. N's or filtered based are ignored. The "-doCounts 1" options for [[Alleles_counts|allele counts]] is needed in order to determine the most common base. If multiple individuals are used the four bases are counted across individuals.
-;-minQ [INT]
+;-doFasta 3: use the base with thie highest effective depth (EBD). This only works for one individual
-minimum base quality score
+;-basesPerLine	[INT]
+Number of bases perline in output fasta file (default is 50)
+;-explode	[INT]
+(default) only output chromosomes with data. 1: write out all chromosomes
+;-rmTrans [INT]
+(default) all sites are used. 1: Remove transition. Here transitions are determined using a fasta file such as a reference genome.
+;-ref [fileName]
+a fasta file used to determine if a site is a transitions (needed when using -rmTrans 1 is used)
+;-seed [INT]
+Use a seed in order to replicate results ( relevant when using random sample -dofasta 1 )
+For filters see [[Filters]]
+=Output=
+Output is a fasta file, a normal looking fast file. Nothing special about this. For -doFasta 1, sometimes its big letters sometime small letters. This is due to the results being copied directly from the sequencing data. So small/big letters correspond to which strand for the original data. For the consensus fasta all letters are capital letters.
 ==Example==
-Create a fasta file bases from a random samples of bases
+Create a fasta file bases from a random samples of bases.
 <pre>
-./angsd -i smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1
+./angsd -i bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1
 </pre>
+=EBD=
+For four bases we have 4 different EBD, each EBD is the product of the mapping quality and scores for the base under consideration.
+The EBD is the effective base depth, as defined by [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3638139/]:
+<math>
+EBD_b = \sum_{i=1}^{N_b} (phred(mapq_i)*phred(qscore_i)),\qquad phred(q) =10^{-q/10} \qquad b \in \{A,C,G,T\}
+</math>
+where <math>b</math> is a certain base, <math>N_b</math> is the number of reads with base <math>b</math>
+=Ancestral fasta using multiple outgroups=
+If you have outgroup species map to your reference genome and you want to use them to make a fasta file with ancestral alleles. You can use one or more outgroup individuals e.g. for human you could have a four outgroup bam file from a chimp, a bonobo, a gorrilla and a orangotan. Assuming you want to make a fasta file where the alleles is the same for all outgroup species then you can use a command like
+<pre>
+./angsd -b fourOutgroup.bamlist -out myFasta -doCounts 1 -snp_pval 0.01 -domaf 1 -domajorminor 1 -gl 2 -rmSNPs 1 -minind 4 -setMinDepthInd 10 -explode 1
+</pre>
+'''-b fourOutgroup.bamlist''' contains the bam files for four outgroup individuals.
+'''-out myFasta''' the output name
+'''-doCounts 1''' counts bases accross individuals to determine the concensus allele
+'''-snp_pval 0.01''' p-value threshold for defining a SNP. A lower threshold will need more evidence to call a SNPs.
+''' -domaf 1''' estimate allele frequency (use to call SNPs) with that the major and minor alleles inferred from  data
+'''-domajorminor 1''' infer the major and minor allele from data
+'''-gl 2'''  use genotype likelihoods based on the GATK model
+''' -rmSNPs 1''' remove polymorphic sites. instead of keeping sites that are polymorphic then we remove them such that all outgroups have the same allele.
+''' -minind 4''' remove site where you don't have data for all four individuals
+'''-setMinDepthInd 10''' require at least 10 read for each individual
+''' -explode 1''' make the fasta file for the whole genome not just the chromosomes/scaffolds were you have data
+The sites that are polymorphic or do not have enough data will be labeled as 'N' in the fasta file