ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Fasta: Difference between revisions

From angsd
Jump to navigation Jump to search
 
(31 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Available from version 0.559+.


This option creates a fasta.gz file from a sequencing data file (BAM file). The function uses genome information in the BAM header to determine the length and chromosome names. For the sites without data an "N" is written.  
This option creates a fasta.gz file from a sequencing data file (BAM file). The function uses genome information in the BAM header to determine the length and chromosome names. For the sites without data an "N" is written.  


<classdiagram type="dir:LR">
<classdiagram type="dir:LR">
  [Single BAM file{bg:orange}]->[Sequence data|Random base (-doFasta 1);Consensus base (-doFasta 2)]
  [Single BAM file{bg:orange}]->[Sequence data|Random base (-doFasta 1);Consensus base (-doFasta 2);Highest EBD (-doFasta 3); write iupac (-doFasta 4)]
[sequence data]->doFasta[fasta file{bg:blue}]
[sequence data]->doFasta[fasta file{bg:blue}]
  </classdiagram>
  </classdiagram>


<classdiagram type="dir:LR">
[Multiple BAM files{bg:orange}]->[Sequence data|Random base (-doFasta 1);Consensus base (-doFasta 2);write iupac (-doFasta 4)]
[sequence data]->doFasta[fasta file{bg:blue}]
</classdiagram>
This can be used as input for the ANGSD analysis:
# [[Error estimation]]
# [[ABBA-BABA]]
The iupac output code was kindly provided by Kristian Ullrich.
=Brief Overview=
=Brief Overview=
<pre>
<pre>
> ./angsd -doFasta
./angsd -dofasta -> Tue Sep 26 17:02:07 2017
--------------
--------------
nalysisFasta.cpp:
abcWriteFasta.cpp:
-doFasta 0
-doFasta 0
1: use a random base
1: use a random (non N) base (needs -doCounts 1)
2: use the most common base (needs -doCounts 1)
2: use the most common (non N) base (needs -doCounts 1)
3: use the base with highest ebd (under development)  
3: use the base with highest ebd (under development)  
-minQ 13 (remove bases with qscore<minQ)
4: output iupac codes (under development)  
-basesPerLine 50 (Number of bases perline in output file)
-basesPerLine 50 (Number of bases perline in output file)
-explode 0 print chromosome where we have no data (0:no,1:yes)
-rmTrans 0 remove transitions as different from -ref bases (0:no,1:yes)
-ref (null) reference fasta, only used with -rmTrans 1
-seed 0 use non random seed of value 1
</pre>
</pre>


This function will dump a fasta file, the full header information from the SAM/BAM file will be used. This means that a fasta will be generated for ALL entries in the header even if '-r/-rf -filter' is used.
This function will dump a fasta file, the full header information from the SAM/BAM file will be used. This means that a fasta will be generated for the entire chromosome even if '-r/-rf -sites' is used.


The EBD is the effective base depth, as defined by [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3638139/]:
=Options=
;-doFasta 1: sample a random base at each position. N's or filtered based are ignored. The "-doCounts 1" options for [[Alleles_counts|allele counts]] is needed in order to determine the most common base. If multiple individuals are used the four bases are counted across individuals.  


<math>
;-doFasta 2: use the most common base. In the case of ties a random base is chosen among the bases with the same maximum counts. N's or filtered based are ignored. The "-doCounts 1" options for [[Alleles_counts|allele counts]] is needed in order to determine the most common base. If multiple individuals are used the four bases are counted across individuals.
EBD_A = \sum_{b_i=A}^N (phred(mapq_i)*phred(qscore_i)),\qquad phred(q) =10^{-q/10} 
</math>


For four bases we have 4 different EBD, each EBD is the product of the mapping quality and scores for the base under consideration.
;-doFasta 3: use the base with thie highest effective depth (EBD). This only works for one individual


=Options=
;-basesPerLine [INT]
;-doFasta 1: sample a random base at each position.
Number of bases perline in output fasta file (default is 50)


;-doFasta 2: use the most common base. In the case of ties a random base is chosen among the bases with the same maximum counts. The "-doCounts 1" options for [[Alleles_counts|allele counts]] is needed in order to determine the most common base.
;-explode [INT]
0 (default) only output chromosomes with data. 1: write out all chromosomes
;-rmTrans [INT]
0 (default) all sites are used. 1: Remove transition. Here transitions are determined using a fasta file such as a reference genome.  
;-ref [fileName]
a fasta file used to determine if a site is a transitions (needed when using -rmTrans 1 is used)
;-seed [INT]
Use a seed in order to replicate results ( relevant when using random sample -dofasta 1 )


;-minQ [INT]  
For filters see [[Filters]]
minimum base quality score.


=Output=
=Output=
Line 46: Line 67:


<pre>
<pre>
./angsd -i smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1
./angsd -i bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1
</pre>
 
 
 
=EBD=
For four bases we have 4 different EBD, each EBD is the product of the mapping quality and scores for the base under consideration.
The EBD is the effective base depth, as defined by [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3638139/]:
 
<math>
EBD_b = \sum_{i=1}^{N_b} (phred(mapq_i)*phred(qscore_i)),\qquad phred(q) =10^{-q/10} \qquad b \in \{A,C,G,T\}
</math>
 
where <math>b</math> is a certain base, <math>N_b</math> is the number of reads with base <math>b</math>
 
 
=Ancestral fasta using multiple outgroups=
If you have outgroup species map to your reference genome and you want to use them to make a fasta file with ancestral alleles. You can use one or more outgroup individuals e.g. for human you could have a four outgroup bam file from a chimp, a bonobo, a gorrilla and a orangotan. Assuming you want to make a fasta file where the alleles is the same for all outgroup species then you can use a command like
 
<pre>
./angsd -b fourOutgroup.bamlist -out myFasta -doCounts 1 -snp_pval 0.01 -domaf 1 -domajorminor 1 -gl 2 -rmSNPs 1 -minind 4 -setMinDepthInd 10 -explode 1
 
</pre>
</pre>
'''-b fourOutgroup.bamlist''' contains the bam files for four outgroup individuals.
'''-out myFasta''' the output name
'''-doCounts 1''' counts bases accross individuals to determine the concensus allele
'''-snp_pval 0.01''' p-value threshold for defining a SNP. A lower threshold will need more evidence to call a SNPs.
''' -domaf 1''' estimate allele frequency (use to call SNPs) with that the major and minor alleles inferred from  data
'''-domajorminor 1''' infer the major and minor allele from data
'''-gl 2'''  use genotype likelihoods based on the GATK model
''' -rmSNPs 1''' remove polymorphic sites. instead of keeping sites that are polymorphic then we remove them such that all outgroups have the same allele. 
''' -minind 4''' remove site where you don't have data for all four individuals
'''-setMinDepthInd 10''' require at least 10 read for each individual
''' -explode 1''' make the fasta file for the whole genome not just the chromosomes/scaffolds were you have data
The sites that are polymorphic or do not have enough data will be labeled as 'N' in the fasta file

Latest revision as of 17:45, 6 February 2023

This option creates a fasta.gz file from a sequencing data file (BAM file). The function uses genome information in the BAM header to determine the length and chromosome names. For the sites without data an "N" is written.

<classdiagram type="dir:LR">

[Single BAM file{bg:orange}]->[Sequence data|Random base (-doFasta 1);Consensus base (-doFasta 2);Highest EBD (-doFasta 3); write iupac (-doFasta 4)]

[sequence data]->doFasta[fasta file{bg:blue}]

</classdiagram>


<classdiagram type="dir:LR">

[Multiple BAM files{bg:orange}]->[Sequence data|Random base (-doFasta 1);Consensus base (-doFasta 2);write iupac (-doFasta 4)]

[sequence data]->doFasta[fasta file{bg:blue}]

</classdiagram>


This can be used as input for the ANGSD analysis:

  1. Error estimation
  2. ABBA-BABA

The iupac output code was kindly provided by Kristian Ullrich.

Brief Overview

./angsd -dofasta 	-> Tue Sep 26 17:02:07 2017
--------------
abcWriteFasta.cpp:
	-doFasta	0
	1: use a random (non N) base (needs -doCounts 1)
	2: use the most common (non N) base (needs -doCounts 1)
	3: use the base with highest ebd (under development) 
	4: output iupac codes (under development) 
	-basesPerLine	50	(Number of bases perline in output file)
	-explode	0	 print chromosome where we have no data (0:no,1:yes)
	-rmTrans	0	 remove transitions as different from -ref bases (0:no,1:yes)
	-ref	(null)	 reference fasta, only used with -rmTrans 1
	-seed	0	 use non random seed of value 1

This function will dump a fasta file, the full header information from the SAM/BAM file will be used. This means that a fasta will be generated for the entire chromosome even if '-r/-rf -sites' is used.

Options

-doFasta 1
sample a random base at each position. N's or filtered based are ignored. The "-doCounts 1" options for allele counts is needed in order to determine the most common base. If multiple individuals are used the four bases are counted across individuals.
-doFasta 2
use the most common base. In the case of ties a random base is chosen among the bases with the same maximum counts. N's or filtered based are ignored. The "-doCounts 1" options for allele counts is needed in order to determine the most common base. If multiple individuals are used the four bases are counted across individuals.
-doFasta 3
use the base with thie highest effective depth (EBD). This only works for one individual
-basesPerLine [INT]

Number of bases perline in output fasta file (default is 50)

-explode [INT]

0 (default) only output chromosomes with data. 1: write out all chromosomes

-rmTrans [INT]

0 (default) all sites are used. 1: Remove transition. Here transitions are determined using a fasta file such as a reference genome.

-ref [fileName]

a fasta file used to determine if a site is a transitions (needed when using -rmTrans 1 is used)

-seed [INT]

Use a seed in order to replicate results ( relevant when using random sample -dofasta 1 )

For filters see Filters

Output

Output is a fasta file, a normal looking fast file. Nothing special about this. For -doFasta 1, sometimes its big letters sometime small letters. This is due to the results being copied directly from the sequencing data. So small/big letters correspond to which strand for the original data. For the consensus fasta all letters are capital letters.

Example

Create a fasta file bases from a random samples of bases.

./angsd -i bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1


EBD

For four bases we have 4 different EBD, each EBD is the product of the mapping quality and scores for the base under consideration. The EBD is the effective base depth, as defined by [1]:

where is a certain base, is the number of reads with base


Ancestral fasta using multiple outgroups

If you have outgroup species map to your reference genome and you want to use them to make a fasta file with ancestral alleles. You can use one or more outgroup individuals e.g. for human you could have a four outgroup bam file from a chimp, a bonobo, a gorrilla and a orangotan. Assuming you want to make a fasta file where the alleles is the same for all outgroup species then you can use a command like

./angsd -b fourOutgroup.bamlist -out myFasta -doCounts 1 -snp_pval 0.01 -domaf 1 -domajorminor 1 -gl 2 -rmSNPs 1 -minind 4 -setMinDepthInd 10 -explode 1

-b fourOutgroup.bamlist contains the bam files for four outgroup individuals.

-out myFasta the output name

-doCounts 1 counts bases accross individuals to determine the concensus allele

-snp_pval 0.01 p-value threshold for defining a SNP. A lower threshold will need more evidence to call a SNPs.

-domaf 1 estimate allele frequency (use to call SNPs) with that the major and minor alleles inferred from data

-domajorminor 1 infer the major and minor allele from data

-gl 2 use genotype likelihoods based on the GATK model

-rmSNPs 1 remove polymorphic sites. instead of keeping sites that are polymorphic then we remove them such that all outgroups have the same allele.

-minind 4 remove site where you don't have data for all four individuals

-setMinDepthInd 10 require at least 10 read for each individual

-explode 1 make the fasta file for the whole genome not just the chromosomes/scaffolds were you have data

The sites that are polymorphic or do not have enough data will be labeled as 'N' in the fasta file