ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Allele Frequencies: Difference between revisions

From angsd
Jump to navigation Jump to search
 
(33 intermediate revisions by 3 users not shown)
Line 1: Line 1:
The allele frequency is the relative frequency of an allele across all alleles for a site.
<div class="keywords"> -domaf,-domaf,-domaf,-domaf,-domaf, domaf, domaf, domaf, domaf, domaf, domaf, dopost, SNP_pval </div>


This can be polarized according to the major/minor, reference/non-refernce or ancestral/derived. Therefore the choice of allele frequency estimator is closely related to choosing which alleles are segregating (see [[Inferring_Major_and_Minor_alleles]]).  
The allele frequency is the relative frequency of an allele for a site. This can be polarized according to the major/minor, reference/non-refernce or ancestral/derived. .Therefore the choice of allele frequency estimator is closely related to choosing which alleles are segregating (see [[Inferring_Major_and_Minor_alleles]]).  


We allow for frequency estimation from different input data:
We allow for frequency estimation from different input data:


# Genotype Likelihoods
# Genotype Likelihoods
# Genotype posteriors
# Genotype posterior probabilities
# Counts of bases
# Counts of bases


The allele frequency estimator from genotype likelihoods are from this  [[suYeon | publication]], and the base counts method is from this [[Li2010 |publication]]. Unless you have very specific reasons for using the count based estimator we recommend that uses use the '''-doMaf 2'''.
The allele frequency estimator from genotype likelihoods are from this  [[suYeon | publication]], and the base counts method is from this [[Li2010 |publication]].  
./angsd -doMaf  
 
-> angsd version: 0.572 build(Jan  7 2014 02:33:35)
For the case of the genotype likelihood based methods we allow for deviations from Hardy-Weinberg, namely we allow for users to supply a file containing inbreeding coefficients for each individual.
-> Analysis helpbox/synopsis information:
 
------------------------
=Brief Overview=
analysisMaf.cpp:
 
<pre>
./angsd -doMaf
abcFreq.cpp:
-doMaf 0 (Calculate persite frequencies '.mafs.gz')
-doMaf 0 (Calculate persite frequencies '.mafs.gz')
1: Frequency (fixed major and minor)
1: Frequency (fixed major and minor)
Line 20: Line 23:
4: Frequency from genotype probabilities
4: Frequency from genotype probabilities
8: AlleleCounts based method (known major minor)
8: AlleleCounts based method (known major minor)
Filedumping is supressed if value is negative
NB. Filedumping is supressed if value is negative
-doSNP 0 (Perform an LRT of variability)
-minMaf 0.010000 0
-minLRT 24.000000 0
-ref (null) (Filename for fasta reference)
-anc (null) (Filename for fasta ancestral)
-eps 0.001000 [Only used for -doMaf &8]
-doPost 0 (Calculate posterior prob 3xgprob)
-doPost 0 (Calculate posterior prob 3xgprob)
1: Using frequency as prior
1: Using frequency as prior
2: Using uniform prior
2: Using uniform prior
-beagleProb 0 (Dump beagle style postprobs)
3: Using SFS as prior (still in development)
-indFname (null) (file containing individual inbreedcoeficients)
4: Using reference panel as prior (still in development), requires a site file with chr pos major minor af ac an
NB These frequency estimators requires major/minor -doMajorMinor
Filters:
 
-minMaf  -1.000000 (Remove sites with MAF below)
<pre>
-SNP_pval 0.317311 (Remove sites with a pvalue larger)
./angsd -doMaf
-rmSNPs 0 (Remove infered SNPs instead of keeping them (pval > SNP_pval)
-> angsd version: 0.572 build(Jan  7 2014 02:33:35)
-rmTriallelic 0.000000 (Remove sites with a pvalue lower)
-> Analysis helpbox/synopsis information:
-forceMaf 0 (Write .mafs file when running -doAsso (by default does not output .mafs file with -doAsso))
------------------------
-skipMissing 1 (Set post to 0.33 if missing (do not use freq as prior))
analysisMaf.cpp:
Extras:
-doMaf 0 (Calculate persite frequencies '.mafs.gz')
1: Frequency (fixed major and minor)
2: Frequency (fixed major unknown minor)
4: Frequency from genotype probabilities
8: AlleleCounts based method (known major minor)
Filedumping is supressed if value is negative
-doSNP 0 (Perform an LRT of variability)
-minMaf 0.010000 0
-minLRT 24.000000 0
-ref (null) (Filename for fasta reference)
-ref (null) (Filename for fasta reference)
-anc (null) (Filename for fasta ancestral)
-anc (null) (Filename for fasta ancestral)
-eps 0.001000 [Only used for -doMaf &8]
-eps 0.001000 [Only used for -doMaf &8]
-doPost 0 (Calculate posterior prob 3xgprob)
-beagleProb 0 (Dump beagle style postprobs)
1: Using frequency as prior
-indFname (null) (file containing individual inbreedcoeficients)
2: Using uniform prior
-underFlowProtect 0 (file containing individual inbreedcoeficients)
-beagleProb 0 (Dump beagle style postprobs)
-indFname (null) (file containing individual inbreedcoeficients)
NB These frequency estimators requires major/minor -doMajorMinor
NB These frequency estimators requires major/minor -doMajorMinor
</pre>
</pre>


=Output data=
=Allele Frequency estimation=
==.mafs==
The major and minor allele is first inferred from the data or given by the user (see [[Inferring_Major_and_Minor_alleles]]). This includes information from both major and minor allele, a reference genome (for major) or an ancestral genome.  
<pre>
 
chromo position major minor ref knownEM unknownEM nInd
; -doMaf [int]
21      9719788 T      A      0.000001        -0.000012      3
 
21      9719789 G      A      0.000000        -0.000001      3
1:  Known major, and Known minor. Here both the major and minor allele is assumed to be known (inferred or given by user). The allele frequency is the obtained using based on the genotype likelihoods. The allele frequency estimator from genotype likelihoods are from this [[suYeon | publication]]  but using the EM algorithm and is briefly described [[SYKmaf|here]].  
21      9719790 A      C      0.000000        -0.000004      3
21      9719791 G      A      0.000000        -0.000001      3
21      9719792 G      A      0.000000        -0.000002      3
21      9719793 G      T      0.498277        41.932766      3
21      9719794 T      A      0.000000        -0.000001      3
21      9719795 T      A      0.000000        -0.000001      3


</pre>
2:  Known major, Unknown minor. Here the major allele is assumed to be known  (inferred or given by user) however the minor allele is not determined. Instead we sum over the 3 possible minor alleles weighted by their probabilities. The allele frequency estimator from genotype likelihoods are from this [[suYeon | publication]] but using the EM algorithm and is briefly described [[SYKmaf|here]].
The first 4 columns are always defined to be:
.


;1. chromosome name
4: frequency based on genotype posterior probabilities. If genotype probabilities are used as input to ANGSD the allele frequency is estimated directly on these by [[postFreq|summing over the probabitlies]].  
;2. position
;3. major allele
;4. minor allele


Depending on whether or not a reference and/or ancestral fasta files has been supplied these can occur as column 5 and 6.
8: frequency based on base counts. This method does not rely on genotype likelihood or probabilities but instead infers the allele frequency directly on the base counts. The base counts method is from this [[Li2010 |publication]].  
There are 4 different MAF estimators the estimate for these are given by the names knownEM,unknownEM,knownBFGS,unknownBFGS.


Multiple estimators can be used simultaniusly be summing up the above numbers. Thus -doMaf 7 (1+2+4) will use the first three estimators. If the allele frequencies are estimated from the genotype likelihoods then you need to infer the major and minor allele (-doMajorMinor)


Futhermore if -doSNP is included, then the corresponding LRT will be printed.
;NB using -doMaf 4 is only supported if the posteriors are supplied as external files. Since the estimation of genotype posteriors in itself requires a maf estimator.


=Example=


The nInd column is the effective sample size, as detmined by the genotype likelihoods.
==From genotype likelihood==
Example for estimating the allele frequencies both while assuming known major and minor allele but also while taking the uncertaincy of the minor allele inference into account. The [[Inferring_Major_and_Minor_alleles|inference of the major and minor]] allele is done directly from the genotype likelihood


<pre>
./angsd -out out -doMajorMinor 1 -doMaf 3 -bam bam.filelist -GL 2
</pre>


Anders check below:
==From genotype probabilities==
Example of the use of a genotype probability file for example from the output from beagle.


<pre>
./angsd -out out -doMaf 4 -beagle beagle.file.gz
</pre>


This pretty explanatory, nInd is the number of individuals where we have "reliable" reads (see bugs section)
Depending on -doMaf INT, and -ref FILENAME and -anc FILENAME, extra column will be input.


=Theory=
==Estimator from base counts==
==ML estimator with known minor==


First infer the [[Inferring_Major_and_Minor_alleles|Major and Minor]] allele and then use BFGS (-doMaf 1) optimazation or the EM algorithm (-doMaf 2) to estimate the allele frequencies.
The allele frequencies can be infered directy from the sequencing data [[Li2010|citation]].
This works by using "counts" of alleles, and should be invoked like


<math>
L(D|f) \propto \prod_i^N p(D_i|f) = \prod_i^N \sum_{g\in\{0,1,2\}}p(D_i|G=g)p(G=g|f)
</math>


<math>
<pre>
  \hat{f}=argmax_{f} L(D|f)
./angsd -out out -doMajorMinor 2 -doMaf 8 -bam bam.filelist -doCounts 1
</math>
</pre>


==ML estimator with unknown minor==


First infer the [[Inferring_Major_and_Minor_alleles|Major]] allele and then use  BFGS (-doMaf 4) optimazation or the EM algorithm (-doMaf 8) to estimate the allele frequencies. Here only the Major allele needs to be known and the uncertaincy of infering the minor allele is modelled.  
=Output data=
==.mafs.gz==
<pre>
chromo position major minor ref knownEM unknownEM nInd
21      9719788 T      A      0.000001        -0.000012      3
21      9719789 G      A      0.000000        -0.000001      3
21      9719790 A      C      0.000000        -0.000004      3
21      9719791 G      A      0.000000        -0.000001      3
21      9719792 G      A      0.000000        -0.000002      3
21      9719793 G      T      0.498277        41.932766      3
21      9719794 T      A      0.000000        -0.000001      3
21      9719795 T      A      0.000000        -0.000001      3


Let <math>\{M,m\}</math> denote the major an minor allele assuming adiallelic site, then the maximum likelihood estimate of this pair is found using the likelihood function
</pre>


<math>
;chromo
  P(D|M,f) =  \prod_i P(D_i|M,f) =  \sum_m \sum_{A_1,A_2} P(D_i|G=A_1A_2)p(G=A_1A_2|m,M)p(m),
chromosome name
</math>
;position
position
;major
major allele
;minor
minor allele
;knownEM
frequency using -doMaf 1
;unknownEM
frequency using -doMaf 2
;phat
frequency using -doMaf 8
;nInd
is the number of individuals with data
;pK-EM
p-value for the frequency of (known) minor allele (-doSNPStat 1 -doMaf 1)
;pu-EM
p-value for the frequency of (unknown) minor allele (-doSNPStat 1 -doMaf 2)

Latest revision as of 11:16, 8 June 2023

-domaf,-domaf,-domaf,-domaf,-domaf, domaf, domaf, domaf, domaf, domaf, domaf, dopost, SNP_pval

The allele frequency is the relative frequency of an allele for a site. This can be polarized according to the major/minor, reference/non-refernce or ancestral/derived. .Therefore the choice of allele frequency estimator is closely related to choosing which alleles are segregating (see Inferring_Major_and_Minor_alleles).

We allow for frequency estimation from different input data:

  1. Genotype Likelihoods
  2. Genotype posterior probabilities
  3. Counts of bases

The allele frequency estimator from genotype likelihoods are from this publication, and the base counts method is from this publication.

For the case of the genotype likelihood based methods we allow for deviations from Hardy-Weinberg, namely we allow for users to supply a file containing inbreeding coefficients for each individual.

Brief Overview

 ./angsd -doMaf
abcFreq.cpp:
-doMaf	0 (Calculate persite frequencies '.mafs.gz')
	1: Frequency (fixed major and minor)
	2: Frequency (fixed major unknown minor)
	4: Frequency from genotype probabilities
	8: AlleleCounts based method (known major minor)
	NB. Filedumping is supressed if value is negative
-doPost	0	(Calculate posterior prob 3xgprob)
	1: Using frequency as prior
	2: Using uniform prior
	3: Using SFS as prior (still in development)
	4: Using reference panel as prior (still in development), requires a site file with chr pos major minor af ac an
Filters:
	-minMaf  	-1.000000	(Remove sites with MAF below)
	-SNP_pval	0.317311	(Remove sites with a pvalue larger)
	-rmSNPs 	0	(Remove infered SNPs instead of keeping them (pval > SNP_pval)
	-rmTriallelic	0.000000	(Remove sites with a pvalue lower)
	-forceMaf	0	(Write .mafs file when running -doAsso (by default does not output .mafs file with -doAsso))
	-skipMissing	1	(Set post to 0.33 if missing (do not use freq as prior))
Extras:
	-ref	(null)	(Filename for fasta reference)
	-anc	(null)	(Filename for fasta ancestral)
	-eps	0.001000 [Only used for -doMaf &8]
	-beagleProb	0 (Dump beagle style postprobs)
	-indFname	(null) (file containing individual inbreedcoeficients)
	-underFlowProtect	0 (file containing individual inbreedcoeficients)
NB These frequency estimators requires major/minor -doMajorMinor

Allele Frequency estimation

The major and minor allele is first inferred from the data or given by the user (see Inferring_Major_and_Minor_alleles). This includes information from both major and minor allele, a reference genome (for major) or an ancestral genome.

-doMaf [int]

1: Known major, and Known minor. Here both the major and minor allele is assumed to be known (inferred or given by user). The allele frequency is the obtained using based on the genotype likelihoods. The allele frequency estimator from genotype likelihoods are from this publication but using the EM algorithm and is briefly described here.

2: Known major, Unknown minor. Here the major allele is assumed to be known (inferred or given by user) however the minor allele is not determined. Instead we sum over the 3 possible minor alleles weighted by their probabilities. The allele frequency estimator from genotype likelihoods are from this publication but using the EM algorithm and is briefly described here. .

4: frequency based on genotype posterior probabilities. If genotype probabilities are used as input to ANGSD the allele frequency is estimated directly on these by summing over the probabitlies.

8: frequency based on base counts. This method does not rely on genotype likelihood or probabilities but instead infers the allele frequency directly on the base counts. The base counts method is from this publication.

Multiple estimators can be used simultaniusly be summing up the above numbers. Thus -doMaf 7 (1+2+4) will use the first three estimators. If the allele frequencies are estimated from the genotype likelihoods then you need to infer the major and minor allele (-doMajorMinor)

NB using -doMaf 4 is only supported if the posteriors are supplied as external files. Since the estimation of genotype posteriors in itself requires a maf estimator.

Example

From genotype likelihood

Example for estimating the allele frequencies both while assuming known major and minor allele but also while taking the uncertaincy of the minor allele inference into account. The inference of the major and minor allele is done directly from the genotype likelihood

./angsd -out out -doMajorMinor 1 -doMaf 3 -bam bam.filelist -GL 2

From genotype probabilities

Example of the use of a genotype probability file for example from the output from beagle.

./angsd -out out -doMaf 4 -beagle beagle.file.gz


Estimator from base counts

The allele frequencies can be infered directy from the sequencing data citation. This works by using "counts" of alleles, and should be invoked like


./angsd -out out -doMajorMinor 2 -doMaf 8 -bam bam.filelist -doCounts 1


Output data

.mafs.gz

chromo	position	major	minor	ref	knownEM	unknownEM	nInd
21      9719788 T       A       0.000001        -0.000012       3
21      9719789 G       A       0.000000        -0.000001       3
21      9719790 A       C       0.000000        -0.000004       3
21      9719791 G       A       0.000000        -0.000001       3
21      9719792 G       A       0.000000        -0.000002       3
21      9719793 G       T       0.498277        41.932766       3
21      9719794 T       A       0.000000        -0.000001       3
21      9719795 T       A       0.000000        -0.000001       3

chromo

chromosome name

position

position

major

major allele

minor

minor allele

knownEM

frequency using -doMaf 1

unknownEM

frequency using -doMaf 2

phat

frequency using -doMaf 8

nInd

is the number of individuals with data

pK-EM

p-value for the frequency of (known) minor allele (-doSNPStat 1 -doMaf 1)

pu-EM

p-value for the frequency of (unknown) minor allele (-doSNPStat 1 -doMaf 2)