ANGSD: Analysis of next generation Sequencing Data
Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.
Allele Frequencies: Difference between revisions
No edit summary |
No edit summary |
||
Line 12: | Line 12: | ||
INT=16 frequencies from genotype probabilities | INT=16 frequencies from genotype probabilities | ||
Multiple estimators can be used simultaniusly be summing up the above numbers. Thus -doMaf 7 (1+2+4) will use the two first estimators | |||
==Allele frequencies from genotype likelihoods== | ==Allele frequencies from genotype likelihoods== | ||
The allele frequency estimators are described in [[suYeon | citation]]. The | The allele frequency estimators are described in [[suYeon | citation]]. For testing reasons two optimazations are availeble. The BFGS and the EM algorithm. The EM algorithm is much faster then the BFGS. The allele frequencies are estimated by assuming that the site is diallelic and the major or minor alleles can be infered prior to the estimation or the uncertaincy of the minor allele can be incorborated into the model. | ||
===ML estimator with known minor=== | ===ML estimator with known minor=== | ||
First infer the [[Inferring_Major_and_Minor_alleles|Major and Minor]] allele and then use BFGS optimazation to estimate the allele frequencies. | |||
<math> | |||
L(D|f) \propto \prod_i^N p(D_i|f) = \prod_i^N \sum_{g\in\{0,1,2\}}p(D_i|G=g)p(G=g|f) | |||
</math> | |||
<math> | |||
\hat{f}=argmax_{f} L(D|f) | |||
</math> | |||
Let <math>\{M,m\}</math> denote the two possible alleles at the diallelic site, then the maximum likelihood estimate of this pair is found using the likelihood function | |||
<math> | |||
P(D|\{m,M\}) = \prod_i P(D_i|\{m,M\}) | |||
=\prod_i \sum_{A_1,A_2 \in \{m,M\}} P(D_i|G=A_1A_2)p(G=A_1A_2|\{m,M\}), | |||
</math> | |||
===ML estimator with unknown minor=== | ===ML estimator with unknown minor=== | ||
First infer the [[Inferring_Major_and_Minor_alleles|Major and Minor]] allele and then use the EM algorithm to estimate the allele frequencies. | |||
==Estimator from genotype probabilities== | ==Estimator from genotype probabilities== | ||
If the genotype probabilities are known the frequencies can be estimated by summing up the posterior probabilities <math> p(G=g|D)</math> where <math>D</math> is the sequencing data and <math>g\in\{0,1,2\}</math> the allele count of the minor allele. The frequency estimate | |||
<math> | |||
\hat{f}=\frac{1}{2N}\sum_i^N \left(2p(G=2|D)+p(G=1|D)\right) | |||
</math> | |||
==Estimator from sequencing data== | ==Estimator from sequencing data== |
Revision as of 11:15, 18 June 2012
Allele Frequency estimation
- -doMaf [int]
INT=1 bfgs known minor
INT=2 EM known minor
INT=4 BFGS unknown minor
INT=8 EM unknown minor
INT=16 frequencies from genotype probabilities
Multiple estimators can be used simultaniusly be summing up the above numbers. Thus -doMaf 7 (1+2+4) will use the two first estimators
Allele frequencies from genotype likelihoods
The allele frequency estimators are described in citation. For testing reasons two optimazations are availeble. The BFGS and the EM algorithm. The EM algorithm is much faster then the BFGS. The allele frequencies are estimated by assuming that the site is diallelic and the major or minor alleles can be infered prior to the estimation or the uncertaincy of the minor allele can be incorborated into the model.
ML estimator with known minor
First infer the Major and Minor allele and then use BFGS optimazation to estimate the allele frequencies.
Let denote the two possible alleles at the diallelic site, then the maximum likelihood estimate of this pair is found using the likelihood function
ML estimator with unknown minor
First infer the Major and Minor allele and then use the EM algorithm to estimate the allele frequencies.
Estimator from genotype probabilities
If the genotype probabilities are known the frequencies can be estimated by summing up the posterior probabilities where is the sequencing data and the allele count of the minor allele. The frequency estimate