We allow the major and minor to be determined from either the counts of nucleotides, based on genotype likelihoods or specified by the ancestral/reference.

NB version 505 or higher is required for doMajorMinor 4 and doMajorMinor 5.

arguments

-doMajorMinor 1 (major and minor determined from GL)
-doMajorMinor 2 (major and minor determined from counts of nucs)
-doMajorMinor 3 (major and minor determined from filter list)
-doMajorMinor 4 (major is reference (minor from GL))
-doMajorMinor 5 (major is ancestral (minor from GL))

Inferring Major and Minor alleles

The inference method is chosen based on the data input.

From alignment data

-doMajorMinor 2
-doCount 1

If you input sequencing data like the bam format you can choose to infer the major and minor allele by picking the two most frequently observed bases across individuals. This is the approach from here: citation. To use this appraoch choose

From genotype likelihood data

-doMajorMinor 1

From input for either sequencing data like bam files or from genotype likelihood data like glfv3 the major and minor allele can be inferred directly from likelihoods. We use a maximum likelihood approach to choose the major and minor alleles. Details of the method can be found here and for citation use this publication Skotte2012.

From genotype probability data

-doMajorMinor 3

Currently only genotype probability data in beagle output format is allowed. This format already contains information for the major and minor allele.

Theory

From Genotype Likelihoods

Assuming that the considered site is diallelic, we infer those two alleles using the genotype likelihoods. Let $\{M,m\}$ denote the two possible alleles at the diallelic site, then the maximum likelihood estimate of this pair is found using the likelihood function

$P(D|\{m,M\})=\prod _{i}P(D_{i}|\{m,M\})=\prod _{i}\sum _{A_{1},A_{2}\in \{m,M\}}P(D_{i}|G=A_{1}A_{2})p(G=A_{1}A_{2}|\{m,M\}),$

where $P(D_{i}|G=A_{1}A_{2})$ is the genotype likelihood. We then assume that the two alleles within an individual are independent and randomly drawn from the set $\{m,M\}$ with equal probability, ignoring the fact that the two alleles at a diallelic site are not observed equally frequent. This gives us $p(G=A_{1}A_{2}|\{m,M\})=1/4$ for all four possible combinations of $A_{1},A_{2}\in \{m,M\}$ . Therefore we estimate the two possible alleles at the diallelic site by

$argmax_{\{m,M\}}\prod _{i}\sum _{A_{1},A_{2}\in \{m,M\}}P(D_{i}|G=A_{1}A_{2}).$

To infer which of these two alleles is the minor allele, we estimate the allele frequencies (only one iteration of the EM algorithm is needed if the starting point is frequencies of 0.5).

Major Minor

Contents