ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Major Minor: Difference between revisions

From angsd
Jump to navigation Jump to search
No edit summary
No edit summary
Line 6: Line 6:
=Brief Overview=
=Brief Overview=
<pre>
<pre>
../angsd0.567/angsd -doMajorMinor  
./angsd -doMajorMinor
-> angsd version: 0.567 build(Dec  7 2013 17:25:57)
Command:
/home/software/angsd/angsd0.583/angsd -doMajorMinor  
-> angsd version: 0.580 build(Feb 26 2014 11:19:53)
-> Analysis helpbox/synopsis information:
-> Analysis helpbox/synopsis information:
-------------------
-------------------
Line 14: Line 16:
1: Infer major and minor from GL
1: Infer major and minor from GL
2: Infer major and minor from allele counts
2: Infer major and minor from allele counts
3: use major and minor from bim file (requires -sites afile.bim)
3: use major and minor from a file (requires -sites file.txt)
4: Use reference allele as major (requires -ref)
4: Use reference allele as major (requires -ref)
5: Use ancestral allele as major (requires -anc)
5: Use ancestral allele as major (requires -anc)
Line 20: Line 22:


=Details=
=Details=
==From genotype likelihood data==
; -doMajorMinor 1
From input for either sequencing data like bam files or from genotype likelihood data like glfv3 the major and minor allele can be inferred directly from likelihoods. We use a maximum likelihood approach to choose the major and minor alleles. Details of the method can be found in the theory section of this page and for citation use this publication [[Skotte2012]].


==From counts of data==
==From counts of data==
Line 25: Line 32:
If you input sequencing data like the bam format you can choose to infer the major and minor allele by picking the two most frequently observed bases across individuals. This is the approach from here: [[Li2010|citation]]. To use this appraoch choose
If you input sequencing data like the bam format you can choose to infer the major and minor allele by picking the two most frequently observed bases across individuals. This is the approach from here: [[Li2010|citation]]. To use this appraoch choose


==From genotype likelihood data==
; -doMajorMinor 1
From input for either sequencing data like bam files or from genotype likelihood data like glfv3 the major and minor allele can be inferred directly from likelihoods. We use a maximum likelihood approach to choose the major and minor alleles. Details of the method can be found in the theory section of this page and for citation use this publication [[Skotte2012]].


==Forcing Major/minor==
==Forcing Major/minor==
You can force the major and minor according to your reference or ancestral states if you have defined those '''-ref/-anc'''. We first estimate the major/minor from the data using '''-doMajorMinor 1/-doMajorMinor 2''', and swap these accordingly with the major we are trying to force. If that is not the case the site will be discarded from downstream analysis.
You can force the major and minor according to your reference or ancestral states if you have defined those '''-ref/-anc'''. We first estimate the major/minor from the data using '''-doMajorMinor 1/-doMajorMinor 2''', and swap these accordingly with the major we are trying to force. If that is not the case the site will be discarded from downstream analysis.
=Theory=
==From Genotype Likelihoods==
Assuming that the considered site is diallelic, we infer those two alleles using the genotype likelihoods. Let <math>\{M,m\}</math> denote the two possible alleles at the diallelic site, then the maximum likelihood estimate of this pair is found using the likelihood function
<math>
  P(D|\{m,M\}) =  \prod_i P(D_i|\{m,M\})
=\prod_i \sum_{A_1,A_2 \in \{m,M\}} P(D_i|G=A_1A_2)p(G=A_1A_2|\{m,M\}),
</math>
where <math>P(D_i|G=A_1A_2)</math> is the genotype likelihood. We then assume that the two alleles within an individual are independent and randomly drawn from the set <math>\{m,M\}</math> with equal probability, ignoring the fact that the two alleles at a diallelic site are not observed equally frequent. This gives us <math>p(G=A_1A_2|\{m,M\})=1/4</math> for all four possible combinations of <math>A_1,A_2 \in \{m,M\}</math>. Therefore we estimate the two possible alleles at the diallelic site by
<math>
argmax_{\{m,M\}}\prod_i \sum_{A_1,A_2 \in \{m,M\}} P(D_i|G=A_1A_2)p(G=A_1A_2|\{m,M\}).
</math>
To infer which of these two alleles is the minor allele, we estimate the allele frequencies (only one iteration of the EM algorithm is needed if the starting point is frequencies of 0.5).
This is the approach described in [[Skotte2012]].

Revision as of 15:14, 26 February 2014

We allow the major and minor to be determined from either the counts of nucleotides, based on genotype likelihoods, specified by the ancestral/reference or even force both major minor to specific bases, which can be useful if you compare with HapMap data etc.

NB if you force a major -doMajorMinor 4 or 5 but this allele is neither the estimated major or minor, the site will be discarded.


Brief Overview

./angsd -doMajorMinor 
Command:
/home/software/angsd/angsd0.583/angsd -doMajorMinor 
	-> angsd version: 0.580	 build(Feb 26 2014 11:19:53)
	-> Analysis helpbox/synopsis information:
-------------------
analysisMajorMinor.cpp:
	-doMajorMinor	0
	1: Infer major and minor from GL
	2: Infer major and minor from allele counts
	3: use major and minor from a file (requires -sites file.txt)
	4: Use reference allele as major (requires -ref)
	5: Use ancestral allele as major (requires -anc)

Details

From genotype likelihood data

-doMajorMinor 1

From input for either sequencing data like bam files or from genotype likelihood data like glfv3 the major and minor allele can be inferred directly from likelihoods. We use a maximum likelihood approach to choose the major and minor alleles. Details of the method can be found in the theory section of this page and for citation use this publication Skotte2012.


From counts of data

-doMajorMinor 2

If you input sequencing data like the bam format you can choose to infer the major and minor allele by picking the two most frequently observed bases across individuals. This is the approach from here: citation. To use this appraoch choose


Forcing Major/minor

You can force the major and minor according to your reference or ancestral states if you have defined those -ref/-anc. We first estimate the major/minor from the data using -doMajorMinor 1/-doMajorMinor 2, and swap these accordingly with the major we are trying to force. If that is not the case the site will be discarded from downstream analysis.