ANGSD: Analysis of next generation Sequencing Data
Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.
Association: Difference between revisions
| No edit summary | No edit summary | ||
| Line 7: | Line 7: | ||
| We recommend that users don't perform association analysis on all sites, but limit the analysis to informative sites, and in the case of alignement data (BAM), we advise that users filter out the low mapping quality reads and the low qscore bases. | We recommend that users don't perform association analysis on all sites, but limit the analysis to informative sites, and in the case of alignement data (BAM), we advise that users filter out the low mapping quality reads and the low qscore bases. | ||
| The filtering of the alignment data is described in [[Input]], and the filtering based on frequencies/polymorphic sites are described http://popgen.dk/wiki/index.php/Filters#Allele_frequencies| here] | The filtering of the alignment data is described in [[Input]], and the filtering based on frequencies/polymorphic sites are described [http://popgen.dk/wiki/index.php/Filters#Allele_frequencies| here] | ||
| <pre> | <pre> | ||
| -minQ 20 -minMapQ 30 -minMaf 0.05 | -minQ 20 -minMapQ 30 -minMaf 0.05 | ||
Revision as of 16:04, 28 June 2013
Association can be performed using two approaches.
- Based on testing differences in allele frequencies between cases and controls.
- Based on a generalized linear framework which also allows for quantitative traits and for including additional covariates.
Both methods takes the uncertainty of the genotypes into account, by working directly with genotype likelihoods or posteriors.
We recommend that users don't perform association analysis on all sites, but limit the analysis to informative sites, and in the case of alignement data (BAM), we advise that users filter out the low mapping quality reads and the low qscore bases.
The filtering of the alignment data is described in Input, and the filtering based on frequencies/polymorphic sites are described here
-minQ 20 -minMapQ 30 -minMaf 0.05
Either the sites with an allele frequency above some threshold '-minMaf 0.05' or sites that have an p-value of being variable of 1e-6 '-minLRT 24'. Futhermore it is advisable in the case of alignement data (BAM), that users filter low mapping quality reads and low qscore bases.
Case control association using allele frequencies
To test for differences in the allele frequency genotype likelihood need to be provided or estimated. The test is an implimentation of the likelihoods ratio test for differences between cases and controls Kim2011.If alignment files are your input then -GL must be invoked.
- -doAsso [int]
1: The test is performed assuming the minor allele is known 
3: The test is performed summing over all possible minor alleles
- -yBin [file]
a file containing the case control status. 0 being the controls, 1 being the cases and -999 being missing phenotypes. The file should contain a single phenotype entry per line.
Example of cases control phenotype file
1 0 0 0 1 1 1 1 0 -999 1 0 0 0 0 1
example
./angsd -yBin pheno.ybin -doAsso 1 -GL 1 -out out -doMajorMinor 1 -minLRT 24 -doMaf 2 -doSNP 1 -bam bam.filelist
Score statistic
To perform the test in a generalized linear framework posterior genotype probabilities must be provided or estimated. The approach is published here citation. If alignment files are your input then -doLike, -doMaf, -doPost must be invoked. If input files are genotype likelihoods then -doMaf, -doPost must be invoked. Beagle output files can be used directly.
- -doAsso [int]
2: The test is based on a score statistic from a generalized linear framework
- -yBin [file]
a file containing the case control status. 0 being the controls, 1 being the cases and -999 being missing phenotypes.
Example of cases control phenotype file
1 0 0 0 1 1 1 1 0 -999 1 0 0 0 0 1
- -yQuant [file]
a file containing the phenotype values.-999 being missing phenotypes. The file should contain a single phenotype entry per line.
Example of quantitative phenotype file
-999 2.06164722761138 -0.091935218675602 -0.287527686061831 -999 -999 -1.20996664036026 0.0188541092307412 -2.1122713873334 -999 -1.32920529536579 -1.10582299663753 -0.391773417823766 -0.501400984567535 -999 1.06014677976046 -1.10582299663753 -999 0.223156127557052 -0.189660869820135
- -cov [file]
a files containing additional covariates in the analysis. Each lines should contain the additional covariates for a single individuals. Thus the number of lines should match the number of individuals and the number of coloums should match the number of additional covariates.
Example of covariate file
1 0 0 1 1 0.1 0 0 2 0 1 0 2 0 1 0 2 0.1 0 1 1 0 0 1 1 0.3 0 0 2 0 0 0 1 0 0 0 2 0.2 0 1 1 0 1 0 1 0 0 0 1 0.1 0 0 1 0 0 0 2 0 0 1 2 0 0 0 2 0 0 0 1 0 0 1 1 0.5 0 0 2 0 0 0
- -minHigh [int]
default = 10 
This approach needs a certain amount of variability in the genotype probabilities. minHigh filters out sites that does not have at least [int] number of heterozygoes and homogoes genotype with at least 0.9 probability. This filter avoids the scenario where all individuals are heterozygoes with a high probability. 
- -minCount [int]
default = 10 
The minimum expected minor alleles in the sample. This is the frequency multiplied by to times the number of individuals. Performing association on extremely low minor allele frequencies does not make sence.
- -model [int]
1 (default): additive/log-additive for linear/logistic regression
2: dominant
3: recessive
example
For cases control data for polymorphic sites (LRT>24)
./angsd -yBin pheno.ybin -doAsso 2 -GL 1 -doPost 1 -out out -doMajorMinor 1 -minLRT 24 -doMaf 2 -doSNP 1 -bam bam.filelist
For quantitative traits (normal distributed errors)  for polymorphic sites (LRT>24) and additional covariates
./angsd -yQuant pheno.y -doAsso 2 -cov cov.file -GL 1 -doPost 1 -out out -doMajorMinor 1 -minLRT 24 -doMaf 2 -doSNP 1 -bam bam.filelist
output
Score statistic (prefix lrt*)
| Chromosome | Position | major | minor | Frequency | N | LRT | highHe | highHo | 
- Chromosome
The Chromosome
- Position
The physical Position
- major
Often the most common allele. If beagle input is used it might be the minor (see frequency)
- major
Often the minorallele. If beagle input is used it might be the major (see frequency)
- Frequency
The frequency estimate. The choice of estimation is determined by the *doMaf option.
- N
The number of individuals with non-missing data. That is the individuals who have both some sequencing data for the given site and have phenotype data
- LRT
The likelihood ratio statistic. This statistic is chi square distributed with one degree of freedom. Sites that fails one of the filters are given the value -999.000000
- highHe
Number of sites with a heterozygoes posterior probability above 0.9
- highHe
Number of sites with a homozygote posterior probability above 0.9
example:
chr position major minor frequency Nind LRT highHe highHo 1 13999950 G T 0.208311 330 -999.000000 0 20 1 14000023 C A 0.032045 330 4.429480 20 310 1 14000072 G T 0.015167 330 0.262688 10 320 1 14000113 A G 0.015194 330 0.404170 10 320 1 14000202 G A 0.245345 330 0.562326 90 60 1 14000375 T C 0.015211 330 0.263012 10 320 1 14000851 T C 0.015177 330 0.478610 10 320 1 14000873 G A 0.303030 330 1.032727 100 230 1 14001008 T C 0.015158 330 3.831819 10 320 1 14001018 T C 0.300947 330 0.859651 80 240