ANGSD: Analysis of next generation Sequencing Data
Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.
Association
Association can be performed using two approaches.
- Based on testing differences in allele frequencies between cases and controls, using genotype likelihoods
- Based on a generalized linear framework which also allows for quantitative traits and for including additional covariates, using genotype posteriors.
We recommend that users don't perform association analysis on all sites, but limit the analysis to informative sites, and in the case of alignement data (BAM), we advise that users filter away the low mapping quality reads and the low qscore bases.
The filtering of the alignment data is described in Input, and filtering based on frequencies/polymorphic sites are described here.
This can be done easily at the command line by adding the below commands
-minQ 20 -minMapQ 30 -SNP_pval 1e-6 #Use polymorphic sites with a p-value of 10^-6 -minQ 20 -minMapQ 30 -minMaf 0.05 #Use sites with a MAF >0.05
Brief Overview
./angsd -doAsso analysisAsso.cpp: -doAsso 0 1: Frequency Test (Known Major and Minor) 2: Score Test 3: Frequency Test (Unknown Minor) Frequency Test Options: -yBin (null) (File containing disease status) Score Test Options: -yBin (null) (File containing disease status) -yQuant (null) (File containing phenotypes) -minHigh 10 (Require atleast minHigh number of high credible genotypes) -minCount 10 (Require this number of minor alleles, estimated from MAF) -cov (null) (File containing additional covariates) -model 1 1: Additive/Log-Additive (Default) 2: Dominant 3: Recessive
Case control association using allele frequencies
To test for differences in the allele frequencies, genotype likelihood needs to be provided or estimated. The test is an implimentation of the likelihoods ratio test for differences between cases and controls described in details in Kim2011.
- -doAsso [int]
1: The test is performed assuming the minor allele is known.
3: The test is performed summing over all possible minor alleles.
- -yBin [Filename]
A file containing the case control status. 0 being the controls, 1 being the cases and -999 being missing phenotypes. The file should contain a single phenotype entry per line.
Example of cases control phenotype file
1 0 0 0 1 1 1 1 0 -999 1 0 0 0 0 1
Example
create a large number of individuals by recycling the example files (500 individuals) and simulate some phentypes (case/control) using R
for i in `seq 1 50`;do cat bam.filelist>>large.filelist;done Rscript -e "write.table(cbind(rbinom(500,1,0.5)),'pheno.ybin',row=F,col=F)"
./angsd -yBin pheno.ybin -doAsso 1 -GL 1 -out out -doMajorMinor 1 -doMaf 1 -SNP_pval 1e-6 -bam large.filelist -r 1: -P 5
Note that because you are reading 500 bam files it takes a little while
gunzip -c out.lrt0.gz | head
Chromosome Position Major Minor Frequency LRT 1 14000003 G A 0.057070 0.016684 1 14000013 G A 0.067886 0.029014 1 14000019 G T 0.052904 0.569061 1 14000023 C A 0.073336 0.184060 1 14000053 T C 0.038903 0.604695 1 14000170 C T 0.050756 0.481033 1 14000176 G A 0.053157 0.424910 1 14000200 C A 0.085332 0.485030 1 14000202 G A 0.257132 0.025047
The LRT is the likelihood ration statistics which is chi square distributed with one degree of freedom.
Dependency Chain
The method is based on estimating frequencies from genotype likelihoods. If alignment data has been supplied you need to specify the following.
If you have supplied genotype likelihood files as input for angsd you can skip 1.
Score statistic
To perform the test in a generalized linear framework posterior genotype probabilities must be provided or estimated. The approach is published here skotte2012.
- -doAsso 2
- -yBin [Filename]
A file containing the case control status. 0 being the controls, 1 being the cases and -999 being missing phenotypes.
Example of cases control phenotype file
1 0 0 0 1 1 1 1 0 -999 1 0 0 0 0 1
- -yQuant [Filename]
File containing the phenotype values.-999 being missing phenotypes. The file should contain a single phenotype entry per line.
Example of quantitative phenotype file
-999 2.06164722761138 -0.091935218675602 -0.287527686061831 -999 -999 -1.20996664036026 0.0188541092307412 -2.1122713873334 -999 -1.32920529536579 -1.10582299663753 -0.391773417823766 -0.501400984567535 -999 1.06014677976046 -1.10582299663753 -999 0.223156127557052 -0.189660869820135
- -cov [Filename]
Files containing additional covariates in the analysis. Each lines should contain the additional covariates for a single individuals. Thus the number of lines should match the number of individuals and the number of coloums should match the number of additional covariates.
Example of covariate file
1 0 0 1 1 0.1 0 0 2 0 1 0 2 0 1 0 2 0.1 0 1 1 0 0 1 1 0.3 0 0 2 0 0 0 1 0 0 0 2 0.2 0 1 1 0 1 0 1 0 0 0 1 0.1 0 0 1 0 0 0 2 0 0 1 2 0 0 0 2 0 0 0 1 0 0 1 1 0.5 0 0 2 0 0 0
- -minHigh [int]
default = 10
This approach needs a certain amount of variability in the genotype probabilities. minHigh filters out sites that does not have at least [int] number of heterozygoes and homogoes genotype with at least 0.9 probability. This filter avoids the scenario where all individuals are heterozygoes with a high probability.
- -minCount [int]
default = 10
The minimum expected minor alleles in the sample. This is the frequency multiplied by two times the number of individuals. Performing association on extremely low minor allele frequencies does not make sence.
- -model [int]
- Additive/Log-additive for Linear/Logistic Regression (Default).
- Dominant.
- Recessive.
Example
create a large number of individuals by recycling the example files (500 individuals) and simulate some phentypes (case/control) using R
rm large.filelist for i in `seq 1 50`;do cat bam.filelist>>large.filelist;done Rscript -e "write.table(cbind(rbinom(500,1,0.5)),'pheno.ybin',row=F,col=F)" Rscript -e "write.table(cbind(rnorm(500)),'pheno.yquant',row=F,col=F)" Rscript -e "set.seed(1);write.table(cbind(rbinom(500,1,0.5),rnorm(500)),'cov.file',row=F,col=F)"
For cases control data for polymorphic sites (p-value < 1e-6)
./angsd -yBin pheno.ybin -doAsso 2 -GL 1 -doPost 1 -out out -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 1 -bam large.filelist -P 5 -r 1:
For quantitative traits (normal distributed errors) for polymorphic sites (p-value < 1e-6) and additional covariates
./angsd -yQuant pheno.yquant -doAsso 2 -cov cov.file -GL 1 -doPost 1 -out out -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 1 -bam large.filelist -P 5 -r 1:
Example with imputation (using BEAGLE)
First the polymorphic sites to be analysed needs to be selected (-doMaf 1 -SNP_pval -doMajorMinor) and the genotype likelihoods estimated (-GL 1) for use in the Beagle software (-doGlf 2).
./angsd -GL 1 -out input -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 1 -bam large.filelist -P 5 -r 1: -doGlf 2
Perform the imputation
java -Xmx15000m -jar beagle.jar like=input.beagle.gz out=beagleOut
the reference fai can be obtained by indexing the reference genome or by using a bam files header
samtools view -H bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam | grep SN |cut -f2,3 | sed 's/SN\://g' | sed 's/LN\://g' > ref.fai
The association can then be performed on the genotype probabilities using the score statistics
./angsd -doMaf 4 -beagle beagleOut.impute.beagle.gz.gprobs.gz -fai ref.fai
Dependency Chain
The method is based on genotype probabilities. If alignment data has been supplied you need to specify the following.
- Genotype likelihood model (-GL).
- Determine Major/Minor (-doMajorMinor).
- Maf estimator (-doMaf).
- Calculate posterior (-doPost).
If you have supplied genotype likelihoods for angsd, then you should skip 1.
If you have supplied genotype probabilities (as beagle output format), there are no dependencies.
Output
The output from the association analysis is a list of files called prefix.lrt. These are tab separated plain text files, with nine columns.
Chromosome | Position | Major | Minor | Frequency | N* | LRT | highHe* | highHo* |
---|
* Indicates that these columns are only used for the score test.
Field | Description |
---|---|
Chromosome | Chromosome. |
Position | Physical Position. |
Major | The Major allele as determined by -doMajorMinor. If posterior genotype files has been supplied as input, this column is not defined. |
Minor | The Minor allele as determined by -doMajorMinor. If posterior genotype files has been supplied as input, this column is not defined. |
Frequency | The Minor allele frequency as determined by -doMaf. |
N* | Number of individuals. That is the number of samples that have both sequencing data and phenotypic data. |
LRT | The likelihood ratio statistic. This statistic is chi square distributed with one degree of freedom. Sites that fails one of the filters are given the value -999.000000. |
highHe* | Number of sites with a heterozygoes posterior probability above 0.9. |
highHo* | Number of sites with a homozygote posterior probability above 0.9. |
Example:
chr position major minor frequency Nind LRT highHe highHo 1 13999950 G T 0.208311 330 -999.000000 0 20 1 14000023 C A 0.032045 330 4.429480 20 310 1 14000072 G T 0.015167 330 0.262688 10 320 1 14000113 A G 0.015194 330 0.404170 10 320 1 14000202 G A 0.245345 330 0.562326 90 60 1 14000375 T C 0.015211 330 0.263012 10 320 1 14000851 T C 0.015177 330 0.478610 10 320 1 14000873 G A 0.303030 330 1.032727 100 230 1 14001008 T C 0.015158 330 3.831819 10 320 1 14001018 T C 0.300947 330 0.859651 80 240