ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Association: Difference between revisions

From angsd
Jump to navigation Jump to search
 
(126 intermediate revisions by 4 users not shown)
Line 1: Line 1:
Association can be performed using two approaches.
Association can be performed using two approaches.
# Based on testing differences in allele frequencies between cases and controls.
# Based on testing differences in allele frequencies between cases and controls, using genotype likelihoods
# Based on a generalized linear framework which also allows for quantitative traits and for including additional covariates.  
# Based on a generalized linear framework which also allows for quantitative traits and binary and for including additional covariates, using genotype posteriors.  


Both methods takes the uncertainty of the genotypes into account, by working directly with genotype likelihoods or posteriors.
__TOC__
__TOC__
We recommend that users don't perform association analysis on all sites, but limit the analysis to informative sites, and in the case of alignement data (BAM), we advise that users filter out the low mapping quality reads and the low qscore bases.
We recommend that users don't perform association analysis on all sites, but limit the analysis to informative sites, and in the case of alignement data (BAM), we advise that users filter away the low mapping quality reads and the low qscore bases.


The filtering of the alignment data is described in [[Input]], and filtering based on frequencies/polymorphic sites are described [[Filters#Allele_frequencies| here]].
The filtering of the alignment data is described in [[Input]], and filtering based on frequencies/polymorphic sites are described [[Filters#Allele_frequencies| here]].
<div class="toccolours mw-collapsible mw-collapsed">
<div class="toccolours mw-collapsible mw-collapsed">
These can be done easily at the command line by the below commands
This can be done easily at the command line by adding the below commands
<pre class="mw-collapsible-content">
<pre class="mw-collapsible-content">
-minQ 20 -minMapQ 30 -minLRT 24 #Use polymorphic sites with a p-value of 10^-6
-minQ 20 -minMapQ 30 -SNP_pval 1e-6 #Use polymorphic sites with a p-value of 10^-6
-minQ 20 -minMapQ 30 -minMaf 0.05 #Use sites with a MAF >0.05
-minQ 20 -minMapQ 30 -minMaf 0.05 #Use sites with a MAF >0.05
</pre>
</pre>
</div>
</div>
=Brief Overview=
<pre>
./angsd -doAsso
abcAsso.cpp:
        -doAsso 0
        1: Frequency Test (Known Major and Minor)
        2: Score Test
        4: Latent genotype model
        5: Score Test with latent genotype model - hybrid test
        6: Dosage regression
        7: Latent genotype model (wald test) - NOT PROPERLY TESTED YET!
  Frequency Test Options:
        -yBin          (null)  (File containing disease status)
  Score, Latent, Hybrid and Dosage Test Options:
        -yBin          (null)  (File containing disease status)
        -yCount        (null)  (File containing count phenotypes)
        -yQuant        (null)  (File containing phenotypes)
        -cov            (null)  (File containing additional covariates)
        -sampleFile            (null)  (.sample File containing phenotypes and covariates)
        -whichPhe      (null)  Select which phenotypes to analyse, write phenos comma seperated ('phe1,phe2,...'), only works with a .sample file
        -whichCov      (null)  Select which covariates to include, write covs comma seperated ('cov1,cov2,...'), only works with a .sample file
        -model  1
        1: Additive/Log-Additive (Default)
        2: Dominant
        3: Recessive
        -minHigh        10      (Require atleast minHigh number of high credible genotypes)
        -minCount      10      (Require this number of minor alleles, estimated from MAF)
        -assoThres      0.000001        Threshold for logistic regression
        -assoIter      100    Number of iterations for logistic regression
        -emThres        0.000100        Threshold for convergence of EM algorithm in doAsso 4 and 5
        -emIter 40      Number of max iterations for EM algorithm in doAsso 4 and 5
        -doPriming      1      Prime EM algorithm with dosage derived coefficients (0: no, 1: yes - default)
        -Pvalue 0      Prints a P-value instead of a likelihood ratio (0: no - default, 1: yes)
  Hybrid Test Options:
        -hybridThres            0.050000        (p-value value threshold for when to perform latent genotype model)
</pre>
=Case control association using allele frequencies=
=Case control association using allele frequencies=
To test for differences in the allele frequencies,  genotype likelihood needs to be provided or estimated. The test is an implimentation of the likelihoods ratio test for differences between cases and controls [[Kim2011]]. If alignment files are your input then a genotype likelihood model must be specified [[Genotype Likelihods -GL]].
To test for differences in the allele frequencies,  genotype likelihood needs to be provided or [[Genotype_likelihoods_from_alignments | estimated]]. The test is an implimentation of the likelihoods ratio test for differences between cases and controls described in details in [[Kim2011]].


;-doAsso [int]  
;-doAsso [int]  
1: The test is performed assuming the minor allele is known <br>
'''1''': The test is performed assuming the minor allele is known. <br>
3: The test is performed summing over all possible minor alleles
 


;-yBin [file]
;-yBin [Filename]
a file containing the case control status. 0 being the controls, 1 being the cases and -999 being missing phenotypes. The file should contain a single phenotype entry per line.
A file containing the case control status. 0 being the controls, 1 being the cases and -999 being missing phenotypes. The file should contain a single phenotype entry per line.
<div class="toccolours mw-collapsible mw-collapsed">
<div class="toccolours mw-collapsible mw-collapsed">
Example of cases control phenotype file
Example of cases control phenotype file
Line 47: Line 89:


==Example==
==Example==
create a large number of individuals by recycling the example files (500 individuals) and simulate some phentypes (case/control) using R
<pre>
for i in `seq 1 50`;do cat bam.filelist>>large.filelist;done
Rscript -e "write.table(cbind(rbinom(500,1,0.5)),'pheno.ybin',row=F,col=F)"
</pre>


<pre>
<pre>
./angsd -yBin pheno.ybin -doAsso 1 -GL 1 -out out -doMajorMinor 1 -minLRT 24 -doMaf 2 -doSNP 1 -bam bam.filelist  
./angsd -yBin pheno.ybin -doAsso 1 -GL 1 -out out -doMajorMinor 1 -doMaf 1 -SNP_pval 1e-6 -bam large.filelist -r 1: -P 5
</pre>
</pre>
Note that because you are reading 500 bam files it takes a little while
<div class="toccolours mw-collapsible mw-collapsed">
gunzip -c out.lrt0.gz | head
<pre class="mw-collapsible-content">
Chromosome Position Major Minor Frequency LRT
1 14000003 G A 0.057070 0.016684
1 14000013 G A 0.067886 0.029014
1 14000019 G T 0.052904 0.569061
1 14000023 C A 0.073336 0.184060
1 14000053 T C 0.038903 0.604695
1 14000170 C T 0.050756 0.481033
1 14000176 G A 0.053157 0.424910
1 14000200 C A 0.085332 0.485030
1 14000202 G A 0.257132 0.025047
</pre>
</div>
The LRT is the likelihood ratio statistics which is chi square distributed with one degree of freedom. The P-value can also be obtained instead (by using -Pvalue 1). -Pvalue is accurate upto chisq values of 70, which is equvialent to P-values of 1.1102e-16.
==Dependency Chain==
The method is based on estimating frequencies from genotype likelihoods. If alignment data has been supplied you need to specify the following.
# [[Genotype_likelihoods_from_alignments | Genotype likelihood model (-GL)]].
#[[Inferring_Major_and_Minor_alleles  |Determine Major/Minor (-doMajorMinor)]].
#[[Allele_Frequency_estimation| Maf estimator (-doMaf)]].
If you have supplied genotype likelihood files as input for angsd you can skip 1.


=Score statistic=
=Score statistic=
To perform the test in a generalized linear framework posterior genotype probabilities must be provided or [[Genotype_calling|estimated]]. The approach is published here [[skotte2012]].
;-doAsso 2


To perform the test in a generalized linear framework posterior genotype probabilities must be provided or estimated. The approach is published here [[skotte2012|citation]]. If alignment files are your input then -doLike, -doMaf, -doPost must be invoked. If input files are genotype likelihoods then -doMaf, -doPost must be invoked. Beagle output files can be used directly.
;-yBin [Filename]
 
A file containing the case control status. 0 being the controls, 1 being the cases and -999 being missing phenotypes.  
;-doAsso [int]
2: The test is based on a score statistic from a generalized linear framework
;-yBin [file]
a file containing the case control status. 0 being the controls, 1 being the cases and -999 being missing phenotypes.  
<div class="toccolours mw-collapsible mw-collapsed">
<div class="toccolours mw-collapsible mw-collapsed">
Example of cases control phenotype file
Example of cases control phenotype file
Line 81: Line 156:
</pre>
</pre>
</div>
</div>
;-yQuant [file]
;-yQuant [Filename]
a file containing the phenotype values.-999 being missing phenotypes. The file should contain a single phenotype entry per line.
File containing the phenotype values.-999 being missing phenotypes. The file should contain a single phenotype entry per line.
<div class="toccolours mw-collapsible mw-collapsed">
<div class="toccolours mw-collapsible mw-collapsed">
Example of quantitative phenotype file
Example of quantitative phenotype file
Line 108: Line 183:
</pre>
</pre>
</div>
</div>
;-cov [file]
;-yCount [Filename]
a files containing additional covariates in the analysis. Each lines should contain the additional covariates for a single individuals. Thus the number of lines should match the number of individuals and the number of coloums should match the number of additional covariates.
A file containing the count phenotype data, for doing poission based regression. -999 being missing phenotypes.
 
 
;-cov [Filename]
Files containing additional covariates in the analysis. Each lines should contain the additional covariates for a single individuals. Thus the number of lines should match the number of individuals and the number of coloums should match the number of additional covariates.


<div class="toccolours mw-collapsible mw-collapsed">
<div class="toccolours mw-collapsible mw-collapsed">
Line 138: Line 217:
;-minHigh [int]
;-minHigh [int]
default = 10 <br>
default = 10 <br>
This approach needs a certain amount of variability in the genotype probabilities. minHigh filters out sites that does not have at least [int] number of heterozygoes and homogoes genotype with at least 0.9 probability. This filter avoids the scenario where all individuals are heterozygoes with a high probability.  
This approach needs a certain amount of variability in the genotype probabilities. minHigh filters out sites that does not have at least [int] number of of homozygous major, heterozygous and homozygous minor genotypes. At least two of the three genotypes categories needs at least [int] individuals with a genotype probability above 0.9. This filter avoids the scenario where all individuals have genotypes with the same probability e.g. all are heterozygous with a high probability or all have 0.33333333 probability for all three genotypes.  
;-minCount [int]  
;-minCount [int]  
default = 10 <br>
default = 10 <br>
The minimum expected minor alleles in the sample. This is the frequency multiplied by to times the number of individuals. Performing association on extremely low minor allele frequencies does not make sence.
The minimum expected minor alleles in the sample. This is the frequency multiplied by two times the number of individuals. Performing association on extremely low minor allele frequencies does not make sence.
;-model [int]
;-model [int]
1 (default): additive/log-additive for linear/logistic regression
# Additive/Log-additive for Linear/Logistic Regression (Default).
# Dominant.
# Recessive.
;-fai [Filename]
A fasta index file (.fai). For human data either on hg19 or hg38 one can just use the file, test/hg19.fa.fai that is in the ANGSD repository and is therefore downloaded when cloning ANGSD from its github. Otherwise the .fai file can be obtained by indexing the reference genome or by using the header of a bam file.
;-sampleFile [Filename]
A .sample File containing phenotypes and covariates for doing the analysis. It is the Oxford sample information file (.sample) format described [https://www.well.ox.ac.uk/~gav/qctool_v2/documentation/sample_file_formats.html here].
;-whichPhe [phe1,phe2,...]
Use this option to select which phenotypes to analyse, write phenos comma seperated ('phe1,phe2,...'), only works with a .sample file.
; -whichCov [cov1,cov2,...]
Use this option to select which covariates to include, write covs comma seperated ('cov1,cov2,...'), only works with a .sample file.
 
 
==Example==
create a large number of individuals by recycling the example files (500 individuals) and simulate some phentypes (case/control) using R
 
<pre>
rm large.filelist
for i in `seq 1 50`;do cat bam.filelist>>large.filelist;done
Rscript -e "write.table(cbind(rbinom(500,1,0.5)),'pheno.ybin',row=F,col=F)"
Rscript -e "write.table(cbind(rnorm(500)),'pheno.yquant',row=F,col=F)"
Rscript -e "set.seed(1);write.table(cbind(rbinom(500,1,0.5),rnorm(500)),'cov.file',row=F,col=F)"
</pre>
 
 
For cases control data for polymorphic sites (p-value < 1e-6)
<pre>
./angsd -yBin pheno.ybin -doAsso 2 -GL 1 -doPost 1 -out out -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 1 -bam large.filelist -P 5 -r 1:
</pre>
 
 
For quantitative traits (normal distributed errors)  for polymorphic sites (p-value < 1e-6) and additional covariates
<pre>
./angsd -yQuant pheno.yquant -doAsso 2 -cov cov.file -GL 1 -doPost 1 -out out -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 1  -bam large.filelist -P 5  -r 1:
</pre>
 
 
 
==Example with imputation (using BEAGLE)==
 
First the polymorphic sites to be analysed needs to be selected (-doMaf 1 -SNP_pval -doMajorMinor) and the genotype likelihoods estimated (-GL 1) for use in [http://faculty.washington.edu/browning/beagle/beagle.html the Beagle software] (-doGlf 2).
 
<pre>
./angsd -GL 1 -out input -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 1  -bam large.filelist -P 5  -r 1: -doGlf 2
</pre>
 
Perform the imputation
 
<pre>
java -Xmx15000m -jar beagle.jar like=input.beagle.gz out=beagleOut
</pre>
 
the reference fai can be obtained by indexing the reference genome or by using a bam files header
<pre>
samtools view -H  bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam | grep SN |cut -f2,3 | sed 's/SN\://g' |  sed 's/LN\://g' > ref.fai
</pre>
 
The association can then be performed on the genotype probabilities using the score statistics
<pre>
./angsd -doMaf 4 -beagle beagleOut.impute.beagle.gz.gprobs.gz -fai ref.fai  -yBin pheno.ybin -doAsso 2
</pre>
 
==Dependency Chain==
The method is based on genotype probabilities. If alignment data has been supplied you need to specify the following.
 
# [[Genotype_likelihoods_from_alignments | Genotype likelihood model (-GL)]].
#[[Inferring_Major_and_Minor_alleles  |Determine Major/Minor (-doMajorMinor)]].
#[[Allele_Frequency_estimation| Maf estimator (-doMaf)]].
#[[Genotype_calling| Calculate posterior genotype probability (-doPost)]]. If you use the score statistics -doAsso 2 then calculate the posterior using the allele frequency as prior (-doPost 1).
 
If you have supplied genotype likelihoods for angsd, then you should skip 1.<br>
 
If you have supplied genotype probabilities (as beagle output format), there are no dependencies.
 
=Latent genotype model (EM algorithm)=
To perform the test in a generalized linear framework posterior genotype probabilities must be provided or [[Genotype_calling|estimated]]. The approach is employing an EM algorithm where the genotype is the introduced as a latent variable and then the likelihood is maximised using weighted least squares regression, similar to the approach in asaMap.
;-doAsso 4
 
Otherwise works exactly like the Score Test, the only thing that has to be changed is the -doAsso flag.
This method has an advantage in that effect sizes are estimated and reported.
 
 
==Example with genotype probabilities==
 
It can be run thus, with a binary phenotype (can also be used for a quantitative phenotype with the -yQuant flag):
<pre>
./angsd -doMaf 4 -beagle beagleOut.impute.beagle.gz.gprobs.gz -fai ref.fai  -yBin pheno.ybin -doAsso 4
</pre>


2: dominant


3: recessive


===Example===
=Hybrid model (Score Test + EM algorithm)=
For cases control data for polymorphic sites (LRT>24)
To perform the test in a generalized linear framework posterior genotype probabilities must be provided or [[Genotype_calling|estimated]]. The approach is employing the score test first, and then if the chi-square test statistic is below a certain threshold, also apply the latent genotype model, thereby getting the effect size. The idea behind this, is that the score test is faster, as we do need to apply the EM algorithm, however using the EM algorithm gives us an effect size.
;-doAsso 5
;-hybridThres 0.05        (p-value threshold for when to perform latent genotype)
 
Otherwise works exactly like the score test + latent genotype model, the only thing that has to be changed is the -doAsso flag.
This method has an advantage in that effect sizes are estimated and reported.
 
==Example with genotype probabilities==
 
It can be run thus, with a binary phenotype (can also be used for a quantitative phenotype with the -yQuant flag):
<pre>
<pre>
./angsd -yBin pheno.ybin -doAsso 2 -GL 1 -doPost 1 -out out -doMajorMinor 1 -minLRT 24 -doMaf 2 -doSNP 1 -bam bam.filelist
./angsd -doMaf 4 -beagle beagleOut.impute.beagle.gz.gprobs.gz -fai ref.fai  -yBin pheno.ybin -doAsso 5
</pre>
</pre>


=Dosage model=
To perform the test in a generalized linear framework posterior genotype probabilities must be provided or [[Genotype_calling|estimated]]. The approach is calculating the dosage or the expected genotype from the genotype probabilities, using the following formula:
E[G|X] = p(G=1|X) + 2*p(G=2|X)
And then doing a normal linear/logistic model with the dosages as the tested variable. This approach is almost as fast as the score test and effect sizes are also estimated.
;-doAsso 6
Otherwise works exactly like the score test + latent genotype model, the only thing that has to be changed is the -doAsso flag.
This method has an advantage in that effect sizes are estimated and reported.
=Input File Formats=
All -doAsso methods can now be run with genotype probabilities stored in either a BEAGLE file [https://faculty.washington.edu/browning/beagle/beagle_3.3.2_31Oct11.pdf], a BGEN file [https://www.well.ox.ac.uk/~gav/bgen_format/] or a BCF/VCF file [https://samtools.github.io/hts-specs/VCFv4.2.pdf].
==BEAGLE files==
The BEAGLE files can be run with a binary, a count or a quantive phenotype and also with covariates. An example of a command with a binary phenotype:
<pre>
./angsd -doMaf 4 -beagle test/assotest/test.beagle -yBin test/assotest/testBin.phe -doAsso 4 -cov test.cov -out test.res -fai test/hg19.fa.fai
</pre>
And with a count phenotype (using Poisson regression):
<pre>
./angsd -doMaf 4 -beagle test/assotest/test.beagle -yCount test/assotest/testCount.phe -doAsso 4 -cov test.cov -out test.res -fai test/hg19.fa.fai
</pre>
And with a quantitative phenotype (using normal linear regression):
<pre>
./angsd -doMaf 4 -beagle test/assotest/test.beagle -yQuant test/assotest/testQuant.phe -doAsso 4 -cov test.cov -out test.res -fai test/hg19.fa.fai
</pre>
==BGEN files==
The BGENfiles can be run with a binary, a count or a quantitative phenotype and also with covariates. Both ZLIB and ZSTD compression is supported, however one needs to add a FLAG when compiling with ZSTD (also one has to have the ZSTD library installed).
<pre>
make HTSSRC=../htslib/ WITH_ZSTD=1
</pre>
Also it is made according to v1.3 of the BGEN file format and only the recommended layout 2 is supported. It can also be run with a .sample file [https://www.well.ox.ac.uk/~gav/qctool/documentation/sample_file_formats.html] with both the phenotype and covariates. An example of a command with a binary phenotype:
<pre>
./angsd -doMaf 4 -bgen test/assotest/test.bgen -sampleFile test/assotest/test.sample -doAsso 4 -out test.res -fai test/hg19.fa.fai
</pre>
The column type line in the .sample will decide if a binary or normal regression model will be run, depending on the type of the phenotype (binary or quantitative).


For quantitative traits (normal distributed errors) for polymorphic sites (LRT>24) and additional covariates
==BCF/VCF files==
The BCF/VCF files can be run with genotype probabilities, specified with the "GP" genotype field, as specified in the VCF manual. It can be run with a binary, a count or a quantitative phenotype and also with covariates. This file format does not need a .fai file. An example of a command with a binary phenotype:
<pre>
./angsd -doMaf 4 -vcf-gp test/assotest/test.vcf -yBin test/assotest/testBin.phe -doAsso 4 -cov test.cov -out test.res
</pre>
And with a count phenotype (using Poisson regression):
<pre>
./angsd -doMaf 4 -vcf-gp test/assotest/test.vcf -yCount test/assotest/testCount.phe -doAsso 4 -cov test.cov -out test.res
</pre>
And with a quantitative phenotype (using normal linear regression):
<pre>
<pre>
./angsd -yQuant pheno.y -doAsso 2 -cov cov.file -GL 1 -doPost 1 -out out -doMajorMinor 1 -minLRT 24 -doMaf 2 -doSNP 1 -bam bam.filelist
./angsd -doMaf 4 -vcf-gp test/assotest/test.vcf -yQuant test/assotest/testQuant.phe -doAsso 4 -cov test.cov -out test.res
</pre>
</pre>


=output=
=Output=
==Score statistic (prefix lrt*)==
==Output format==
{| border="1"
The output from the association analysis is a list of files called '''prefix.lrt'''. These are tab separated plain text files, with nine columns.
|Chromosome || Position || major || minor || Frequency || N || LRT || highHe ||highHo
{| class="wikitable"
|-
! scope="col"| Chromosome
! scope="col"| Position
! scope="col"| Major
! scope="col"| Minor
! scope="col"| Frequency
! scope="col"| N*
! scope="col"| LRT (or P)
! scope="col"| beta^
! scope="col"| SE^
! scope="col"| highHe*
! scope="col"| highHo*
! scope="col"| emIter~
|}
'''*''' Indicates that these columns are only used for the score test, latent genotype model, hybrid model and dosage model.
'''^''' Indicates that these columns are only used for the latent genotype model, hybrid model and dosage model.
'''~''' Indicates that these columns are only used for the latent genotype model and hybrid model.
{| class="wikitable"
|-
! scope="col"| Field
! scope="col"| Description
|-
! scope="row"| Chromosome
| Chromosome.
|-
! scope="row"| Position
| Physical Position.
|-
! scope="row"| Major
| The Major allele as determined by [[MajorMinor |-doMajorMinor]]. If posterior genotype files has been supplied as input, this column is not defined.
|-
! scope="row"| Minor
| The Minor allele as determined by [[MajorMinor |-doMajorMinor]]. If posterior genotype files has been supplied as input, this column is not defined.
|-
! scope="row"| Frequency
| The Minor allele frequency as determined by [[Maf|-doMaf]].
|-
! scope="row"| N*
| Number of individuals. That is the number of samples that have both sequencing data and phenotypic data.
|-
! scope="row"| LRT (or P)
| The likelihood ratio statistic. This statistic is chi square distributed with one degree of freedom. Sites that fails one of the filters are given the value -999.000000. The P-value can also be obtained instead (by using -Pvalue 1), this column will have "P" as its column name the.
|-
! scope="row"| beta
| The estimated effect size. Sites that fails one of the filters are given the value nan.
|-
! scope="row"| SE
| The estimated standard error. Sites that fails one of the filters are given the value nan.
|-
! scope="row"| high_WT/HE/HO*
| Number of individuals with a WE/HE/HO genotype posterior probability above 0.9. WT=major/major,HE=major/minor,HO=minor/minor.
|-
! scope="row"| emIter~
| Number of iterations of EM algorithm for maximising likelihood.
|}
|}
*Chromosome
The Chromosome
*Position
The physical Position
* major
Often the most common allele. If beagle input is used it might be the minor (see frequency)
* major
Often the minorallele. If beagle input is used it might be the major (see frequency)
* Frequency
The frequency estimate. The choice of estimation is determined by the *doMaf option.
*N
The number of individuals with non-missing data. That is the individuals who have both some sequencing data for the given site and have phenotype data
*LRT
The likelihood ratio statistic. This statistic is chi square distributed with one degree of freedom. Sites that fails one of the filters are given the value -999.000000
*highHe
Number of sites with a heterozygoes posterior probability above 0.9
*highHe
Number of sites with a homozygote posterior probability above 0.9




example:
 
Example without effect sizes (beta):
<pre>
<pre>
chr    position        major  minor  frequency      Nind    LRT     highHe  highHo
Chromosome Position Major Minor Frequency N LRT high_WT/HE/HO
1       13999950        G      T      0.208311        330    -999.000000    0      20
1 14000023 C A 0.052976 330 2.863582 250/10/0
1      14000023       C       A       0.032045        330     4.429480        20      310
1 14000072 G T 0.020555 330 1.864555 320/10/0
1       14000072       G       T       0.015167        330     0.262688        10     320
1 14000113 A G 0.019543 330 0.074985 320/10/0
1       14000113       A       G       0.015194        330     0.404170        10     320
1 14000202 G A 0.270106 330 0.181530 50/90/0
1       14000202       G       A       0.245345        330     0.562326        90     60
1 14000375 T C 0.020471 330 1.845881 320/10/0
1       14000375       T       C       0.015211        330     0.263012        10     320
1 14000851 T C 0.016849 330 0.694058 320/10/0
1       14000851       T       C       0.015177        330     0.478610        10     320
1 14000873 G A 0.305990 330 0.684507 140/60/10
1       14000873       G       A       0.303030        330     1.032727        100    230
1 14001008 T C 0.018434 330 0.031631 320/10/0
1       14001008       T       C       0.015158        330     3.831819        10     320
1 14001018 T C 0.296051 330 0.761196 110/40/10
1       14001018       T       C       0.300947        330     0.859651        80      240
</pre>
</pre>
<!--=Citations=
For '''-doAsso 1' and '''-doAsso 3'
{{:Skotte2012}}
-->
==Example with genotype probabilities==
It can be run thus, with a binary phenotype (can also be used for a quantitative phenotype with the -yQuant flag):
<pre>
./angsd -doMaf 4 -beagle beagleOut.impute.beagle.gz.gprobs.gz -fai ref.fai  -yBin pheno.ybin -doAsso 6
</pre>
==Printing mafs files==
By default, -doAsso does not print the mafs file. To print the mafs file, use -forceMaf 1.
 
<pre>
-forceMaf 0 (Write .mafs file when running -doAsso (by default does not output .mafs file with -doAsso))
</pre>
=Problems with inflation of p-values=
You can evaluate the behavior of the tests by making a QQ plot of the LRT or P-values. There are several reasons why it might show signs of inflation
; -doPost (when using doAsso 2, 4, 5 or 6 without the use of posterior input -beagle
if you estimate the posterior genotype probability using a uniform prior (-doPost 2) then small differences in depth between sample will inflate the test statistics (see [[Skotte2012]]). Use the allele frequency as a prior (-doPost 1)
; -minCount/-minHigh
If you set this too low then it will results in inflation of the test statistics.
; -yQuant (when using -doAsso 2, 4, 5 or 6 with a quantitative trait)
If your trait is not continues or the distribution of the trait is skewed or has outliers then you will get inflation of p-values. Same rules apply as for a standard regression. Consider transforming you trait into a normal distribution
; Population structure
If you have population structure then you will have to adjust for it in the regression model (-doAssso 2, 4, 5 or 6). Consider using NGSadmix or PCAngsd and use the results as covariates. Note that the model will still have some issues because it uses the allele frequency as a prior. For the adventurous you can use PCAngsd or NGSadmix to estimate the individual allele frequencies and calculate your own genotype probabilities that take structure into account. These can then be used in angsd using the -beagle input format.
; low N
Usually a GWAS is performed on thousands of samples and we have only tested the use of the score statistics on hundreds of samples. If you have a low number of samples then try to figure out what minor allele frequency you would need in order to have some power. Also be careful with reducing -minCount/-minHigh.

Latest revision as of 14:32, 7 June 2023

Association can be performed using two approaches.

  1. Based on testing differences in allele frequencies between cases and controls, using genotype likelihoods
  2. Based on a generalized linear framework which also allows for quantitative traits and binary and for including additional covariates, using genotype posteriors.

We recommend that users don't perform association analysis on all sites, but limit the analysis to informative sites, and in the case of alignement data (BAM), we advise that users filter away the low mapping quality reads and the low qscore bases.

The filtering of the alignment data is described in Input, and filtering based on frequencies/polymorphic sites are described here.

This can be done easily at the command line by adding the below commands

-minQ 20 -minMapQ 30 -SNP_pval 1e-6 #Use polymorphic sites with a p-value of 10^-6
-minQ 20 -minMapQ 30 -minMaf 0.05 #Use sites with a MAF >0.05

Brief Overview

./angsd -doAsso
abcAsso.cpp:
        -doAsso 0
        1: Frequency Test (Known Major and Minor)
        2: Score Test
        4: Latent genotype model
        5: Score Test with latent genotype model - hybrid test
        6: Dosage regression
        7: Latent genotype model (wald test) - NOT PROPERLY TESTED YET!
  Frequency Test Options:
        -yBin           (null)  (File containing disease status)

  Score, Latent, Hybrid and Dosage Test Options:
        -yBin           (null)  (File containing disease status)
        -yCount         (null)  (File containing count phenotypes)
        -yQuant         (null)  (File containing phenotypes)
        -cov            (null)  (File containing additional covariates)
        -sampleFile             (null)  (.sample File containing phenotypes and covariates)
        -whichPhe       (null)  Select which phenotypes to analyse, write phenos comma seperated ('phe1,phe2,...'), only works with a .sample file
        -whichCov       (null)  Select which covariates to include, write covs comma seperated ('cov1,cov2,...'), only works with a .sample file
        -model  1
        1: Additive/Log-Additive (Default)
        2: Dominant
        3: Recessive

        -minHigh        10      (Require atleast minHigh number of high credible genotypes)
        -minCount       10      (Require this number of minor alleles, estimated from MAF)
        -assoThres      0.000001        Threshold for logistic regression
        -assoIter       100     Number of iterations for logistic regression
        -emThres        0.000100        Threshold for convergence of EM algorithm in doAsso 4 and 5
        -emIter 40      Number of max iterations for EM algorithm in doAsso 4 and 5

        -doPriming      1       Prime EM algorithm with dosage derived coefficients (0: no, 1: yes - default)

        -Pvalue 0       Prints a P-value instead of a likelihood ratio (0: no - default, 1: yes)

  Hybrid Test Options:
        -hybridThres            0.050000        (p-value value threshold for when to perform latent genotype model)

Case control association using allele frequencies

To test for differences in the allele frequencies, genotype likelihood needs to be provided or estimated. The test is an implimentation of the likelihoods ratio test for differences between cases and controls described in details in Kim2011.

-doAsso [int]

1: The test is performed assuming the minor allele is known.


-yBin [Filename]

A file containing the case control status. 0 being the controls, 1 being the cases and -999 being missing phenotypes. The file should contain a single phenotype entry per line.

Example of cases control phenotype file

1
0
0
0
1
1
1
1
0
-999
1
0
0
0
0
1

Example

create a large number of individuals by recycling the example files (500 individuals) and simulate some phentypes (case/control) using R

for i in `seq 1 50`;do cat bam.filelist>>large.filelist;done
Rscript -e "write.table(cbind(rbinom(500,1,0.5)),'pheno.ybin',row=F,col=F)"
./angsd -yBin pheno.ybin -doAsso 1 -GL 1 -out out -doMajorMinor 1 -doMaf 1 -SNP_pval 1e-6 -bam large.filelist -r 1: -P 5

Note that because you are reading 500 bam files it takes a little while

gunzip -c out.lrt0.gz | head

Chromosome	Position	Major	Minor	Frequency	LRT
1	14000003	G	A	0.057070	0.016684
1	14000013	G	A	0.067886	0.029014
1	14000019	G	T	0.052904	0.569061
1	14000023	C	A	0.073336	0.184060
1	14000053	T	C	0.038903	0.604695
1	14000170	C	T	0.050756	0.481033
1	14000176	G	A	0.053157	0.424910
1	14000200	C	A	0.085332	0.485030
1	14000202	G	A	0.257132	0.025047

The LRT is the likelihood ratio statistics which is chi square distributed with one degree of freedom. The P-value can also be obtained instead (by using -Pvalue 1). -Pvalue is accurate upto chisq values of 70, which is equvialent to P-values of 1.1102e-16.

Dependency Chain

The method is based on estimating frequencies from genotype likelihoods. If alignment data has been supplied you need to specify the following.

  1. Genotype likelihood model (-GL).
  2. Determine Major/Minor (-doMajorMinor).
  3. Maf estimator (-doMaf).

If you have supplied genotype likelihood files as input for angsd you can skip 1.

Score statistic

To perform the test in a generalized linear framework posterior genotype probabilities must be provided or estimated. The approach is published here skotte2012.

-doAsso 2
-yBin [Filename]

A file containing the case control status. 0 being the controls, 1 being the cases and -999 being missing phenotypes.

Example of cases control phenotype file

1
0
0
0
1
1
1
1
0
-999
1
0
0
0
0
1
-yQuant [Filename]

File containing the phenotype values.-999 being missing phenotypes. The file should contain a single phenotype entry per line.

Example of quantitative phenotype file

-999
2.06164722761138
-0.091935218675602
-0.287527686061831
-999
-999
-1.20996664036026
0.0188541092307412
-2.1122713873334
-999
-1.32920529536579
-1.10582299663753
-0.391773417823766
-0.501400984567535
-999
1.06014677976046
-1.10582299663753
-999
0.223156127557052
-0.189660869820135
-yCount [Filename]

A file containing the count phenotype data, for doing poission based regression. -999 being missing phenotypes.


-cov [Filename]

Files containing additional covariates in the analysis. Each lines should contain the additional covariates for a single individuals. Thus the number of lines should match the number of individuals and the number of coloums should match the number of additional covariates.

Example of covariate file

1 0 0 1 
1 0.1 0 0 
2 0 1 0 
2 0 1 0 
2 0.1 0 1 
1 0 0 1 
1 0.3 0 0 
2 0 0 0 
1 0 0 0 
2 0.2 0 1 
1 0 1 0 
1 0 0 0 
1 0.1 0 0 
1 0 0 0 
2 0 0 1 
2 0 0 0 
2 0 0 0 
1 0 0 1 
1 0.5 0 0 
2 0 0 0
-minHigh [int]

default = 10
This approach needs a certain amount of variability in the genotype probabilities. minHigh filters out sites that does not have at least [int] number of of homozygous major, heterozygous and homozygous minor genotypes. At least two of the three genotypes categories needs at least [int] individuals with a genotype probability above 0.9. This filter avoids the scenario where all individuals have genotypes with the same probability e.g. all are heterozygous with a high probability or all have 0.33333333 probability for all three genotypes.

-minCount [int]

default = 10
The minimum expected minor alleles in the sample. This is the frequency multiplied by two times the number of individuals. Performing association on extremely low minor allele frequencies does not make sence.

-model [int]
  1. Additive/Log-additive for Linear/Logistic Regression (Default).
  2. Dominant.
  3. Recessive.
-fai [Filename]

A fasta index file (.fai). For human data either on hg19 or hg38 one can just use the file, test/hg19.fa.fai that is in the ANGSD repository and is therefore downloaded when cloning ANGSD from its github. Otherwise the .fai file can be obtained by indexing the reference genome or by using the header of a bam file.

-sampleFile [Filename]

A .sample File containing phenotypes and covariates for doing the analysis. It is the Oxford sample information file (.sample) format described here.

-whichPhe [phe1,phe2,...]

Use this option to select which phenotypes to analyse, write phenos comma seperated ('phe1,phe2,...'), only works with a .sample file.

-whichCov [cov1,cov2,...]

Use this option to select which covariates to include, write covs comma seperated ('cov1,cov2,...'), only works with a .sample file.


Example

create a large number of individuals by recycling the example files (500 individuals) and simulate some phentypes (case/control) using R

rm large.filelist
for i in `seq 1 50`;do cat bam.filelist>>large.filelist;done
Rscript -e "write.table(cbind(rbinom(500,1,0.5)),'pheno.ybin',row=F,col=F)"
Rscript -e "write.table(cbind(rnorm(500)),'pheno.yquant',row=F,col=F)"
Rscript -e "set.seed(1);write.table(cbind(rbinom(500,1,0.5),rnorm(500)),'cov.file',row=F,col=F)"


For cases control data for polymorphic sites (p-value < 1e-6)

./angsd -yBin pheno.ybin -doAsso 2 -GL 1 -doPost 1 -out out -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 1 -bam large.filelist -P 5 -r 1:


For quantitative traits (normal distributed errors) for polymorphic sites (p-value < 1e-6) and additional covariates

./angsd -yQuant pheno.yquant -doAsso 2 -cov cov.file -GL 1 -doPost 1 -out out -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 1  -bam large.filelist -P 5  -r 1:


Example with imputation (using BEAGLE)

First the polymorphic sites to be analysed needs to be selected (-doMaf 1 -SNP_pval -doMajorMinor) and the genotype likelihoods estimated (-GL 1) for use in the Beagle software (-doGlf 2).

./angsd -GL 1 -out input -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 1  -bam large.filelist -P 5  -r 1: -doGlf 2

Perform the imputation

java -Xmx15000m -jar beagle.jar like=input.beagle.gz out=beagleOut

the reference fai can be obtained by indexing the reference genome or by using a bam files header

samtools view -H  bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam | grep SN |cut -f2,3 | sed 's/SN\://g' |  sed 's/LN\://g' > ref.fai

The association can then be performed on the genotype probabilities using the score statistics

./angsd -doMaf 4 -beagle beagleOut.impute.beagle.gz.gprobs.gz -fai ref.fai  -yBin pheno.ybin -doAsso 2 

Dependency Chain

The method is based on genotype probabilities. If alignment data has been supplied you need to specify the following.

  1. Genotype likelihood model (-GL).
  2. Determine Major/Minor (-doMajorMinor).
  3. Maf estimator (-doMaf).
  4. Calculate posterior genotype probability (-doPost). If you use the score statistics -doAsso 2 then calculate the posterior using the allele frequency as prior (-doPost 1).

If you have supplied genotype likelihoods for angsd, then you should skip 1.

If you have supplied genotype probabilities (as beagle output format), there are no dependencies.

Latent genotype model (EM algorithm)

To perform the test in a generalized linear framework posterior genotype probabilities must be provided or estimated. The approach is employing an EM algorithm where the genotype is the introduced as a latent variable and then the likelihood is maximised using weighted least squares regression, similar to the approach in asaMap.

-doAsso 4

Otherwise works exactly like the Score Test, the only thing that has to be changed is the -doAsso flag. This method has an advantage in that effect sizes are estimated and reported.


Example with genotype probabilities

It can be run thus, with a binary phenotype (can also be used for a quantitative phenotype with the -yQuant flag):

./angsd -doMaf 4 -beagle beagleOut.impute.beagle.gz.gprobs.gz -fai ref.fai  -yBin pheno.ybin -doAsso 4 


Hybrid model (Score Test + EM algorithm)

To perform the test in a generalized linear framework posterior genotype probabilities must be provided or estimated. The approach is employing the score test first, and then if the chi-square test statistic is below a certain threshold, also apply the latent genotype model, thereby getting the effect size. The idea behind this, is that the score test is faster, as we do need to apply the EM algorithm, however using the EM algorithm gives us an effect size.

-doAsso 5
-hybridThres 0.05 (p-value threshold for when to perform latent genotype)

Otherwise works exactly like the score test + latent genotype model, the only thing that has to be changed is the -doAsso flag. This method has an advantage in that effect sizes are estimated and reported.

Example with genotype probabilities

It can be run thus, with a binary phenotype (can also be used for a quantitative phenotype with the -yQuant flag):

./angsd -doMaf 4 -beagle beagleOut.impute.beagle.gz.gprobs.gz -fai ref.fai  -yBin pheno.ybin -doAsso 5

Dosage model

To perform the test in a generalized linear framework posterior genotype probabilities must be provided or estimated. The approach is calculating the dosage or the expected genotype from the genotype probabilities, using the following formula:

E[G|X] = p(G=1|X) + 2*p(G=2|X)

And then doing a normal linear/logistic model with the dosages as the tested variable. This approach is almost as fast as the score test and effect sizes are also estimated.

-doAsso 6

Otherwise works exactly like the score test + latent genotype model, the only thing that has to be changed is the -doAsso flag. This method has an advantage in that effect sizes are estimated and reported.

Input File Formats

All -doAsso methods can now be run with genotype probabilities stored in either a BEAGLE file [1], a BGEN file [2] or a BCF/VCF file [3].

BEAGLE files

The BEAGLE files can be run with a binary, a count or a quantive phenotype and also with covariates. An example of a command with a binary phenotype:

./angsd -doMaf 4 -beagle test/assotest/test.beagle -yBin test/assotest/testBin.phe -doAsso 4 -cov test.cov -out test.res -fai test/hg19.fa.fai

And with a count phenotype (using Poisson regression):

./angsd -doMaf 4 -beagle test/assotest/test.beagle -yCount test/assotest/testCount.phe -doAsso 4 -cov test.cov -out test.res -fai test/hg19.fa.fai

And with a quantitative phenotype (using normal linear regression):

./angsd -doMaf 4 -beagle test/assotest/test.beagle -yQuant test/assotest/testQuant.phe -doAsso 4 -cov test.cov -out test.res -fai test/hg19.fa.fai

BGEN files

The BGENfiles can be run with a binary, a count or a quantitative phenotype and also with covariates. Both ZLIB and ZSTD compression is supported, however one needs to add a FLAG when compiling with ZSTD (also one has to have the ZSTD library installed).

make HTSSRC=../htslib/ WITH_ZSTD=1 

Also it is made according to v1.3 of the BGEN file format and only the recommended layout 2 is supported. It can also be run with a .sample file [4] with both the phenotype and covariates. An example of a command with a binary phenotype:

./angsd -doMaf 4 -bgen test/assotest/test.bgen -sampleFile test/assotest/test.sample -doAsso 4 -out test.res -fai test/hg19.fa.fai

The column type line in the .sample will decide if a binary or normal regression model will be run, depending on the type of the phenotype (binary or quantitative).

BCF/VCF files

The BCF/VCF files can be run with genotype probabilities, specified with the "GP" genotype field, as specified in the VCF manual. It can be run with a binary, a count or a quantitative phenotype and also with covariates. This file format does not need a .fai file. An example of a command with a binary phenotype:

./angsd -doMaf 4 -vcf-gp test/assotest/test.vcf -yBin test/assotest/testBin.phe -doAsso 4 -cov test.cov -out test.res

And with a count phenotype (using Poisson regression):

./angsd -doMaf 4 -vcf-gp test/assotest/test.vcf -yCount test/assotest/testCount.phe -doAsso 4 -cov test.cov -out test.res

And with a quantitative phenotype (using normal linear regression):

./angsd -doMaf 4 -vcf-gp test/assotest/test.vcf -yQuant test/assotest/testQuant.phe -doAsso 4 -cov test.cov -out test.res

Output

Output format

The output from the association analysis is a list of files called prefix.lrt. These are tab separated plain text files, with nine columns.

Chromosome Position Major Minor Frequency N* LRT (or P) beta^ SE^ highHe* highHo* emIter~

* Indicates that these columns are only used for the score test, latent genotype model, hybrid model and dosage model. ^ Indicates that these columns are only used for the latent genotype model, hybrid model and dosage model. ~ Indicates that these columns are only used for the latent genotype model and hybrid model.

Field Description
Chromosome Chromosome.
Position Physical Position.
Major The Major allele as determined by -doMajorMinor. If posterior genotype files has been supplied as input, this column is not defined.
Minor The Minor allele as determined by -doMajorMinor. If posterior genotype files has been supplied as input, this column is not defined.
Frequency The Minor allele frequency as determined by -doMaf.
N* Number of individuals. That is the number of samples that have both sequencing data and phenotypic data.
LRT (or P) The likelihood ratio statistic. This statistic is chi square distributed with one degree of freedom. Sites that fails one of the filters are given the value -999.000000. The P-value can also be obtained instead (by using -Pvalue 1), this column will have "P" as its column name the.
beta The estimated effect size. Sites that fails one of the filters are given the value nan.
SE The estimated standard error. Sites that fails one of the filters are given the value nan.
high_WT/HE/HO* Number of individuals with a WE/HE/HO genotype posterior probability above 0.9. WT=major/major,HE=major/minor,HO=minor/minor.
emIter~ Number of iterations of EM algorithm for maximising likelihood.


Example without effect sizes (beta):

Chromosome	Position	Major	Minor	Frequency	N	LRT	high_WT/HE/HO
1	14000023	C	A	0.052976	330	2.863582	250/10/0
1	14000072	G	T	0.020555	330	1.864555	320/10/0
1	14000113	A	G	0.019543	330	0.074985	320/10/0
1	14000202	G	A	0.270106	330	0.181530	50/90/0
1	14000375	T	C	0.020471	330	1.845881	320/10/0
1	14000851	T	C	0.016849	330	0.694058	320/10/0
1	14000873	G	A	0.305990	330	0.684507	140/60/10
1	14001008	T	C	0.018434	330	0.031631	320/10/0
1	14001018	T	C	0.296051	330	0.761196	110/40/10

Example with genotype probabilities

It can be run thus, with a binary phenotype (can also be used for a quantitative phenotype with the -yQuant flag):

./angsd -doMaf 4 -beagle beagleOut.impute.beagle.gz.gprobs.gz -fai ref.fai  -yBin pheno.ybin -doAsso 6 

Printing mafs files

By default, -doAsso does not print the mafs file. To print the mafs file, use -forceMaf 1.

	-forceMaf	0	(Write .mafs file when running -doAsso (by default does not output .mafs file with -doAsso))

Problems with inflation of p-values

You can evaluate the behavior of the tests by making a QQ plot of the LRT or P-values. There are several reasons why it might show signs of inflation

-doPost (when using doAsso 2, 4, 5 or 6 without the use of posterior input -beagle

if you estimate the posterior genotype probability using a uniform prior (-doPost 2) then small differences in depth between sample will inflate the test statistics (see Skotte2012). Use the allele frequency as a prior (-doPost 1)

-minCount/-minHigh

If you set this too low then it will results in inflation of the test statistics.

-yQuant (when using -doAsso 2, 4, 5 or 6 with a quantitative trait)

If your trait is not continues or the distribution of the trait is skewed or has outliers then you will get inflation of p-values. Same rules apply as for a standard regression. Consider transforming you trait into a normal distribution

Population structure

If you have population structure then you will have to adjust for it in the regression model (-doAssso 2, 4, 5 or 6). Consider using NGSadmix or PCAngsd and use the results as covariates. Note that the model will still have some issues because it uses the allele frequency as a prior. For the adventurous you can use PCAngsd or NGSadmix to estimate the individual allele frequencies and calculate your own genotype probabilities that take structure into account. These can then be used in angsd using the -beagle input format.

low N

Usually a GWAS is performed on thousands of samples and we have only tested the use of the score statistics on hundreds of samples. If you have a low number of samples then try to figure out what minor allele frequency you would need in order to have some power. Also be careful with reducing -minCount/-minHigh.