angsd - User contributions [en]

Installation

2023-12-08T21:25:32Z

Isin:

There has been some confusion about the versions of ANGSD.

* Even versions are freezes from the last odd giversion

* Odd versions are git versions. Once there has been enough commits we will increment and make a release.

=Download and Installation=
To download and use ANGSD you need to download the htslib and the angsd source folder

You can either download the angsd0.938.tar.gz which contains both.
[http://popgen.dk/software/download/angsd/angsd0.938.tar.gz]

Or you can use github for the latest version of both htslib and angsd

Earlier versions from here: http://popgen.dk/software/download/angsd/
And here: https://github.com/ANGSD/angsd/releases

=Installation=
Download and unpack the tarball, enter the directory and type make. Users on a mac computer, can use curl instead of wget.

===Unix===
The software can be compiled using make.
<pre>
wget http://popgen.dk/software/download/angsd/angsd0.940.tar.gz
tar xf angsd0.940.tar.gz
cd htslib;make;cd ..
cd angsd
make HTSSRC=../htslib
cd ..
</pre>
The executable then located in '''angsd/angsd'''.

=Install from github=
To install CRAM support you also need to install htslib and can be done using the following commands

<pre>
git clone --recursive https://github.com/samtools/htslib.git
git clone https://github.com/ANGSD/angsd.git
cd htslib;make;cd ../angsd ;make HTSSRC=../htslib
</pre>

=Using htslib submodule=

<pre>
git clone https://github.com/ANGSD/angsd.git
cd angsd
make
</pre>

=Systemwide installation of htslib=

<pre>
make HTSSRC="systemwide"
</pre>

Installation

2023-12-08T21:24:13Z

Isin: /* Systemwide installation of htslib? */

There has been some confusion about the versions of ANGSD.

* Even versions are freezes from the last odd giversion

* Odd versions are git versions. Once there has been enough commits we will increment and make a release.

=Download and Installation=
To download and use ANGSD you need to download the htslib and the angsd source folder

You can either download the angsd0.938.tar.gz which contains both.
[http://popgen.dk/software/download/angsd/angsd0.938.tar.gz]

Or you can use github for the latest version of both htslib and angsd

Earlier versions from here: http://popgen.dk/software/download/angsd/
And here: https://github.com/ANGSD/angsd/releases

=Install=
Download and unpack the tarball, enter the directory and type make. Users on a mac computer, can use curl instead of wget.

===Unix===
The software can be compiled using make.
<pre>
wget http://popgen.dk/software/download/angsd/angsd0.940.tar.gz
tar xf angsd0.940.tar.gz
cd htslib;make;cd ..
cd angsd
make HTSSRC=../htslib
cd ..
</pre>
The executable then located in '''angsd/angsd'''.

=Install from github=
To install CRAM support you also need to install htslib and can be done using the following commands

<pre>
git clone --recursive https://github.com/samtools/htslib.git
git clone https://github.com/ANGSD/angsd.git
cd htslib;make;cd ../angsd ;make HTSSRC=../htslib
</pre>

=Systemwide installation of htslib=

<pre>
make HTSSRC="systemwide"
</pre>

Installation

2023-12-08T21:23:35Z

Isin: /* Systemwide installation of htslib? */

There has been some confusion about the versions of ANGSD.

* Even versions are freezes from the last odd giversion

* Odd versions are git versions. Once there has been enough commits we will increment and make a release.

=Download and Installation=
To download and use ANGSD you need to download the htslib and the angsd source folder

You can either download the angsd0.938.tar.gz which contains both.
[http://popgen.dk/software/download/angsd/angsd0.938.tar.gz]

Or you can use github for the latest version of both htslib and angsd

Earlier versions from here: http://popgen.dk/software/download/angsd/
And here: https://github.com/ANGSD/angsd/releases

=Install=
Download and unpack the tarball, enter the directory and type make. Users on a mac computer, can use curl instead of wget.

===Unix===
The software can be compiled using make.
<pre>
wget http://popgen.dk/software/download/angsd/angsd0.940.tar.gz
tar xf angsd0.940.tar.gz
cd htslib;make;cd ..
cd angsd
make HTSSRC=../htslib
cd ..
</pre>
The executable then located in '''angsd/angsd'''.

=Install from github=
To install CRAM support you also need to install htslib and can be done using the following commands

<pre>
git clone --recursive https://github.com/samtools/htslib.git
git clone https://github.com/ANGSD/angsd.git
cd htslib;make;cd ../angsd ;make HTSSRC=../htslib
</pre>

=Systemwide installation of htslib?=

```
make HTSSRC="systemwide"
```

SFS Estimation

2023-12-08T14:22:15Z

Isin: Minor fix: Add missing /pre

Latest version can now do bootstrapping. Folding should now be done in realSFS and not in the saf file generation.

=Quick Start=
The process of estimating the SFS and multidimensional has improved a lot in the newer versions.

Assuming you have a bam/cram file list in the file 'file.list' and you have your ancestral state in ancestral.fasta, then the process is:

<pre>
#no filtering
./angsd -gl 1 -anc ancestral -dosaf 1
#or alot of filtering
./angsd -gl 1 -anc ancestral -dosaf 1 -baq 1 -C 50 -minMapQ 30 -minQ 20

#this will generate 3 files
1) angsdput.saf.idx 2) angsdput.saf.pos.gz 3) angsdput.saf.gz
#these are binary files that are formally defined in https://github.com/ANGSD/angsd/blob/newsaf/doc/formats.pdf

#To find the global SFS based on the run from above simply do
./realSFS angsdput.saf.idx
##or only use chromosome 22
./realSFS angsdput.saf.idx -r 22

## or specific regions
./realSFS angsdput.saf.idx -r 22:100000-150000000

##or limit to a fixed number of sites
./realSFS angsdput.saf.idx -r 17 -nSites 10000000

##or you can find the 2dim sf by
./realSFS ceu.saf.idx yri.saf.idx
##NB the program will find the intersect internally. No need for multiple runs with angsd main program.

##or you can find the 3dim sf by
./realSFS ceu.saf.idx yri.saf.idx MEX.saf.idx
</pre>

=SFS=
This method will estimate the site frequency spectrum, the method is described in [[Nielsen2012]]. The theory behind the model is briefly described [[realSFSmethod|here]]

This is a 2 step procedure first generate a ".saf" file (site allele frequency likelihood), followed by an optimization of the .saf file which will estimate the Site frequency spectrum (SFS).

For the optimization we have implemented 2 different approaches both found in the misc folder. The diagram below shows the how the method goes from raw bam files to the SFS.

You can also estimate a [[2d SFS Estimation| 2dsfs]] or even higher if you want to.
<pre>
* NB the ancestral state needs to be supplied for the full SFS, but you can use the -fold 1 to estimate the folded SFS and then use the reference as ancestral.
* NB the output from the -doSaf 2 are not sample allele frequency likelihoods but sample alle posteriors.
And applying the realSFS to this output is therefore NOT the ML estimate of the SFS as described in the Nielsen 2012 paper,
but the 'Incorporating deviations from Hardy-Weinberg Equilibrium (HWE)' section of that paper.

</pre>

{{#mermaid:graph LR;
A[sequence data] --> B[genotype likelihoods<br/>- SAMtools<br/>- GATK<br/>- SOAPsnp<br/>- Kim et.al]
B -->|doSaf| C[.saf file]
C -->|optimize realSFS| D[.saf.ml file]

class A sequenceData;
class B genotypeLikelihoods;
class C safFile;
class D safMlFile;

classDef sequenceData fill:#FFA500;
classDef genotypeLikelihoods fill:#FFFFFF;
classDef safFile fill:#009FFF;
classDef safMlFile fill:#FF0000;
}}

=Brief Overview=
<pre>
./angsd -dosaf
-> angsd version: 0.935-44-g02a07fc-dirty (htslib: 1.12-1-g9672589) build(Jul 8 2021 08:04:55)
-> ./angsd -dosaf
-> Analysis helpbox/synopsis information:
-> Wed Aug 18 11:09:03 2021
-> doMcall=0
--------------
abcSaf.cpp:
-doSaf 0
1: perform multisample GL estimation
2: use an inbreeding version
3: calculate genotype probabilities (use -doPost 3 instead)
4: Assume genotype posteriors as input (still beta)
-underFlowProtect 0
-anc (null) (ancestral fasta)
-noTrans 0 (remove transitions)
-pest (null) (prior SFS)
-isHap 0 (is haploid beta!)
-doPost 0 (doPost 3,used for accesing saf based variables)
NB:
If -pest is supplied in addition to -doSaf then the output will then be posterior probability of the sample allelefrequency for each site
</pre>

<pre>
misc/realSFS
./realSFS afile.saf.idx [-start FNAME -P nThreads -tole tole -maxIter -nSites ]
</pre>
For information and parameters concerning the realSFS subprogram go here: [[realSFS]]

=Options=
;-doSaf 1: Calculate the Site allele frequency likelihood based on individual genotype likelihoods assuming HWE

;-doSaf 2:(version above 0.503) Calculate per site posterior probabilities of the site allele frequencies based on individual genotype likelihoods while taking into account individual inbreeding coefficients. This is implemented by Filipe G. Vieira. You need to supply a file containing all the inbreeding coefficients. -indF. Consider if you want to either get the MAP estimate by using all sites, or get the standardized values by conditioning on the called snpsites. See bottom of this page for examples.

;-doSaf 3: Calculate the genotype posterior probabilities for all samples forall sites, using an estimate of the sfs (sample allele frequency distribution). This needs a prior distribution of the SFS (which can be obtained from -doSaf 1/realSFS).

;-doSaf 4: Calculate the posterior probabilities of the sample allele frequency distribution for each site based on genotype probabilities. The genotype probabilities should be provided by the using using the -beagle options. Often the genotype probabilities will be obtained by haplotype imputation.

;-underFlowProtect [INT]
0: (default) no underflow protection. 1: use underflow protection. For large data sets (large number of individuals) underflow projection is needed.

=Output file=
The output file from the ''-doSaf'' is described in detail in angsd/doc/formats.pdf. These binary annoying files can be printed with
<pre>
realSFS print myfile.saf.idx
#or
realSFS print mayflies.saf.idx -r chr1:10000-20000
</pre>
==Example==
A full example is shown below where we use the test data that can be found on the [[quick start]] page. In this example we use GATK genotype likelihoods.

first generate .saf file with 4 threads
<pre>
./angsd -bam bam.filelist -doSaf 1 -out small -anc chimpHg19.fa -GL 2 -P 4
</pre>
We always recommend that you filter out the bad qscore bases and meaningless mapQ reads. eg '''-minMapQ 1 -minQ 20'''. So the above analysis with these filters can be written as:
<pre>
./angsd -bam bam.filelist -doSaf 1 -out small -anc chimpHg19.fa -GL 2 -P 4 -minMapQ 1 -minQ 20
</pre>
Obtain a maximum likelihood estimate of the SFS using EM algorithm
<pre>
misc/realSFS small.saf.idx -maxIter 100 -P 4 >small.sfs
</pre>

[[File:SfsSmall.png|thumb]]

A plot of this figure are seen on the right. The jaggedness is due to the very low number of sites in this small dataset.

=Interpretation of the output file=
Each row is a region of the genome (see below).
Each row is the expected values of the SFS.
==NB==
The generation of the .saf file contains a saf for each site, whereas the optimization requires information for a region of the genome. The optimization will therefore use large amounts of memory.

=Folded spectra=
If you don't have the ancestral state, you can instead estimate the folded SFS. This is done by supplying the -anc with the reference genome and applying -fold 1 to realSFS.

The above example would then be

<pre>
#first generate .saf file
./angsd -bam bam.filelist -doSaf 1 -out smallFolded -anc chimpHg19.fa -GL 2
#now try the EM optimization with 4 threads
misc/realSFS smallFolded.saf.idx -maxIter 100 -P 4 >smallFolded.sfs
#in R
sfs<-scan("smallFolded.sfs")
barplot(sfs[-1])
</pre>
[[File:SmallFolded.png|thumb]]

=Posterior of the per-site distributions of the sample allele frequency=
If you supply a prior for the SFS (which can be obtained from the -doSaf/realSFS analysis), the output of the .saf file will no longer be site allele frequency likelihoods but instead will be the log posterior probability of the sample allele frequency for each site in logspace.

=Format specification of binary .saf* files=
This can be found in the angsd/doc/formats.pdf

* If the -fold 1 has been set, then the dimension is no longer 2*nInd+1 but nInd+1 (this is deprecated)
* If the -pest parameter has been supplied the output is no longer likelihoods but log posterior site allele frequencies

=Bootstrapping=
We have recently added the possibility to bootstrap the SFS. Which can be very usefull for getting confidence intervals of the estimated SFS.

This is done by:

<pre>
realSFS pop.saf.idx -bootstrap 100 -P number_of_cores
</pre>
The program will then get you 100 estimates of SFS, based on data that has been subsampled with replacement.

=How to plot=
Assuming the we have obtained a single global sfs(only one line in the output) from '''realSFS''' program, and this is located in '''file.saf.sfs''', then we can plot the results simply like:
<pre>
sfs<-(scan("small.sfs")) #read in the log sfs
barplot(sfs[-c(1,length(sfs))]) #plot variable sites
</pre>
[[File:SfsSmall.png|thumb]]
We can make it more fancy like below:

<pre>
#function to normalize
norm <- function(x) x/sum(x)
#read data
sfs <- (scan("small.sfs"))
#the variability as percentile
pvar<- (1-sfs[1]-sfs[length(sfs)])*100
#the variable categories of the sfs
sfs<-norm(sfs[-c(1,length(sfs))])
barplot(sfs,legend=paste("Variability:= ",round(pvar,3),"%"),xlab="Chromosomes",
names=1:length(sfs),ylab="Proportions",main="mySFS plot",col='blue')
</pre>
[[File:SfsSmallFine.png|thumb]]

If your output from '''realSFS''' contains more than one line, it is because you have estimated multiple local SFS's. Then you can't use the above commands directly but should first pick a specific row.

<pre>
sfs<-(as.numeric(read.table("multiple.sfs")[1,])) #first region.
#do the above
sfs<-(as.numeric(read.table("multiple.sfs")[2,])) #second region.
</pre>

=Which genotype likelihood model should I choose ?=
It depends on the data. As shown on this example [[Glcomparison]], there was a huge difference between '''-GL 1''' and '''-GL 2''' for older 1000genomes BAM files, but little difference for newer bam files.
=Validation=
The validation is based on the pre 0.900 version
==-doSaf 1==
<pre>
cd misc;
./supersim -outfiles test -npop 1 -nind 12 -pvar 0.9 -nsites 50000
echo testchr1 100000 >test.fai
../angsd -fai test.fai -glf test.glf.gz -nind 12 -doSaf 1 -issim 1
./realSFS angsdput.saf 24 2>/dev/null >res
cat res
31465.429798 4938.453115 2568.586388 1661.227445 1168.891114 975.302535 794.727537 632.691896 648.223566 546.293853 487.936192 417.178505 396.200026 409.813797 308.434836 371.699254 245.585920 322.293532 282.980046 292.584975 212.845183 196.682483 221.802128 236.221205 197.914673
</pre>

==-doSaf 2==
<pre>
ngsSim=../ngsSim/ngsSim
angsd=./angsd
realSFS=./misc/realSFS

$ngsSim -npop 1 -nind 24 -nsites 1000000 -depth 4 -F 0.0 -outfiles testF0.0
$ngsSim -npop 1 -nind 24 -nsites 1000000 -depth 4 -F 0.9 -outfiles testF0.9

for i in `seq 24`;do echo 0.9;done >indF
echo testchr1 250000000 >test.fai
$angsd -fai test.fai -issim 1 -glf testF0.0.glf.gz -nind 24 -out noF -dosaf 1
$angsd -fai test.fai -issim 1 -glf testF0.9.glf.gz -nind 24 -out withF -dosaf 2 -domajorminor 1 -domaf 1 -indF indF
$angsd -fai test.fai -issim 1 -glf testF0.9.glf.gz -nind 24 -out withFsnp -dosaf 2 -domajorminor 1 -domaf 1 -indF indF -snp_pval 1e-4

$realSFS noF.saf 48 >noF.sfs
$realSFS withF.saf 48 >withF.sfs

#in R
trueNoF<-scan("testF0.0.frq")
trueWithF<-scan("testF0.9.frq")
pdf("sfsFcomparison.pdf",width=14)
par(mfrow=c(1,2),width=14)
barplot(trueNoF[-1],main='true sfs F=0.0')
barplot(trueWithF[-1],main='true sfs F=0.9')

estWithF<-scan("withF.sfs")
estNoF<-scan("noF.sfs")

barplot(rbind(trueNoF,estNoF)[,-1],main="true vs est SFS F=0 (ML) (all sites)",be=T,col=1:2)
barplot(rbind(trueWithF,estWithF)[,-1],main='true vs est sfs=0.9 (MAP) (all sites)',be=T,col=1:2)

readBjoint <- function(file=NULL,nind=10,nsites=10){
ff <- gzfile(file,"rb")
m<-matrix(readBin(ff,double(),(2*nind+1)*nsites),ncol=(2*nind+1),byrow=TRUE)
close(ff)
return(m)
}

m <- exp(readBjoint("withF.saf",nind=24,5e6))
barplot(rbind(trueWithF,colMeans(m))[,-1],main='true vs est sfs F=0.9 (colmean of site pp) (all sites)',be=T,col=1:2)
m <- exp(readBjoint("withFsnp.saf",nind=24,5e6))
m <- colMeans(m)*nrow(m)
##m contains SFS for absolute frequencies
m[1] <-1e6-sum(m[-1])
##m now contains a corrected estimate containing the zero category
barplot(rbind(trueWithF,norm(m))[,-1],main='true vs est sfs F=0.9 (colmean of site pp) (called snp sites)',be=T,col=1:2)

dev.off()

</pre>
See results from above here:http://www.popgen.dk/angsd/sfsFcomparison.pdf

=safv3 comparison=
Between 0.800 and 0.900 i decided to move to a better format than the raw sad files. This new format takes up half the storage and allows for easy random access and generalizes to unto 5dimensional sfs. A comparison can be found here: [[safv3]]
=Using NGStools=
See [[realSFS]] for how to convert the new safformat to the old safformat if you use NGStools.

SFS Estimation

2023-12-08T14:21:30Z

Isin: Remove old yuml diagram and replace with mermaid equivalent

Latest version can now do bootstrapping. Folding should now be done in realSFS and not in the saf file generation.

=Quick Start=
The process of estimating the SFS and multidimensional has improved a lot in the newer versions.

Assuming you have a bam/cram file list in the file 'file.list' and you have your ancestral state in ancestral.fasta, then the process is:

<pre>
#no filtering
./angsd -gl 1 -anc ancestral -dosaf 1
#or alot of filtering
./angsd -gl 1 -anc ancestral -dosaf 1 -baq 1 -C 50 -minMapQ 30 -minQ 20

#this will generate 3 files
1) angsdput.saf.idx 2) angsdput.saf.pos.gz 3) angsdput.saf.gz
#these are binary files that are formally defined in https://github.com/ANGSD/angsd/blob/newsaf/doc/formats.pdf

#To find the global SFS based on the run from above simply do
./realSFS angsdput.saf.idx
##or only use chromosome 22
./realSFS angsdput.saf.idx -r 22

## or specific regions
./realSFS angsdput.saf.idx -r 22:100000-150000000

##or limit to a fixed number of sites
./realSFS angsdput.saf.idx -r 17 -nSites 10000000

##or you can find the 2dim sf by
./realSFS ceu.saf.idx yri.saf.idx
##NB the program will find the intersect internally. No need for multiple runs with angsd main program.

##or you can find the 3dim sf by
./realSFS ceu.saf.idx yri.saf.idx MEX.saf.idx
</pre>

=SFS=
This method will estimate the site frequency spectrum, the method is described in [[Nielsen2012]]. The theory behind the model is briefly described [[realSFSmethod|here]]

This is a 2 step procedure first generate a ".saf" file (site allele frequency likelihood), followed by an optimization of the .saf file which will estimate the Site frequency spectrum (SFS).

For the optimization we have implemented 2 different approaches both found in the misc folder. The diagram below shows the how the method goes from raw bam files to the SFS.

You can also estimate a [[2d SFS Estimation| 2dsfs]] or even higher if you want to.
<pre>
* NB the ancestral state needs to be supplied for the full SFS, but you can use the -fold 1 to estimate the folded SFS and then use the reference as ancestral.
* NB the output from the -doSaf 2 are not sample allele frequency likelihoods but sample alle posteriors.
And applying the realSFS to this output is therefore NOT the ML estimate of the SFS as described in the Nielsen 2012 paper,
but the 'Incorporating deviations from Hardy-Weinberg Equilibrium (HWE)' section of that paper.

{{#mermaid:graph LR;
A[sequence data] --> B[genotype likelihoods<br/>- SAMtools<br/>- GATK<br/>- SOAPsnp<br/>- Kim et.al]
B -->|doSaf| C[.saf file]
C -->|optimize realSFS| D[.saf.ml file]

class A sequenceData;
class B genotypeLikelihoods;
class C safFile;
class D safMlFile;

classDef sequenceData fill:#FFA500;
classDef genotypeLikelihoods fill:#FFFFFF;
classDef safFile fill:#009FFF;
classDef safMlFile fill:#FF0000;
}}

=Brief Overview=
<pre>
./angsd -dosaf
-> angsd version: 0.935-44-g02a07fc-dirty (htslib: 1.12-1-g9672589) build(Jul 8 2021 08:04:55)
-> ./angsd -dosaf
-> Analysis helpbox/synopsis information:
-> Wed Aug 18 11:09:03 2021
-> doMcall=0
--------------
abcSaf.cpp:
-doSaf 0
1: perform multisample GL estimation
2: use an inbreeding version
3: calculate genotype probabilities (use -doPost 3 instead)
4: Assume genotype posteriors as input (still beta)
-underFlowProtect 0
-anc (null) (ancestral fasta)
-noTrans 0 (remove transitions)
-pest (null) (prior SFS)
-isHap 0 (is haploid beta!)
-doPost 0 (doPost 3,used for accesing saf based variables)
NB:
If -pest is supplied in addition to -doSaf then the output will then be posterior probability of the sample allelefrequency for each site
</pre>

<pre>
misc/realSFS
./realSFS afile.saf.idx [-start FNAME -P nThreads -tole tole -maxIter -nSites ]
</pre>
For information and parameters concerning the realSFS subprogram go here: [[realSFS]]

=Options=
;-doSaf 1: Calculate the Site allele frequency likelihood based on individual genotype likelihoods assuming HWE

;-doSaf 2:(version above 0.503) Calculate per site posterior probabilities of the site allele frequencies based on individual genotype likelihoods while taking into account individual inbreeding coefficients. This is implemented by Filipe G. Vieira. You need to supply a file containing all the inbreeding coefficients. -indF. Consider if you want to either get the MAP estimate by using all sites, or get the standardized values by conditioning on the called snpsites. See bottom of this page for examples.

;-doSaf 3: Calculate the genotype posterior probabilities for all samples forall sites, using an estimate of the sfs (sample allele frequency distribution). This needs a prior distribution of the SFS (which can be obtained from -doSaf 1/realSFS).

;-doSaf 4: Calculate the posterior probabilities of the sample allele frequency distribution for each site based on genotype probabilities. The genotype probabilities should be provided by the using using the -beagle options. Often the genotype probabilities will be obtained by haplotype imputation.

;-underFlowProtect [INT]
0: (default) no underflow protection. 1: use underflow protection. For large data sets (large number of individuals) underflow projection is needed.

=Output file=
The output file from the ''-doSaf'' is described in detail in angsd/doc/formats.pdf. These binary annoying files can be printed with
<pre>
realSFS print myfile.saf.idx
#or
realSFS print mayflies.saf.idx -r chr1:10000-20000
</pre>
==Example==
A full example is shown below where we use the test data that can be found on the [[quick start]] page. In this example we use GATK genotype likelihoods.

first generate .saf file with 4 threads
<pre>
./angsd -bam bam.filelist -doSaf 1 -out small -anc chimpHg19.fa -GL 2 -P 4
</pre>
We always recommend that you filter out the bad qscore bases and meaningless mapQ reads. eg '''-minMapQ 1 -minQ 20'''. So the above analysis with these filters can be written as:
<pre>
./angsd -bam bam.filelist -doSaf 1 -out small -anc chimpHg19.fa -GL 2 -P 4 -minMapQ 1 -minQ 20
</pre>
Obtain a maximum likelihood estimate of the SFS using EM algorithm
<pre>
misc/realSFS small.saf.idx -maxIter 100 -P 4 >small.sfs
</pre>

[[File:SfsSmall.png|thumb]]

A plot of this figure are seen on the right. The jaggedness is due to the very low number of sites in this small dataset.

=Interpretation of the output file=
Each row is a region of the genome (see below).
Each row is the expected values of the SFS.
==NB==
The generation of the .saf file contains a saf for each site, whereas the optimization requires information for a region of the genome. The optimization will therefore use large amounts of memory.

=Folded spectra=
If you don't have the ancestral state, you can instead estimate the folded SFS. This is done by supplying the -anc with the reference genome and applying -fold 1 to realSFS.

The above example would then be

<pre>
#first generate .saf file
./angsd -bam bam.filelist -doSaf 1 -out smallFolded -anc chimpHg19.fa -GL 2
#now try the EM optimization with 4 threads
misc/realSFS smallFolded.saf.idx -maxIter 100 -P 4 >smallFolded.sfs
#in R
sfs<-scan("smallFolded.sfs")
barplot(sfs[-1])
</pre>
[[File:SmallFolded.png|thumb]]

=Posterior of the per-site distributions of the sample allele frequency=
If you supply a prior for the SFS (which can be obtained from the -doSaf/realSFS analysis), the output of the .saf file will no longer be site allele frequency likelihoods but instead will be the log posterior probability of the sample allele frequency for each site in logspace.

=Format specification of binary .saf* files=
This can be found in the angsd/doc/formats.pdf

* If the -fold 1 has been set, then the dimension is no longer 2*nInd+1 but nInd+1 (this is deprecated)
* If the -pest parameter has been supplied the output is no longer likelihoods but log posterior site allele frequencies

=Bootstrapping=
We have recently added the possibility to bootstrap the SFS. Which can be very usefull for getting confidence intervals of the estimated SFS.

This is done by:

<pre>
realSFS pop.saf.idx -bootstrap 100 -P number_of_cores
</pre>
The program will then get you 100 estimates of SFS, based on data that has been subsampled with replacement.

=How to plot=
Assuming the we have obtained a single global sfs(only one line in the output) from '''realSFS''' program, and this is located in '''file.saf.sfs''', then we can plot the results simply like:
<pre>
sfs<-(scan("small.sfs")) #read in the log sfs
barplot(sfs[-c(1,length(sfs))]) #plot variable sites
</pre>
[[File:SfsSmall.png|thumb]]
We can make it more fancy like below:

<pre>
#function to normalize
norm <- function(x) x/sum(x)
#read data
sfs <- (scan("small.sfs"))
#the variability as percentile
pvar<- (1-sfs[1]-sfs[length(sfs)])*100
#the variable categories of the sfs
sfs<-norm(sfs[-c(1,length(sfs))])
barplot(sfs,legend=paste("Variability:= ",round(pvar,3),"%"),xlab="Chromosomes",
names=1:length(sfs),ylab="Proportions",main="mySFS plot",col='blue')
</pre>
[[File:SfsSmallFine.png|thumb]]

If your output from '''realSFS''' contains more than one line, it is because you have estimated multiple local SFS's. Then you can't use the above commands directly but should first pick a specific row.

<pre>
sfs<-(as.numeric(read.table("multiple.sfs")[1,])) #first region.
#do the above
sfs<-(as.numeric(read.table("multiple.sfs")[2,])) #second region.
</pre>

=Which genotype likelihood model should I choose ?=
It depends on the data. As shown on this example [[Glcomparison]], there was a huge difference between '''-GL 1''' and '''-GL 2''' for older 1000genomes BAM files, but little difference for newer bam files.
=Validation=
The validation is based on the pre 0.900 version
==-doSaf 1==
<pre>
cd misc;
./supersim -outfiles test -npop 1 -nind 12 -pvar 0.9 -nsites 50000
echo testchr1 100000 >test.fai
../angsd -fai test.fai -glf test.glf.gz -nind 12 -doSaf 1 -issim 1
./realSFS angsdput.saf 24 2>/dev/null >res
cat res
31465.429798 4938.453115 2568.586388 1661.227445 1168.891114 975.302535 794.727537 632.691896 648.223566 546.293853 487.936192 417.178505 396.200026 409.813797 308.434836 371.699254 245.585920 322.293532 282.980046 292.584975 212.845183 196.682483 221.802128 236.221205 197.914673
</pre>

==-doSaf 2==
<pre>
ngsSim=../ngsSim/ngsSim
angsd=./angsd
realSFS=./misc/realSFS

$ngsSim -npop 1 -nind 24 -nsites 1000000 -depth 4 -F 0.0 -outfiles testF0.0
$ngsSim -npop 1 -nind 24 -nsites 1000000 -depth 4 -F 0.9 -outfiles testF0.9

for i in `seq 24`;do echo 0.9;done >indF
echo testchr1 250000000 >test.fai
$angsd -fai test.fai -issim 1 -glf testF0.0.glf.gz -nind 24 -out noF -dosaf 1
$angsd -fai test.fai -issim 1 -glf testF0.9.glf.gz -nind 24 -out withF -dosaf 2 -domajorminor 1 -domaf 1 -indF indF
$angsd -fai test.fai -issim 1 -glf testF0.9.glf.gz -nind 24 -out withFsnp -dosaf 2 -domajorminor 1 -domaf 1 -indF indF -snp_pval 1e-4

$realSFS noF.saf 48 >noF.sfs
$realSFS withF.saf 48 >withF.sfs

#in R
trueNoF<-scan("testF0.0.frq")
trueWithF<-scan("testF0.9.frq")
pdf("sfsFcomparison.pdf",width=14)
par(mfrow=c(1,2),width=14)
barplot(trueNoF[-1],main='true sfs F=0.0')
barplot(trueWithF[-1],main='true sfs F=0.9')

estWithF<-scan("withF.sfs")
estNoF<-scan("noF.sfs")

barplot(rbind(trueNoF,estNoF)[,-1],main="true vs est SFS F=0 (ML) (all sites)",be=T,col=1:2)
barplot(rbind(trueWithF,estWithF)[,-1],main='true vs est sfs=0.9 (MAP) (all sites)',be=T,col=1:2)

readBjoint <- function(file=NULL,nind=10,nsites=10){
ff <- gzfile(file,"rb")
m<-matrix(readBin(ff,double(),(2*nind+1)*nsites),ncol=(2*nind+1),byrow=TRUE)
close(ff)
return(m)
}

m <- exp(readBjoint("withF.saf",nind=24,5e6))
barplot(rbind(trueWithF,colMeans(m))[,-1],main='true vs est sfs F=0.9 (colmean of site pp) (all sites)',be=T,col=1:2)
m <- exp(readBjoint("withFsnp.saf",nind=24,5e6))
m <- colMeans(m)*nrow(m)
##m contains SFS for absolute frequencies
m[1] <-1e6-sum(m[-1])
##m now contains a corrected estimate containing the zero category
barplot(rbind(trueWithF,norm(m))[,-1],main='true vs est sfs F=0.9 (colmean of site pp) (called snp sites)',be=T,col=1:2)

dev.off()

</pre>
See results from above here:http://www.popgen.dk/angsd/sfsFcomparison.pdf

=safv3 comparison=
Between 0.800 and 0.900 i decided to move to a better format than the raw sad files. This new format takes up half the storage and allows for easy random access and generalizes to unto 5dimensional sfs. A comparison can be found here: [[safv3]]
=Using NGStools=
See [[realSFS]] for how to convert the new safformat to the old safformat if you use NGStools.

Installation

2023-07-04T10:56:24Z

Isin: update to latest release

There has been some confusion about the versions of ANGSD.

* Even versions are freezes from the last odd giversion

* Odd versions are git versions. Once there has been enough commits we will increment and make a release.

=Download and Installation=
To download and use ANGSD you need to download the htslib and the angsd source folder

You can either download the angsd0.938.tar.gz which contains both.
[http://popgen.dk/software/download/angsd/angsd0.938.tar.gz]

Or you can use github for the latest version of both htslib and angsd

Earlier versions from here: http://popgen.dk/software/download/angsd/
And here: https://github.com/ANGSD/angsd/releases

=Install=
Download and unpack the tarball, enter the directory and type make. Users on a mac computer, can use curl instead of wget.

===Unix===
The software can be compiled using make.
<pre>
wget http://popgen.dk/software/download/angsd/angsd0.940.tar.gz
tar xf angsd0.940.tar.gz
cd htslib;make;cd ..
cd angsd
make HTSSRC=../htslib
cd ..
</pre>
The executable then located in '''angsd/angsd'''.

=Install from github=
To install CRAM support you also need to install htslib and can be done using the following commands

<pre>
git clone --recursive https://github.com/samtools/htslib.git
git clone https://github.com/ANGSD/angsd.git
cd htslib;make;cd ../angsd ;make HTSSRC=../htslib
</pre>

=Systemwide installation of htslib?=
Then you just type make in the angsd directory

Vcf

2023-06-21T21:53:55Z

Isin: /* Output files */

Newer versions of angsd (master 27april2015) supports basic vcf output. This will only include gl and gp tags which can be useful for certain external programs

Supply
;-doVcf 1

Which is simply a wrapper around -gl -dopost -domajorminor -domaf.

==Example==
A full example commandline is given below:

<pre>
./angsd -b list.list -dovcf 1 -gl 1 -dopost 1 -domajorminor 1 -domaf 1 -snp_pval 1e-6
</pre>

==Output files==
This will generate a vcf file called angsdput.vcf.gz
<div class="toccolours mw-collapsible mw-collapsed">
angsdput.vcf.gz
<pre class="mw-collapsible-content">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="scaled Genotype Likelihoods (these are really llh eventhough they sum to one)">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ind0 ind1 ind2 ind3 ind4 ind5 ind6 ind7 ind8 ind9 ind10 ind11 ind12 ind13 ind14 ind15 ind16 ind17 ind18 ind19 ind20 ind21 ind22 ind23 ind24 ind25 ind26 ind27 ind28 ind29 ind30 ind31 ind32
1 14000202 . G A . PASS . GL:GP 0.013409,0.986591,0.000001:0.009959,0.990038,0.000003 0.532745,0.394297,0.072957:0.333333,0.333333,0.333333 0.532745,0.394297,0.072957:0.333333,0.333333,0.333333 0.729804,0.270070,0.000126:0.666110,0.333052,0.000839 0.532745,0.394297,0.072957:0.333333,0.333333,0.333333 0.013409,0.986589,0.000003:0.009959,0.990026,0.000015 0.843814,0.156129,0.000057:0.799685,0.199918,0.000397 0.003405,0.996582,0.000013:0.002523,0.997410,0.000068 0.915318,0.084679,0.000003:0.888870,0.111106,0.000025 0.843862,0.156138,0.000000:0.800003,0.199997,0.000000 0.001243,0.728985,0.269772:0.000420,0.333191,0.666388 0.955789,0.044211,0.000000:0.941178,0.058822,0.000000 0.843860,0.156137,0.000003:0.799987,0.199993,0.000020 0.000001,0.999999,0.000000:0.000001,0.999999,0.000000 0.824856,0.152621,0.022523:0.689947,0.172484,0.137569 0.701963,0.259767,0.038270:0.526842,0.263419,0.209740 0.532745,0.394297,0.072957:0.333333,0.333333,0.333333 0.004272,0.995611,0.000117:0.003164,0.996204,0.000632 0.532745,0.394297,0.072957:0.333333,0.333333,0.333333 0.021392,0.791754,0.186854:0.008712,0.435643,0.555645 0.843681,0.156104,0.000215:0.798816,0.199700,0.001484 0.729804,0.270070,0.000126:0.666110,0.333052,0.000839 0.002698,0.996374,0.000928:0.001990,0.993012,0.004997 0.977395,0.022605,0.000000:0.969698,0.030302,0.000000 0.532745,0.394297,0.072957:0.333333,0.333333,0.333333 0.002152,0.997847,0.000002:0.001593,0.998397,0.000010 0.915213,0.084669,0.000118:0.888149,0.111016,0.000835 0.032903,0.965645,0.001452:0.024405,0.967730,0.007865 0.000006,0.999994,0.000000:0.000005,0.999995,0.000000 0.701963,0.259767,0.038270:0.526842,0.263419,0.209740 0.843846,0.156135,0.000019:0.799896,0.199971,0.000133 0.907127,0.083921,0.008952:0.835380,0.104420,0.060201 0.332066,0.644735,0.023200:0.241926,0.634652,0.123421
1 14000873 . G A . PASS . GL:GP 0.000000,0.124151,0.875849:0.000000,0.030302,0.969698 0.692531,0.305335,0.002134:0.659698,0.329846,0.010456 0.900720,0.099279,0.000000:0.888891,0.111109,0.000000 0.993158,0.006842,0.000000:0.992249,0.007751,0.000000 0.999140,0.000860,0.000000:0.999024,0.000976,0.000000 0.000091,0.994381,0.005528:0.000078,0.975325,0.024596 0.900720,0.099279,0.000000:0.888891,0.111109,0.000000 0.819369,0.180627,0.000004:0.799988,0.199993,0.000018 0.000000,0.124151,0.875849:0.000000,0.030302,0.969698 0.693938,0.305955,0.000107:0.666316,0.333155,0.000529 0.986410,0.013590,0.000000:0.984616,0.015384,0.000000 0.693919,0.305947,0.000135:0.666224,0.333109,0.000666 0.007101,0.988541,0.004358:0.006172,0.974344,0.019484 0.000000,1.000000,0.000000:0.000000,1.000000,0.000000 0.030451,0.672875,0.296674:0.013127,0.328956,0.657917 0.973184,0.026816,0.000000:0.969698,0.030302,0.000000 0.973184,0.026816,0.000000:0.969698,0.030302,0.000000 0.993158,0.006842,0.000000:0.992249,0.007751,0.000000 0.004535,0.995465,0.000000:0.004001,0.995998,0.000000 0.000395,0.693734,0.305871:0.000167,0.333276,0.666557 0.900720,0.099279,0.000000:0.888891,0.111109,0.000000 0.693965,0.305967,0.000068:0.666446,0.333220,0.000334 0.947768,0.052232,0.000000:0.941178,0.058822,0.000000 0.000000,0.220881,0.779119:0.000000,0.058822,0.941178 0.900720,0.099279,0.000000:0.888891,0.111109,0.000000 0.000000,0.017410,0.982590:0.000000,0.003891,0.996109 0.000046,0.999954,0.000000:0.000040,0.999960,0.000000 0.000001,0.999999,0.000000:0.000001,0.999999,0.000000 0.947768,0.052232,0.000000:0.941178,0.058822,0.000000 0.310721,0.689279,0.000000:0.284441,0.715559,0.000000 0.900720,0.099279,0.000000:0.888891,0.111109,0.000000 0.973184,0.026816,0.000000:0.969698,0.030302,0.000000 0.000992,0.693320,0.305688:0.000420,0.333191,0.666388
1 14001018 . T C . PASS . GL:GP 0.000000,0.069163,0.930837:0.000000,0.015384,0.984616 0.826258,0.173742,0.000000:0.800002,0.199997,0.000002 0.159620,0.840380,0.000000:0.137752,0.862248,0.000000 0.950058,0.049942,0.000000:0.941178,0.058822,0.000000 0.904865,0.095135,0.000000:0.888891,0.111109,0.000000 0.826258,0.173742,0.000000:0.800003,0.199997,0.000000 0.495544,0.416810,0.087646:0.333333,0.333333,0.333333 0.993472,0.006528,0.000000:0.992249,0.007751,0.000000 0.001055,0.703204,0.295741:0.000420,0.333191,0.666388 0.703933,0.296042,0.000025:0.666580,0.333287,0.000133 0.826258,0.173742,0.000000:0.800003,0.199997,0.000000 0.703933,0.296042,0.000025:0.666580,0.333287,0.000133 0.000420,0.703651,0.295929:0.000167,0.333276,0.666557 0.000008,0.999992,0.000000:0.000006,0.999994,0.000000 0.085860,0.913986,0.000153:0.073175,0.926086,0.000739 0.904865,0.095135,0.000000:0.888891,0.111109,0.000000 0.495544,0.416810,0.087646:0.333333,0.333333,0.333333 0.993472,0.006528,0.000000:0.992249,0.007751,0.000000 0.703731,0.295957,0.000313:0.665554,0.332774,0.001672 0.000007,0.543141,0.456852:0.000002,0.199997,0.800001 0.904865,0.095135,0.000000:0.888891,0.111109,0.000000 0.904865,0.095135,0.000000:0.888891,0.111109,0.000000 0.904865,0.095135,0.000000:0.888891,0.111109,0.000000 0.000007,0.543141,0.456852:0.000002,0.199997,0.800001 0.904865,0.095135,0.000000:0.888892,0.111108,0.000000 0.000000,0.229117,0.770883:0.000000,0.058822,0.941178 0.045152,0.954211,0.000637:0.038160,0.958794,0.003045 0.000599,0.999394,0.000006:0.000504,0.999465,0.000031 0.974389,0.025611,0.000000:0.969698,0.030302,0.000000 0.000005,0.543142,0.456853:0.000002,0.199997,0.800002 0.974389,0.025611,0.000000:0.969698,0.030302,0.000000 0.703916,0.296035,0.000050:0.666492,0.333243,0.000265 0.495544,0.416810,0.087646:0.333333,0.333333,0.333333
1 14001867 . A G . PASS . GL:GP 0.000407,0.698622,0.300971:0.000167,0.333276,0.666557 0.902773,0.097227,0.000000:0.888891,0.111109,0.000000 0.000645,0.698456,0.300899:0.000265,0.333243,0.666492 0.902773,0.097227,0.000000:0.888891,0.111109,0.000000 0.902773,0.097227,0.000000:0.888891,0.111109,0.000000 0.000006,0.998276,0.001719:0.000005,0.992066,0.007929 0.986717,0.013283,0.000000:0.984616,0.015384,0.000000 0.822776,0.177224,0.000000:0.800002,0.199997,0.000001 0.000003,0.537165,0.462832:0.000001,0.199997,0.800002 0.948903,0.051097,0.000000:0.941178,0.058822,0.000000 0.948903,0.051097,0.000000:0.941178,0.058822,0.000000 0.986717,0.013283,0.000000:0.984616,0.015384,0.000000 0.000000,0.537166,0.462833:0.000000,0.199997,0.800003 0.000060,0.999940,0.000000:0.000051,0.999949,0.000000 0.000079,0.999921,0.000000:0.000068,0.999932,0.000000 0.902773,0.097227,0.000000:0.888891,0.111109,0.000000 0.902773,0.097227,0.000000:0.888891,0.111109,0.000000 0.999159,0.000841,0.000000:0.999024,0.000976,0.000000 0.004641,0.995359,0.000000:0.004001,0.995999,0.000000 0.000645,0.698456,0.300899:0.000265,0.333243,0.666492 0.948903,0.051097,0.000000:0.941178,0.058822,0.000000 0.822774,0.177223,0.000002:0.799994,0.199995,0.000011 0.996646,0.003354,0.000000:0.996109,0.003891,0.000000 0.000000,0.008985,0.991015:0.000000,0.001949,0.998051 0.822776,0.177224,0.000000:0.800002,0.199997,0.000001 0.000000,0.367208,0.632792:0.000000,0.111109,0.888891 0.000000,0.999135,0.000865:0.000000,0.995998,0.004001 0.005821,0.994177,0.000002:0.005019,0.994970,0.000011 0.993314,0.006686,0.000000:0.992249,0.007751,0.000000 0.000000,0.999998,0.000002:0.000000,0.999992,0.000008 0.948903,0.051097,0.000000:0.941178,0.058822,0.000000 0.822776,0.177224,0.000001:0.800000,0.199997,0.000003 0.007316,0.992667,0.000017:0.006309,0.993611,0.000079
[capped]
</pre>
</div>
Notice that the sampleID is simply an ind followed by an integer. These correspond to the samples from the `-b filelist`.

==References==

Allele Frequencies

2023-06-08T10:16:21Z

Isin: /* Output data */

<div class="keywords"> -domaf,-domaf,-domaf,-domaf,-domaf, domaf, domaf, domaf, domaf, domaf, domaf, dopost, SNP_pval </div>

The allele frequency is the relative frequency of an allele for a site. This can be polarized according to the major/minor, reference/non-refernce or ancestral/derived. .Therefore the choice of allele frequency estimator is closely related to choosing which alleles are segregating (see [[Inferring_Major_and_Minor_alleles]]).

We allow for frequency estimation from different input data:

# Genotype Likelihoods
# Genotype posterior probabilities
# Counts of bases

The allele frequency estimator from genotype likelihoods are from this [[suYeon | publication]], and the base counts method is from this [[Li2010 |publication]].

For the case of the genotype likelihood based methods we allow for deviations from Hardy-Weinberg, namely we allow for users to supply a file containing inbreeding coefficients for each individual.

=Brief Overview=

<pre>
./angsd -doMaf
abcFreq.cpp:
-doMaf 0 (Calculate persite frequencies '.mafs.gz')
1: Frequency (fixed major and minor)
2: Frequency (fixed major unknown minor)
4: Frequency from genotype probabilities
8: AlleleCounts based method (known major minor)
NB. Filedumping is supressed if value is negative
-doPost 0 (Calculate posterior prob 3xgprob)
1: Using frequency as prior
2: Using uniform prior
3: Using SFS as prior (still in development)
4: Using reference panel as prior (still in development), requires a site file with chr pos major minor af ac an
Filters:
-minMaf -1.000000 (Remove sites with MAF below)
-SNP_pval 0.317311 (Remove sites with a pvalue larger)
-rmSNPs 0 (Remove infered SNPs instead of keeping them (pval > SNP_pval)
-rmTriallelic 0.000000 (Remove sites with a pvalue lower)
-forceMaf 0 (Write .mafs file when running -doAsso (by default does not output .mafs file with -doAsso))
-skipMissing 1 (Set post to 0.33 if missing (do not use freq as prior))
Extras:
-ref (null) (Filename for fasta reference)
-anc (null) (Filename for fasta ancestral)
-eps 0.001000 [Only used for -doMaf &8]
-beagleProb 0 (Dump beagle style postprobs)
-indFname (null) (file containing individual inbreedcoeficients)
-underFlowProtect 0 (file containing individual inbreedcoeficients)
NB These frequency estimators requires major/minor -doMajorMinor

</pre>

=Allele Frequency estimation=
The major and minor allele is first inferred from the data or given by the user (see [[Inferring_Major_and_Minor_alleles]]). This includes information from both major and minor allele, a reference genome (for major) or an ancestral genome.

; -doMaf [int]

1: Known major, and Known minor. Here both the major and minor allele is assumed to be known (inferred or given by user). The allele frequency is the obtained using based on the genotype likelihoods. The allele frequency estimator from genotype likelihoods are from this [[suYeon | publication]] but using the EM algorithm and is briefly described [[SYKmaf|here]].

2: Known major, Unknown minor. Here the major allele is assumed to be known (inferred or given by user) however the minor allele is not determined. Instead we sum over the 3 possible minor alleles weighted by their probabilities. The allele frequency estimator from genotype likelihoods are from this [[suYeon | publication]] but using the EM algorithm and is briefly described [[SYKmaf|here]].
.

4: frequency based on genotype posterior probabilities. If genotype probabilities are used as input to ANGSD the allele frequency is estimated directly on these by [[postFreq|summing over the probabitlies]].

8: frequency based on base counts. This method does not rely on genotype likelihood or probabilities but instead infers the allele frequency directly on the base counts. The base counts method is from this [[Li2010 |publication]].

Multiple estimators can be used simultaniusly be summing up the above numbers. Thus -doMaf 7 (1+2+4) will use the first three estimators. If the allele frequencies are estimated from the genotype likelihoods then you need to infer the major and minor allele (-doMajorMinor)

;NB using -doMaf 4 is only supported if the posteriors are supplied as external files. Since the estimation of genotype posteriors in itself requires a maf estimator.

=Example=

==From genotype likelihood==
Example for estimating the allele frequencies both while assuming known major and minor allele but also while taking the uncertaincy of the minor allele inference into account. The [[Inferring_Major_and_Minor_alleles|inference of the major and minor]] allele is done directly from the genotype likelihood

<pre>
./angsd -out out -doMajorMinor 1 -doMaf 3 -bam bam.filelist -GL 2
</pre>

==From genotype probabilities==
Example of the use of a genotype probability file for example from the output from beagle.

<pre>
./angsd -out out -doMaf 4 -beagle beagle.file.gz
</pre>

==Estimator from base counts==

The allele frequencies can be infered directy from the sequencing data [[Li2010|citation]].
This works by using "counts" of alleles, and should be invoked like

<pre>
./angsd -out out -doMajorMinor 2 -doMaf 8 -bam bam.filelist -doCounts 1
</pre>

=Output data=
==.mafs.gz==
<pre>
chromo position major minor ref knownEM unknownEM nInd
21 9719788 T A 0.000001 -0.000012 3
21 9719789 G A 0.000000 -0.000001 3
21 9719790 A C 0.000000 -0.000004 3
21 9719791 G A 0.000000 -0.000001 3
21 9719792 G A 0.000000 -0.000002 3
21 9719793 G T 0.498277 41.932766 3
21 9719794 T A 0.000000 -0.000001 3
21 9719795 T A 0.000000 -0.000001 3

</pre>

;chromo
chromosome name
;position
position
;major
major allele
;minor
minor allele
;knownEM
frequency using -doMaf 1
;unknownEM
frequency using -doMaf 2
;phat
frequency using -doMaf 8
;nInd
is the number of individuals with data
;pK-EM
p-value for the frequency of (known) minor allele (-doSNPStat 1 -doMaf 1)
;pu-EM
p-value for the frequency of (unknown) minor allele (-doSNPStat 1 -doMaf 2)

Allele Frequencies

2023-06-08T07:55:20Z

Isin: /* .mafs.gz */

<div class="keywords"> -domaf,-domaf,-domaf,-domaf,-domaf, domaf, domaf, domaf, domaf, domaf, domaf, dopost, SNP_pval </div>

The allele frequency is the relative frequency of an allele for a site. This can be polarized according to the major/minor, reference/non-refernce or ancestral/derived. .Therefore the choice of allele frequency estimator is closely related to choosing which alleles are segregating (see [[Inferring_Major_and_Minor_alleles]]).

We allow for frequency estimation from different input data:

# Genotype Likelihoods
# Genotype posterior probabilities
# Counts of bases

The allele frequency estimator from genotype likelihoods are from this [[suYeon | publication]], and the base counts method is from this [[Li2010 |publication]].

For the case of the genotype likelihood based methods we allow for deviations from Hardy-Weinberg, namely we allow for users to supply a file containing inbreeding coefficients for each individual.

=Brief Overview=

<pre>
./angsd -doMaf
abcFreq.cpp:
-doMaf 0 (Calculate persite frequencies '.mafs.gz')
1: Frequency (fixed major and minor)
2: Frequency (fixed major unknown minor)
4: Frequency from genotype probabilities
8: AlleleCounts based method (known major minor)
NB. Filedumping is supressed if value is negative
-doPost 0 (Calculate posterior prob 3xgprob)
1: Using frequency as prior
2: Using uniform prior
3: Using SFS as prior (still in development)
4: Using reference panel as prior (still in development), requires a site file with chr pos major minor af ac an
Filters:
-minMaf -1.000000 (Remove sites with MAF below)
-SNP_pval 0.317311 (Remove sites with a pvalue larger)
-rmSNPs 0 (Remove infered SNPs instead of keeping them (pval > SNP_pval)
-rmTriallelic 0.000000 (Remove sites with a pvalue lower)
-forceMaf 0 (Write .mafs file when running -doAsso (by default does not output .mafs file with -doAsso))
-skipMissing 1 (Set post to 0.33 if missing (do not use freq as prior))
Extras:
-ref (null) (Filename for fasta reference)
-anc (null) (Filename for fasta ancestral)
-eps 0.001000 [Only used for -doMaf &8]
-beagleProb 0 (Dump beagle style postprobs)
-indFname (null) (file containing individual inbreedcoeficients)
-underFlowProtect 0 (file containing individual inbreedcoeficients)
NB These frequency estimators requires major/minor -doMajorMinor

</pre>

=Allele Frequency estimation=
The major and minor allele is first inferred from the data or given by the user (see [[Inferring_Major_and_Minor_alleles]]). This includes information from both major and minor allele, a reference genome (for major) or an ancestral genome.

; -doMaf [int]

1: Known major, and Known minor. Here both the major and minor allele is assumed to be known (inferred or given by user). The allele frequency is the obtained using based on the genotype likelihoods. The allele frequency estimator from genotype likelihoods are from this [[suYeon | publication]] but using the EM algorithm and is briefly described [[SYKmaf|here]].

2: Known major, Unknown minor. Here the major allele is assumed to be known (inferred or given by user) however the minor allele is not determined. Instead we sum over the 3 possible minor alleles weighted by their probabilities. The allele frequency estimator from genotype likelihoods are from this [[suYeon | publication]] but using the EM algorithm and is briefly described [[SYKmaf|here]].
.

4: frequency based on genotype posterior probabilities. If genotype probabilities are used as input to ANGSD the allele frequency is estimated directly on these by [[postFreq|summing over the probabitlies]].

8: frequency based on base counts. This method does not rely on genotype likelihood or probabilities but instead infers the allele frequency directly on the base counts. The base counts method is from this [[Li2010 |publication]].

Multiple estimators can be used simultaniusly be summing up the above numbers. Thus -doMaf 7 (1+2+4) will use the first three estimators. If the allele frequencies are estimated from the genotype likelihoods then you need to infer the major and minor allele (-doMajorMinor)

;NB using -doMaf 4 is only supported if the posteriors are supplied as external files. Since the estimation of genotype posteriors in itself requires a maf estimator.

=Example=

==From genotype likelihood==
Example for estimating the allele frequencies both while assuming known major and minor allele but also while taking the uncertaincy of the minor allele inference into account. The [[Inferring_Major_and_Minor_alleles|inference of the major and minor]] allele is done directly from the genotype likelihood

<pre>
./angsd -out out -doMajorMinor 1 -doMaf 3 -bam bam.filelist -GL 2
</pre>

==From genotype probabilities==
Example of the use of a genotype probability file for example from the output from beagle.

<pre>
./angsd -out out -doMaf 4 -beagle beagle.file.gz
</pre>

==Estimator from base counts==

The allele frequencies can be infered directy from the sequencing data [[Li2010|citation]].
This works by using "counts" of alleles, and should be invoked like

<pre>
./angsd -out out -doMajorMinor 2 -doMaf 8 -bam bam.filelist -doCounts 1
</pre>

=Output data=
==.mafs.gz==
<pre>
chromo position major minor ref knownEM unknownEM nInd
21 9719788 T A 0.000001 -0.000012 3
21 9719789 G A 0.000000 -0.000001 3
21 9719790 A C 0.000000 -0.000004 3
21 9719791 G A 0.000000 -0.000001 3
21 9719792 G A 0.000000 -0.000002 3
21 9719793 G T 0.498277 41.932766 3
21 9719794 T A 0.000000 -0.000001 3
21 9719795 T A 0.000000 -0.000001 3

</pre>

;chromo
chromosome name
;position
position
;major
major allele
;minor
minor allele
;knownEM
frequency using -doMaf 1
;unknownEM
frequency using -doMaf 2
;phat
frequency using -doMaf 8
;nInd
is the number of individuals with data
;pK-EM
probability of a site being SNP with known minor alleles (-doSNPStat 1 -doMaf 1)
;pu-EM
p-value for a site being SNP with unknown minor alleles (-doSNPStat 1 -doMaf 2)

Allele Frequencies

2023-06-07T14:03:33Z

Isin: /* .mafs.gz */

<div class="keywords"> -domaf,-domaf,-domaf,-domaf,-domaf, domaf, domaf, domaf, domaf, domaf, domaf, dopost, SNP_pval </div>

The allele frequency is the relative frequency of an allele for a site. This can be polarized according to the major/minor, reference/non-refernce or ancestral/derived. .Therefore the choice of allele frequency estimator is closely related to choosing which alleles are segregating (see [[Inferring_Major_and_Minor_alleles]]).

We allow for frequency estimation from different input data:

# Genotype Likelihoods
# Genotype posterior probabilities
# Counts of bases

The allele frequency estimator from genotype likelihoods are from this [[suYeon | publication]], and the base counts method is from this [[Li2010 |publication]].

For the case of the genotype likelihood based methods we allow for deviations from Hardy-Weinberg, namely we allow for users to supply a file containing inbreeding coefficients for each individual.

=Brief Overview=

<pre>
./angsd -doMaf
abcFreq.cpp:
-doMaf 0 (Calculate persite frequencies '.mafs.gz')
1: Frequency (fixed major and minor)
2: Frequency (fixed major unknown minor)
4: Frequency from genotype probabilities
8: AlleleCounts based method (known major minor)
NB. Filedumping is supressed if value is negative
-doPost 0 (Calculate posterior prob 3xgprob)
1: Using frequency as prior
2: Using uniform prior
3: Using SFS as prior (still in development)
4: Using reference panel as prior (still in development), requires a site file with chr pos major minor af ac an
Filters:
-minMaf -1.000000 (Remove sites with MAF below)
-SNP_pval 0.317311 (Remove sites with a pvalue larger)
-rmSNPs 0 (Remove infered SNPs instead of keeping them (pval > SNP_pval)
-rmTriallelic 0.000000 (Remove sites with a pvalue lower)
-forceMaf 0 (Write .mafs file when running -doAsso (by default does not output .mafs file with -doAsso))
-skipMissing 1 (Set post to 0.33 if missing (do not use freq as prior))
Extras:
-ref (null) (Filename for fasta reference)
-anc (null) (Filename for fasta ancestral)
-eps 0.001000 [Only used for -doMaf &8]
-beagleProb 0 (Dump beagle style postprobs)
-indFname (null) (file containing individual inbreedcoeficients)
-underFlowProtect 0 (file containing individual inbreedcoeficients)
NB These frequency estimators requires major/minor -doMajorMinor

</pre>

=Allele Frequency estimation=
The major and minor allele is first inferred from the data or given by the user (see [[Inferring_Major_and_Minor_alleles]]). This includes information from both major and minor allele, a reference genome (for major) or an ancestral genome.

; -doMaf [int]

1: Known major, and Known minor. Here both the major and minor allele is assumed to be known (inferred or given by user). The allele frequency is the obtained using based on the genotype likelihoods. The allele frequency estimator from genotype likelihoods are from this [[suYeon | publication]] but using the EM algorithm and is briefly described [[SYKmaf|here]].

2: Known major, Unknown minor. Here the major allele is assumed to be known (inferred or given by user) however the minor allele is not determined. Instead we sum over the 3 possible minor alleles weighted by their probabilities. The allele frequency estimator from genotype likelihoods are from this [[suYeon | publication]] but using the EM algorithm and is briefly described [[SYKmaf|here]].
.

4: frequency based on genotype posterior probabilities. If genotype probabilities are used as input to ANGSD the allele frequency is estimated directly on these by [[postFreq|summing over the probabitlies]].

8: frequency based on base counts. This method does not rely on genotype likelihood or probabilities but instead infers the allele frequency directly on the base counts. The base counts method is from this [[Li2010 |publication]].

Multiple estimators can be used simultaniusly be summing up the above numbers. Thus -doMaf 7 (1+2+4) will use the first three estimators. If the allele frequencies are estimated from the genotype likelihoods then you need to infer the major and minor allele (-doMajorMinor)

;NB using -doMaf 4 is only supported if the posteriors are supplied as external files. Since the estimation of genotype posteriors in itself requires a maf estimator.

=Example=

==From genotype likelihood==
Example for estimating the allele frequencies both while assuming known major and minor allele but also while taking the uncertaincy of the minor allele inference into account. The [[Inferring_Major_and_Minor_alleles|inference of the major and minor]] allele is done directly from the genotype likelihood

<pre>
./angsd -out out -doMajorMinor 1 -doMaf 3 -bam bam.filelist -GL 2
</pre>

==From genotype probabilities==
Example of the use of a genotype probability file for example from the output from beagle.

<pre>
./angsd -out out -doMaf 4 -beagle beagle.file.gz
</pre>

==Estimator from base counts==

The allele frequencies can be infered directy from the sequencing data [[Li2010|citation]].
This works by using "counts" of alleles, and should be invoked like

<pre>
./angsd -out out -doMajorMinor 2 -doMaf 8 -bam bam.filelist -doCounts 1
</pre>

=Output data=
==.mafs.gz==
<pre>
chromo position major minor ref knownEM unknownEM nInd
21 9719788 T A 0.000001 -0.000012 3
21 9719789 G A 0.000000 -0.000001 3
21 9719790 A C 0.000000 -0.000004 3
21 9719791 G A 0.000000 -0.000001 3
21 9719792 G A 0.000000 -0.000002 3
21 9719793 G T 0.498277 41.932766 3
21 9719794 T A 0.000000 -0.000001 3
21 9719795 T A 0.000000 -0.000001 3

</pre>

;chromo
chromosome name
;position
position
;major
major allele
;minor
minor allele
;knownEM
frequency using -doMaf 1
;unknownEM
frequency using -doMaf 2
;phat
frequency using -doMaf 8
;nInd
is the number of individuals with data
;pK-EM
probability of a site being SNP (using -doSNPStat 1 and -doMaf 1)
;pu-EM
probability of a site being SNP (using -doSNPStat 1 and -doMaf 2)

Association

2023-06-07T13:32:26Z

Isin: /* Output */

Association can be performed using two approaches.
# Based on testing differences in allele frequencies between cases and controls, using genotype likelihoods
# Based on a generalized linear framework which also allows for quantitative traits and binary and for including additional covariates, using genotype posteriors.

__TOC__
We recommend that users don't perform association analysis on all sites, but limit the analysis to informative sites, and in the case of alignement data (BAM), we advise that users filter away the low mapping quality reads and the low qscore bases.

The filtering of the alignment data is described in [[Input]], and filtering based on frequencies/polymorphic sites are described [[Filters#Allele_frequencies| here]].
<div class="toccolours mw-collapsible mw-collapsed">
This can be done easily at the command line by adding the below commands
<pre class="mw-collapsible-content">
-minQ 20 -minMapQ 30 -SNP_pval 1e-6 #Use polymorphic sites with a p-value of 10^-6
-minQ 20 -minMapQ 30 -minMaf 0.05 #Use sites with a MAF >0.05
</pre>
</div>
=Brief Overview=
<pre>
./angsd -doAsso
abcAsso.cpp:
-doAsso 0
1: Frequency Test (Known Major and Minor)
2: Score Test
4: Latent genotype model
5: Score Test with latent genotype model - hybrid test
6: Dosage regression
7: Latent genotype model (wald test) - NOT PROPERLY TESTED YET!
Frequency Test Options:
-yBin (null) (File containing disease status)

Score, Latent, Hybrid and Dosage Test Options:
-yBin (null) (File containing disease status)
-yCount (null) (File containing count phenotypes)
-yQuant (null) (File containing phenotypes)
-cov (null) (File containing additional covariates)
-sampleFile (null) (.sample File containing phenotypes and covariates)
-whichPhe (null) Select which phenotypes to analyse, write phenos comma seperated ('phe1,phe2,...'), only works with a .sample file
-whichCov (null) Select which covariates to include, write covs comma seperated ('cov1,cov2,...'), only works with a .sample file
-model 1
1: Additive/Log-Additive (Default)
2: Dominant
3: Recessive

-minHigh 10 (Require atleast minHigh number of high credible genotypes)
-minCount 10 (Require this number of minor alleles, estimated from MAF)
-assoThres 0.000001 Threshold for logistic regression
-assoIter 100 Number of iterations for logistic regression
-emThres 0.000100 Threshold for convergence of EM algorithm in doAsso 4 and 5
-emIter 40 Number of max iterations for EM algorithm in doAsso 4 and 5

-doPriming 1 Prime EM algorithm with dosage derived coefficients (0: no, 1: yes - default)

-Pvalue 0 Prints a P-value instead of a likelihood ratio (0: no - default, 1: yes)

Hybrid Test Options:
-hybridThres 0.050000 (p-value value threshold for when to perform latent genotype model)

</pre>

=Case control association using allele frequencies=
To test for differences in the allele frequencies, genotype likelihood needs to be provided or [[Genotype_likelihoods_from_alignments | estimated]]. The test is an implimentation of the likelihoods ratio test for differences between cases and controls described in details in [[Kim2011]].

;-doAsso [int]
'''1''': The test is performed assuming the minor allele is known. <br>

;-yBin [Filename]
A file containing the case control status. 0 being the controls, 1 being the cases and -999 being missing phenotypes. The file should contain a single phenotype entry per line.
<div class="toccolours mw-collapsible mw-collapsed">
Example of cases control phenotype file
<pre class="mw-collapsible-content">
1
0
0
0
1
1
1
1
0
-999
1
0
0
0
0
1
</pre>
</div>

==Example==

create a large number of individuals by recycling the example files (500 individuals) and simulate some phentypes (case/control) using R

<pre>
for i in `seq 1 50`;do cat bam.filelist>>large.filelist;done
Rscript -e "write.table(cbind(rbinom(500,1,0.5)),'pheno.ybin',row=F,col=F)"
</pre>

<pre>
./angsd -yBin pheno.ybin -doAsso 1 -GL 1 -out out -doMajorMinor 1 -doMaf 1 -SNP_pval 1e-6 -bam large.filelist -r 1: -P 5
</pre>
Note that because you are reading 500 bam files it takes a little while

<div class="toccolours mw-collapsible mw-collapsed">
gunzip -c out.lrt0.gz | head
<pre class="mw-collapsible-content">
Chromosome Position Major Minor Frequency LRT
1 14000003 G A 0.057070 0.016684
1 14000013 G A 0.067886 0.029014
1 14000019 G T 0.052904 0.569061
1 14000023 C A 0.073336 0.184060
1 14000053 T C 0.038903 0.604695
1 14000170 C T 0.050756 0.481033
1 14000176 G A 0.053157 0.424910
1 14000200 C A 0.085332 0.485030
1 14000202 G A 0.257132 0.025047
</pre>
</div>

The LRT is the likelihood ratio statistics which is chi square distributed with one degree of freedom. The P-value can also be obtained instead (by using -Pvalue 1). -Pvalue is accurate upto chisq values of 70, which is equvialent to P-values of 1.1102e-16.

==Dependency Chain==
The method is based on estimating frequencies from genotype likelihoods. If alignment data has been supplied you need to specify the following.

# [[Genotype_likelihoods_from_alignments | Genotype likelihood model (-GL)]].
#[[Inferring_Major_and_Minor_alleles |Determine Major/Minor (-doMajorMinor)]].
#[[Allele_Frequency_estimation| Maf estimator (-doMaf)]].

If you have supplied genotype likelihood files as input for angsd you can skip 1.

=Score statistic=
To perform the test in a generalized linear framework posterior genotype probabilities must be provided or [[Genotype_calling|estimated]]. The approach is published here [[skotte2012]].
;-doAsso 2

;-yBin [Filename]
A file containing the case control status. 0 being the controls, 1 being the cases and -999 being missing phenotypes.
<div class="toccolours mw-collapsible mw-collapsed">
Example of cases control phenotype file
<pre class="mw-collapsible-content">
1
0
0
0
1
1
1
1
0
-999
1
0
0
0
0
1
</pre>
</div>
;-yQuant [Filename]
File containing the phenotype values.-999 being missing phenotypes. The file should contain a single phenotype entry per line.
<div class="toccolours mw-collapsible mw-collapsed">
Example of quantitative phenotype file
<pre class="mw-collapsible-content">
-999
2.06164722761138
-0.091935218675602
-0.287527686061831
-999
-999
-1.20996664036026
0.0188541092307412
-2.1122713873334
-999
-1.32920529536579
-1.10582299663753
-0.391773417823766
-0.501400984567535
-999
1.06014677976046
-1.10582299663753
-999
0.223156127557052
-0.189660869820135
</pre>
</div>
;-yCount [Filename]
A file containing the count phenotype data, for doing poission based regression. -999 being missing phenotypes.

;-cov [Filename]
Files containing additional covariates in the analysis. Each lines should contain the additional covariates for a single individuals. Thus the number of lines should match the number of individuals and the number of coloums should match the number of additional covariates.

<div class="toccolours mw-collapsible mw-collapsed">
Example of covariate file
<pre class="mw-collapsible-content">
1 0 0 1
1 0.1 0 0
2 0 1 0
2 0 1 0
2 0.1 0 1
1 0 0 1
1 0.3 0 0
2 0 0 0
1 0 0 0
2 0.2 0 1
1 0 1 0
1 0 0 0
1 0.1 0 0
1 0 0 0
2 0 0 1
2 0 0 0
2 0 0 0
1 0 0 1
1 0.5 0 0
2 0 0 0
</pre>
</div>
;-minHigh [int]
default = 10 <br>
This approach needs a certain amount of variability in the genotype probabilities. minHigh filters out sites that does not have at least [int] number of of homozygous major, heterozygous and homozygous minor genotypes. At least two of the three genotypes categories needs at least [int] individuals with a genotype probability above 0.9. This filter avoids the scenario where all individuals have genotypes with the same probability e.g. all are heterozygous with a high probability or all have 0.33333333 probability for all three genotypes.
;-minCount [int]
default = 10 <br>
The minimum expected minor alleles in the sample. This is the frequency multiplied by two times the number of individuals. Performing association on extremely low minor allele frequencies does not make sence.
;-model [int]
# Additive/Log-additive for Linear/Logistic Regression (Default).
# Dominant.
# Recessive.
;-fai [Filename]
A fasta index file (.fai). For human data either on hg19 or hg38 one can just use the file, test/hg19.fa.fai that is in the ANGSD repository and is therefore downloaded when cloning ANGSD from its github. Otherwise the .fai file can be obtained by indexing the reference genome or by using the header of a bam file.
;-sampleFile [Filename]
A .sample File containing phenotypes and covariates for doing the analysis. It is the Oxford sample information file (.sample) format described [https://www.well.ox.ac.uk/~gav/qctool_v2/documentation/sample_file_formats.html here].
;-whichPhe [phe1,phe2,...]
Use this option to select which phenotypes to analyse, write phenos comma seperated ('phe1,phe2,...'), only works with a .sample file.
; -whichCov [cov1,cov2,...]
Use this option to select which covariates to include, write covs comma seperated ('cov1,cov2,...'), only works with a .sample file.

==Example==
create a large number of individuals by recycling the example files (500 individuals) and simulate some phentypes (case/control) using R

<pre>
rm large.filelist
for i in `seq 1 50`;do cat bam.filelist>>large.filelist;done
Rscript -e "write.table(cbind(rbinom(500,1,0.5)),'pheno.ybin',row=F,col=F)"
Rscript -e "write.table(cbind(rnorm(500)),'pheno.yquant',row=F,col=F)"
Rscript -e "set.seed(1);write.table(cbind(rbinom(500,1,0.5),rnorm(500)),'cov.file',row=F,col=F)"
</pre>

For cases control data for polymorphic sites (p-value < 1e-6)
<pre>
./angsd -yBin pheno.ybin -doAsso 2 -GL 1 -doPost 1 -out out -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 1 -bam large.filelist -P 5 -r 1:
</pre>

For quantitative traits (normal distributed errors) for polymorphic sites (p-value < 1e-6) and additional covariates
<pre>
./angsd -yQuant pheno.yquant -doAsso 2 -cov cov.file -GL 1 -doPost 1 -out out -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 1 -bam large.filelist -P 5 -r 1:
</pre>

==Example with imputation (using BEAGLE)==

First the polymorphic sites to be analysed needs to be selected (-doMaf 1 -SNP_pval -doMajorMinor) and the genotype likelihoods estimated (-GL 1) for use in [http://faculty.washington.edu/browning/beagle/beagle.html the Beagle software] (-doGlf 2).

<pre>
./angsd -GL 1 -out input -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 1 -bam large.filelist -P 5 -r 1: -doGlf 2
</pre>

Perform the imputation

<pre>
java -Xmx15000m -jar beagle.jar like=input.beagle.gz out=beagleOut
</pre>

the reference fai can be obtained by indexing the reference genome or by using a bam files header
<pre>
samtools view -H bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam | grep SN |cut -f2,3 | sed 's/SN\://g' | sed 's/LN\://g' > ref.fai
</pre>

The association can then be performed on the genotype probabilities using the score statistics
<pre>
./angsd -doMaf 4 -beagle beagleOut.impute.beagle.gz.gprobs.gz -fai ref.fai -yBin pheno.ybin -doAsso 2
</pre>

==Dependency Chain==
The method is based on genotype probabilities. If alignment data has been supplied you need to specify the following.

# [[Genotype_likelihoods_from_alignments | Genotype likelihood model (-GL)]].
#[[Inferring_Major_and_Minor_alleles |Determine Major/Minor (-doMajorMinor)]].
#[[Allele_Frequency_estimation| Maf estimator (-doMaf)]].
#[[Genotype_calling| Calculate posterior genotype probability (-doPost)]]. If you use the score statistics -doAsso 2 then calculate the posterior using the allele frequency as prior (-doPost 1).

If you have supplied genotype likelihoods for angsd, then you should skip 1.<br>

If you have supplied genotype probabilities (as beagle output format), there are no dependencies.

=Latent genotype model (EM algorithm)=
To perform the test in a generalized linear framework posterior genotype probabilities must be provided or [[Genotype_calling|estimated]]. The approach is employing an EM algorithm where the genotype is the introduced as a latent variable and then the likelihood is maximised using weighted least squares regression, similar to the approach in asaMap.
;-doAsso 4

Otherwise works exactly like the Score Test, the only thing that has to be changed is the -doAsso flag.
This method has an advantage in that effect sizes are estimated and reported.

==Example with genotype probabilities==

It can be run thus, with a binary phenotype (can also be used for a quantitative phenotype with the -yQuant flag):
<pre>
./angsd -doMaf 4 -beagle beagleOut.impute.beagle.gz.gprobs.gz -fai ref.fai -yBin pheno.ybin -doAsso 4
</pre>

=Hybrid model (Score Test + EM algorithm)=
To perform the test in a generalized linear framework posterior genotype probabilities must be provided or [[Genotype_calling|estimated]]. The approach is employing the score test first, and then if the chi-square test statistic is below a certain threshold, also apply the latent genotype model, thereby getting the effect size. The idea behind this, is that the score test is faster, as we do need to apply the EM algorithm, however using the EM algorithm gives us an effect size.
;-doAsso 5
;-hybridThres 0.05 (p-value threshold for when to perform latent genotype)

Otherwise works exactly like the score test + latent genotype model, the only thing that has to be changed is the -doAsso flag.
This method has an advantage in that effect sizes are estimated and reported.

==Example with genotype probabilities==

It can be run thus, with a binary phenotype (can also be used for a quantitative phenotype with the -yQuant flag):
<pre>
./angsd -doMaf 4 -beagle beagleOut.impute.beagle.gz.gprobs.gz -fai ref.fai -yBin pheno.ybin -doAsso 5
</pre>

=Dosage model=
To perform the test in a generalized linear framework posterior genotype probabilities must be provided or [[Genotype_calling|estimated]]. The approach is calculating the dosage or the expected genotype from the genotype probabilities, using the following formula:

E[G|X] = p(G=1|X) + 2*p(G=2|X)

And then doing a normal linear/logistic model with the dosages as the tested variable. This approach is almost as fast as the score test and effect sizes are also estimated.

;-doAsso 6

Otherwise works exactly like the score test + latent genotype model, the only thing that has to be changed is the -doAsso flag.
This method has an advantage in that effect sizes are estimated and reported.

=Input File Formats=
All -doAsso methods can now be run with genotype probabilities stored in either a BEAGLE file [https://faculty.washington.edu/browning/beagle/beagle_3.3.2_31Oct11.pdf], a BGEN file [https://www.well.ox.ac.uk/~gav/bgen_format/] or a BCF/VCF file [https://samtools.github.io/hts-specs/VCFv4.2.pdf].
==BEAGLE files==
The BEAGLE files can be run with a binary, a count or a quantive phenotype and also with covariates. An example of a command with a binary phenotype:
<pre>
./angsd -doMaf 4 -beagle test/assotest/test.beagle -yBin test/assotest/testBin.phe -doAsso 4 -cov test.cov -out test.res -fai test/hg19.fa.fai
</pre>
And with a count phenotype (using Poisson regression):
<pre>
./angsd -doMaf 4 -beagle test/assotest/test.beagle -yCount test/assotest/testCount.phe -doAsso 4 -cov test.cov -out test.res -fai test/hg19.fa.fai
</pre>
And with a quantitative phenotype (using normal linear regression):
<pre>
./angsd -doMaf 4 -beagle test/assotest/test.beagle -yQuant test/assotest/testQuant.phe -doAsso 4 -cov test.cov -out test.res -fai test/hg19.fa.fai
</pre>
==BGEN files==
The BGENfiles can be run with a binary, a count or a quantitative phenotype and also with covariates. Both ZLIB and ZSTD compression is supported, however one needs to add a FLAG when compiling with ZSTD (also one has to have the ZSTD library installed).
<pre>
make HTSSRC=../htslib/ WITH_ZSTD=1
</pre>
Also it is made according to v1.3 of the BGEN file format and only the recommended layout 2 is supported. It can also be run with a .sample file [https://www.well.ox.ac.uk/~gav/qctool/documentation/sample_file_formats.html] with both the phenotype and covariates. An example of a command with a binary phenotype:
<pre>
./angsd -doMaf 4 -bgen test/assotest/test.bgen -sampleFile test/assotest/test.sample -doAsso 4 -out test.res -fai test/hg19.fa.fai
</pre>
The column type line in the .sample will decide if a binary or normal regression model will be run, depending on the type of the phenotype (binary or quantitative).

==BCF/VCF files==
The BCF/VCF files can be run with genotype probabilities, specified with the "GP" genotype field, as specified in the VCF manual. It can be run with a binary, a count or a quantitative phenotype and also with covariates. This file format does not need a .fai file. An example of a command with a binary phenotype:
<pre>
./angsd -doMaf 4 -vcf-gp test/assotest/test.vcf -yBin test/assotest/testBin.phe -doAsso 4 -cov test.cov -out test.res
</pre>
And with a count phenotype (using Poisson regression):
<pre>
./angsd -doMaf 4 -vcf-gp test/assotest/test.vcf -yCount test/assotest/testCount.phe -doAsso 4 -cov test.cov -out test.res
</pre>
And with a quantitative phenotype (using normal linear regression):
<pre>
./angsd -doMaf 4 -vcf-gp test/assotest/test.vcf -yQuant test/assotest/testQuant.phe -doAsso 4 -cov test.cov -out test.res
</pre>

=Output=
==Output format==
The output from the association analysis is a list of files called '''prefix.lrt'''. These are tab separated plain text files, with nine columns.
{| class="wikitable"
|-
! scope="col"| Chromosome
! scope="col"| Position
! scope="col"| Major
! scope="col"| Minor
! scope="col"| Frequency
! scope="col"| N*
! scope="col"| LRT (or P)
! scope="col"| beta^
! scope="col"| SE^
! scope="col"| highHe*
! scope="col"| highHo*
! scope="col"| emIter~
|}
'''*''' Indicates that these columns are only used for the score test, latent genotype model, hybrid model and dosage model.
'''^''' Indicates that these columns are only used for the latent genotype model, hybrid model and dosage model.
'''~''' Indicates that these columns are only used for the latent genotype model and hybrid model.
{| class="wikitable"
|-
! scope="col"| Field
! scope="col"| Description
|-
! scope="row"| Chromosome
| Chromosome.
|-
! scope="row"| Position
| Physical Position.
|-
! scope="row"| Major
| The Major allele as determined by [[MajorMinor |-doMajorMinor]]. If posterior genotype files has been supplied as input, this column is not defined.
|-
! scope="row"| Minor
| The Minor allele as determined by [[MajorMinor |-doMajorMinor]]. If posterior genotype files has been supplied as input, this column is not defined.
|-
! scope="row"| Frequency
| The Minor allele frequency as determined by [[Maf|-doMaf]].
|-
! scope="row"| N*
| Number of individuals. That is the number of samples that have both sequencing data and phenotypic data.
|-
! scope="row"| LRT (or P)
| The likelihood ratio statistic. This statistic is chi square distributed with one degree of freedom. Sites that fails one of the filters are given the value -999.000000. The P-value can also be obtained instead (by using -Pvalue 1), this column will have "P" as its column name the.
|-
! scope="row"| beta
| The estimated effect size. Sites that fails one of the filters are given the value nan.
|-
! scope="row"| SE
| The estimated standard error. Sites that fails one of the filters are given the value nan.
|-
! scope="row"| high_WT/HE/HO*
| Number of individuals with a WE/HE/HO genotype posterior probability above 0.9. WT=major/major,HE=major/minor,HO=minor/minor.
|-
! scope="row"| emIter~
| Number of iterations of EM algorithm for maximising likelihood.
|}

Example without effect sizes (beta):
<pre>
Chromosome Position Major Minor Frequency N LRT high_WT/HE/HO
1 14000023 C A 0.052976 330 2.863582 250/10/0
1 14000072 G T 0.020555 330 1.864555 320/10/0
1 14000113 A G 0.019543 330 0.074985 320/10/0
1 14000202 G A 0.270106 330 0.181530 50/90/0
1 14000375 T C 0.020471 330 1.845881 320/10/0
1 14000851 T C 0.016849 330 0.694058 320/10/0
1 14000873 G A 0.305990 330 0.684507 140/60/10
1 14001008 T C 0.018434 330 0.031631 320/10/0
1 14001018 T C 0.296051 330 0.761196 110/40/10
</pre>


==Example with genotype probabilities==

It can be run thus, with a binary phenotype (can also be used for a quantitative phenotype with the -yQuant flag):
<pre>
./angsd -doMaf 4 -beagle beagleOut.impute.beagle.gz.gprobs.gz -fai ref.fai -yBin pheno.ybin -doAsso 6
</pre>

==Printing mafs files==

By default, -doAsso does not print the mafs file. To print the mafs file, use -forceMaf 1.

<pre>
-forceMaf 0 (Write .mafs file when running -doAsso (by default does not output .mafs file with -doAsso))
</pre>

=Problems with inflation of p-values=

You can evaluate the behavior of the tests by making a QQ plot of the LRT or P-values. There are several reasons why it might show signs of inflation
; -doPost (when using doAsso 2, 4, 5 or 6 without the use of posterior input -beagle
if you estimate the posterior genotype probability using a uniform prior (-doPost 2) then small differences in depth between sample will inflate the test statistics (see [[Skotte2012]]). Use the allele frequency as a prior (-doPost 1)
; -minCount/-minHigh
If you set this too low then it will results in inflation of the test statistics.
; -yQuant (when using -doAsso 2, 4, 5 or 6 with a quantitative trait)
If your trait is not continues or the distribution of the trait is skewed or has outliers then you will get inflation of p-values. Same rules apply as for a standard regression. Consider transforming you trait into a normal distribution
; Population structure
If you have population structure then you will have to adjust for it in the regression model (-doAssso 2, 4, 5 or 6). Consider using NGSadmix or PCAngsd and use the results as covariates. Note that the model will still have some issues because it uses the allele frequency as a prior. For the adventurous you can use PCAngsd or NGSadmix to estimate the individual allele frequencies and calculate your own genotype probabilities that take structure into account. These can then be used in angsd using the -beagle input format.
; low N
Usually a GWAS is performed on thousands of samples and we have only tested the use of the score statistics on hundreds of samples. If you have a low number of samples then try to figure out what minor allele frequency you would need in order to have some power. Also be careful with reducing -minCount/-minHigh.

Association

2023-05-22T12:48:51Z

Isin:

Association can be performed using two approaches.
# Based on testing differences in allele frequencies between cases and controls, using genotype likelihoods
# Based on a generalized linear framework which also allows for quantitative traits and binary and for including additional covariates, using genotype posteriors.

__TOC__
We recommend that users don't perform association analysis on all sites, but limit the analysis to informative sites, and in the case of alignement data (BAM), we advise that users filter away the low mapping quality reads and the low qscore bases.

The filtering of the alignment data is described in [[Input]], and filtering based on frequencies/polymorphic sites are described [[Filters#Allele_frequencies| here]].
<div class="toccolours mw-collapsible mw-collapsed">
This can be done easily at the command line by adding the below commands
<pre class="mw-collapsible-content">
-minQ 20 -minMapQ 30 -SNP_pval 1e-6 #Use polymorphic sites with a p-value of 10^-6
-minQ 20 -minMapQ 30 -minMaf 0.05 #Use sites with a MAF >0.05
</pre>
</div>
=Brief Overview=
<pre>
./angsd -doAsso
abcAsso.cpp:
-doAsso 0
1: Frequency Test (Known Major and Minor)
2: Score Test
4: Latent genotype model
5: Score Test with latent genotype model - hybrid test
6: Dosage regression
7: Latent genotype model (wald test) - NOT PROPERLY TESTED YET!
Frequency Test Options:
-yBin (null) (File containing disease status)

Score, Latent, Hybrid and Dosage Test Options:
-yBin (null) (File containing disease status)
-yCount (null) (File containing count phenotypes)
-yQuant (null) (File containing phenotypes)
-cov (null) (File containing additional covariates)
-sampleFile (null) (.sample File containing phenotypes and covariates)
-whichPhe (null) Select which phenotypes to analyse, write phenos comma seperated ('phe1,phe2,...'), only works with a .sample file
-whichCov (null) Select which covariates to include, write covs comma seperated ('cov1,cov2,...'), only works with a .sample file
-model 1
1: Additive/Log-Additive (Default)
2: Dominant
3: Recessive

-minHigh 10 (Require atleast minHigh number of high credible genotypes)
-minCount 10 (Require this number of minor alleles, estimated from MAF)
-assoThres 0.000001 Threshold for logistic regression
-assoIter 100 Number of iterations for logistic regression
-emThres 0.000100 Threshold for convergence of EM algorithm in doAsso 4 and 5
-emIter 40 Number of max iterations for EM algorithm in doAsso 4 and 5

-doPriming 1 Prime EM algorithm with dosage derived coefficients (0: no, 1: yes - default)

-Pvalue 0 Prints a P-value instead of a likelihood ratio (0: no - default, 1: yes)

Hybrid Test Options:
-hybridThres 0.050000 (p-value value threshold for when to perform latent genotype model)

</pre>

=Case control association using allele frequencies=
To test for differences in the allele frequencies, genotype likelihood needs to be provided or [[Genotype_likelihoods_from_alignments | estimated]]. The test is an implimentation of the likelihoods ratio test for differences between cases and controls described in details in [[Kim2011]].

;-doAsso [int]
'''1''': The test is performed assuming the minor allele is known. <br>

;-yBin [Filename]
A file containing the case control status. 0 being the controls, 1 being the cases and -999 being missing phenotypes. The file should contain a single phenotype entry per line.
<div class="toccolours mw-collapsible mw-collapsed">
Example of cases control phenotype file
<pre class="mw-collapsible-content">
1
0
0
0
1
1
1
1
0
-999
1
0
0
0
0
1
</pre>
</div>

==Example==

create a large number of individuals by recycling the example files (500 individuals) and simulate some phentypes (case/control) using R

<pre>
for i in `seq 1 50`;do cat bam.filelist>>large.filelist;done
Rscript -e "write.table(cbind(rbinom(500,1,0.5)),'pheno.ybin',row=F,col=F)"
</pre>

<pre>
./angsd -yBin pheno.ybin -doAsso 1 -GL 1 -out out -doMajorMinor 1 -doMaf 1 -SNP_pval 1e-6 -bam large.filelist -r 1: -P 5
</pre>
Note that because you are reading 500 bam files it takes a little while

<div class="toccolours mw-collapsible mw-collapsed">
gunzip -c out.lrt0.gz | head
<pre class="mw-collapsible-content">
Chromosome Position Major Minor Frequency LRT
1 14000003 G A 0.057070 0.016684
1 14000013 G A 0.067886 0.029014
1 14000019 G T 0.052904 0.569061
1 14000023 C A 0.073336 0.184060
1 14000053 T C 0.038903 0.604695
1 14000170 C T 0.050756 0.481033
1 14000176 G A 0.053157 0.424910
1 14000200 C A 0.085332 0.485030
1 14000202 G A 0.257132 0.025047
</pre>
</div>

The LRT is the likelihood ratio statistics which is chi square distributed with one degree of freedom. The P-value can also be obtained instead (by using -Pvalue 1). -Pvalue is accurate upto chisq values of 70, which is equvialent to P-values of 1.1102e-16.

==Dependency Chain==
The method is based on estimating frequencies from genotype likelihoods. If alignment data has been supplied you need to specify the following.

# [[Genotype_likelihoods_from_alignments | Genotype likelihood model (-GL)]].
#[[Inferring_Major_and_Minor_alleles |Determine Major/Minor (-doMajorMinor)]].
#[[Allele_Frequency_estimation| Maf estimator (-doMaf)]].

If you have supplied genotype likelihood files as input for angsd you can skip 1.

=Score statistic=
To perform the test in a generalized linear framework posterior genotype probabilities must be provided or [[Genotype_calling|estimated]]. The approach is published here [[skotte2012]].
;-doAsso 2

;-yBin [Filename]
A file containing the case control status. 0 being the controls, 1 being the cases and -999 being missing phenotypes.
<div class="toccolours mw-collapsible mw-collapsed">
Example of cases control phenotype file
<pre class="mw-collapsible-content">
1
0
0
0
1
1
1
1
0
-999
1
0
0
0
0
1
</pre>
</div>
;-yQuant [Filename]
File containing the phenotype values.-999 being missing phenotypes. The file should contain a single phenotype entry per line.
<div class="toccolours mw-collapsible mw-collapsed">
Example of quantitative phenotype file
<pre class="mw-collapsible-content">
-999
2.06164722761138
-0.091935218675602
-0.287527686061831
-999
-999
-1.20996664036026
0.0188541092307412
-2.1122713873334
-999
-1.32920529536579
-1.10582299663753
-0.391773417823766
-0.501400984567535
-999
1.06014677976046
-1.10582299663753
-999
0.223156127557052
-0.189660869820135
</pre>
</div>
;-yCount [Filename]
A file containing the count phenotype data, for doing poission based regression. -999 being missing phenotypes.

;-cov [Filename]
Files containing additional covariates in the analysis. Each lines should contain the additional covariates for a single individuals. Thus the number of lines should match the number of individuals and the number of coloums should match the number of additional covariates.

<div class="toccolours mw-collapsible mw-collapsed">
Example of covariate file
<pre class="mw-collapsible-content">
1 0 0 1
1 0.1 0 0
2 0 1 0
2 0 1 0
2 0.1 0 1
1 0 0 1
1 0.3 0 0
2 0 0 0
1 0 0 0
2 0.2 0 1
1 0 1 0
1 0 0 0
1 0.1 0 0
1 0 0 0
2 0 0 1
2 0 0 0
2 0 0 0
1 0 0 1
1 0.5 0 0
2 0 0 0
</pre>
</div>
;-minHigh [int]
default = 10 <br>
This approach needs a certain amount of variability in the genotype probabilities. minHigh filters out sites that does not have at least [int] number of of homozygous major, heterozygous and homozygous minor genotypes. At least two of the three genotypes categories needs at least [int] individuals with a genotype probability above 0.9. This filter avoids the scenario where all individuals have genotypes with the same probability e.g. all are heterozygous with a high probability or all have 0.33333333 probability for all three genotypes.
;-minCount [int]
default = 10 <br>
The minimum expected minor alleles in the sample. This is the frequency multiplied by two times the number of individuals. Performing association on extremely low minor allele frequencies does not make sence.
;-model [int]
# Additive/Log-additive for Linear/Logistic Regression (Default).
# Dominant.
# Recessive.
;-fai [Filename]
A fasta index file (.fai). For human data either on hg19 or hg38 one can just use the file, test/hg19.fa.fai that is in the ANGSD repository and is therefore downloaded when cloning ANGSD from its github. Otherwise the .fai file can be obtained by indexing the reference genome or by using the header of a bam file.
;-sampleFile [Filename]
A .sample File containing phenotypes and covariates for doing the analysis. It is the Oxford sample information file (.sample) format described [https://www.well.ox.ac.uk/~gav/qctool_v2/documentation/sample_file_formats.html here].
;-whichPhe [phe1,phe2,...]
Use this option to select which phenotypes to analyse, write phenos comma seperated ('phe1,phe2,...'), only works with a .sample file.
; -whichCov [cov1,cov2,...]
Use this option to select which covariates to include, write covs comma seperated ('cov1,cov2,...'), only works with a .sample file.

==Example==
create a large number of individuals by recycling the example files (500 individuals) and simulate some phentypes (case/control) using R

<pre>
rm large.filelist
for i in `seq 1 50`;do cat bam.filelist>>large.filelist;done
Rscript -e "write.table(cbind(rbinom(500,1,0.5)),'pheno.ybin',row=F,col=F)"
Rscript -e "write.table(cbind(rnorm(500)),'pheno.yquant',row=F,col=F)"
Rscript -e "set.seed(1);write.table(cbind(rbinom(500,1,0.5),rnorm(500)),'cov.file',row=F,col=F)"
</pre>

For cases control data for polymorphic sites (p-value < 1e-6)
<pre>
./angsd -yBin pheno.ybin -doAsso 2 -GL 1 -doPost 1 -out out -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 1 -bam large.filelist -P 5 -r 1:
</pre>

For quantitative traits (normal distributed errors) for polymorphic sites (p-value < 1e-6) and additional covariates
<pre>
./angsd -yQuant pheno.yquant -doAsso 2 -cov cov.file -GL 1 -doPost 1 -out out -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 1 -bam large.filelist -P 5 -r 1:
</pre>

==Example with imputation (using BEAGLE)==

First the polymorphic sites to be analysed needs to be selected (-doMaf 1 -SNP_pval -doMajorMinor) and the genotype likelihoods estimated (-GL 1) for use in [http://faculty.washington.edu/browning/beagle/beagle.html the Beagle software] (-doGlf 2).

<pre>
./angsd -GL 1 -out input -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 1 -bam large.filelist -P 5 -r 1: -doGlf 2
</pre>

Perform the imputation

<pre>
java -Xmx15000m -jar beagle.jar like=input.beagle.gz out=beagleOut
</pre>

the reference fai can be obtained by indexing the reference genome or by using a bam files header
<pre>
samtools view -H bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam | grep SN |cut -f2,3 | sed 's/SN\://g' | sed 's/LN\://g' > ref.fai
</pre>

The association can then be performed on the genotype probabilities using the score statistics
<pre>
./angsd -doMaf 4 -beagle beagleOut.impute.beagle.gz.gprobs.gz -fai ref.fai -yBin pheno.ybin -doAsso 2
</pre>

==Dependency Chain==
The method is based on genotype probabilities. If alignment data has been supplied you need to specify the following.

# [[Genotype_likelihoods_from_alignments | Genotype likelihood model (-GL)]].
#[[Inferring_Major_and_Minor_alleles |Determine Major/Minor (-doMajorMinor)]].
#[[Allele_Frequency_estimation| Maf estimator (-doMaf)]].
#[[Genotype_calling| Calculate posterior genotype probability (-doPost)]]. If you use the score statistics -doAsso 2 then calculate the posterior using the allele frequency as prior (-doPost 1).

If you have supplied genotype likelihoods for angsd, then you should skip 1.<br>

If you have supplied genotype probabilities (as beagle output format), there are no dependencies.

=Latent genotype model (EM algorithm)=
To perform the test in a generalized linear framework posterior genotype probabilities must be provided or [[Genotype_calling|estimated]]. The approach is employing an EM algorithm where the genotype is the introduced as a latent variable and then the likelihood is maximised using weighted least squares regression, similar to the approach in asaMap.
;-doAsso 4

Otherwise works exactly like the Score Test, the only thing that has to be changed is the -doAsso flag.
This method has an advantage in that effect sizes are estimated and reported.

==Example with genotype probabilities==

It can be run thus, with a binary phenotype (can also be used for a quantitative phenotype with the -yQuant flag):
<pre>
./angsd -doMaf 4 -beagle beagleOut.impute.beagle.gz.gprobs.gz -fai ref.fai -yBin pheno.ybin -doAsso 4
</pre>

=Hybrid model (Score Test + EM algorithm)=
To perform the test in a generalized linear framework posterior genotype probabilities must be provided or [[Genotype_calling|estimated]]. The approach is employing the score test first, and then if the chi-square test statistic is below a certain threshold, also apply the latent genotype model, thereby getting the effect size. The idea behind this, is that the score test is faster, as we do need to apply the EM algorithm, however using the EM algorithm gives us an effect size.
;-doAsso 5
;-hybridThres 0.05 (p-value threshold for when to perform latent genotype)

Otherwise works exactly like the score test + latent genotype model, the only thing that has to be changed is the -doAsso flag.
This method has an advantage in that effect sizes are estimated and reported.

==Example with genotype probabilities==

It can be run thus, with a binary phenotype (can also be used for a quantitative phenotype with the -yQuant flag):
<pre>
./angsd -doMaf 4 -beagle beagleOut.impute.beagle.gz.gprobs.gz -fai ref.fai -yBin pheno.ybin -doAsso 5
</pre>

=Dosage model=
To perform the test in a generalized linear framework posterior genotype probabilities must be provided or [[Genotype_calling|estimated]]. The approach is calculating the dosage or the expected genotype from the genotype probabilities, using the following formula:

E[G|X] = p(G=1|X) + 2*p(G=2|X)

And then doing a normal linear/logistic model with the dosages as the tested variable. This approach is almost as fast as the score test and effect sizes are also estimated.

;-doAsso 6

Otherwise works exactly like the score test + latent genotype model, the only thing that has to be changed is the -doAsso flag.
This method has an advantage in that effect sizes are estimated and reported.

=Input File Formats=
All -doAsso methods can now be run with genotype probabilities stored in either a BEAGLE file [https://faculty.washington.edu/browning/beagle/beagle_3.3.2_31Oct11.pdf], a BGEN file [https://www.well.ox.ac.uk/~gav/bgen_format/] or a BCF/VCF file [https://samtools.github.io/hts-specs/VCFv4.2.pdf].
==BEAGLE files==
The BEAGLE files can be run with a binary, a count or a quantive phenotype and also with covariates. An example of a command with a binary phenotype:
<pre>
./angsd -doMaf 4 -beagle test/assotest/test.beagle -yBin test/assotest/testBin.phe -doAsso 4 -cov test.cov -out test.res -fai test/hg19.fa.fai
</pre>
And with a count phenotype (using Poisson regression):
<pre>
./angsd -doMaf 4 -beagle test/assotest/test.beagle -yCount test/assotest/testCount.phe -doAsso 4 -cov test.cov -out test.res -fai test/hg19.fa.fai
</pre>
And with a quantitative phenotype (using normal linear regression):
<pre>
./angsd -doMaf 4 -beagle test/assotest/test.beagle -yQuant test/assotest/testQuant.phe -doAsso 4 -cov test.cov -out test.res -fai test/hg19.fa.fai
</pre>
==BGEN files==
The BGENfiles can be run with a binary, a count or a quantitative phenotype and also with covariates. Both ZLIB and ZSTD compression is supported, however one needs to add a FLAG when compiling with ZSTD (also one has to have the ZSTD library installed).
<pre>
make HTSSRC=../htslib/ WITH_ZSTD=1
</pre>
Also it is made according to v1.3 of the BGEN file format and only the recommended layout 2 is supported. It can also be run with a .sample file [https://www.well.ox.ac.uk/~gav/qctool/documentation/sample_file_formats.html] with both the phenotype and covariates. An example of a command with a binary phenotype:
<pre>
./angsd -doMaf 4 -bgen test/assotest/test.bgen -sampleFile test/assotest/test.sample -doAsso 4 -out test.res -fai test/hg19.fa.fai
</pre>
The column type line in the .sample will decide if a binary or normal regression model will be run, depending on the type of the phenotype (binary or quantitative).

==BCF/VCF files==
The BCF/VCF files can be run with genotype probabilities, specified with the "GP" genotype field, as specified in the VCF manual. It can be run with a binary, a count or a quantitative phenotype and also with covariates. This file format does not need a .fai file. An example of a command with a binary phenotype:
<pre>
./angsd -doMaf 4 -vcf-gp test/assotest/test.vcf -yBin test/assotest/testBin.phe -doAsso 4 -cov test.cov -out test.res
</pre>
And with a count phenotype (using Poisson regression):
<pre>
./angsd -doMaf 4 -vcf-gp test/assotest/test.vcf -yCount test/assotest/testCount.phe -doAsso 4 -cov test.cov -out test.res
</pre>
And with a quantitative phenotype (using normal linear regression):
<pre>
./angsd -doMaf 4 -vcf-gp test/assotest/test.vcf -yQuant test/assotest/testQuant.phe -doAsso 4 -cov test.cov -out test.res
</pre>

=Output=
==Output format==
The output from the association analysis is a list of files called '''prefix.lrt'''. These are tab separated plain text files, with nine columns.
{| class="wikitable"
|-
! scope="col"| Chromosome
! scope="col"| Position
! scope="col"| Major
! scope="col"| Minor
! scope="col"| Frequency
! scope="col"| N*
! scope="col"| LRT (or P)
! scope="col"| beta^
! scope="col"| SE^
! scope="col"| highHe*
! scope="col"| highHo*
! scope="col"| emIter~
|}
'''*''' Indicates that these columns are only used for the score test, latent genotype model, hybrid model and dosage model.
'''^''' Indicates that these columns are only used for the latent genotype model, hybrid model and dosage model.
'''~''' Indicates that these columns are only used for the latent genotype model and hybrid model.
{| class="wikitable"
|-
! scope="col"| Field
! scope="col"| Description
|-
! scope="row"| Chromosome
| Chromosome.
|-
! scope="row"| Position
| Physical Position.
|-
! scope="row"| Major
| The Major allele as determined by [[MajorMinor |-doMajorMinor]]. If posterior genotype files has been supplied as input, this column is not defined.
|-
! scope="row"| Minor
| The Minor allele as determined by [[MajorMinor |-doMajorMinor]]. If posterior genotype files has been supplied as input, this column is not defined.
|-
! scope="row"| Frequency
| The Minor allele frequency as determined by [[Maf|-doMaf]].
|-
! scope="row"| N*
| Number of individuals. That is the number of samples that have both sequencing data and phenotypic data.
|-
! scope="row"| LRT (or P)
| The likelihood ratio statistic. This statistic is chi square distributed with one degree of freedom. Sites that fails one of the filters are given the value -999.000000. The P-value can also be obtained instead (by using -Pvalue 1), this column will have "P" as its column name the.
|-
! scope="row"| beta
| The estimated effect size. Sites that fails one of the filters are given the value nan.
|-
! scope="row"| SE
| The estimated standard error. Sites that fails one of the filters are given the value nan.
|-
! scope="row"| high_WT/HE/HO*
| Number of individuals with a WE/HE/HO genotype posterior probability above 0.9. WT=major/major,HE=major/minor,HO=minor/minor.
|-
! scope="row"| emIter~
| Number of iterations of EM algorithm for maximising likelihood.
|}

Example without effect sizes (beta):
<pre>
Chromosome Position Major Minor Frequency N LRT high_WT/HE/HO
1 14000023 C A 0.052976 330 2.863582 250/10/0
1 14000072 G T 0.020555 330 1.864555 320/10/0
1 14000113 A G 0.019543 330 0.074985 320/10/0
1 14000202 G A 0.270106 330 0.181530 50/90/0
1 14000375 T C 0.020471 330 1.845881 320/10/0
1 14000851 T C 0.016849 330 0.694058 320/10/0
1 14000873 G A 0.305990 330 0.684507 140/60/10
1 14001008 T C 0.018434 330 0.031631 320/10/0
1 14001018 T C 0.296051 330 0.761196 110/40/10
</pre>


==Example with genotype probabilities==

It can be run thus, with a binary phenotype (can also be used for a quantitative phenotype with the -yQuant flag):
<pre>
./angsd -doMaf 4 -beagle beagleOut.impute.beagle.gz.gprobs.gz -fai ref.fai -yBin pheno.ybin -doAsso 6
</pre>

=Problems with inflation of p-values=

You can evaluate the behavior of the tests by making a QQ plot of the LRT or P-values. There are several reasons why it might show signs of inflation
; -doPost (when using doAsso 2, 4, 5 or 6 without the use of posterior input -beagle
if you estimate the posterior genotype probability using a uniform prior (-doPost 2) then small differences in depth between sample will inflate the test statistics (see [[Skotte2012]]). Use the allele frequency as a prior (-doPost 1)
; -minCount/-minHigh
If you set this too low then it will results in inflation of the test statistics.
; -yQuant (when using -doAsso 2, 4, 5 or 6 with a quantitative trait)
If your trait is not continues or the distribution of the trait is skewed or has outliers then you will get inflation of p-values. Same rules apply as for a standard regression. Consider transforming you trait into a normal distribution
; Population structure
If you have population structure then you will have to adjust for it in the regression model (-doAssso 2, 4, 5 or 6). Consider using NGSadmix or PCAngsd and use the results as covariates. Note that the model will still have some issues because it uses the allele frequency as a prior. For the adventurous you can use PCAngsd or NGSadmix to estimate the individual allele frequencies and calculate your own genotype probabilities that take structure into account. These can then be used in angsd using the -beagle input format.
; low N
Usually a GWAS is performed on thousands of samples and we have only tested the use of the score statistics on hundreds of samples. If you have a low number of samples then try to figure out what minor allele frequency you would need in order to have some power. Also be careful with reducing -minCount/-minHigh.

User:Isin

2022-12-21T09:53:31Z

Isin: test

test