ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

ANGSD

From angsd
Revision as of 10:33, 11 June 2012 by Albrecht (talk | contribs)
Jump to navigation Jump to search


About

overview

<classdiagram type="dir:LR"> [sequence data]->[genotype;likelihoods] [genotype;likelihoods]->[genotype;probabilities] [sequence files|bam files;SOAP files{bg:orange}]->[sequence data] [glf files|glfv3;soapSNP{bg:orange}]->[genotype;likelihoods] [genotype prob|beagle output{bg:orange}]->[genotype;probabilities] </classdiagram>

sequencing data

<classdiagram> // [input|bam files;SOAP files{bg:orange}]->[sequence data]

[sequence data]->[output|summary stats;phat estimates;error estimates{bg:blue}]
</classdiagram>

genotype likelihoods

<classdiagram> //[input data|glf files{bg:orange}]->[genotype;likelihoods] [genotype;likelihoods]->[output|glf files;beagle files;MAF estimates;MAF associations;SNP Calling;realSFS;error estimates;Inbreeding{bg:blue}]

</classdiagram>

Commands

gatk

angsd.g++ -outfiles samtools -GL 1 uppile -b ceu5.mapped.list soapsnp If no recalibration matrix exists these will be created first. angsd.g++ -outfiles soapsnp -GL 0 uppile -b ceu5.mapped.list gatk

angsd.g++ -outfiles gatk -GL 2 uppile -b ceu5.mapped.list


angsd.g++ -outfiles suyeon -doCounts 1 -qs 20 -doLike 1 uppile -b ceu5.mapped.list

genotype probabilities

<classdiagram> //[input data|beagle output{bg:orange}]->[genotype;probabilities] [genotype;probabilities]->[output|genotype calling;MAF estimates;associations{bg:blue}]

</classdiagram>

Download

http://popgen.dk/dirty_source

Installation

Misc folder contains:

1. simnextgen a program for simulating genotype likelihoods
2. getUnfolded, simnextgen only outputs the true folded frequency spectra. getUnfolded uses the "true" genotypes to estimate the unfolded spectra.
3. testfolded/optimSFS uses the .sfs output from -realSFS 1 to estimate the site frequency spectra polarized to the ancestral type.


samtools-0.1.17 contains:

The sourcecode for samtools http://samtools.sourceforge.net/, this used for reading mpileup from within dirty.

Compile everything like

cd misc/;make;cd ../samtools-0.1.17;make;cd ..;make

Overview

This program can now estimate MAF, do association, and in the future it will supersede realSFS, and will calculate various estimates used in population genetics.

All methods are based on genotype likelihoods

The currently accepted input formats are

1. glfv3 from SAMtools (this format is removed and deprecated in the newer version of SAMtools) inputtype=1

2. text glfv3 (same as above) inputtype=2

3. soap files (the genotype likelihoods are inferred) inputtype=0

4. flat genotype liklihood files tglf files inputtype=3

5. simulation files from simnextgen, (in misc folder)

6. bamfiles (using samtools as engine)

Methods

dumping counts

 -dumpCounts 1, print overall depth in the .pos file
 
 -dumpCounts 2, 1) plus .counts with overall sample depth, each line nInd long
 
 -dumpCounts 3, 1) plus .coutns with count of A,C,G,T, each line 4*nInd long

Genotype calling

output file .geno


-doGeno 1, print out major minor

-doGeno 2, print the called genotype under HWE (suyeon maf prior method)

-doGeno 4, print out the genotype with the best raw likelihood

-doGeno 8, print out the posteriour under the suyeon model

-doGeno 16, print all 3 posts (major,major),(major,minor),(minor,minor)

-doGeno 32, print the depth, (if availble), the number of reads used to determine the genotype llhs.

-doGeno 64, somewhat different dumps the binary posts for all samples, encoded as 3*nind double

The genotype are integers such that AA=0,AC=1,AG=2,AT=3,CC=4,CG=5,CT=6,GG=7,GT=8,TT=9

output is (-doGeno NOT 64) chr, pos, numberof samples times[ the above]


NB currently you also need to supply -doMaf to run this genotype calling

realSFS

From version 0.20 we can now estimate the joint likelihood of being in frequency j. From version 0.24 the tajima is being estimated

-realSFS 1 an sfs file will be generated.

-realSFS 2 the tajima will be printed to standard output

-realSFS 4 snpcalling (not implemented)

-realSFS 8 genotypecalling (not implemented)

Genotype likelihoods from alignments

Kim et al.

Citation Citation

Samtools

GATK

Inferring Major and Minor alleles

The inference method is chosen based on the data input.

From alignment data

The two most frequently observed bases across individuals are chosen as the major and minor allele.

Citation

From genotype likelihood data

We use a maximum likelihood approach to choose the major and minor alleles. Details can be found here.

Citation

From genotype probability data

Error estimation

Kim et al.

Citation

Using an outgroup

Allele Frequency estimation

-doMaf INT

INT=1 bfgs known minor

INT=2 EM known minor

INT=4 BFGS unknown minor

INT=8 EM unknown minor


Phat estimator

ML estimator with known minor

ML estimator with unknown minor

SNP Calling

Likelihood ratio test

supply -doSNP 1 Then the MAF estimate(s) given by -doMaf INT, will be used for a like ratio test

chromo  position        major   minor   knownEM pK-EM   nInd
21      9719770 T       A       0.000009        -0.000018       1
21      9719771 T       A       0.000001        -0.000003       1
21      9719772 C       A       0.000001        -0.000003       1
21      9719773 T       A       0.000002        -0.000010       2
21      9719774 G       A       0.000002        -0.000010       2
21      9719775 A       C       0.000004        -0.000022       2
21      9719776 G       A       0.000002        -0.000010       2
21      9719777 A       C       0.000002        -0.000013       2

Covariance matrix for PCA

Association

Association can be performed using two approaches. One based on testing differences in allele frequencies between cases and control while the other is based on a generalized linear framework which allowes for including additional covariates. Both methods takes the uncertaincy of the genotypes into account.

Case control association using allele frequencies

To test for differences in the allele frequency genotype likelihood need to be provided or estimated. If alignment files are your input then -doLike must be invoked.

-doAsso [int]

1: The test is performed assuming the minor allele is known
3: The test is performed summing over all possible minor alleles

-yBin [file]

a file containing the case control status. 0 being the controls, 1 being the cases and -999 being missing phenotypes. The file should contain a single phenotype entry per line. Example of cases control phenotype file


cite kim et al.

Score statistic

To perform the test in a generalized linear framework posterior genotype probabilities must be provided or estimated. If alignment files are your input then -doLike, -doMaf, -doPost must be invoked. If input files are genotype likelihoods then -doMaf, -doPost must be invoked. Beagle output files can be used directly.

-doAsso [int]

2: The test is based on a score statistic from a generalized linear framework

-yBin [file]

a file containing the case control status. 0 being the controls, 1 being the cases and -999 being missing phenotypes. The file should contain a single phenotype entry per line. Example of cases control phenotype file

-yQuant [file]

a file containing the phenotype values.-999 being missing phenotypes. The file should contain a single phenotype entry per line. Example of a quantitative phenotype file

-cov [file]

a files containing additional covariates in the analysis. Each lines should contain the additional covariates for a single individuals. Thus the number of lines should match the number of individuals and the number of coloums should match the number of additional covariates. Example of a covariance file

-minHigh [int]

default = 10
This approach needs a certain amount of variability in the genotype probabilities. minHigh filters out sites that does not have at least [int] number of heterozygoes and homogoes genotype with at least 0.9 probability. This filter avoids the scenario where all individuals are heterozygoes with a high probability.

-minCount [int]

default = 10
The minimum expected minor alleles in the sample. This is the frequency multiplied by to times the number of individuals. Performing association on extremely low minor allele frequencies does not make sence.


cite skotte et al.

Heterozygosity

HWE and Inbreeding estimates

For method details see here

Input data

Output data

.mafs

chromo	position	major	minor	ref	knownEM	unknownEM	nInd
21      9719788 T       A       0.000001        -0.000012       3
21      9719789 G       A       0.000000        -0.000001       3
21      9719790 A       C       0.000000        -0.000004       3
21      9719791 G       A       0.000000        -0.000001       3
21      9719792 G       A       0.000000        -0.000002       3
21      9719793 G       T       0.498277        41.932766       3
21      9719794 T       A       0.000000        -0.000001       3
21      9719795 T       A       0.000000        -0.000001       3

This pretty explanatory, nInd is the number of individuals where we have "reliable" reads (see bugs section) Depending on -doMaf INT, and -ref FILENAME and -anc FILENAME, extra column will be input.

Association

Score statistic (prefix lrt*)

Chromosome Position Frequency N LRT
  • Chromosome

The Chromosome

  • Position

The physical Position

  • Frequency

The frequency estimate. The choice of estimation is determined by the *doMaf option.

  • N

The number of individuals with non-missing data. That is the individuals who have both some sequencing data for the given site and have phenotype data

  • LRT

The likelihood ratio statistic. This statistic is chi square distributed with one degree of freedom. Sites that fails one of the filters are given the value -999.000000

example:

 1	711153	0.012228		3200	-999.000000
 1	713682	0.047357		3200	0.133145
 1	713754	0.047357		3200	1.018738
 1	742429	0.096592		3200	0.174977
 1	743404	0.043796		3200	1.003485
 1	744055	0.097272		3200	2.334205
 1	751595	0.055826		3200	0.300824
 1	758311	0.054249		3200	1.242375
 1	765522	0.097715		3200	2.667515
 1	766409	0.345465		3200	0.162817


All options

Input sequencing data

Alignment data

The input data is given as a file with the full path to all the individual filenames.

-soap

Sorted soap alignment files.

-qs
-maxHits

Genotype likelihood data

-samglf [filename]

Samtools glf format (binary output). use the -g options in samtools to generate the files. This format is deprecated in newer versions of samtools.

-samglfclean [filename]

Samtools glf format (text output). use the -g options in samtools to generate binary files followed the use of the samtools glfview. This format is deprecated in newer versions of samtools.

-tglf [filename]

A simple format for genotype likelihoods For fetching a number of sites, use -nLines INT

-fai

Error rates

-errorFile

Association study

-doAsso
  • -adjust
  • -yBin
  • -yQuant
  • -cov
  • -assoCutoff
  • -sitePerm

Outfile name

  • -outfiles

Use subset of data

  • -start
  • -stop
  • -target
  • -strand
  • -minHigh
  • -minCount
  • -cutOff
  • -downsample
  • -minDepth
  • -minInd

The minimum number of individuals with genotype data to analyze and print results for.

  • -minMaf

The minimum estimated MAF to use in analysis and to print results for.

Frequency estimation

  • -doMaf
  • -doSNP

Run options

  • -chunkSize
  • -nLines
  • -nThreads

Estimate genotype likelihoods

  • -doGLF INT

INT=1 binary output

INT=2 text beagle output

INT>2 textoutput

Estimate the covariance matrix

  • -getCovar


  • -emIter



  • -doError


  • -eps
  • -nInd

Estimate SFS realSFS

-realSFS 1

Run Examples

Using glfv3

../dirty -samglf ceu.glf.list -outfiles test.glf -doMaf 2 -fai numSort.Fai -nLines 50000 -chunkSize 500 -nThreads 16

Using mpileup

./dirty.g++ -chunkSize 200000 -outfiles ASDF -doMaf 2 -nThreads 10 mpileup -g -r 21:1-20000000 -I ~/sample/*.chr21 >bcfoutput

First is programname. Followed by the arguments used for dirty Followed by mpileup and the arguments that will be bassed directly to SAMtools

From version 0.25, we can now get the nucleotide count for every site, for every sample. This is done by omitting the -g parameter

Using tglf files

cd into the teststuff subfolder

../dirty.g++ -nThreads 1 -tglf lct.list -posfile lct.pos -nLines 100000 -outfiles GG -doMaf 15 -doSNP 1

If we want to estimate the SFS

../dirty.g++ -nThreads 5 -tglf lct.list -posfile lct.pos -nLines 100000 -outfiles KK -realSFS 1

Using soapfiles

../dirty.g++ -soap tsk.sub10.list -doMaf 2 -outfiles NEW10 -chunkSize 1000 -nLines 10000 -nThreads 4

-soap is filelist containing the soapfiles, each soapfile must be sorted according the chromosomename (lexical ordering), and position.


Using simulated files

These are .glf.gz files generated from simnextgen in misc subfolder, NB REMEMBER TO SUPPLY -nInd argument since these can't be inferred from the binary file. ./dirty.g++ -sim1 misc/small.glf.gz -nInd 15 -outfiles results -doMaf 2

Example for getting the depths with bamfiles

./angsd.g++ -dumpCounts 1 -outfiles tspr3 mpileup -rchr11:130060569-130060569 -b /space/lucampBam/lucampHighBam.filelist

sfstools

Simple example for getting S, pi and tajima

1) no prior ./misc/sfstools.g++ -nChr 50 -sfsFile lctoutput.sfs -tajima tajimafile >moded

2) with prior ./misc/sfstools.g++ -nChr 50 -priorFile LCT.data/all.25.sfs.ml -sfsFile lctoutput.sfs -tajima tajimafile >moded


stdout is the normalized output from .sfs

Full example for simulating nextgen data and estimating SFS (somewhat deprecated should use sfstools now)

First we simulate a small dataset with very high probability of being variabe.

Number of individuals =15,number of sites=10000, pvar=0.9

./simnextgen.gcc -nind 15 -nsites 10000 -outfiles small -pvar 0.9 
->Using args: -nind 15 -errate 0.007500 -depth 5.000000 -pvar 0.900000 -nsites 10000 -F 0.000000 -model 1
->Dumping files: sequencefile: small.seq.gz	glffile: small.glf.gz	truefreq: small.frq args:small.args geno:small.geno

The .frq file contains the folded spectra, we use getUnfolded on the "true" genotypes, to get the unfolded spectra

./getUnfolded.g++ -infile small.geno -nChr 30
-> opening filehandle for: small.geno
-> opening filehandle for: small.geno.var
-> opening filehandle for: small.geno.unfolded

The second line in small.geno.unfolded now contains the unfolded spectra.

We now run dirty on the glf.gz file.

./dirty.g++ -sim1 misc/small.glf.gz -nInd 15 -outfiles results -realSFS 1
##screen output removed

And we now run the optimSFS program to get the spectra, we supply the filename, the number of chromosomes, and the number of threads we want to use

./optimSFS.gcc -binput ../results.sfs -ncat 30 -nThread 4
##screen output removed

The final output is in results.sfs.ml

We start R and plot our estimated SFS along with the true SFS

mytrue <- as.numeric(read.table("small.geno.unfolded",skip=1))
myest <-scan("../results.sfs.ml")
barplot(rbind(mytrue,myest),beside=T)

File:SfsExample.png

Version notes

  • 0.16 Is now bundled with SAMtools-0.1.17 and the mpileup (and friends) command be used for passing data to dirty.
  • 0.17 added extra options -minInd and -minMaf, for only printing and using sites above a threshold
  • 0.18 added option to pass reference and ancestral allele as fasta files.(using faidx format) (doMaf is now encoded internally as a MAF_(UN)KNOWN_TYPE)
  • 0.19 added support for tglf inputfiles, -tglf -posfile see runexamples, also added the likeratio test for snp calling
  • 0.20 Added the check for missing data, before the major/minor. included -realSFS, changed the deallocation of the -doMAF results, such that its proper cleaned up.
  • 0.21 refactored pml.cpp into pml_estError_genLikes.cpp and pml_freq_asso.cpp (fixed a bug that preventede -samglf and samglfclean from working)
  • 0.22 Well this update was a mixture of edits from user:albrecht and BGI so its difficult to give a concise description
  • .0.23 Program can now read simulated files (single pop only) An example can be seen in "full example ... sfs" and input types.
  • 0.24 added the tajima estimator. This should go in tandem with some R scripts. Had to modify parseargs, shared, and pml_freq_asso
  • 0.25 the depth is now being populated when using mpileup -g. The program can now get the counts from mpileup

ANGSD below

  • 0.01.a - 0.01.b The bfgs now supports threading, maybe anders implemented a heteorzygosity estimator.
  • 0.01.c A problem if we didn't observe any llh, caused the MAF estimator to 'nan'.
  • 0.02 Fixed small bug in bfgs optimization of sfs optimization. When choosing a region bigger than what was covered by the .sfs file the program would hang. Added genotypecaller, added -sfsEst to the realSFS part of the program.
  • 0.03 added and documented genotypecaller, can dump counts,-realSFS 1 dumps positions, -realSFS 2 is deprecated,S,pi and tajima has been added to sfstools along with possibility to do prior

Possible Bugs

0.18 If we don't get any reliable genotypelikelihoods for a site, the site wont be included in the .mafs file. If this is want we want I can't say.

0.20 the major minor, doesn't work for the 3genotypes likelihood format (beagle)

0.21 When using soap files do we want to infer major/minor from generated likes or from counts?

0.24 Should we plugin the keepInd vector for the realSFS

angsd

0.02 modify the -doGeno to set -doMaf to avoid a segfault with pars->results->asso->freq

0.03 when dumping beagle we need to -doMaf 2 otherwise segfault

Wish list

1. start stop to work with soap input (Not gonna happen I think)