ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Genotype calling

From angsd
Revision as of 12:40, 2 March 2016 by Thorfinn (talk | contribs)
Jump to navigation Jump to search

We really don't recommend doing analysis based on called genotypes, but incorporate the uncertainty directly into the analysis you want to perform. But we recognise that many methods are still relying on called genotypes, and have therefore implemented a basic genotype caller into angsd.

The program can do genotype calling based either on the genotype til the highest likelihood or by using the frequency as a prior(recommended see Kim2011).


Brief Overview

./angsd -dogeno         -> Wed Mar  2 12:39:19 2016
-----------------
abcCallGenotypes.cpp:

-doGeno 0
        1: write major and minor
        2: write the called genotype encoded as -1,0,1,2, -1=not called
        4: write the called genotype directly: eg AA,AC etc 
        8: write the posterior probability of all possible genotypes
        16: write the posterior probability of called genotype
        32: write the posterior probabilities of the 3 gentypes as binary
        -> A combination of the above can be choosen by summing the values, EG write 0,1,2 types with majorminor as -doGeno 3
        -postCutoff=0.333333 (Only genotype to missing if below this threshold)
        -geno_minDepth=-1       (-1 indicates no cutof)
        -geno_maxDepth=-1       (-1 indicates no cutof)
        -geno_minMM=-1.000000   (minimum fraction af major-minor bases)
        -minInd=0       (only keep sites if you call genotypes from this number of individuals)

        NB When writing the posterior the -postCutoff is not used
        NB geno_minDepth requires -doCounts
        NB geno_maxDepth requires -doCounts


angsd can also use the full information of the sample allele frequencies for calling genotypes see SFS Estimation.

options

-doGeno [int]

1: print out major minor

2: print the called genotype as -1,0,1,2

4: print the called genotype as AA, AC, AG, ...

8: print all 3 posts (major,major),(major,minor),(minor,minor)

16: print the posterior of the called genotype

32: somewhat different dumps the binary posterior for all samples, encoded as 3*nind double

Use the sum of the above to give the output you want. Forexample -doGeno 5 (1+4) prins the major and minor allele followed by the genotype (AA, AC ...) for each individual

-doPost [int]

1: estimate the posterior genotype probability based on the allele frequency as a prior

2: estimate the posterior genotype probability assuming a uniform prior

-geno_minDepth [int]

set genotypes to missing if the individual depth is less than [int]

-geno_maxDepth [int]

set genotypes to missing if the individual depth is larger than [int]

-geno_minMM [float]

set genotypes to missing if less than [float] of the bases and the major or minor (likely a triallic site)

-postCutoff [float]

Call only a genotype with a posterior above this threshold.

NB if the raw posterior dump is requested the -postCutoff is not used

Example

./angsd -bam bam.filelist -GL 1 -out outfile -doMaf 2 -doMajorMinor 1 -SNP_pval 0.000001 -doGeno 5 -doPost 1 -postCutoff 0.95

gives a output like this:

1       14000202        G       A       GG      NN      NN      GA      NN      
1       14000873        G       A       GG      GG      GG      AA      GA      
1       14001018        T       C       NN      NN      NN      CC      NN      
1       14001867        A       G       NN      AA      AA      NN      NN      
1       14002342        C       T       CC      CC      CC      CC      CC      
1       14002422        A       T       AA      NN      NN      NN      NN      
1       14002474        T       C       TC      TT      TT      TT      TT      
1       14003581        C       T       CC      CC      NN      NN      CT      
1       14004623        T       C       TT      TT      TT      NN      TC      
1       14005069        A       G       AA      AA      AA      AA      AA