Revision as of 12:40, 6 December 2013

ANGSD currently has 2 different methods for estimating error rates.

A method that will estimate the joint typespecific errorrates for multiple samples. This method is using the minor allele frequency as prior, and therefore this method can not be used for single samples.

A method that interpret that uses and outgroup and an 'errorfree' individual.

The typespecific errorrates is an $4\times 4$ matrix, with the errorates of observing base A but in reality it was base C etc.

Error estimation from polymorphic sites

The method for estimating typespecific errors is described in Kim2011, and is based on the counts of the 4 different nucleotides. This method should be applied to the sites that are variable and the measure for variability is the simple MAF estimator that is described in Li2010.

options

-numSites [int]

default 10000. This is the number of sites we want to use for this analysis.

-cutoff [float]

default 0.005. This means we only run the error estimation on sites with a MAF>0.005. This should be modified according to the number of samples in the dataset.

-eps [float]

default 0.001.This is a guess of the errorrate in the sample, this is used for the simple MAF estimator

-errors [filename]

This file should contain a guess of the typespecific errors.

extra options

To further refine what data should be used please see alleles counts.

Example

The simplest example is:

./angsd -bam smallBam.filelist -doCounts 1 -out test  -doError 1 -doMajorMinor 2 -nThreads 2 -minSites 1000

Or a more elaborate example where we only want to estimate the typespecific errors for the "good" data:

./angsd -bam smallBam.filelist -doCounts 1 -out test2  -doError 1 -doMajorMinor 2 -nThreads 2 -minSites 1000 -minQ 20 -minMapQ 30

Output

#test
0.000000	0.005488	0.003847	0.003137\
0.006807	0.000000	0.001972	0.002396\
0.002190	0.001855	0.000000	0.008068\
0.002491	0.004268	0.005812	0.000000
#test2
0.000000	0.000071	0.003381	0.001254\
0.003989	0.000000	0.000000	0.002568\
0.002270	0.000000	0.000000	0.003650\
0.001451	0.004327	0.000974	0.000000

Notice that we for the more stringent test2 dataset have somewhat lower error rates. But we should really choose a much larger number of sites to do this analysis.

NB Currently the ordering of each line, can not be interpreted as the error rates along the genome, due to the threading. This will likely change in future versions.

Error estimation using an outgroup and an error free individual

The estimated rates can roughly be interpreted as relative error rates. That is excess of derived alleles in your sample compare to the derived alleles in the an error free indviduals The idea is the your sample and the error free individuals should have the same expected number of derived alleles and the extra observed derived alleles in you sample are due to the excess error. For each individual we sample a single base from the reads at each position. We use only positions were there are coverage for both the outgroup, the sample and the error free individual. See Error_estimation_method for more details

-doAnsError 1

Use -doAnsError to use this analysis

-anc [filename]

fasta file with the ancestral alleles

-ref [filename]

fasta file of an error free individual.

-sample [int]

1: Sample only one read (default) 0:use all reads

additional options

-minQ [int]

default 0. Minimum allowed base quality score

example

- angsd -doAnsError 1 -anc chimpHg19.fa -ref hg19perfect.fa -out results -bam bam.filelist
- Rscript R/estError.R file=results.ancError

typing Rscript R/estError.R will give you additional options for plotting the results

Output

The Rscript will output the type specific errors and the overall errors. It will also produce two figures. One will the type specific errors and one with the overall errors. If a lot of individuals are included in the analysis the figure will need to be modified.

Theory

Method 1

Method 2

The estimated rates can roughly be intrepreted as relative error rates. That is excess of errors in your sample compare to the error in the perfect indviduals. The idea is the your sample and the perfect individuals should have the same expected number of derived alleles and the extra derived alleles in you sample are due to the excess errorr. For each individual we sample a single base from the reads at each position. We use only positions were there are coverage for both the chimp, the sample and the perfect man. The overall error rate is obtained from

$O_{D}=E_{D}(1-\epsilon )+E_{A}\epsilon$

were

$\epsilon$ is the error rate
$O_{D}$ is the observed number of derived alleles in the sample
$E_{D}$ is the expected number of derived alleles which is obtained from the observed derived alleles from the perfect man
$E_{A}$ is the expected number of ancestral alleles which is obtained from the perfect man

For the type specific error rates are obtained from maximizing the likehood

$p(H=h|C=c)=p(H=h|C=c,noerror)\left(1-\sum _{h'\neq h}e_{h->h'}\right)+\left(\sum _{h'\neq h}p(H=h'|C=c,noerror)e_{h'->h}\right)$

where

$h$ is the allele of you sample
$c$ is the allele of the chimp
$e_{a->b}$ is the error rate for base a to base b
$p(H=h|C=c,noerror)$ are obtained from the perfect man assuming that the perfect man has no errors.

@@ Line 1: / Line 1: @@
 ANGSD currently has 2 different methods for estimating error rates.
-# A method that will estimate the typespecific errorrates for multiple samples. This method is using the minor allele frequency as prior, and therefore this method can not be used for single samples.
+# A method that will estimate the joint typespecific errorrates for multiple samples. This method is using the minor allele frequency as prior, and therefore this method can not be used for single samples.
 # A method that interpret that uses and outgroup and an 'errorfree' individual.
+The typespecific errorrates is an <math>4\times 4</math> matrix, with the errorates of observing base '''A''' but in reality it was base '''C''' etc.
 __TOC__
 =Error estimation from polymorphic sites=

Error estimation: Difference between revisions

Revision as of 12:40, 6 December 2013

Contents

Error estimation from polymorphic sites

options

extra options

Example

Output

Error estimation using an outgroup and an error free individual

additional options

example

Output

Theory

Method 1

Method 2

Navigation menu