ANGSD: Analysis of next generation Sequencing Data
Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.
Error estimation: Difference between revisions
Line 4: | Line 4: | ||
==options== | ==options== | ||
; -cutoff [float] | ; -cutoff [float] | ||
default 0. | default 0.005. This means we only run the error estimation on sites with a MAF>0.005. This should be modified according to the number of samples in the dataset. | ||
; -eps [float] | |||
default 0.001.This is a guess of the errorrate in the sample, this is used for the simple MAF estimator | |||
; -errors [filename] | ; -errors [filename] | ||
This file should contain a guess of the typespecific errors. NB this is not implemented in the current version | This file should contain a guess of the typespecific errors. NB this is not implemented in the current version | ||
===extra options=== | |||
To further refine what data should be used please see [[alleles counts]]. | |||
==Example== | |||
The simplest example is: | |||
<pre> | |||
./angsd -bam smallBam.filelist -doCounts 1 -out test -doError 1 -doMajorMinor 2 -nThreads 2 -minSites 1000 | |||
</pre> | |||
Or a more elaborate example where we only want to estimate the typespecific errors for the "good" data: | |||
<pre> | |||
./angsd -bam smallBam.filelist -doCounts 1 -out test2 -doError 1 -doMajorMinor 2 -nThreads 2 -minSites 1000 -minQ 20 -minMapQ 30 | |||
</pre> | |||
===Output=== | |||
<pre> | |||
#test | |||
0.000000 0.005488 0.003847 0.003137\ | |||
0.006807 0.000000 0.001972 0.002396\ | |||
0.002190 0.001855 0.000000 0.008068\ | |||
0.002491 0.004268 0.005812 0.000000 | |||
#test2 | |||
0.000000 0.000071 0.003381 0.001254\ | |||
0.003989 0.000000 0.000000 0.002568\ | |||
0.002270 0.000000 0.000000 0.003650\ | |||
0.001451 0.004327 0.000974 0.000000 | |||
</pre> | |||
==example== | ==example== | ||
Revision as of 22:09, 11 October 2012
Error estimation from polymorphic sites
The method for estimating typespecific errors is described in Kim2011, and is based on the counts of the 4 different nucleotides. This method should be applied to the sites that are variable and the measure for variability is the simple MAF estimator that is described in Li2010.
options
- -cutoff [float]
default 0.005. This means we only run the error estimation on sites with a MAF>0.005. This should be modified according to the number of samples in the dataset.
- -eps [float]
default 0.001.This is a guess of the errorrate in the sample, this is used for the simple MAF estimator
- -errors [filename]
This file should contain a guess of the typespecific errors. NB this is not implemented in the current version
extra options
To further refine what data should be used please see alleles counts.
Example
The simplest example is:
./angsd -bam smallBam.filelist -doCounts 1 -out test -doError 1 -doMajorMinor 2 -nThreads 2 -minSites 1000
Or a more elaborate example where we only want to estimate the typespecific errors for the "good" data:
./angsd -bam smallBam.filelist -doCounts 1 -out test2 -doError 1 -doMajorMinor 2 -nThreads 2 -minSites 1000 -minQ 20 -minMapQ 30
Output
#test 0.000000 0.005488 0.003847 0.003137\ 0.006807 0.000000 0.001972 0.002396\ 0.002190 0.001855 0.000000 0.008068\ 0.002491 0.004268 0.005812 0.000000 #test2 0.000000 0.000071 0.003381 0.001254\ 0.003989 0.000000 0.000000 0.002568\ 0.002270 0.000000 0.000000 0.003650\ 0.001451 0.004327 0.000974 0.000000
example
Error estimation using an outgroup and an error free individual
- -doAncError [int]
- -anc [filename]
fasta file with the ancestral alleles
- -ref [filename]
fasta file of a reference (error free) individual.
- -doAncError 1
- -doAncError 2
additional options
- -minQ [int]
default 0. Minimum allowed base quality score
- -minMapQ [int]
default 0. Minimum allowed mapping quality score