ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Filters: Difference between revisions

From angsd
Jump to navigation Jump to search
Line 101: Line 101:
Similar if we only want sites with information for atleast 5 samples
Similar if we only want sites with information for atleast 5 samples
<pre>
<pre>
./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minKeepInd 5
./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minInd 5
</pre>
</pre>
<div class="toccolours mw-collapsible mw-collapsed">
<div class="toccolours mw-collapsible mw-collapsed">

Revision as of 11:13, 26 February 2014

We allow for filtering at many different levels.

  1. Read level, MapQ, unique mapped reads etc
  2. Base level, qscore
  3. Sequencing depth
  4. Regions (using BAM indexing (active lookup))
  5. Single sites (passive lookup, also allows for forcing major and minor) -sites
  6. Filtering based on downstream analysis. minimum MAF, LRT for SNP calling etc.
  7. Trimming out the ends of the reads
  8. etc

It follows that some filters will select a subset of data, and some of the filters will discard certain sites. If multiple filters has been chosen, the analysis will be limited to the chain of filters.

Filters for Bam files

We allow for filtering and manipulation a the read level. These filters include minimum mapping and base qualtity, paired reads and others. Additionally specific regions can be analysed. All of the filters for bam files are described in Input#BAM_files.

Selected Sites

For analysing specfic regions see Input#BAM_files. If you are interested in running your analysis at individual sites that are distributed throughout the entire genome, it might be faster to simply to loop over the entire data, but only analyse the data at specific positions. This can be done by supplying the -sites argument. With this approach we also allows for the forcing of major/minor alleles using external information.

Allele frequencies

-minMaf [float]
only work with sites with a maf above [float]

Requires -doMaf.

Polymorphic sites

-SNP_pval [float]
only work with sites with a p-value less than [float]

Requires -doMaf.

Number of non missing individuals

-minInd [int]
only work with sites with information from at least [int] individuals

Extra

-setMinDepth [int]

Discard site if total sequencing depth (all individuals added together) is below [int] Requires -doCounts

-setMaxDepth [int]

Discard site if total sequencing depth (all individuals added together) is above [int] -doCounts


-geno_minDeph [int]

Only call genotypes for a site if the depth is as least [int] for that individuals

This requires -doCounts and -doGeno

-trim [int]

Removes the [int] bases in both 'ends' of the reads, mostly useful for ancient DNA.

Examples

First we do a run with no filters

./angsd  -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1:

gunzip -c TSK.mafs.gz | head

chromo	position	major	minor	knownEM	nInd
1	13999919	A	C	0.000008	1
1	13999920	G	A	0.000008	1
1	13999921	G	A	0.000008	1
1	13999922	C	A	0.000008	1
1	13999923	A	C	0.000008	1
1	13999924	G	A	0.000008	1
1	13999925	G	A	0.000008	1
1	13999926	A	C	0.000008	1
1	13999927	G	A	0.000008	1

Now we do a filter with MAF cutoff of 1\%

./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minMaf 0.01

gunzip -c TSK.mafs.gz | head

chromo	position	major	minor	knownEM	nInd
1	13999950	T	G	0.495291	2
1	14000019	G	T	0.047247	9
1	14000056	C	T	0.055851	10
1	14000127	G	T	0.060760	10
1	14000170	C	T	0.052388	9
1	14000176	G	A	0.047928	10
1	14000202	G	A	0.279722	9
1	14000262	C	T	0.058555	9
1	14000322	A	G	0.040471	8

Similar if we only want sites with information for atleast 5 samples

./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minInd 5

gunzip -c TSK.mafs.gz | head

chromo	position	major	minor	knownEM	nInd
1	13999971	T	A	0.000007	6
1	13999972	G	A	0.000007	6
1	13999973	C	A	0.000005	5
1	13999974	G	A	0.000006	6
1	13999975	C	A	0.000002	5
1	13999976	C	A	0.000004	7
1	13999977	A	C	0.000005	8
1	13999978	C	A	0.000005	8
1	13999979	T	A	0.000005	8

If we are interested in all sites with a p-value of 10^(-6) of being variable

./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -SNP_pval 1e-6

gunzip -c TSK.mafs.gz | head

chromo	position	major	minor	knownEM	pK-EM	nInd
1	14000202	G	A	0.279722	42.623150	9
1	14000873	G	A	0.212120	79.118476	10
1	14001018	T	C	0.333736	89.040311	8
1	14001867	A	G	0.200232	47.195423	10
1	14002422	A	T	0.167692	43.196259	9
1	14003581	C	T	0.207404	58.593208	9
1	14004623	T	C	0.219838	102.856433	10
1	14007493	A	G	0.453217	28.398647	9
1	14007558	C	T	0.395670	80.236777	7