ANGSD: Analysis of next generation Sequencing Data
Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.
Filters: Difference between revisions
(55 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
We allow for filtering at many different levels. | |||
# Read level, MapQ, unique mapped reads etc | |||
# Base level, qscore | |||
# Sequencing depth | |||
# Regions (using BAM indexing (active lookup)) | |||
# Single sites (passive lookup, also allows for forcing major and minor) [[Sites |-sites]] | |||
# Filtering based on downstream analysis. minimum MAF, LRT for SNP calling etc. | |||
# Trimming out the ends of the reads | |||
# etc | |||
It follows that some filters will select a subset of data, and some of the filters will discard certain sites. If multiple filters has been chosen, the analysis will be limited to the chain of filters. | |||
=Filters for reads in Bam files= | |||
= | |||
We allow for filtering and manipulation a the read level. These filters include minimum mapping and base qualtity, paired reads and others. Additionally specific regions can be analysed. All of the filters for bam files are described in [[Input#BAM_files]]. | |||
=Selected Sites= | |||
For analysing specfic regions see [[Input#BAM_files]]. If you are interested in running your analysis at individual sites that are distributed throughout the entire genome, it might be faster to simply to loop over the entire data, but only analyse the data at specific positions. This can be done by supplying the [[Sites | -sites]] argument. With this approach we also allows for the forcing of major/minor alleles using external information. | |||
=Allele frequencies= | |||
; -minMaf [float]: only work with sites with a maf above [float] | |||
Requires [[Allele Frequency estimation | -doMaf]]. | |||
=Polymorphic sites= | |||
; -SNP_pval [float]: only work with sites with a p-value less than [float] | |||
Requires [[Allele Frequency estimation | -doMaf]]. | |||
=Number of non missing individuals= | |||
; -minInd [int]: Only keep sites with at least minIndDepth (default is 1) from at least [int] individuals | |||
=Extra= | |||
== | ;-setMinDepth [int]: | ||
Discard site if total sequencing depth (all individuals added together) is below [int]. | |||
Requires [[Alleles counts | -doCounts]] | |||
;-setMaxDepth [int]: | |||
; - | Discard site if total sequencing depth (all individuals added together) is above [int] | ||
[[Alleles counts | -doCounts]] | |||
;-setMinDepthInd [int]: | |||
Discard individual if sequencing depth for an individual is below [int]. This filter is only applied to analysis which are based on counts of alleles i.e. analysis that uses [[Alleles counts | -doCounts]] | |||
; - | ;-setMaxDepthInd [int]: | ||
Discard individual if sequencing depth for an individual is above [int] This filter is only applied to analysis which are based on counts of alleles i.e. analysis that uses [[Alleles counts | -doCounts]] | |||
; - | ;-geno_minDepth [int] | ||
Only call genotypes if the depth is as least [int] for that individuals | |||
This requires [[Alleles counts | -doCounts]] and [[Genotype calling |-doGeno ]] | |||
=Examples= | |||
First we do a run with no filters | |||
<pre> | <pre> | ||
./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: | ./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: | ||
</pre> | </pre> | ||
<div class="toccolours mw-collapsible mw-collapsed"> | |||
gunzip -c TSK.mafs.gz | head | |||
<pre class="mw-collapsible-content"> | |||
chromo position major minor unknownEM nInd | |||
1 13999919 A C 0.000006 1 | |||
1 13999920 G A 0.000006 1 | |||
1 13999921 G A 0.000006 1 | |||
1 13999922 C A 0.000006 1 | |||
1 13999923 A C 0.000006 1 | |||
1 13999924 G A 0.000006 1 | |||
1 13999925 G A 0.000006 1 | |||
1 13999926 A C 0.000006 1 | |||
1 13999927 G A 0.000006 1 | |||
</pre> | |||
</div> | |||
Now we do a filter with MAF cutoff of 1\% | Now we do a filter with MAF cutoff of 1\% | ||
<pre> | <pre> | ||
./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minMaf 0.01 | |||
</pre> | </pre> | ||
<div class="toccolours mw-collapsible mw-collapsed"> | |||
gunzip -c TSK.mafs.gz | head | |||
<pre class="mw-collapsible-content"> | |||
chromo position major minor unknownEM nInd | |||
1 14000003 G A 0.032285 9 | |||
1 14000013 G A 0.058291 9 | |||
1 14000019 G T 0.013709 9 | |||
1 14000023 C A 0.025033 9 | |||
1 14000170 C T 0.031133 10 | |||
1 14000176 G A 0.028189 10 | |||
1 14000200 C A 0.075946 7 | |||
1 14000202 G A 0.257007 7 | |||
1 14000774 G T 0.030039 10 | |||
</pre> | |||
</div> | |||
Similar if we only want sites with information for atleast 5 samples | Similar if we only want sites with information for atleast 5 samples | ||
<pre> | <pre> | ||
. | ./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minInd 5 | ||
</pre> | |||
chromo position major minor | <div class="toccolours mw-collapsible mw-collapsed"> | ||
gunzip -c TSK.mafs.gz | head | |||
1 13999972 G A 0. | <pre class="mw-collapsible-content"> | ||
1 13999973 C A 0. | chromo position major minor unknownEM nInd | ||
1 13999974 G A 0. | 1 13999972 G A 0.000003 5 | ||
1 13999973 C A 0.000002 5 | |||
1 13999974 G A 0.000002 5 | |||
1 13999975 C A 0.000002 5 | 1 13999975 C A 0.000002 5 | ||
1 13999976 C A 0. | 1 13999976 C A 0.000002 5 | ||
1 13999977 A C 0. | 1 13999977 A C 0.000000 5 | ||
1 13999978 C A 0. | 1 13999978 C A 0.000000 5 | ||
1 13999979 T A 0. | 1 13999979 T A 0.000000 5 | ||
1 13999980 G A 0.000001 5 | |||
</pre> | </pre> | ||
</div> | |||
If we are interested in all sites with a p-value of 10^(-6) of being variable | If we are interested in all sites with a p-value of 10^(-6) of being variable | ||
<pre> | <pre> | ||
. | ./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -SNP_pval 1e-6 | ||
</pre> | |||
chromo position major minor | <div class="toccolours mw-collapsible mw-collapsed"> | ||
1 | gunzip -c TSK.mafs.gz | head | ||
<pre class="mw-collapsible-content"> | |||
1 14001018 T C 0. | chromo position major minor unknownEM pu-EM nInd | ||
1 14001867 A G 0. | 1 14000873 G A 0.282476 0.000000e+00 10 | ||
1 14002422 A T 0. | 1 14001018 T C 0.259890 7.494005e-14 9 | ||
1 14003581 C T 0. | 1 14001867 A G 0.272099 6.361578e-14 10 | ||
1 14004623 T C 0. | 1 14002422 A T 0.377890 0.000000e+00 9 | ||
1 14007493 A G 0. | 1 14003581 C T 0.194393 5.551115e-16 9 | ||
1 14007558 C T 0. | 1 14004623 T C 0.259172 2.424727e-13 10 | ||
1 14007493 A G 0.297176 5.114086e-07 9 | |||
1 14007558 C T 0.381770 0.000000e+00 8 | |||
1 14007649 G A 0.220547 1.054967e-11 9 | |||
</pre> | </pre> | ||
</div> | |||
Latest revision as of 07:49, 15 November 2019
We allow for filtering at many different levels.
- Read level, MapQ, unique mapped reads etc
- Base level, qscore
- Sequencing depth
- Regions (using BAM indexing (active lookup))
- Single sites (passive lookup, also allows for forcing major and minor) -sites
- Filtering based on downstream analysis. minimum MAF, LRT for SNP calling etc.
- Trimming out the ends of the reads
- etc
It follows that some filters will select a subset of data, and some of the filters will discard certain sites. If multiple filters has been chosen, the analysis will be limited to the chain of filters.
Filters for reads in Bam files
We allow for filtering and manipulation a the read level. These filters include minimum mapping and base qualtity, paired reads and others. Additionally specific regions can be analysed. All of the filters for bam files are described in Input#BAM_files.
Selected Sites
For analysing specfic regions see Input#BAM_files. If you are interested in running your analysis at individual sites that are distributed throughout the entire genome, it might be faster to simply to loop over the entire data, but only analyse the data at specific positions. This can be done by supplying the -sites argument. With this approach we also allows for the forcing of major/minor alleles using external information.
Allele frequencies
- -minMaf [float]
- only work with sites with a maf above [float]
Requires -doMaf.
Polymorphic sites
- -SNP_pval [float]
- only work with sites with a p-value less than [float]
Requires -doMaf.
Number of non missing individuals
- -minInd [int]
- Only keep sites with at least minIndDepth (default is 1) from at least [int] individuals
Extra
- -setMinDepth [int]
Discard site if total sequencing depth (all individuals added together) is below [int]. Requires -doCounts
- -setMaxDepth [int]
Discard site if total sequencing depth (all individuals added together) is above [int] -doCounts
- -setMinDepthInd [int]
Discard individual if sequencing depth for an individual is below [int]. This filter is only applied to analysis which are based on counts of alleles i.e. analysis that uses -doCounts
- -setMaxDepthInd [int]
Discard individual if sequencing depth for an individual is above [int] This filter is only applied to analysis which are based on counts of alleles i.e. analysis that uses -doCounts
- -geno_minDepth [int]
Only call genotypes if the depth is as least [int] for that individuals
This requires -doCounts and -doGeno
Examples
First we do a run with no filters
./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1:
gunzip -c TSK.mafs.gz | head
chromo position major minor unknownEM nInd 1 13999919 A C 0.000006 1 1 13999920 G A 0.000006 1 1 13999921 G A 0.000006 1 1 13999922 C A 0.000006 1 1 13999923 A C 0.000006 1 1 13999924 G A 0.000006 1 1 13999925 G A 0.000006 1 1 13999926 A C 0.000006 1 1 13999927 G A 0.000006 1
Now we do a filter with MAF cutoff of 1\%
./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minMaf 0.01
gunzip -c TSK.mafs.gz | head
chromo position major minor unknownEM nInd 1 14000003 G A 0.032285 9 1 14000013 G A 0.058291 9 1 14000019 G T 0.013709 9 1 14000023 C A 0.025033 9 1 14000170 C T 0.031133 10 1 14000176 G A 0.028189 10 1 14000200 C A 0.075946 7 1 14000202 G A 0.257007 7 1 14000774 G T 0.030039 10
Similar if we only want sites with information for atleast 5 samples
./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minInd 5
gunzip -c TSK.mafs.gz | head
chromo position major minor unknownEM nInd 1 13999972 G A 0.000003 5 1 13999973 C A 0.000002 5 1 13999974 G A 0.000002 5 1 13999975 C A 0.000002 5 1 13999976 C A 0.000002 5 1 13999977 A C 0.000000 5 1 13999978 C A 0.000000 5 1 13999979 T A 0.000000 5 1 13999980 G A 0.000001 5
If we are interested in all sites with a p-value of 10^(-6) of being variable
./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -SNP_pval 1e-6
gunzip -c TSK.mafs.gz | head
chromo position major minor unknownEM pu-EM nInd 1 14000873 G A 0.282476 0.000000e+00 10 1 14001018 T C 0.259890 7.494005e-14 9 1 14001867 A G 0.272099 6.361578e-14 10 1 14002422 A T 0.377890 0.000000e+00 9 1 14003581 C T 0.194393 5.551115e-16 9 1 14004623 T C 0.259172 2.424727e-13 10 1 14007493 A G 0.297176 5.114086e-07 9 1 14007558 C T 0.381770 0.000000e+00 8 1 14007649 G A 0.220547 1.054967e-11 9