ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Filters: Difference between revisions

From angsd
Jump to navigation Jump to search
 
(21 intermediate revisions by 2 users not shown)
Line 1: Line 1:
<pre>
Information on this page is for version 0.569 or higher. Sorry for confusion, hopefully program and wiki will be updated before weekend.
</pre>
We allow for filtering at many different levels.
We allow for filtering at many different levels.


Line 14: Line 10:
# etc
# etc


It follows that some filters will select a subset of data, and some of the filters will discard certain sites. If multiple filters has been chosen, the analysis will be limited to the chain of filters. Eg setting a qscore threshold and an effective sample size filter along with a MAF filter will first. Remove the data with low qscores, then we found out the number of samples with data, and remove those below the threshold. Then we calculate the MAF and remove those sites with af MAF below the threshold.
It follows that some filters will select a subset of data, and some of the filters will discard certain sites. If multiple filters has been chosen, the analysis will be limited to the chain of filters.  
 
This page will describe some of these filters, and it follows that only some are available in the case of BAM input.
=Read level filters=
We allow for filtering and manipulation a the read level using the following arguments.


;-only_proper_pairs [int]=0
=Filters for reads in Bam files=
Include only proper pairs (pairs of read with both mates mapped correctly).  1: include only proper (default), 0: use all reads. If your data is not paired end you have to choose 1
;-uniqueOnly [int]=0
remove reads that have multiple best hits.. 0 no (default), 1 remove
;-remove_bads [int]=1
Same as  the samtools flags -x which removes read with a flag above 255 (not primary, failure and duplicate reads)
;-minQ [int]=0
minimum base quality
;-minMapQ [int]=0
minimum mapQ quality. Internally this is handled by setting the bases with a qscore below the threshold to 'N'.
-baq [int] =0
perform baq computation, remember to cite the baq paper for this.


=Selected Regions=
We allow for filtering and manipulation a the read level. These filters include minimum mapping and base qualtity, paired reads and others. Additionally specific regions can be analysed. All of the filters for bam files are described in [[Input#BAM_files]].
BAM files allows for indexing which makes random retrieval of regions fast and easy. This section describes region lookup as we have implemented it in angsd.
 
;-r [region]
Specify a region with in a chromosome using the syntax [chr]:[start-stop]. examples
chr1:1-10000            // first 10000 based for chr1
chr2:50000-              //chr2 but exclude the first 50000 bases
chr11:1-                  //all of chr11
chr7:123456              //position 123456 of chr7
;-rf [region file]  
specify multiple regions in a file.
 
The format for the regions supplied to the '''-rf''' file is the same the '''-r''' command line arguments.


=Selected Sites=
=Selected Sites=
If you are interested in running your analysis at individual sites that are distributed throughout the entire genome, it might be faster to simply to loop over the entire data, but only analyse the data at specific positions. This can be done by supplying the [[Sites | -sites]] argument. With this approach we also allows for the forcing of major/minor alleles.
For analysing specfic regions see [[Input#BAM_files]]. If you are interested in running your analysis at individual sites that are distributed throughout the entire genome, it might be faster to simply to loop over the entire data, but only analyse the data at specific positions. This can be done by supplying the [[Sites | -sites]] argument. With this approach we also allows for the forcing of major/minor alleles using external information.


=Allele frequencies=
=Allele frequencies=
; -minMaf [float]: only work with sites with a maf above 'float'
; -minMaf [float]: only work with sites with a maf above [float]


Of cause requires [[Allele Frequency estimation | -doMaf]].
Requires [[Allele Frequency estimation | -doMaf]].


=Polymorphic sites=
=Polymorphic sites=


; -minLRT [float]: only work with sits with an LRT>float
; -SNP_pval [float]: only work with sites with a p-value less than [float]


Of cause requires [[Allele Frequency estimation | -doMaf]].
Requires [[Allele Frequency estimation | -doMaf]].


=Number of non missing individuals=
=Number of non missing individuals=




; -minInd [int]: only work with sites with information from atleast int individuals
; -minInd [int]: Only keep sites with at least minIndDepth (default is 1) from at least [int] individuals
 
This functionality is implemented as part of the basic file reading and some of the downstream analysis [[Allele Counts |-doCounts]],[[Genotype Likelihoods | -GL]].


=Extra=
=Extra=
;-setMinDepth:
;-setMinDepth [int]:
Discard site if sequencing depth is below threshold
Discard site if total sequencing depth (all individuals added together) is below [int].
;-setMaxDepth:
Requires [[Alleles counts | -doCounts]]
Discard site if sequencing depth is above threshold


;-setMaxDepth [int]:
Discard site if total sequencing depth (all individuals added together) is above [int]
[[Alleles counts | -doCounts]]


These filter is implemented in [[Alleles counts | -doCounts]].
;-setMinDepthInd [int]:
Discard individual if sequencing depth for an individual is below [int]. This filter is only applied to analysis which are based on counts of alleles i.e. analysis that uses [[Alleles counts | -doCounts]]


;-geno_minDeph
;-setMaxDepthInd [int]:
Only call genotypes if per sample genotypes are above this threshold
Discard individual if sequencing depth for an individual is above [int] This filter is only applied to analysis which are based on counts of alleles i.e. analysis that uses [[Alleles counts | -doCounts]]


This requires [[Alleles counts | -doCounts]] and [[Genotype calling |-doGeno ]]




;-trim
;-geno_minDepth [int]
Removes the 'ends' of the reads, mostly useful for ancient DNA.
Only call genotypes if the depth is as least [int] for that individuals
 
This requires [[Alleles counts | -doCounts]] and [[Genotype calling |-doGeno ]]


=Examples=
=Examples=


First we do a run with no filters


First we do a run with no filters
<pre>
<pre>
./angsd  -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1:
./angsd  -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1:
...
head TSK.mafs
chromo position major minor knownEM nInd
1 13999919 A C 0.000008 1
1 13999920 G A 0.000008 1
1 13999921 G A 0.000008 1
1 13999922 C A 0.000008 1
1 13999923 A C 0.000008 1
1 13999924 G A 0.000008 1
1 13999925 G A 0.000008 1
1 13999926 A C 0.000008 1
1 13999927 G A 0.000008 1
</pre>
</pre>
<div class="toccolours mw-collapsible mw-collapsed">
gunzip -c TSK.mafs.gz | head
<pre class="mw-collapsible-content">
chromo position major minor unknownEM nInd
1 13999919 A C 0.000006 1
1 13999920 G A 0.000006 1
1 13999921 G A 0.000006 1
1 13999922 C A 0.000006 1
1 13999923 A C 0.000006 1
1 13999924 G A 0.000006 1
1 13999925 G A 0.000006 1
1 13999926 A C 0.000006 1
1 13999927 G A 0.000006 1
</pre>
</div>


Now we do a filter with MAF cutoff of 1\%
Now we do a filter with MAF cutoff of 1\%


<pre>
<pre>
../angsd0.3/angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minMaf 0.01
./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minMaf 0.01
head TSK.mafs
chromo position major minor knownEM nInd
1 13999950 T G 0.495291 2
1 14000019 G T 0.047247 9
1 14000056 C T 0.055851 10
1 14000127 G T 0.060760 10
1 14000170 C T 0.052388 9
1 14000176 G A 0.047928 10
1 14000202 G A 0.279722 9
1 14000262 C T 0.058555 9
1 14000322 A G 0.040471 8
</pre>
</pre>
<div class="toccolours mw-collapsible mw-collapsed">
gunzip -c TSK.mafs.gz | head
<pre class="mw-collapsible-content">
chromo position major minor unknownEM nInd
1 14000003 G A 0.032285 9
1 14000013 G A 0.058291 9
1 14000019 G T 0.013709 9
1 14000023 C A 0.025033 9
1 14000170 C T 0.031133 10
1 14000176 G A 0.028189 10
1 14000200 C A 0.075946 7
1 14000202 G A 0.257007 7
1 14000774 G T 0.030039 10
</pre>
</div>


Similar if we only want sites with information for atleast 5 samples
Similar if we only want sites with information for atleast 5 samples
<pre>
<pre>
../angsd0.3/angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minKeepInd 5
./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minInd 5
head TSK.mafs  
</pre>
chromo position major minor knownEM nInd
<div class="toccolours mw-collapsible mw-collapsed">
1 13999971 T A 0.000007 6
gunzip -c TSK.mafs.gz | head
1 13999972 G A 0.000007 6
<pre class="mw-collapsible-content">
1 13999973 C A 0.000005 5
chromo position major minor unknownEM nInd
1 13999974 G A 0.000006 6
1 13999972 G A 0.000003 5
1 13999973 C A 0.000002 5
1 13999974 G A 0.000002 5
1 13999975 C A 0.000002 5
1 13999975 C A 0.000002 5
1 13999976 C A 0.000004 7
1 13999976 C A 0.000002 5
1 13999977 A C 0.000005 8
1 13999977 A C 0.000000 5
1 13999978 C A 0.000005 8
1 13999978 C A 0.000000 5
1 13999979 T A 0.000005 8
1 13999979 T A 0.000000 5
1 13999980 G A 0.000001 5
</pre>
</pre>
</div>
If we are interested in all sites with a p-value of 10^(-6) of being variable
If we are interested in all sites with a p-value of 10^(-6) of being variable
<pre>
<pre>
../angsd0.3/angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minLRT 24 -doSNP 1
./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -SNP_pval 1e-6
head TSK.mafs  
</pre>
chromo position major minor knownEM pK-EM nInd
<div class="toccolours mw-collapsible mw-collapsed">
1 14000202 G A 0.279722 42.623150 9
gunzip -c TSK.mafs.gz | head
1 14000873 G A 0.212120 79.118476 10
<pre class="mw-collapsible-content">
1 14001018 T C 0.333736 89.040311 8
chromo position major minor unknownEM pu-EM nInd
1 14001867 A G 0.200232 47.195423 10
1 14000873 G A 0.282476 0.000000e+00 10
1 14002422 A T 0.167692 43.196259 9
1 14001018 T C 0.259890 7.494005e-14 9
1 14003581 C T 0.207404 58.593208 9
1 14001867 A G 0.272099 6.361578e-14 10
1 14004623 T C 0.219838 102.856433 10
1 14002422 A T 0.377890 0.000000e+00 9
1 14007493 A G 0.453217 28.398647 9
1 14003581 C T 0.194393 5.551115e-16 9
1 14007558 C T 0.395670 80.236777 7
1 14004623 T C 0.259172 2.424727e-13 10
 
1 14007493 A G 0.297176 5.114086e-07 9
1 14007558 C T 0.381770 0.000000e+00 8
1 14007649 G A 0.220547 1.054967e-11 9
</pre>
</pre>
</div>

Latest revision as of 08:49, 15 November 2019

We allow for filtering at many different levels.

  1. Read level, MapQ, unique mapped reads etc
  2. Base level, qscore
  3. Sequencing depth
  4. Regions (using BAM indexing (active lookup))
  5. Single sites (passive lookup, also allows for forcing major and minor) -sites
  6. Filtering based on downstream analysis. minimum MAF, LRT for SNP calling etc.
  7. Trimming out the ends of the reads
  8. etc

It follows that some filters will select a subset of data, and some of the filters will discard certain sites. If multiple filters has been chosen, the analysis will be limited to the chain of filters.

Filters for reads in Bam files

We allow for filtering and manipulation a the read level. These filters include minimum mapping and base qualtity, paired reads and others. Additionally specific regions can be analysed. All of the filters for bam files are described in Input#BAM_files.

Selected Sites

For analysing specfic regions see Input#BAM_files. If you are interested in running your analysis at individual sites that are distributed throughout the entire genome, it might be faster to simply to loop over the entire data, but only analyse the data at specific positions. This can be done by supplying the -sites argument. With this approach we also allows for the forcing of major/minor alleles using external information.

Allele frequencies

-minMaf [float]
only work with sites with a maf above [float]

Requires -doMaf.

Polymorphic sites

-SNP_pval [float]
only work with sites with a p-value less than [float]

Requires -doMaf.

Number of non missing individuals

-minInd [int]
Only keep sites with at least minIndDepth (default is 1) from at least [int] individuals

Extra

-setMinDepth [int]

Discard site if total sequencing depth (all individuals added together) is below [int]. Requires -doCounts

-setMaxDepth [int]

Discard site if total sequencing depth (all individuals added together) is above [int] -doCounts

-setMinDepthInd [int]

Discard individual if sequencing depth for an individual is below [int]. This filter is only applied to analysis which are based on counts of alleles i.e. analysis that uses -doCounts

-setMaxDepthInd [int]

Discard individual if sequencing depth for an individual is above [int] This filter is only applied to analysis which are based on counts of alleles i.e. analysis that uses -doCounts


-geno_minDepth [int]

Only call genotypes if the depth is as least [int] for that individuals

This requires -doCounts and -doGeno

Examples

First we do a run with no filters

./angsd  -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1:

gunzip -c TSK.mafs.gz | head

chromo	position	major	minor	unknownEM	nInd
1	13999919	A	C	0.000006	1
1	13999920	G	A	0.000006	1
1	13999921	G	A	0.000006	1
1	13999922	C	A	0.000006	1
1	13999923	A	C	0.000006	1
1	13999924	G	A	0.000006	1
1	13999925	G	A	0.000006	1
1	13999926	A	C	0.000006	1
1	13999927	G	A	0.000006	1

Now we do a filter with MAF cutoff of 1\%

./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minMaf 0.01

gunzip -c TSK.mafs.gz | head

chromo	position	major	minor	unknownEM	nInd
1	14000003	G	A	0.032285	9
1	14000013	G	A	0.058291	9
1	14000019	G	T	0.013709	9
1	14000023	C	A	0.025033	9
1	14000170	C	T	0.031133	10
1	14000176	G	A	0.028189	10
1	14000200	C	A	0.075946	7
1	14000202	G	A	0.257007	7
1	14000774	G	T	0.030039	10

Similar if we only want sites with information for atleast 5 samples

./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minInd 5

gunzip -c TSK.mafs.gz | head

chromo	position	major	minor	unknownEM	nInd
1	13999972	G	A	0.000003	5
1	13999973	C	A	0.000002	5
1	13999974	G	A	0.000002	5
1	13999975	C	A	0.000002	5
1	13999976	C	A	0.000002	5
1	13999977	A	C	0.000000	5
1	13999978	C	A	0.000000	5
1	13999979	T	A	0.000000	5
1	13999980	G	A	0.000001	5

If we are interested in all sites with a p-value of 10^(-6) of being variable

./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -SNP_pval 1e-6

gunzip -c TSK.mafs.gz | head

chromo	position	major	minor	unknownEM	pu-EM	nInd
1	14000873	G	A	0.282476	0.000000e+00	10
1	14001018	T	C	0.259890	7.494005e-14	9
1	14001867	A	G	0.272099	6.361578e-14	10
1	14002422	A	T	0.377890	0.000000e+00	9
1	14003581	C	T	0.194393	5.551115e-16	9
1	14004623	T	C	0.259172	2.424727e-13	10
1	14007493	A	G	0.297176	5.114086e-07	9
1	14007558	C	T	0.381770	0.000000e+00	8
1	14007649	G	A	0.220547	1.054967e-11	9