Latest revision as of 08:49, 15 November 2019

We allow for filtering at many different levels.

Read level, MapQ, unique mapped reads etc
Base level, qscore
Sequencing depth
Regions (using BAM indexing (active lookup))
Single sites (passive lookup, also allows for forcing major and minor) -sites
Filtering based on downstream analysis. minimum MAF, LRT for SNP calling etc.
Trimming out the ends of the reads
etc

It follows that some filters will select a subset of data, and some of the filters will discard certain sites. If multiple filters has been chosen, the analysis will be limited to the chain of filters.

Filters for reads in Bam files

We allow for filtering and manipulation a the read level. These filters include minimum mapping and base qualtity, paired reads and others. Additionally specific regions can be analysed. All of the filters for bam files are described in Input#BAM_files.

Selected Sites

For analysing specfic regions see Input#BAM_files. If you are interested in running your analysis at individual sites that are distributed throughout the entire genome, it might be faster to simply to loop over the entire data, but only analyse the data at specific positions. This can be done by supplying the -sites argument. With this approach we also allows for the forcing of major/minor alleles using external information.

Allele frequencies

-minMaf [float]: only work with sites with a maf above [float]

Requires -doMaf.

Polymorphic sites

-SNP_pval [float]: only work with sites with a p-value less than [float]

Requires -doMaf.

Number of non missing individuals

-minInd [int]: Only keep sites with at least minIndDepth (default is 1) from at least [int] individuals

Extra

-setMinDepth [int]

Discard site if total sequencing depth (all individuals added together) is below [int]. Requires -doCounts

-setMaxDepth [int]

Discard site if total sequencing depth (all individuals added together) is above [int] -doCounts

-setMinDepthInd [int]

Discard individual if sequencing depth for an individual is below [int]. This filter is only applied to analysis which are based on counts of alleles i.e. analysis that uses -doCounts

-setMaxDepthInd [int]

Discard individual if sequencing depth for an individual is above [int] This filter is only applied to analysis which are based on counts of alleles i.e. analysis that uses -doCounts

-geno_minDepth [int]

Only call genotypes if the depth is as least [int] for that individuals

This requires -doCounts and -doGeno

Examples

First we do a run with no filters

./angsd  -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1:

gunzip -c TSK.mafs.gz | head

chromo	position	major	minor	unknownEM	nInd
1	13999919	A	C	0.000006	1
1	13999920	G	A	0.000006	1
1	13999921	G	A	0.000006	1
1	13999922	C	A	0.000006	1
1	13999923	A	C	0.000006	1
1	13999924	G	A	0.000006	1
1	13999925	G	A	0.000006	1
1	13999926	A	C	0.000006	1
1	13999927	G	A	0.000006	1

Now we do a filter with MAF cutoff of 1\%

./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minMaf 0.01

gunzip -c TSK.mafs.gz | head

chromo	position	major	minor	unknownEM	nInd
1	14000003	G	A	0.032285	9
1	14000013	G	A	0.058291	9
1	14000019	G	T	0.013709	9
1	14000023	C	A	0.025033	9
1	14000170	C	T	0.031133	10
1	14000176	G	A	0.028189	10
1	14000200	C	A	0.075946	7
1	14000202	G	A	0.257007	7
1	14000774	G	T	0.030039	10

Similar if we only want sites with information for atleast 5 samples

./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minInd 5

gunzip -c TSK.mafs.gz | head

chromo	position	major	minor	unknownEM	nInd
1	13999972	G	A	0.000003	5
1	13999973	C	A	0.000002	5
1	13999974	G	A	0.000002	5
1	13999975	C	A	0.000002	5
1	13999976	C	A	0.000002	5
1	13999977	A	C	0.000000	5
1	13999978	C	A	0.000000	5
1	13999979	T	A	0.000000	5
1	13999980	G	A	0.000001	5

If we are interested in all sites with a p-value of 10^(-6) of being variable

./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -SNP_pval 1e-6

gunzip -c TSK.mafs.gz | head

chromo	position	major	minor	unknownEM	pu-EM	nInd
1	14000873	G	A	0.282476	0.000000e+00	10
1	14001018	T	C	0.259890	7.494005e-14	9
1	14001867	A	G	0.272099	6.361578e-14	10
1	14002422	A	T	0.377890	0.000000e+00	9
1	14003581	C	T	0.194393	5.551115e-16	9
1	14004623	T	C	0.259172	2.424727e-13	10
1	14007493	A	G	0.297176	5.114086e-07	9
1	14007558	C	T	0.381770	0.000000e+00	8
1	14007649	G	A	0.220547	1.054967e-11	9

Filters: Difference between revisions

Latest revision as of 08:49, 15 November 2019

Contents

Filters for reads in Bam files

Selected Sites

Allele frequencies

Polymorphic sites

Number of non missing individuals

Extra

Examples

Navigation menu

@@ Line 1: / Line 1: @@
-<pre>
-Information on this page is for version 0.569 or higher. Sorry for confusion, hopefully program and wiki will be updated before weekend.
-</pre>
 We allow for filtering at many different levels.
@@ Line 11: / Line 7: @@
 # Single sites (passive lookup, also allows for forcing major and minor) [[Sites |-sites]]
 # Filtering based on downstream analysis. minimum MAF, LRT for SNP calling etc.
+# Trimming out the ends of the reads
+# etc
+It follows that some filters will select a subset of data, and some of the filters will discard certain sites. If multiple filters has been chosen, the analysis will be limited to the chain of filters.
-This page will describe some of these filters.
+=Filters for reads in Bam files=
-=Read level filters=
-We allow for filtering and manipulation a the read level using the following arguments.
-;-only_proper_pairs [int]=0
+We allow for filtering and manipulation a the read level. These filters include minimum mapping and base qualtity, paired reads and others. Additionally specific regions can be analysed. All of the filters for bam files are described in [[Input#BAM_files]].
-Include only proper pairs (pairs of read with both mates mapped correctly).  1: include only proper (default), 0: use all reads. If your data is not paired end you have to choose 1
-;-uniqueOnly [int]=0
-remove reads that have multiple best hits.. 0 no (default), 1 remove
-;-remove_bads [int]=1
-Same as  the samtools flags -x which removes read with a flag above 255 (not primary, failure and duplicate reads)
-;-minQ [int]=0
-minimum base quality
-;-minMapQ [int]=0
-minimum mapQ quality. Internally this is handled by setting the bases with a qscore below the threshold to 'N'.
--baq [int] =0
-perform baq computation, remember to cite the baq paper for this.
-=Selected Regions=
-see [[input]]
 =Selected Sites=
-We support 2 different kinds of inputfiles for filtering. Bim files, if you want to use a specific major minor, or plain textfiles containing chromosome tab position.
+For analysing specfic regions see [[Input#BAM_files]]. If you are interested in running your analysis at individual sites that are distributed throughout the entire genome, it might be faster to simply to loop over the entire data, but only analyse the data at specific positions. This can be done by supplying the [[Sites | -sites]] argument. With this approach we also allows for the forcing of major/minor alleles using external information.
-; -filter [bimfile.bim] or -filter [afile.keep]
+=Allele frequencies=
+; -minMaf [float]: only work with sites with a maf above [float]
+Requires [[Allele Frequency estimation | -doMaf]].
+=Polymorphic sites=
+; -SNP_pval [float]: only work with sites with a p-value less than [float]
-File is determed by suffix of file.
+Requires [[Allele Frequency estimation | -doMaf]].
+=Number of non missing individuals=
-Only use sites contained in the bim (plink format) file. With -doMajorMinor 3 the major/minor alleles from the bim file is used.
-Example of a bim file
+; -minInd [int]: Only keep sites with at least minIndDepth (default is 1) from at least [int] individuals
-<pre>
-	rs11240767	0	728951	T	C
-	rs3131972	0	752721	A	G
-	rs3131969	0	754182	A	G
-	rs3131967	0	754334	T	C
-	rs1048488	0	760912	C	T
-	rs12124819	0	776546	G	A
-	rs4040617	0	779322	G	A
-	rs4970383	0	838555	A	C
-</pre>
-Columns are, chromosome name, rsnumber, position in centimorgan, position in bp and major/major.
-Only column 1,4,5,6 are used. The major and minor state can also be encoded as 0,1,2,3,4. With 0=A,1=C,2=G,3=T,4=N
-Example of a keep file
+=Extra=
-<pre>
+;-setMinDepth [int]:
-chr1  100001
+Discard site if total sequencing depth (all individuals added together) is below [int].
-chr1  2500000
+Requires [[Alleles counts | -doCounts]]
-chr1  347348
-</pre>
-The .keep are implemented in 0.503 or above.
+;-setMaxDepth [int]:
+Discard site if total sequencing depth (all individuals added together) is above [int]
+[[Alleles counts | -doCounts]]
-For the .bim file no ordering is assumed. We require that the .keep file is sorted according to chromosome.
+;-setMinDepthInd [int]:
-==Warning==
+Discard individual if sequencing depth for an individual is below [int]. This filter is only applied to analysis which are based on counts of alleles i.e. analysis that uses [[Alleles counts | -doCounts]]
-To clarify the above, we require that the .keep file is sorted/grouped together by chromosome. We do not care of the ordering of positions within each chromosome.
-The program requires that the ordering of the chromosomes from the filereading has the same order as in the .keep file. You could of cause reheader your bamfiles, and sort it. But a much simpler solution is to force the filereading to use the same ordering as your keep file and supply that using the -rf.
-An example for this is
-<pre>
-cut -f1 filter.keep |uniq >awk '{print $0":"}' >regions.txt
-./angsd [do analysis] -rf regions.txt
-</pre>
-=Allele frequencies=
+;-setMaxDepthInd [int]:
-; -minMaf [float]: only work with sites with a maf above 'float'
+Discard individual if sequencing depth for an individual is above [int] This filter is only applied to analysis which are based on counts of alleles i.e. analysis that uses [[Alleles counts | -doCounts]]
-=polymorphic sites=
-; -minLRT [float]: only work with sits with an LRT>float
-=Number of non missing individuals=
-; -minInd [int]: only work with sites with information from atleast int individiduals
+;-geno_minDepth [int]
+Only call genotypes if the depth is as least [int] for that individuals
+This requires [[Alleles counts | -doCounts]] and [[Genotype calling |-doGeno ]]
+=Examples=
+First we do a run with no filters
-First we do a run with no filters
 <pre>
 ./angsd  -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1:
-...
-head TSK.mafs
-chromo	position	major	minor	knownEM	nInd
-	13999919	A	C	0.000008	1
-	13999920	G	A	0.000008	1
-	13999921	G	A	0.000008	1
-	13999922	C	A	0.000008	1
-	13999923	A	C	0.000008	1
-	13999924	G	A	0.000008	1
-	13999925	G	A	0.000008	1
-	13999926	A	C	0.000008	1
-	13999927	G	A	0.000008	1
 </pre>
+<div class="toccolours mw-collapsible mw-collapsed">
+gunzip -c TSK.mafs.gz | head
+<pre class="mw-collapsible-content">
+chromo	position	major	minor	unknownEM	nInd
+	13999919	A	C	0.000006	1
+	13999920	G	A	0.000006	1
+	13999921	G	A	0.000006	1
+	13999922	C	A	0.000006	1
+	13999923	A	C	0.000006	1
+	13999924	G	A	0.000006	1
+	13999925	G	A	0.000006	1
+	13999926	A	C	0.000006	1
+	13999927	G	A	0.000006	1
+</pre>
+</div>
 Now we do a filter with MAF cutoff of 1\%
 <pre>
-../angsd0.3/angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minMaf 0.01
+./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minMaf 0.01
-head TSK.mafs
+</pre>
-chromo	position	major	minor	knownEM	nInd
-	13999950	T	G	0.495291	2
+<div class="toccolours mw-collapsible mw-collapsed">
-	14000019	G	T	0.047247	9
+gunzip -c TSK.mafs.gz | head
-	14000056	C	T	0.055851	10
+<pre class="mw-collapsible-content">
-	14000127	G	T	0.060760	10
+chromo	position	major	minor	unknownEM	nInd
-	14000170	C	T	0.052388	9
+	14000003	G	A	0.032285	9
-	14000176	G	A	0.047928	10
+	14000013	G	A	0.058291	9
-	14000202	G	A	0.279722	9
+	14000019	G	T	0.013709	9
-	14000262	C	T	0.058555	9
+	14000023	C	A	0.025033	9
-	14000322	A	G	0.040471	8
+	14000170	C	T	0.031133	10
+	14000176	G	A	0.028189	10
+	14000200	C	A	0.075946	7
+	14000202	G	A	0.257007	7
+	14000774	G	T	0.030039	10
 </pre>
+</div>
 Similar if we only want sites with information for atleast 5 samples
 <pre>
-../angsd0.3/angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minKeepInd 5
+./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minInd 5
-head TSK.mafs
+</pre>
-chromo	position	major	minor	knownEM	nInd
+<div class="toccolours mw-collapsible mw-collapsed">
-	13999971	T	A	0.000007	6
+gunzip -c TSK.mafs.gz | head
-	13999972	G	A	0.000007	6
+<pre class="mw-collapsible-content">
-	13999973	C	A	0.000005	5
+chromo	position	major	minor	unknownEM	nInd
-	13999974	G	A	0.000006	6
+	13999972	G	A	0.000003	5
+	13999973	C	A	0.000002	5
+	13999974	G	A	0.000002	5
 	13999975	C	A	0.000002	5
-	13999976	C	A	0.000004	7
+	13999976	C	A	0.000002	5
-	13999977	A	C	0.000005	8
+	13999977	A	C	0.000000	5
-	13999978	C	A	0.000005	8
+	13999978	C	A	0.000000	5
-	13999979	T	A	0.000005	8
+	13999979	T	A	0.000000	5
+	13999980	G	A	0.000001	5
 </pre>
+</div>
 If we are interested in all sites with a p-value of 10^(-6) of being variable
 <pre>
-../angsd0.3/angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minLRT 24 -doSNP 1
+./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -SNP_pval 1e-6
-head TSK.mafs
+</pre>
-chromo	position	major	minor	knownEM	pK-EM	nInd
+<div class="toccolours mw-collapsible mw-collapsed">
-	14000202	G	A	0.279722	42.623150	9
+gunzip -c TSK.mafs.gz | head
-	14000873	G	A	0.212120	79.118476	10
+<pre class="mw-collapsible-content">
-	14001018	T	C	0.333736	89.040311	8
+chromo	position	major	minor	unknownEM	pu-EM	nInd
-	14001867	A	G	0.200232	47.195423	10
+	14000873	G	A	0.282476	0.000000e+00	10
-	14002422	A	T	0.167692	43.196259	9
+	14001018	T	C	0.259890	7.494005e-14	9
-	14003581	C	T	0.207404	58.593208	9
+	14001867	A	G	0.272099	6.361578e-14	10
-	14004623	T	C	0.219838	102.856433	10
+	14002422	A	T	0.377890	0.000000e+00	9
-	14007493	A	G	0.453217	28.398647	9
+	14003581	C	T	0.194393	5.551115e-16	9
-	14007558	C	T	0.395670	80.236777	7
+	14004623	T	C	0.259172	2.424727e-13	10
+	14007493	A	G	0.297176	5.114086e-07	9
+	14007558	C	T	0.381770	0.000000e+00	8
+	14007649	G	A	0.220547	1.054967e-11	9
 </pre>
+</div>
-==Deprecated options==
-These options should either be included (as is) or be discarded
-;-minDepth:
-;-maxDepth: