ANGSD: Analysis of next generation Sequencing Data
Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.
Filters: Difference between revisions
No edit summary |
No edit summary |
||
Line 9: | Line 9: | ||
# Sequencing depth | # Sequencing depth | ||
# Regions (using BAM indexing (active lookup)) | # Regions (using BAM indexing (active lookup)) | ||
# Single sites (passive lookup, also allows for forcing major and minor) | # Single sites (passive lookup, also allows for forcing major and minor) [[Sites |-sites]] | ||
# Filtering based on downstream analysis. minimum MAF, LRT for SNP calling etc. | # Filtering based on downstream analysis. minimum MAF, LRT for SNP calling etc. | ||
Revision as of 15:11, 11 December 2013
Information on this page is for version 0.569 or higher. Sorry for confusion, hopefully program and wiki will be updated before weekend.
We allow for filtering at many different levels.
- Read level, MapQ, unique mapped reads etc
- Base level, qscore
- Sequencing depth
- Regions (using BAM indexing (active lookup))
- Single sites (passive lookup, also allows for forcing major and minor) -sites
- Filtering based on downstream analysis. minimum MAF, LRT for SNP calling etc.
Version Notice
The information on this page relates to versions before 0.542. See Filters2 for the latest approach.
Main
In most analysis you are only interested in a subset of sites and not all sites. Currently we have the following filter options.
NB the afile.keep is still beta and some users have reported that this made the program crash on random occasions.
Selected Regions
see input
Selected Sites
We support 2 different kinds of inputfiles for filtering. Bim files, if you want to use a specific major minor, or plain textfiles containing chromosome tab position.
- -filter [bimfile.bim] or -filter [afile.keep]
File is determed by suffix of file.
Only use sites contained in the bim (plink format) file. With -doMajorMinor 3 the major/minor alleles from the bim file is used.
Example of a bim file
1 rs11240767 0 728951 T C 1 rs3131972 0 752721 A G 1 rs3131969 0 754182 A G 1 rs3131967 0 754334 T C 1 rs1048488 0 760912 C T 1 rs12124819 0 776546 G A 1 rs4040617 0 779322 G A 1 rs4970383 0 838555 A C
Columns are, chromosome name, rsnumber, position in centimorgan, position in bp and major/major. Only column 1,4,5,6 are used. The major and minor state can also be encoded as 0,1,2,3,4. With 0=A,1=C,2=G,3=T,4=N
Example of a keep file
chr1 100001 chr1 2500000 chr1 347348
The .keep are implemented in 0.503 or above.
For the .bim file no ordering is assumed. We require that the .keep file is sorted according to chromosome.
Warning
To clarify the above, we require that the .keep file is sorted/grouped together by chromosome. We do not care of the ordering of positions within each chromosome. The program requires that the ordering of the chromosomes from the filereading has the same order as in the .keep file. You could of cause reheader your bamfiles, and sort it. But a much simpler solution is to force the filereading to use the same ordering as your keep file and supply that using the -rf. An example for this is
cut -f1 filter.keep |uniq >awk '{print $0":"}' >regions.txt ./angsd [do analysis] -rf regions.txt
Allele frequencies
- -minMaf [float]
- only work with sites with a maf above 'float'
polymorphic sites
- -minLRT [float]
- only work with sits with an LRT>float
Number of non missing individuals
- -minInd [int]
- only work with sites with information from atleast int individiduals
First we do a run with no filters
./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: ... head TSK.mafs chromo position major minor knownEM nInd 1 13999919 A C 0.000008 1 1 13999920 G A 0.000008 1 1 13999921 G A 0.000008 1 1 13999922 C A 0.000008 1 1 13999923 A C 0.000008 1 1 13999924 G A 0.000008 1 1 13999925 G A 0.000008 1 1 13999926 A C 0.000008 1 1 13999927 G A 0.000008 1
Now we do a filter with MAF cutoff of 1\%
../angsd0.3/angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minMaf 0.01 head TSK.mafs chromo position major minor knownEM nInd 1 13999950 T G 0.495291 2 1 14000019 G T 0.047247 9 1 14000056 C T 0.055851 10 1 14000127 G T 0.060760 10 1 14000170 C T 0.052388 9 1 14000176 G A 0.047928 10 1 14000202 G A 0.279722 9 1 14000262 C T 0.058555 9 1 14000322 A G 0.040471 8
Similar if we only want sites with information for atleast 5 samples
../angsd0.3/angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minKeepInd 5 head TSK.mafs chromo position major minor knownEM nInd 1 13999971 T A 0.000007 6 1 13999972 G A 0.000007 6 1 13999973 C A 0.000005 5 1 13999974 G A 0.000006 6 1 13999975 C A 0.000002 5 1 13999976 C A 0.000004 7 1 13999977 A C 0.000005 8 1 13999978 C A 0.000005 8 1 13999979 T A 0.000005 8
If we are interested in all sites with a p-value of 10^(-6) of being variable
../angsd0.3/angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minLRT 24 -doSNP 1 head TSK.mafs chromo position major minor knownEM pK-EM nInd 1 14000202 G A 0.279722 42.623150 9 1 14000873 G A 0.212120 79.118476 10 1 14001018 T C 0.333736 89.040311 8 1 14001867 A G 0.200232 47.195423 10 1 14002422 A T 0.167692 43.196259 9 1 14003581 C T 0.207404 58.593208 9 1 14004623 T C 0.219838 102.856433 10 1 14007493 A G 0.453217 28.398647 9 1 14007558 C T 0.395670 80.236777 7
Deprecated options
These options should either be included (as is) or be discarded
- -minDepth
- -maxDepth