ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Filters: Difference between revisions

From angsd
Jump to navigation Jump to search
Line 15: Line 15:
We allow for filtering and manipulation a the read level using the following arguments.
We allow for filtering and manipulation a the read level using the following arguments.


;-r [region]
Specify a region with in a chromosome using the syntax [chr]:[start-stop]. examples
chr1:1-10000            \\ first 10000 based for chr1
chr2:50000-              \\chr2 but exclude the first 50000 bases
chr11:1-                  \\all of chr11
chr7:123456              //position 123456 of chr7
;-only_proper_pairs [int]=0
;-only_proper_pairs [int]=0
Include only proper pairs (pairs of read with both mates mapped correctly).  1: include only proper (default), 0: use all reads. If your data is not paired end you have to choose 1
Include only proper pairs (pairs of read with both mates mapped correctly).  1: include only proper (default), 0: use all reads. If your data is not paired end you have to choose 1
;-rf [region file]
specify multiple regions in a file.
;-nLines [int]=50
Number of lines to read per file at a time. Reducing this number will decrease the RAM usage with a small cost to the speed.
;-uniqueOnly [int]=0
;-uniqueOnly [int]=0
remove reads that have multiple best hits.. 0 no (default), 1 remove
remove reads that have multiple best hits.. 0 no (default), 1 remove

Revision as of 16:17, 11 December 2013

Information on this page is for version 0.569 or higher. Sorry for confusion, hopefully program and wiki will be updated before weekend.

We allow for filtering at many different levels.

  1. Read level, MapQ, unique mapped reads etc
  2. Base level, qscore
  3. Sequencing depth
  4. Regions (using BAM indexing (active lookup))
  5. Single sites (passive lookup, also allows for forcing major and minor) -sites
  6. Filtering based on downstream analysis. minimum MAF, LRT for SNP calling etc.

Read level filters

We allow for filtering and manipulation a the read level using the following arguments.

-only_proper_pairs [int]=0

Include only proper pairs (pairs of read with both mates mapped correctly). 1: include only proper (default), 0: use all reads. If your data is not paired end you have to choose 1

-uniqueOnly [int]=0

remove reads that have multiple best hits.. 0 no (default), 1 remove

-remove_bads [int]=1

Same as the samtools flags -x which removes read with a flag above 255 (not primary, failure and duplicate reads)

-minQ [int]=0

minimum base quality

-minMapQ [int]=0

minimum mapQ quality -baq [int] =0 perform baq computation, remember to cite the baq paper for this.

Selected Regions

see input

Selected Sites

We support 2 different kinds of inputfiles for filtering. Bim files, if you want to use a specific major minor, or plain textfiles containing chromosome tab position.

-filter [bimfile.bim] or -filter [afile.keep]



File is determed by suffix of file.


Only use sites contained in the bim (plink format) file. With -doMajorMinor 3 the major/minor alleles from the bim file is used.

Example of a bim file

1	rs11240767	0	728951	T	C
1	rs3131972	0	752721	A	G
1	rs3131969	0	754182	A	G
1	rs3131967	0	754334	T	C
1	rs1048488	0	760912	C	T
1	rs12124819	0	776546	G	A
1	rs4040617	0	779322	G	A
1	rs4970383	0	838555	A	C

Columns are, chromosome name, rsnumber, position in centimorgan, position in bp and major/major. Only column 1,4,5,6 are used. The major and minor state can also be encoded as 0,1,2,3,4. With 0=A,1=C,2=G,3=T,4=N

Example of a keep file

chr1  100001
chr1  2500000
chr1  347348

The .keep are implemented in 0.503 or above.

For the .bim file no ordering is assumed. We require that the .keep file is sorted according to chromosome.

Warning

To clarify the above, we require that the .keep file is sorted/grouped together by chromosome. We do not care of the ordering of positions within each chromosome. The program requires that the ordering of the chromosomes from the filereading has the same order as in the .keep file. You could of cause reheader your bamfiles, and sort it. But a much simpler solution is to force the filereading to use the same ordering as your keep file and supply that using the -rf. An example for this is

cut -f1 filter.keep |uniq >awk '{print $0":"}' >regions.txt
./angsd [do analysis] -rf regions.txt

Allele frequencies

-minMaf [float]
only work with sites with a maf above 'float'

polymorphic sites

-minLRT [float]
only work with sits with an LRT>float

Number of non missing individuals

-minInd [int]
only work with sites with information from atleast int individiduals



First we do a run with no filters

./angsd  -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1:
...
head TSK.mafs 
chromo	position	major	minor	knownEM	nInd
1	13999919	A	C	0.000008	1
1	13999920	G	A	0.000008	1
1	13999921	G	A	0.000008	1
1	13999922	C	A	0.000008	1
1	13999923	A	C	0.000008	1
1	13999924	G	A	0.000008	1
1	13999925	G	A	0.000008	1
1	13999926	A	C	0.000008	1
1	13999927	G	A	0.000008	1

Now we do a filter with MAF cutoff of 1\%

../angsd0.3/angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minMaf 0.01
head TSK.mafs 
chromo	position	major	minor	knownEM	nInd
1	13999950	T	G	0.495291	2
1	14000019	G	T	0.047247	9
1	14000056	C	T	0.055851	10
1	14000127	G	T	0.060760	10
1	14000170	C	T	0.052388	9
1	14000176	G	A	0.047928	10
1	14000202	G	A	0.279722	9
1	14000262	C	T	0.058555	9
1	14000322	A	G	0.040471	8

Similar if we only want sites with information for atleast 5 samples

../angsd0.3/angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minKeepInd 5
head TSK.mafs 
chromo	position	major	minor	knownEM	nInd
1	13999971	T	A	0.000007	6
1	13999972	G	A	0.000007	6
1	13999973	C	A	0.000005	5
1	13999974	G	A	0.000006	6
1	13999975	C	A	0.000002	5
1	13999976	C	A	0.000004	7
1	13999977	A	C	0.000005	8
1	13999978	C	A	0.000005	8
1	13999979	T	A	0.000005	8

If we are interested in all sites with a p-value of 10^(-6) of being variable

../angsd0.3/angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minLRT 24 -doSNP 1 
head TSK.mafs 
chromo	position	major	minor	knownEM	pK-EM	nInd
1	14000202	G	A	0.279722	42.623150	9
1	14000873	G	A	0.212120	79.118476	10
1	14001018	T	C	0.333736	89.040311	8
1	14001867	A	G	0.200232	47.195423	10
1	14002422	A	T	0.167692	43.196259	9
1	14003581	C	T	0.207404	58.593208	9
1	14004623	T	C	0.219838	102.856433	10
1	14007493	A	G	0.453217	28.398647	9
1	14007558	C	T	0.395670	80.236777	7


Deprecated options

These options should either be included (as is) or be discarded

-minDepth
-maxDepth