ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Sites

From angsd
Jump to navigation Jump to search

Version notice

The information on this page relates to version 0.542 or above.

Main

In most analysis you are only interested in a subset of sites and not all sites. Currently we have the following filter options.

Selected Sites

We support 2 different kinds of inputfiles for filtering.

  1. Either the user supply a file containing chromsome and positions
  2. Or the user supply a file containing chromosome,position, major and minor

Only sites contained in the filter file will be outputted. If you supply an augmented filter for the purpose of forcing a major and minor state then remember to supply '-doMajorMinor 3'

A filter file is supplied to ANGSD with the command

-filter filename
</pre


Example of a filter file. File must be tab seperated.
<pre>
chr1  100001
chr1  2500000
chr1  347348


Example of a file containing information of major and minor. File must be tab seperated.

1	728951	T	C
1	752721	A	G
1	754182	A	G
1	754334	T	C
1	760912	C	T
1	776546	G	A
1	779322	G	A
1	838555	A	C

The major and minor state can also be encoded as 0,1,2,3,4. With 0=A,1=C,2=G,3=T,4=N

We do not require the positions to be sorted, but we require that the file grouped by chromosome name.

Details

if a filter file has been supplied as '-filter filter.txt', then ANGSD will parse the entire filter.txt file and generate binary representations and dump these in the outputfiles called

  1. filter.txt.bin
  2. filter.txt.idx

Therefore remember to purge old versions of these files, if you have updated the filter.txt file.

Allele frequencies

-minMaf [float]
only work with sites with a maf above 'float'

polymorphic sites

-minLRT [float]
only work with sits with an LRT>float

Number of non missing individuals

-minInd [int]
only work with sites with information from atleast int individiduals, requires -doCounts 1



First we do a run with no filters

./angsd  -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1:
...
head TSK.mafs 
chromo	position	major	minor	knownEM	nInd
1	13999919	A	C	0.000008	1
1	13999920	G	A	0.000008	1
1	13999921	G	A	0.000008	1
1	13999922	C	A	0.000008	1
1	13999923	A	C	0.000008	1
1	13999924	G	A	0.000008	1
1	13999925	G	A	0.000008	1
1	13999926	A	C	0.000008	1
1	13999927	G	A	0.000008	1

Now we do a filter with MAF cutoff of 1\%

../angsd0.3/angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minMaf 0.01
head TSK.mafs 
chromo	position	major	minor	knownEM	nInd
1	13999950	T	G	0.495291	2
1	14000019	G	T	0.047247	9
1	14000056	C	T	0.055851	10
1	14000127	G	T	0.060760	10
1	14000170	C	T	0.052388	9
1	14000176	G	A	0.047928	10
1	14000202	G	A	0.279722	9
1	14000262	C	T	0.058555	9
1	14000322	A	G	0.040471	8

Similar if we only want sites with information for atleast 5 samples

../angsd0.3/angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minKeepInd 5
head TSK.mafs 
chromo	position	major	minor	knownEM	nInd
1	13999971	T	A	0.000007	6
1	13999972	G	A	0.000007	6
1	13999973	C	A	0.000005	5
1	13999974	G	A	0.000006	6
1	13999975	C	A	0.000002	5
1	13999976	C	A	0.000004	7
1	13999977	A	C	0.000005	8
1	13999978	C	A	0.000005	8
1	13999979	T	A	0.000005	8

If we are interested in all sites with a p-value of 10^(-6) of being variable

../angsd0.3/angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minLRT 24 -doSNP 1 
head TSK.mafs 
chromo	position	major	minor	knownEM	pK-EM	nInd
1	14000202	G	A	0.279722	42.623150	9
1	14000873	G	A	0.212120	79.118476	10
1	14001018	T	C	0.333736	89.040311	8
1	14001867	A	G	0.200232	47.195423	10
1	14002422	A	T	0.167692	43.196259	9
1	14003581	C	T	0.207404	58.593208	9
1	14004623	T	C	0.219838	102.856433	10
1	14007493	A	G	0.453217	28.398647	9
1	14007558	C	T	0.395670	80.236777	7


Deprecated options

These options should either be included (as is) or be discarded

-minDepth
-maxDepth