ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Sites: Difference between revisions

From angsd
Jump to navigation Jump to search
(Created page with "In most analysis you are only interested in a subset of sites and not all sites. Currently we have the following filter options. NB the afile.keep is still beta and some use...")
 
 
(43 intermediate revisions by the same user not shown)
Line 1: Line 1:
In most analysis you are only interested in a subset of sites and not all sites. Currently we have the following filter options.
This page describes the '''-sites''' filtering that angsd allows. This functionality allows the user to supply a list of sites for which the analysis will be limited to. If you are interested in regions you could consider to use the '''-r/-rf''' options, as described in [[Filters]]. The '''-sites''' will read all input data (but limit the analyses to those defined by -sites), where as the '''-r/-rf''', will use the indexing of BAM files. The '''-sites''' and '''-r/-rf''' can be used in combination.


You should therefore limit your analyses to the chromosomes/scaffolds for which you have sites you want to analyse (beeing the chromsomes in your -sites file).
This can be done with (assuming sites.txt is your -sites sites.txt file).


NB the afile.keep is still beta and some users have reported that this made the program crash on random occasions.
;NB the positions are one indexed. Which means there should never be a position of zero.
=Selected Sites=
<pre>
We support 2 different kinds of inputfiles for filtering. Bim files, if you want to use a specific major minor, or plain textfiles containing chromosome tab position.
cut -f1 sites.txt |sort|uniq >chrs.txt
</pre>
And append '''-rf chrs.txt''' to your argument list.


; -filter [bimfile.bim] or -filter [afile.keep]
=Brief overview=
<pre>
--------------
abcFilter.cpp:
-sites (null) (File containing sites to keep (chr pos))
-sites (null) (File containing sites to keep (chr regStart regStop))
-sites (null) (File containing sites to keep (chr pos major minor))
-minInd 0 Only use site if atleast minInd of samples has data
1) You can force major/minor by -doMajorMinor 3
And make sure file contains 4 columns (chr tab pos tab major tab minor)
</pre>
The -sites file is a nice ascii text file. ANGSD requires that a binary index version is generated. This is done by
<pre>
angsd sites index your.file
</pre>


=Details=
;-sites filename
File containing the sites to include in analysis. If the site does not exist in the sequencing data, then the sites will of course not be included in the output.
;-minInd [int]
Only keep those sites where we have data for at least this number of individuals.




We support 3 different kinds of inputfiles for filtering.


File is determed by suffix of file.
# Either the user supply a file containing chromsome and positions
# Or supply regions like chr tab regStart tab regSTop
# Or supply a file containing chromosome,position, major and minor


Only sites contained in the filter file will be outputted. If you supply an augmented filter for the purpose of forcing a major and minor state then remember to supply '-doMajorMinor 3'


Only use sites contained in the bim (plink format) file. With -doMajorMinor 3 the major/minor alleles from the bim file is used.
A filter file is supplied to ANGSD with the command
<pre>
-sites filename
</pre>
And this file should be indexed before hand


Example of a bim file
<pre>
<pre>
1 rs11240767 0 728951 T C
angsd sites index filename
1 rs3131972 0 752721 A G
1 rs3131969 0 754182 A G
1 rs3131967 0 754334 T C
1 rs1048488 0 760912 C T
1 rs12124819 0 776546 G A
1 rs4040617 0 779322 G A
1 rs4970383 0 838555 A C
</pre>
</pre>
Columns are, chromosome name, rsnumber, position in centimorgan, position in bp and major/major.
==Example of filter file for single sites==
Only column 1,4,5,6 are used. The major and minor state can also be encoded as 0,1,2,3,4. With 0=A,1=C,2=G,3=T,4=N
Example of a filter file.
 
Example of a keep file
<pre>
<pre>
chr1  100001
chr1  100001
chr1  2500000
chr1  2500000
chr1  347348
chr1  347348
</pre>
==Example of filter file for regions==
Example of a filter file.
<pre>
chr1  100 220
chr1  1234 2234
chr1  1000000 2000000
</pre>
</pre>


The .keep are implemented in 0.503 or above.
==Example of augmented filterfile==
 
Example of a file containing information of major and minor. File must be tab seperated.
For the .bim file no ordering is assumed. We require that the .keep file is sorted according to chromosome.
==Warning==
To clarify the above, we require that the .keep file is sorted/grouped together by chromosome. We do not care of the ordering of positions within each chromosome.
The program requires that the ordering of the chromosomes from the filereading has the same order as in the .keep file. You could of cause reheader your bamfiles, and sort it. But a much simpler solution is to force the filereading to use the same ordering as your keep file and supply that using the -rf.
An example for this is
<pre>
<pre>
cut -f1 filter.keep |uniq >awk '{print $0":"}' >regions.txt
1 728951 T C
./angsd [do analysis] -rf regions.txt
1 752721 A G
1 754182 A G
1 754334 T C
1 760912 C T
1 776546 G A
1 779322 G A
1 838555 A C
</pre>
</pre>
The major and minor state can also be encoded as 0,1,2,3,4. With 0=A,1=C,2=G,3=T,4=N


=Allele frequencies=
We do not require the positions to be sorted, but we require that the file is grouped by chromosome name.
; -minMaf [float]: only work with sites with a maf above 'float'
 
=polymorphic sites=
 
; -minLRT [float]: only work with sits with an LRT>float
 
=Number of non missing individuals=
 
 
; -minKeepIndC [int]: only work with sites with information from atleast int individiduals, requires -doCounts 1


==Internal representation==
If a filter file has been supplied as '-sites filter.txt', then ANGSD will parse the entire filter.txt file and generate binary representations and dump these in the outputfiles called
# filter.txt.bin
# filter.txt.idx


 
These can be printed using the command
 
 
First we do a run with no filters
<pre>
<pre>
./angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1:
angsd sites print filter.txt
...
head TSK.mafs
chromo position major minor knownEM nInd
1 13999919 A C 0.000008 1
1 13999920 G A 0.000008 1
1 13999921 G A 0.000008 1
1 13999922 C A 0.000008 1
1 13999923 A C 0.000008 1
1 13999924 G A 0.000008 1
1 13999925 G A 0.000008 1
1 13999926 A C 0.000008 1
1 13999927 G A 0.000008 1
</pre>
</pre>


Now we do a filter with MAF cutoff of 1\%
=Requirements=
 
The file should be sorted according to column1.
This can be achieved by:
<pre>
<pre>
../angsd0.3/angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minMaf 0.01
sort -k1 unsorted.txt >sorted.txt
head TSK.mafs
chromo position major minor knownEM nInd
1 13999950 T G 0.495291 2
1 14000019 G T 0.047247 9
1 14000056 C T 0.055851 10
1 14000127 G T 0.060760 10
1 14000170 C T 0.052388 9
1 14000176 G A 0.047928 10
1 14000202 G A 0.279722 9
1 14000262 C T 0.058555 9
1 14000322 A G 0.040471 8
</pre>
</pre>
=Bedfiles?=
If you want to use bed files, you need to convert to the native angsd format.
Bedfiles are chr pos1 pos2 value, and are zero indexed. Furthermore pos1 is included, but pos2 is not included.
You can convert to angsd format with


Similar if we only want sites with information for atleast 5 samples
<pre>
../angsd0.3/angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minKeepInd 5
head TSK.mafs
chromo position major minor knownEM nInd
1 13999971 T A 0.000007 6
1 13999972 G A 0.000007 6
1 13999973 C A 0.000005 5
1 13999974 G A 0.000006 6
1 13999975 C A 0.000002 5
1 13999976 C A 0.000004 7
1 13999977 A C 0.000005 8
1 13999978 C A 0.000005 8
1 13999979 T A 0.000005 8
</pre>
If we are interested in all sites with a p-value of 10^(-6) of being variable
<pre>
<pre>
../angsd0.3/angsd -doMaf 2 -doMajorMinor 1 -out TSK -bam bam.filelist -GL 1 -r 1: -minLRT 24 -doSNP 1
awk '{print $1"\t"$2+1"\t"$3}' input.bed >angsd.file
head TSK.mafs
chromo position major minor knownEM pK-EM nInd
1 14000202 G A 0.279722 42.623150 9
1 14000873 G A 0.212120 79.118476 10
1 14001018 T C 0.333736 89.040311 8
1 14001867 A G 0.200232 47.195423 10
1 14002422 A T 0.167692 43.196259 9
1 14003581 C T 0.207404 58.593208 9
1 14004623 T C 0.219838 102.856433 10
1 14007493 A G 0.453217 28.398647 9
1 14007558 C T 0.395670 80.236777 7
 
</pre>
</pre>
==Deprecated options==
These options should either be included (as is) or be discarded
;-minDepth:
;-maxDepth:

Latest revision as of 15:00, 16 November 2015

This page describes the -sites filtering that angsd allows. This functionality allows the user to supply a list of sites for which the analysis will be limited to. If you are interested in regions you could consider to use the -r/-rf options, as described in Filters. The -sites will read all input data (but limit the analyses to those defined by -sites), where as the -r/-rf, will use the indexing of BAM files. The -sites and -r/-rf can be used in combination.

You should therefore limit your analyses to the chromosomes/scaffolds for which you have sites you want to analyse (beeing the chromsomes in your -sites file). This can be done with (assuming sites.txt is your -sites sites.txt file).

NB the positions are one indexed. Which means there should never be a position of zero.
cut -f1 sites.txt |sort|uniq >chrs.txt

And append -rf chrs.txt to your argument list.

Brief overview

--------------
abcFilter.cpp:
	-sites		(null)	(File containing sites to keep (chr pos))
	-sites		(null)	(File containing sites to keep (chr regStart regStop))
	-sites		(null)	(File containing sites to keep (chr pos major minor))
	-minInd		0	Only use site if atleast minInd of samples has data
	1) You can force major/minor by -doMajorMinor 3
	And make sure file contains 4 columns (chr tab pos tab major tab minor)

The -sites file is a nice ascii text file. ANGSD requires that a binary index version is generated. This is done by

angsd sites index your.file

Details

-sites filename

File containing the sites to include in analysis. If the site does not exist in the sequencing data, then the sites will of course not be included in the output.

-minInd [int]

Only keep those sites where we have data for at least this number of individuals.


We support 3 different kinds of inputfiles for filtering.

  1. Either the user supply a file containing chromsome and positions
  2. Or supply regions like chr tab regStart tab regSTop
  3. Or supply a file containing chromosome,position, major and minor

Only sites contained in the filter file will be outputted. If you supply an augmented filter for the purpose of forcing a major and minor state then remember to supply '-doMajorMinor 3'

A filter file is supplied to ANGSD with the command

-sites filename

And this file should be indexed before hand

angsd sites index filename

Example of filter file for single sites

Example of a filter file.

chr1  100001
chr1  2500000
chr1  347348

Example of filter file for regions

Example of a filter file.

chr1  100 220
chr1  1234 2234
chr1  1000000 2000000

Example of augmented filterfile

Example of a file containing information of major and minor. File must be tab seperated.

1	728951	T	C
1	752721	A	G
1	754182	A	G
1	754334	T	C
1	760912	C	T
1	776546	G	A
1	779322	G	A
1	838555	A	C

The major and minor state can also be encoded as 0,1,2,3,4. With 0=A,1=C,2=G,3=T,4=N

We do not require the positions to be sorted, but we require that the file is grouped by chromosome name.

Internal representation

If a filter file has been supplied as '-sites filter.txt', then ANGSD will parse the entire filter.txt file and generate binary representations and dump these in the outputfiles called

  1. filter.txt.bin
  2. filter.txt.idx

These can be printed using the command

angsd sites print filter.txt

Requirements

The file should be sorted according to column1. This can be achieved by:

sort -k1 unsorted.txt >sorted.txt

Bedfiles?

If you want to use bed files, you need to convert to the native angsd format.

Bedfiles are chr pos1 pos2 value, and are zero indexed. Furthermore pos1 is included, but pos2 is not included. You can convert to angsd format with

awk '{print $1"\t"$2+1"\t"$3}' input.bed >angsd.file