ANGSD: Analysis of next generation Sequencing Data
Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.
Sites: Difference between revisions
|  (→Main) | |||
| (29 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
| This page describes the '''-sites''' filtering that angsd allows. This functionality allows the user to supply a list of sites for which the analysis will be limited to. If you are interested in regions you  | This page describes the '''-sites''' filtering that angsd allows. This functionality allows the user to supply a list of sites for which the analysis will be limited to. If you are interested in regions you could consider to use the '''-r/-rf''' options, as described in [[Filters]]. The '''-sites''' will read all input data (but limit the analyses to those defined by -sites), where as the '''-r/-rf''', will use the indexing of BAM files. The '''-sites''' and '''-r/-rf''' can be used in combination. | ||
| You should therefore limit your analyses to the chromosomes/scaffolds for which you have sites you want to analyse (beeing the chromsomes in your -sites file). | |||
| This can be done with (assuming sites.txt is your -sites sites.txt file). | |||
| ;NB the positions are one indexed. Which means there should never be a position of zero. | |||
| <pre> | |||
| cut -f1 sites.txt |sort|uniq >chrs.txt | |||
| </pre> | |||
| And append '''-rf chrs.txt''' to your argument list. | |||
| =Brief overview= | =Brief overview= | ||
| <pre> | <pre> | ||
| -------------- | -------------- | ||
| abcFilter.cpp: | |||
| 	-sites		(null)	(File containing sites to keep (chr  | 	-sites		(null)	(File containing sites to keep (chr pos)) | ||
| 	-sites		(null)	(File containing sites to keep (chr regStart regStop)) | |||
| 	-sites		(null)	(File containing sites to keep (chr pos major minor)) | |||
| 	-minInd		0	Only use site if atleast minInd of samples has data | 	-minInd		0	Only use site if atleast minInd of samples has data | ||
| 	You can force major/minor by -doMajorMinor 3 | 	1) You can force major/minor by -doMajorMinor 3 | ||
| 	And make sure file contains 4 columns (chr tab pos tab major tab minor) | 	And make sure file contains 4 columns (chr tab pos tab major tab minor) | ||
| </pre> | </pre> | ||
| The -sites file is a nice ascii text file. ANGSD requires that a binary index version is generated. This is done by | |||
| <pre> | |||
| angsd sites index your.file | |||
| </pre> | |||
| =Details= | |||
| ;-sites filename | |||
| File containing the sites to include in analysis. If the site does not exist in the sequencing data, then the sites will of course not be included in the output. | |||
| ;-minInd [int] | |||
| Only keep those sites where we have data for at least this number of individuals. | |||
| We support 3 different kinds of inputfiles for filtering.   | |||
| We support  | |||
| # Either the user supply a file containing chromsome and positions | # Either the user supply a file containing chromsome and positions | ||
| # Or  | # Or supply regions like chr tab regStart tab regSTop | ||
| # Or supply a file containing chromosome,position, major and minor | |||
| Only sites contained in the filter file will be outputted. If you supply an augmented filter for the purpose of forcing a major and minor state then remember to supply '-doMajorMinor 3' | Only sites contained in the filter file will be outputted. If you supply an augmented filter for the purpose of forcing a major and minor state then remember to supply '-doMajorMinor 3' | ||
| Line 24: | Line 43: | ||
| A filter file is supplied to ANGSD with the command | A filter file is supplied to ANGSD with the command | ||
| <pre> | <pre> | ||
| - | -sites filename | ||
| </pre> | </pre> | ||
| And this file should be indexed before hand | |||
| <pre> | |||
| Example of a filter file | angsd sites index filename | ||
| </pre> | |||
| ==Example of filter file for single sites== | |||
| Example of a filter file.   | |||
| <pre> | <pre> | ||
| chr1  100001 | chr1  100001 | ||
| chr1  2500000 | chr1  2500000 | ||
| chr1  347348 | chr1  347348 | ||
| </pre> | |||
| ==Example of filter file for regions== | |||
| Example of a filter file.  | |||
| <pre> | |||
| chr1  100 220 | |||
| chr1  1234 2234 | |||
| chr1  1000000 2000000 | |||
| </pre> | </pre> | ||
| ==Example of augmented filterfile== | |||
| Example of a file containing information of major and minor. File must be tab seperated. | Example of a file containing information of major and minor. File must be tab seperated. | ||
| <pre> | <pre> | ||
| Line 51: | Line 81: | ||
| We do not require the positions to be sorted, but we require that the file is grouped by chromosome name. | We do not require the positions to be sorted, but we require that the file is grouped by chromosome name. | ||
| == | ==Internal representation== | ||
| If a filter file has been supplied as '-sites filter.txt', then ANGSD will parse the entire filter.txt file and generate binary representations and dump these in the outputfiles called | |||
| # filter.txt.bin | # filter.txt.bin | ||
| # filter.txt.idx | # filter.txt.idx | ||
| These can be printed using the command | |||
| <pre> | <pre> | ||
| angsd sites print filter.txt | |||
| </pre> | </pre> | ||
| =Requirements= | |||
| The file should be sorted according to column1. | |||
| This can be achieved by: | |||
| <pre> | <pre> | ||
| sort -k1 unsorted.txt >sorted.txt | |||
| </pre> | </pre> | ||
| =Bedfiles?= | |||
| If you want to use bed files, you need to convert to the native angsd format. | |||
| Bedfiles are chr pos1 pos2 value, and are zero indexed. Furthermore pos1 is included, but pos2 is not included. | |||
| You can convert to angsd format with | |||
| <pre> | <pre> | ||
| awk '{print $1"\t"$2+1"\t"$3}' input.bed >angsd.file | |||
| </pre> | </pre> | ||
Latest revision as of 16:00, 16 November 2015
This page describes the -sites filtering that angsd allows. This functionality allows the user to supply a list of sites for which the analysis will be limited to. If you are interested in regions you could consider to use the -r/-rf options, as described in Filters. The -sites will read all input data (but limit the analyses to those defined by -sites), where as the -r/-rf, will use the indexing of BAM files. The -sites and -r/-rf can be used in combination.
You should therefore limit your analyses to the chromosomes/scaffolds for which you have sites you want to analyse (beeing the chromsomes in your -sites file). This can be done with (assuming sites.txt is your -sites sites.txt file).
- NB the positions are one indexed. Which means there should never be a position of zero.
cut -f1 sites.txt |sort|uniq >chrs.txt
And append -rf chrs.txt to your argument list.
Brief overview
-------------- abcFilter.cpp: -sites (null) (File containing sites to keep (chr pos)) -sites (null) (File containing sites to keep (chr regStart regStop)) -sites (null) (File containing sites to keep (chr pos major minor)) -minInd 0 Only use site if atleast minInd of samples has data 1) You can force major/minor by -doMajorMinor 3 And make sure file contains 4 columns (chr tab pos tab major tab minor)
The -sites file is a nice ascii text file. ANGSD requires that a binary index version is generated. This is done by
angsd sites index your.file
Details
- -sites filename
File containing the sites to include in analysis. If the site does not exist in the sequencing data, then the sites will of course not be included in the output.
- -minInd [int]
Only keep those sites where we have data for at least this number of individuals.
We support 3 different kinds of inputfiles for filtering. 
- Either the user supply a file containing chromsome and positions
- Or supply regions like chr tab regStart tab regSTop
- Or supply a file containing chromosome,position, major and minor
Only sites contained in the filter file will be outputted. If you supply an augmented filter for the purpose of forcing a major and minor state then remember to supply '-doMajorMinor 3'
A filter file is supplied to ANGSD with the command
-sites filename
And this file should be indexed before hand
angsd sites index filename
Example of filter file for single sites
Example of a filter file.
chr1 100001 chr1 2500000 chr1 347348
Example of filter file for regions
Example of a filter file.
chr1 100 220 chr1 1234 2234 chr1 1000000 2000000
Example of augmented filterfile
Example of a file containing information of major and minor. File must be tab seperated.
1 728951 T C 1 752721 A G 1 754182 A G 1 754334 T C 1 760912 C T 1 776546 G A 1 779322 G A 1 838555 A C
The major and minor state can also be encoded as 0,1,2,3,4. With 0=A,1=C,2=G,3=T,4=N
We do not require the positions to be sorted, but we require that the file is grouped by chromosome name.
Internal representation
If a filter file has been supplied as '-sites filter.txt', then ANGSD will parse the entire filter.txt file and generate binary representations and dump these in the outputfiles called
- filter.txt.bin
- filter.txt.idx
These can be printed using the command
angsd sites print filter.txt
Requirements
The file should be sorted according to column1. This can be achieved by:
sort -k1 unsorted.txt >sorted.txt
Bedfiles?
If you want to use bed files, you need to convert to the native angsd format.
Bedfiles are chr pos1 pos2 value, and are zero indexed. Furthermore pos1 is included, but pos2 is not included. You can convert to angsd format with
awk '{print $1"\t"$2+1"\t"$3}' input.bed >angsd.file