ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Sites

From angsd
Revision as of 12:02, 4 November 2014 by Thorfinn (talk | contribs) (→‎Details)
Jump to navigation Jump to search

This page describes the -sites filtering that angsd allows. This functionality allows the user to supply a list of sites for which the analysis will be limited to. If you are interested in regions you could consider to use the -r/-rf options, as described in Filters. The -sites will read all input data (but limit the analyses to those defined by -sites), where as the -r/-rf, will use the indexing of BAM files. The -sites and -r/-rf can be used in combination.

You should therefore limit your analyses to the chromosomes/scaffolds for which you have sites you want to analyse (beeing the chromsomes in your -sites file). This can be done with (assuming sites.txt is your -sites sites.txt file).

NB the positions are one indexed. Which means there should never be a position of zero.
cut -f1 sites.txt |sort|uniq >chrs.txt

And append -rf chrs.txt to your argument list.

Brief overview

--------------
abcFilter.cpp:
	-sites		(null)	(File containing sites to keep (chr pos))
	-sites		(null)	(File containing sites to keep (chr regStart regStop))
	-sites		(null)	(File containing sites to keep (chr pos major minor))
	-minInd		0	Only use site if atleast minInd of samples has data
	1) You can force major/minor by -doMajorMinor 3
	And make sure file contains 4 columns (chr tab pos tab major tab minor)

The -sites file is a nice ascii text file. ANGSD requires that a binary index version is generated. This is done by

angsd sites index your.file

Details

-sites filename

File containing the sites to include in analysis. If the site does not exist in the sequencing data, then the sites will of course not be included in the output.

-minInd [int]

Only keep those sites where we have data for at least this number of individuals.


We support 3 different kinds of inputfiles for filtering.

  1. Either the user supply a file containing chromsome and positions
  2. Or supply regions like chr tab regStart tab regSTop
  3. Or supply a file containing chromosome,position, major and minor

Only sites contained in the filter file will be outputted. If you supply an augmented filter for the purpose of forcing a major and minor state then remember to supply '-doMajorMinor 3'

A filter file is supplied to ANGSD with the command

-sites filename

And this file should be indexed before hand

angsd sites index filename

Example of filterfile

Example of a filter file. File must be tab seperated.

chr1  100001
chr1  2500000
chr1  347348


Example of augmented filterfile

Example of a file containing information of major and minor. File must be tab seperated.

1	728951	T	C
1	752721	A	G
1	754182	A	G
1	754334	T	C
1	760912	C	T
1	776546	G	A
1	779322	G	A
1	838555	A	C

The major and minor state can also be encoded as 0,1,2,3,4. With 0=A,1=C,2=G,3=T,4=N

We do not require the positions to be sorted, but we require that the file is grouped by chromosome name.

Internal representation

If a filter file has been supplied as '-sites filter.txt', then ANGSD will parse the entire filter.txt file and generate binary representations and dump these in the outputfiles called

  1. filter.txt.bin
  2. filter.txt.idx

Requirements

The file should be sorted according to column1. This can be achieved by:

sort -k1 unsorted.txt >sorted.txt


The information on this site is now deprecated in the newer versions on github. 
previously the binary representation of the sites to keep were generated on the fly. This could cause problems when scripting, since one angsd instance needed to generate the binary representation.

So this has now been changing into a prerun just for indexing like

./angsd sites index your.file (this will generate your.file.idx and your.file.bin)

You can then printout the text version of the binary with

./angsd sites print your.file

And you can supply this as the -sites your.file

The input format has also been made much simpler.

2column = chr pos
3columns = chr regstart regsttop (both inclusive)
4columsn = chr pos major minor