ANGSD: Analysis of next generation Sequencing Data
Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.
Input
ANSGD currently supports various mapped data, genotype likelihood formats and imputed genotype probability files.
BAM files
ANGSD accepts bam files for mapped sequences. For imformation on the file specification and file creation see the samtools website [1]. These are required do be sorted according to reference and position.
Brief Overview
./angsd -bam DEV VERSION might not work thorfinn feb 1 2013 sizeof(funkypars):128 Command: ./angsd -bam -> angsd version: 0.569 build(Dec 11 2013 14:47:21) -> Analysis helpbox/synopsis information: --------------- parseArgs_bambi.cpp: bam reader: -bam (null) Supply a file list of BAMfiles -i (null) Supply a single BAMfile -r (null) Supply a single region in commandline (see examples below) -rf (null) Supply multiple regions in a file (see examples below) -remove_bads 1 Discard 'bad' reads, (flag >=255) -nInd 0 Only use first nInd from the filelist from the -bam argument -nLines 50 Read nLines from files at a time -uniqueOnly 0 Discards reads that doesn't map uniquely -show 0 Mimic 'samtools mpileup' also supply -ref fasta for printing reference column -minMapQ 0 Discard reads with mapping quality below -minQ 13 Discard reads with mapping quality below -only_proper_pairs 1 Only use reads where the mate could be mapped -C 0 adjust mapQ for excessive mismatches (as SAMtools), supply -ref -baq 0 adjust qscores around indels (as SAMtools), supply -ref Examples for region specification: chr: Use entire chromosome: chr chr:start- Use region from start to end of chr chr:-stop Use region from beginning of chromosome: chr to stop chr:start-stop Use region from start to stop from chromosome: chr chr:site Use single site on chromosome: chr </pre ==arguments== ;-bam [filelist] The filelist is a file containing the full path for each bam file with one filename per row. Example of a filelist with 6 individuals <pre> /home/software/angsd/test/smallBam/smallNA12763.bam /home/software/angsd/test/smallBam/smallNA11830.bam /home/software/angsd/test/smallBam/smallNA12004.bam /home/software/angsd/test/smallBam/smallNA06985.bam /home/software/angsd/test/smallBam/smallNA11993.bam /home/software/angsd/test/smallBam/smallNA12761.bam
Example of estimating allele frequencies from bam files
./angsd -out out -doMaf 2 -bam bam.filelist -doMajorMinor 1
Optional arguments
- -r [region]
Specify a region with in a chromosome using the syntax [chr]:[start-stop]. examples
chr1:1-10000 \\ first 10000 based for chr1 chr2:50000- \\chr2 but exclude the first 50000 bases chr11:1- \\all of chr11 chr7:123456 //position 123456 of chr7
- -only_proper_pairs [int]=0
Include only proper pairs (pairs of read with both mates mapped correctly). 1: include only proper (default), 0: use all reads. If your data is not paired end you have to choose 1
- -rf [region file]
specify multiple regions in a file.
- -nLines [int]=50
Number of lines to read per file at a time. Reducing this number will decrease the RAM usage with a small cost to the speed.
- -uniqueOnly [int]=0
remove reads that have multiple best hits.. 0 no (default), 1 remove
- -remove_bads [int]=1
Same as the samtools flags -x which removes read with a flag above 255 (not primary, failure and duplicate reads)
- -minQ [int]=0
minimum base quality
- -minMapQ [int]=0
minimum mapQ quality -baq [int] =0 perform baq computation, remember to cite the baq paper for this.
genotype likelihood files
For historical reasons the program can use binary glfv3 files, and the text representations. These were generated from old versions of SAMtools, and is deprecated in newer versions of SAMtools. Futhermore for internal use we can read the 'inhouse' tglf format files.
These formats are likely to be deprecated in future versions.
glfv3
- -samglf [filename]
Samtools glf format (binary output). use the pileup -g options in samtools to generate the files. This format is deprecated in newer versions of samtools. Samtools glf format (text output). use the pileup -g options in samtools to generate binary files followed the use of the samtools glfview. This format is deprecated in newer versions of samtools.
- -samglfclean [filename]
tglf
A simple format for genotype likelihoods: Every sample is in seperate files, Every genotype is saved as binary double log10 scaled. in the following order. AA,CC,GG,TT, etc
Since there are no information on position,reference,ancestral the -pos must be supplied in tandem with -tglf
- -tglf [filename]
- -posi [filename]
An example of a -posi file is:
chr1 58924 0 4 chr1 58925 0 4 chr1 58926 0 4 chr1 58927 0 4 chr1 58928 0 4
genotype probability files
beagle format
Genotype probabilities in gz beagle format can be used as input. The format used is the haplotype imputation format outputted from beagle [2].
options
To include a beagle file use the options
- -beagle [fileName]
beagle file name. The file must be gzipped.
- -intName [int]
default 1. If the SNP name are written as chr_position this information will be parsed. If the SNP name is in another format then use -intName 0.
example
The file format is a single linje per site. The first 3 coloums are
- markerName
- alleleA
- alleleB
For each individual 3 coloums are added. These three colums should sum to one.
Example of a file with two individuals
marker alleleA alleleB NA06984 NA06984 NA06984 NA06986 NA06986 NA06986 chr9_95759065 G A 0.6563 0.3078 0.0358 0.5357 0.4016 0.0627 chr9_95759152 C A 1 0 0 0 1 0 chr9_95762332 G A 0.925 0.0734 0.0015 0.894 0.1031 0.0029 chr9_95762333 A T 0.8903 0.1067 0.003 0.811 0.1797 0.0093 chr9_95762343 G T 0.9149 0.0835 0.0017 0.8396 0.1541 0.0064
Example of estimating allele frequencies from beagle files
./angsd -out out -doMaf 16 -beagle file.beagle.gprobs.gz