ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Input

From angsd
Jump to navigation Jump to search

ANGSD currently supports various input formats


<classdiagram type="dir:LR"> [sequence data|BAM;CRAM;mpileup{bg:orange}]-[genotype;likelihoods|VCF;GLF;beagle{bg:orange}] [genotype;likelihoods|VCF;GLF;beagle{bg:orange}]-[genotype;probability|beagle{bg:orange}] </classdiagram>

Below is a short description of those we believe is of most use. Note that CRAM files are used interchangeably as BAM files. So use -bam for supplying both a CRAM list or BAM list or both.


Sequence data (BAM/CRAM/mpileup)

BAM/CRAM

ANGSD accepts BAM/CRAM files for mapped sequences and both are handled using the same -bam option. For information on the file specification and file creation see the samtools website [1]. These are required do be sorted according to reference. To see the options for BAM/CRAM use the command:

./angsd -bam

Example

Example of estimating allele frequencies from bam files

./angsd -out out -doMaf 2 -bam bam.filelist -doMajorMinor 1 -GL 1 -P 5

Arguments

-bam [filelist]
-b [filelist]

The filelist is a file containing the full path for each bam file with one filename per row.


filelist with 6 individuals

-r [region]

Specify a region with in a chromosome using the syntax [chr]:[start-stop]. examples

chr1:1-10000             \\ first 10000 based for chr1
chr2:50000-              \\chr2 but exclude the first 50000 bases
chr11:1-                 \\all of chr11
chr11:                   \\all of chr11
chr7:123456              \\position 123456 of chr7
-rf [region file]

Specify multiple regions in a file using the same syntax as -r

-remove_bads [int]=1

Same as the samtools flags -x which removes read with a flag above 255 (not primary, failure and duplicate reads). 0 no , 1 remove (default).

-uniqueOnly [int]=0

Remove reads that have multiple best hits. 0 no (default), 1 remove.

-minMapQ [int]=0

Minimum mapQ quality.

-trim [int]=0

Number of bases to remove from both ends of the read.

-only_proper_pairs [int]=1

Include only proper pairs (pairs of read with both mates mapped correctly). 1: include only proper (default), 0: use all reads. If your data is not paired end you have to choose 1.

-C [int] =0

Adjust mapQ for excessive mismatches (as SAMtools), supply -ref.

-baq [int]=0

Perform BAQ computation, remember to cite the| BAQ paper for this. 0: No BAQ calcualtion

1:standard BAQ (will downgrade scores).

2:extended BAQ (might also upgrade scores).

You will need to supply your reference (-ref) for BAQ options.

-checkBamHeaders [int]=1

Exits if the headers are not compatible for all files. 0 no , 1 remove (default). Not performing this check is not advisable

-downSample [float]=0

Randomly remove reads to downsample your data. 0.25 will on average keep 25% of the reads

-minChunkSize [int]=250

Minimum number of sites to read in before starting to analyze - larger number will use more RAM

Pileup files

Pileup files are the output files that are generated by SAMtools mpileup.

../angsd/angsd -pileup

Example

./angsd -pileup sam.mpileup -nInd 10 -fai hg19.fa.gz.fai

Arguments

-pileup [filename]

name of the pileup file.

A pileup file


-nInd [int]

Number of individuals must be specified.

-fai [filename]

The index to the reference genome.

-bpl [int]=33554432

maximum bytes per line. Increase if the pileup has many individuals.

-nLines [int]=50

Number of lines to read at a time. Increasing this number will affect the RAM use.

-minQ [int]=0

Minimum base quality score.

Tutorial

Various softwares can generate pileup format but the most used one is samtools

samtools mpileup -b bam.filelist > sam.mpileup

if you can then use it as input to angsd

./angsd -pileup sam.mpileup -nInd 10 -fai hg19.fa.gz.fai -domaf 1 -domajorminor 1 -gl 1

Genotype Likelihood Files

-glf

A simple format for genotype likelihoods: Every genotype likelihood is saved as binary double log scaled. In the following order. AA,AC,AG,AT,... for each individual

-glf [filename]
NB and remember to supply a -fai file and number of individuals with -nInd

This is the format used by supersim subprogram and the -doglf 1 option in angsd

VCF files

VCF file as input is now included but with some limitiations. Only chr,pos,ref,alt and GP/GL tags are being used, and we discard indels and non diallelic sites. Futhermore you are required to include a fai file and the number of individuals.

#for using GL tags
./angsd -vcf-gl ../1000g/ref.r1274.vcf -fai fai.fai -nind 181 -domaf 1 -out two
#for using GP tags
./angsd -vcf-gp ../1000g/ref.r1274.vcf -fai fai.fai -nind 181 -domaf 1 -out two


  1. for using GL tags

./angsd -vcf-gl

NB The 4.2 version of the vcf specifiation clarifies that GP should be phred scaled post probs of the genotypes. But it seems that most software is using non-phred scale. So ANGSD uses the raw GP value. The GL tag is interpreted as log10.


Example

/angsd -vcf-gl file.vcf -fai hg19.fa.gz.fai -nind 10 -domaf 1

Arguments

-vcf-gl [filename]

name of the vcf file.

A vcf file


-nInd [int]

Number of individuals must be specified.

-fai [filename]

The index to the reference genome.

-bpl [int]=33554432

maximum bytes per line. Increase if the pileup has many individuals.

-nLines [int]=50

Number of lines to read at a time. Increasing this number will affect the RAM use.

-minQ [int]=0

Minimum base quality score.

Tutorial

Various softwares can generate pileup format but the most used one is samtools

samtools mpileup -b bam.filelist > sam.mpileup

if you can then use it as input to angsd

./angsd -pileup sam.mpileup -nInd 10 -fai hg19.fa.gz.fai -domaf 1 -domajorminor 1 -gl 1

Genotype Probability Files

Genotype probabilities in gz beagle format can be used as input. The format used is the haplotype imputation format outputted from beagle [2]. A newer version of beagle uses VCF files.

./angsd -beagle

Example

Example of estimating allele frequencies from beagle files

./angsd -out out -doMaf 4 -beagle file.beagle.gprobs.gz -fai ref.fai


Arguments

-beagle [fileName]

beagle file name. The file must be gzipped. The file format is a single line per site. The first 3 coloums are

  • markerName
  • alleleA
  • alleleB

For each individual 3 columns are added. These three columns should sum to one.

file with two individuals
-intName [int]=1

default 1. If the SNP name are written as chr_position this information will be parsed. If the SNP name is in another format then use -intName 0.

-fai [filename]

The index to the reference genome

can also be obtained from the bam header

-bpl [int]=33554432

maximum bytes per line. Increase if the pileup has many individuals

-nLines [int]=50

Number of lines to read at a time. Increasing this number will affect the RAM use