ANGSD: Analysis of next generation Sequencing Data
Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.
Input: Difference between revisions
Line 117: | Line 117: | ||
=Pileup files= | =Pileup files= | ||
<div class="toccolours mw-collapsible mw-collapsed"> | |||
../angsd/angsd -pileup | |||
<pre class="mw-collapsible-content"> | |||
-> angsd version: 0.910-20-g553b991 (htslib: 1.2.1-192-ge7e2b3d) build(Dec 4 2015 12:17:14) | |||
-> Analysis helpbox/synopsis information: | |||
-> Command: | |||
../angsd/angsd -pileup -> Fri Dec 4 12:17:53 2015 | |||
---------------- | |||
multiReader.cpp: | |||
-nLines 50 (Number of lines to read) | |||
-bpl 33554432 (bytesPerLine) | |||
-beagle (null) (Beagle Filename (can be .gz)) | |||
-vcf-GL (null) (vcf Filename (can be .gz)) | |||
-vcf-GP (null) (vcf Filename (can be .gz)) | |||
-glf (null) (glf Filename (can be .gz)) | |||
-pileup (null) (pileup Filename (can be .gz)) | |||
-intName 1 (Assume First column is chr_position) | |||
-isSim 0 (Simulated data assumes ancestral is A) | |||
-nInd 0 (Number of individuals) | |||
-minQ 13 (minimum base quality; only used in pileupreader) | |||
---------------- | |||
multiReader.cpp: | |||
</pre> | |||
</div> | |||
===Example=== | |||
<pre> | |||
./angsd -pileup sam.mpileup -nInd 10 -fai hg19.fa.gz.fai | |||
</pre> | |||
===Arguments=== | |||
;-pileup [filename] | |||
name of the pileup file. | |||
<div class="toccolours mw-collapsible mw-collapsed"> | |||
A pileup file | |||
<pre class="mw-collapsible-content"> | |||
1 13999999 N 3 ggg I<B 2 Gg FF 2 Gg F7 6 ggGgGg DBA@=2 | |||
1 14000000 N 3 ggg 8EG 2 Gg BF 1 G B 7 ggGgGgg C>B=?:< | |||
1 14000001 N 2 gg <@ 2 Gg AC 2 Gg :< 7 ggGgGgg DBB?832 | |||
1 14000002 N 0 2 Cc C1 1 C B 7 ccCcCcc =;A7485 | |||
1 14000003 N 2 gg </ 2 Gg <I 2 Gg </ 7 ggGgGgg C<;A84. | |||
1 14000004 N 3 aaa 6C= 2 Aa A9 2 Aa BB 7 aaAaAaa CBA7951 | |||
1 14000005 N 2 cc 4; 2 Cc CC 2 Cc @@ 7 ccCcCcc CBAB930 | |||
1 14000006 N 3 aaa A9> 2 Aa E< 2 Aa ;C 7 aa$AaAaa D>BC6;: | |||
1 14000007 N 3 ggg 43> 2 Gg BI 2 Gg D@ 6 gGgGgg BB?A.7 | |||
1 14000008 N 3 aaa 776 3 Aa^/A :<? 2 Aa BC 6 aAaAaa D>C;:5 | |||
1 14000009 N 2 gg 96 3 GgG BFD 2 Gg A< 6 gGgGgg CCA882 | |||
1 14000010 N 2 cc 54 3 CcC >;A 2 Cc A: 4 cCcC =A69 | |||
1 14000011 N 2 gg :0 3 GgG 9I< 2 Gg <A 6 gGgGgg C6A864 | |||
1 14000012 N 3 aaa >F? 3 AaA ?<? 2 Aa BC 5 aAaAa D>B99 | |||
1 14000013 N 3 ggg 2== 3 GgG AHD 2 Gg EA 6 gGgGgg C;A@63 | |||
1 14000014 N 3 aaa 8.6 3 AaA ?8A 2 Aa 2C 6 aAaAaa C3A88< | |||
1 14000015 N 2 cc CD 3 CcC CEB 2 Cc ?= 6 cCcCcc D4<:=< | |||
1 14000016 N 1 t 5 3 TtT BGC 2 Tt C@ 6 tT$tTtt C38A9> | |||
1 14000017 N 3 ccc 17J 3 CcC BB3 2 Cc B7 5 ccCcc D::B? | |||
1 14000018 N 3 ccc .:. 3 CcC B:B 2 Cc 2; 5 ccCcc <9956 | |||
</pre> | |||
</div> | |||
;-nInd [int] | |||
Number of individuals must be specified. | |||
;-bpl [int]=33554432 | |||
maximum bytes per line. Increase if the pileup has many individuals | |||
;-nLines [int]=50 | |||
Number of lines to read at a time. Increasing this number will affect the RAM use | |||
;-minQ [int]=0 | |||
Minimum base quality score | |||
===Tutorial=== | |||
Various softwares can generate pileup format but the most used one is samtools | |||
samtools mpileup -b bam.filelist > sam.mpileup | |||
../angsd/angsd -pileup sam.mpileup -nInd 10 -fai chimpHg19.fa.gz.fai -domaf 1 -domajorminor 1 -gl 1 | |||
=Genotype Likelihood Files= | =Genotype Likelihood Files= |
Revision as of 12:31, 4 December 2015
ANGSD currently supports various input formats
<classdiagram type="dir:LR"> [sequence data|BAM;CRAM;mpileup{bg:orange}]-[genotype;likelihoods|VCF;GLFv3{bg:orange}] [genotype;likelihoods|VCF;GLFv3{bg:orange}]-[genotype;probability|beagle imputation{bg:orange}] </classdiagram>
Below is a short description of those we believe is of most use. Note that CRAM files are used interchangeably as BAM files. So use -bam for supplying both a CRAM list or BAM list or both.
Sequence data (BAM/CRAM/mpileup)
BAM/CRAM
ANGSD accepts BAM/CRAM files for mapped sequences and both are handled using the same -bam option. For information on the file specification and file creation see the samtools website [1]. These are required do be sorted according to reference. To see the options for BAM/CRAM use the command:
./angsd -bam
-> angsd version: 0.910-14-g5e2711f (htslib: 1.2.1-252-ga2656aa) build(Dec 4 2015 10:40:24) -> Analysis helpbox/synopsis information: -> Command: ./angsd -bam -> angsd version: 0.910-14-g5e2711f (htslib: 1.2.1-252-ga2656aa) build(Dec 4 2015 10:40:28) -> Fri Dec 4 10:43:27 2015 --------------- parseArgs_bambi.cpp: bam reader: -r (null) Supply a single region in commandline (see examples below) -rf (null) Supply multiple regions in a file (see examples below) -remove_bads 1 Discard 'bad' reads, (flag >=256) -uniqueOnly 0 Discards reads that doesn't map uniquely -show 0 Mimic 'samtools mpileup' also supply -ref fasta for printing reference column -minMapQ 0 Discard reads with mapping quality below -minQ 13 Discard bases with base quality below -trim 0 Number of based to discard at both ends of the reads -only_proper_pairs 1 Only use reads where the mate could be mapped -C 0 adjust mapQ for excessive mismatches (as SAMtools), supply -ref -baq 0 adjust qscores around indels (as SAMtools), supply -ref -if 2 include flags for each read -df 4 discard flags for each read -checkBamHeaders 1 Exit if difference in BAM headers -doCheck 1 Keep going even if datafile is not suffixed with .bam/.cram -downSample 0.000000 Downsample to the fraction of original data -minChunkSize 250 Minimum size of chunk sent to analyses Examples for region specification: chr: Use entire chromosome: chr chr:start- Use region from start to end of chr chr:-stop Use region from beginning of chromosome: chr to stop chr:start-stop Use region from start to stop from chromosome: chr chr:site Use single site on chromosome: chr Will include read if: includeflag:[2] (beta)each segment properly aligned according to the aligner, Will discard read if: discardflag:[4] (beta)segment unmapped,
Example
Example of estimating allele frequencies from bam files
./angsd -out out -doMaf 2 -bam bam.filelist -doMajorMinor 1 -GL 1 -P 5
Arguments
- -bam [filelist]
- -b [filelist]
The filelist is a file containing the full path for each bam file with one filename per row.
filelist with 6 individuals
/home/software/angsd/test/smallBam/smallNA12763.bam /home/software/angsd/test/smallBam/smallNA11830.bam /home/software/angsd/test/smallBam/smallNA12004.bam /home/software/angsd/test/smallBam/smallNA06985.bam /home/software/angsd/test/smallBam/smallNA11993.bam /home/software/angsd/test/smallBam/smallNA12761.bam
- -r [region]
Specify a region with in a chromosome using the syntax [chr]:[start-stop]. examples
chr1:1-10000 \\ first 10000 based for chr1 chr2:50000- \\chr2 but exclude the first 50000 bases chr11:1- \\all of chr11 chr11: \\all of chr11 chr7:123456 \\position 123456 of chr7
- -rf [region file]
specify multiple regions in a file using the same syntax as -r
- -remove_bads [int]=1
Same as the samtools flags -x which removes read with a flag above 255 (not primary, failure and duplicate reads). 0 no , 1 remove (default)
- -uniqueOnly [int]=0
remove reads that have multiple best hits. 0 no (default), 1 remove
- -minMapQ [int]=0
minimum mapQ quality
- -trim [int]=0
Number of bases to remove from both ends of the read.
- -only_proper_pairs [int]=1
Include only proper pairs (pairs of read with both mates mapped correctly). 1: include only proper (default), 0: use all reads. If your data is not paired end you have to choose 1
- -C [int] =0
adjust mapQ for excessive mismatches (as SAMtools), supply -ref
- -baq [int]=0
perform baq computation, remember to cite the baq paper for this.
- -checkBamHeaders [int]=1
Exits if the headers are not compatible for all files. 0 no , 1 remove (default). Not performing this check is not advisable
- -downSample [float]=0
Randomly remove reads to downsample your data. 0.25 will on average keep 25% of the reads
- -minChunkSize [int]=250
Minimum number of sites to read in before starting to analyze - larger number will use more RAM
Pileup files
../angsd/angsd -pileup
-> angsd version: 0.910-20-g553b991 (htslib: 1.2.1-192-ge7e2b3d) build(Dec 4 2015 12:17:14) -> Analysis helpbox/synopsis information: -> Command: ../angsd/angsd -pileup -> Fri Dec 4 12:17:53 2015 ---------------- multiReader.cpp: -nLines 50 (Number of lines to read) -bpl 33554432 (bytesPerLine) -beagle (null) (Beagle Filename (can be .gz)) -vcf-GL (null) (vcf Filename (can be .gz)) -vcf-GP (null) (vcf Filename (can be .gz)) -glf (null) (glf Filename (can be .gz)) -pileup (null) (pileup Filename (can be .gz)) -intName 1 (Assume First column is chr_position) -isSim 0 (Simulated data assumes ancestral is A) -nInd 0 (Number of individuals) -minQ 13 (minimum base quality; only used in pileupreader) ---------------- multiReader.cpp:
Example
./angsd -pileup sam.mpileup -nInd 10 -fai hg19.fa.gz.fai
Arguments
- -pileup [filename]
name of the pileup file.
A pileup file
1 13999999 N 3 ggg I<B 2 Gg FF 2 Gg F7 6 ggGgGg DBA@=2 1 14000000 N 3 ggg 8EG 2 Gg BF 1 G B 7 ggGgGgg C>B=?:< 1 14000001 N 2 gg <@ 2 Gg AC 2 Gg :< 7 ggGgGgg DBB?832 1 14000002 N 0 2 Cc C1 1 C B 7 ccCcCcc =;A7485 1 14000003 N 2 gg </ 2 Gg <I 2 Gg </ 7 ggGgGgg C<;A84. 1 14000004 N 3 aaa 6C= 2 Aa A9 2 Aa BB 7 aaAaAaa CBA7951 1 14000005 N 2 cc 4; 2 Cc CC 2 Cc @@ 7 ccCcCcc CBAB930 1 14000006 N 3 aaa A9> 2 Aa E< 2 Aa ;C 7 aa$AaAaa D>BC6;: 1 14000007 N 3 ggg 43> 2 Gg BI 2 Gg D@ 6 gGgGgg BB?A.7 1 14000008 N 3 aaa 776 3 Aa^/A :<? 2 Aa BC 6 aAaAaa D>C;:5 1 14000009 N 2 gg 96 3 GgG BFD 2 Gg A< 6 gGgGgg CCA882 1 14000010 N 2 cc 54 3 CcC >;A 2 Cc A: 4 cCcC =A69 1 14000011 N 2 gg :0 3 GgG 9I< 2 Gg <A 6 gGgGgg C6A864 1 14000012 N 3 aaa >F? 3 AaA ?<? 2 Aa BC 5 aAaAa D>B99 1 14000013 N 3 ggg 2== 3 GgG AHD 2 Gg EA 6 gGgGgg C;A@63 1 14000014 N 3 aaa 8.6 3 AaA ?8A 2 Aa 2C 6 aAaAaa C3A88< 1 14000015 N 2 cc CD 3 CcC CEB 2 Cc ?= 6 cCcCcc D4<:=< 1 14000016 N 1 t 5 3 TtT BGC 2 Tt C@ 6 tT$tTtt C38A9> 1 14000017 N 3 ccc 17J 3 CcC BB3 2 Cc B7 5 ccCcc D::B? 1 14000018 N 3 ccc .:. 3 CcC B:B 2 Cc 2; 5 ccCcc <9956
- -nInd [int]
Number of individuals must be specified.
- -bpl [int]=33554432
maximum bytes per line. Increase if the pileup has many individuals
- -nLines [int]=50
Number of lines to read at a time. Increasing this number will affect the RAM use
- -minQ [int]=0
Minimum base quality score
Tutorial
Various softwares can generate pileup format but the most used one is samtools samtools mpileup -b bam.filelist > sam.mpileup ../angsd/angsd -pileup sam.mpileup -nInd 10 -fai chimpHg19.fa.gz.fai -domaf 1 -domajorminor 1 -gl 1
Genotype Likelihood Files
-glf
A simple format for genotype likelihoods: Every genotype likelihood is saved as binary double log scaled. In the following order. AA,AC,AG,AT,... for each individual
- -glf [filename]
- NB and remember to supply a -fai file and number of individuals with -nInd
This is the format used by supersim subprogram and the -doglf 1 option in angsd
VCF files
VCF file as input is now included but with some limitiations. Only chr,pos,ref,alt and gp/gl tags are being used, and we discard indels and non diallelic sites. Futhermore you are required to include a fai file and the number of individuals.
#for using GL tags ./angsd -vcf-gl ../1000g/ref.r1274.vcf -fai fai.fai -nind 181 -domaf 1 -out two #for using GP tags ./angsd -vcf-gp ../1000g/ref.r1274.vcf -fai fai.fai -nind 181 -domaf 1 -out two
- NB The 4.2 version of the vcf specifiation clarifies that GP should be phred scaled post probs of the genotypes. But it seems that most software is using non-phred scale. So ANGSD uses the raw GP value. The GL tag is interpreted as log10.
- NB you are required to supply a fai file. Otherwise the program will give a warning (this will be changed in future versions).
Genotype Probability Files
Genotype probabilities in gz beagle format can be used as input. The format used is the haplotype imputation format outputted from beagle [2].
- NB you are required to supply a fai file. Otherwise the program will give the following warning (this will be changed in future versions).
Brief Overview
./angsd -beagle Command: ./angsd -beagle -> angsd version: 0.569 build(Dec 11 2013 17:27:32) ---------------- beagleReader.cpp: -beagle (null) (Beagle Filename (can be .gz)) -intName=0 (Assume First column is chr_position) Use -chunkSize for defining how many sites to use at a time Use -fai to supply a fai file
options
To include a beagle file use the options
- -beagle [fileName]
beagle file name. The file must be gzipped.
- -intName [int]
default 1. If the SNP name are written as chr_position this information will be parsed. If the SNP name is in another format then use -intName 0.
Example
The file format is a single linje per site. The first 3 coloums are
- markerName
- alleleA
- alleleB
For each individual 3 coloums are added. These three colums should sum to one.
Example of a file with two individuals
marker alleleA alleleB NA06984 NA06984 NA06984 NA06986 NA06986 NA06986 chr9_95759065 G A 0.6563 0.3078 0.0358 0.5357 0.4016 0.0627 chr9_95759152 C A 1 0 0 0 1 0 chr9_95762332 G A 0.925 0.0734 0.0015 0.894 0.1031 0.0029 chr9_95762333 A T 0.8903 0.1067 0.003 0.811 0.1797 0.0093 chr9_95762343 G T 0.9149 0.0835 0.0017 0.8396 0.1541 0.0064
Example of estimating allele frequencies from beagle files
./angsd -out out -doMaf 4 -beagle file.beagle.gprobs.gz -fai ref.fai
the reference fai can be optained by indexing the reference genome or by using a bam files header
samtools view -H bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam | grep SN |cut -f2,3 | sed 's/SN\://g' | sed 's/LN\://g' > ref.fai