ANGSD: Analysis of next generation Sequencing Data
Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.
Input: Difference between revisions
Line 476: | Line 476: | ||
;-nLines [int]=50 | ;-nLines [int]=50 | ||
Number of lines to read at a time. Increasing this number will affect the RAM use | Number of lines to read at a time. Increasing this number will affect the RAM use | ||
Revision as of 17:23, 22 February 2019
ANGSD currently supports various input formats
<classdiagram type="dir:LR"> [sequence data|BAM;CRAM;mpileup{bg:orange}]-[genotype;likelihoods|VCF;GLF;beagle{bg:orange}] [genotype;likelihoods|VCF;GLF;beagle{bg:orange}]-[genotype;probability|beagle{bg:orange}] </classdiagram>
Below is a short description of those we believe is of most use. Note that CRAM files are used interchangeably as BAM files. So use -bam for supplying both a CRAM list or BAM list or both.
Sequence data (BAM/CRAM/mpileup)
BAM/CRAM
ANGSD accepts BAM/CRAM files for mapped sequences and both are handled using the same -bam option. For information on the file specification and file creation see the samtools website [1]. These are required do be sorted according to reference. To see the options for BAM/CRAM use the command:
./angsd -bam
-> angsd version: 0.910-14-g5e2711f (htslib: 1.2.1-252-ga2656aa) build(Dec 4 2015 10:40:24) -> Analysis helpbox/synopsis information: -> Command: ./angsd -bam -> angsd version: 0.910-14-g5e2711f (htslib: 1.2.1-252-ga2656aa) build(Dec 4 2015 10:40:28) -> Fri Dec 4 10:43:27 2015 --------------- parseArgs_bambi.cpp: bam reader: -r (null) Supply a single region in commandline (see examples below) -rf (null) Supply multiple regions in a file (see examples below) -remove_bads 1 Discard 'bad' reads, (flag >=256) -uniqueOnly 0 Discards reads that doesn't map uniquely -show 0 Mimic 'samtools mpileup' also supply -ref fasta for printing reference column -minMapQ 0 Discard reads with mapping quality below -minQ 13 Discard bases with base quality below -trim 0 Number of based to discard at both ends of the reads -only_proper_pairs 1 Only use reads where the mate could be mapped -C 0 adjust mapQ for excessive mismatches (as SAMtools), supply -ref -baq 0 adjust qscores around indels (as SAMtools), supply -ref -if 2 include flags for each read -df 4 discard flags for each read -checkBamHeaders 1 Exit if difference in BAM headers -doCheck 1 Keep going even if datafile is not suffixed with .bam/.cram -downSample 0.000000 Downsample to the fraction of original data -minChunkSize 250 Minimum size of chunk sent to analyses Examples for region specification: chr: Use entire chromosome: chr chr:start- Use region from start to end of chr chr:-stop Use region from beginning of chromosome: chr to stop chr:start-stop Use region from start to stop from chromosome: chr chr:site Use single site on chromosome: chr Will include read if: includeflag:[2] (beta)each segment properly aligned according to the aligner, Will discard read if: discardflag:[4] (beta)segment unmapped,
Example
Example of estimating allele frequencies from bam files
./angsd -out out -doMaf 2 -bam bam.filelist -doMajorMinor 1 -GL 1 -P 5
Arguments
- -bam [filelist]
- -b [filelist]
The filelist is a file containing the full path for each bam file with one filename per row.
filelist with 6 individuals
/home/software/angsd/test/smallBam/smallNA12763.bam /home/software/angsd/test/smallBam/smallNA11830.bam /home/software/angsd/test/smallBam/smallNA12004.bam /home/software/angsd/test/smallBam/smallNA06985.bam /home/software/angsd/test/smallBam/smallNA11993.bam /home/software/angsd/test/smallBam/smallNA12761.bam
- -r [region]
Specify a region with in a chromosome using the syntax [chr]:[start-stop]. examples
chr1:1-10000 \\ first 10000 based for chr1 chr2:50000- \\chr2 but exclude the first 50000 bases chr11:1- \\all of chr11 chr11: \\all of chr11 chr7:123456 \\position 123456 of chr7
- -rf [region file]
Specify multiple regions in a file using the same syntax as -r
- -remove_bads [int]=1
Same as the samtools flags -x which removes read with a flag above 255 (not primary, failure and duplicate reads). 0 no , 1 remove (default).
- -uniqueOnly [int]=0
Remove reads that have multiple best hits. 0 no (default), 1 remove.
- -minMapQ [int]=0
Minimum mapQ quality.
- -trim [int]=0
Number of bases to remove from both ends of the read.
- -only_proper_pairs [int]=1
Include only proper pairs (pairs of read with both mates mapped correctly). 1: include only proper (default), 0: use all reads. Only relevant for paired end data.
- -C [int] =0
Adjust mapQ for excessive mismatches (as SAMtools), supply -ref.
- -baq [int]=0
Perform BAQ computation, remember to cite the| BAQ paper for this. 0: No BAQ calcualtion
1:normal BAQ (same as default in SAMtools). 2:extended BAQ (same as default in SAMtools).
- -redo-baq=0
if zero then it will use the existing record
You will need to supply your reference (-ref) for BAQ options.
- -checkBamHeaders [int]=1
Exits if the headers are not compatible for all files. 0 no , 1 remove (default). Not performing this check is not advisable
- -downSample [float]=0
Randomly remove reads to downsample your data. 0.25 will on average keep 25% of the reads
- -setMinChunkSize [int]=250
Minimum number of sites to read in before starting to analyze - larger number will use more RAM
Pileup files
Pileup files are the output files that are generated by SAMtools mpileup.
../angsd/angsd -pileup
-> angsd version: 0.910-20-g553b991 (htslib: 1.2.1-192-ge7e2b3d) build(Dec 4 2015 12:17:14) -> Analysis helpbox/synopsis information: -> Command: ../angsd/angsd -pileup -> Fri Dec 4 12:17:53 2015 ---------------- multiReader.cpp: -nLines 50 (Number of lines to read) -bpl 33554432 (bytesPerLine) -beagle (null) (Beagle Filename (can be .gz)) -vcf-GL (null) (vcf Filename (can be .gz)) -vcf-GP (null) (vcf Filename (can be .gz)) -glf (null) (glf Filename (can be .gz)) -pileup (null) (pileup Filename (can be .gz)) -intName 1 (Assume First column is chr_position) -isSim 0 (Simulated data assumes ancestral is A) -nInd 0 (Number of individuals) -minQ 13 (minimum base quality; only used in pileupreader) ---------------- multiReader.cpp:
Example
./angsd -pileup sam.mpileup -nInd 10 -fai hg19.fa.gz.fai
Arguments
- -pileup [filename]
name of the pileup file.
A pileup file
1 13999999 N 3 ggg I<B 2 Gg FF 2 Gg F7 6 ggGgGg DBA@=2 1 14000000 N 3 ggg 8EG 2 Gg BF 1 G B 7 ggGgGgg C>B=?:< 1 14000001 N 2 gg <@ 2 Gg AC 2 Gg :< 7 ggGgGgg DBB?832 1 14000002 N 0 2 Cc C1 1 C B 7 ccCcCcc =;A7485 1 14000003 N 2 gg </ 2 Gg <I 2 Gg </ 7 ggGgGgg C<;A84. 1 14000004 N 3 aaa 6C= 2 Aa A9 2 Aa BB 7 aaAaAaa CBA7951 1 14000005 N 2 cc 4; 2 Cc CC 2 Cc @@ 7 ccCcCcc CBAB930 1 14000006 N 3 aaa A9> 2 Aa E< 2 Aa ;C 7 aa$AaAaa D>BC6;: 1 14000007 N 3 ggg 43> 2 Gg BI 2 Gg D@ 6 gGgGgg BB?A.7 1 14000008 N 3 aaa 776 3 Aa^/A :<? 2 Aa BC 6 aAaAaa D>C;:5 1 14000009 N 2 gg 96 3 GgG BFD 2 Gg A< 6 gGgGgg CCA882 1 14000010 N 2 cc 54 3 CcC >;A 2 Cc A: 4 cCcC =A69 1 14000011 N 2 gg :0 3 GgG 9I< 2 Gg <A 6 gGgGgg C6A864 1 14000012 N 3 aaa >F? 3 AaA ?<? 2 Aa BC 5 aAaAa D>B99 1 14000013 N 3 ggg 2== 3 GgG AHD 2 Gg EA 6 gGgGgg C;A@63 1 14000014 N 3 aaa 8.6 3 AaA ?8A 2 Aa 2C 6 aAaAaa C3A88< 1 14000015 N 2 cc CD 3 CcC CEB 2 Cc ?= 6 cCcCcc D4<:=< 1 14000016 N 1 t 5 3 TtT BGC 2 Tt C@ 6 tT$tTtt C38A9> 1 14000017 N 3 ccc 17J 3 CcC BB3 2 Cc B7 5 ccCcc D::B? 1 14000018 N 3 ccc .:. 3 CcC B:B 2 Cc 2; 5 ccCcc <9956
- -nInd [int]
Number of individuals must be specified.
- -fai [filename]
The index to the reference genome.
- -bpl [int]=33554432
maximum bytes per line. Increase if the pileup has many individuals.
- -nLines [int]=50
Number of lines to read at a time. Increasing this number will affect the RAM use.
- -minQ [int]=0
Minimum base quality score.
Tutorial
Various softwares can generate pileup format but the most used one is samtools
samtools mpileup -b bam.filelist > sam.mpileup
if you can then use it as input to angsd
./angsd -pileup sam.mpileup -nInd 10 -fai hg19.fa.gz.fai -domaf 1 -domajorminor 1 -gl 1
Genotype Likelihood Files
-glf
A simple format for genotype likelihoods: This is the format used by supersim subprogram and the -doglf 1 option in angsd. This format is binary, 10doubles per individual. -nInd therefore needs to be supplied
../angsd/angsd -glf
-> angsd version: 0.910-20-g553b991 (htslib: 1.2.1-192-ge7e2b3d) build(Dec 4 2015 12:17:14) -> Analysis helpbox/synopsis information: -> Command: ../angsd/angsd -pileup -> Fri Dec 4 12:17:53 2015 ---------------- multiReader.cpp: -nLines 50 (Number of lines to read) -bpl 33554432 (bytesPerLine) -beagle (null) (Beagle Filename (can be .gz)) -vcf-GL (null) (vcf Filename (can be .gz)) -vcf-GP (null) (vcf Filename (can be .gz)) -glf (null) (glf Filename (can be .gz)) -pileup (null) (pileup Filename (can be .gz)) -intName 1 (Assume First column is chr_position) -isSim 0 (Simulated data assumes ancestral is A) -nInd 0 (Number of individuals) -minQ 13 (minimum base quality; only used in pileupreader) ---------------- multiReader.cpp:
Example
./angsd -glf data.glf.gz -nInd 10 -fai hg19.fa.gz.fai
Arguments
- -glf [filename]
name of the glf file (gunzipped). Every genotype likelihood is saved as binary double log scaled. In the following order. AA,AC,AG,AT,... for each individual
- -nInd [int]
Number of individuals must be specified.
- -fai [filename]
The index to the reference genome.
- -bpl [int]=33554432
maximum bytes per line. Increase if the pileup has many individuals.
- -nLines [int]=50
Number of lines to read at a time. Increasing this number will affect the RAM use.
- -minQ [int]=0
Minimum base quality score.
Tutorial
Simulate genotype likelihoods
supersim -outfiles data -nind 10 -nsites 100000 -errate 0.01 -depth 4
then use it as input to angsd
./angsd -glf data.glf.gz -nInd 10 -fai hg19.fa.gz.fai -domaf 1 -domajorminor 1
make GLF file from the chromosome 1
./angsd -GL 1 -out genolike -doGlf 1 -doMajorMinor 1 -doMaf 2 -SNP_pval 2e-6 -bam bam.filelist -r 1:
recalculate the allele frequencies
./angsd -glf genolike.glf.gz -nInd 10 -fai hg19.fa.gz.fai -domaf 2 -domajorminor 1
-glf10_text
-glf10_text was added in commit: https://github.com/ANGSD/angsd/commit/46fc3edc181e80c4ad5e6bd644a64d23a5012e0e nov2 2017. This allows for reading files in the output format as -doglf 4. This is a simple text file with column 1 and column 2 being chromosome/scaffold and position. Then for each individual there are 10 logscaled genotype likelihoods in the order: AA,AC,AG,AT,CC,CG,CT,GG,GT,TT. Example runs are: First generate an example of this format:
./angsd -gl 1 -doglf 4 -bam list -out first -domajorminor 1 -domaf 1
This generates the file first.glf.gz. Which we can then use as input. Example here:
./angsd -glf10_text first.glf.gz -nind 33 -domaf 1 -domajorminor 1 -fai fai.fai
Notice that -nInd and -fai needs to be supplied.
VCF files
VCF file as input is now included but with some limitations. Only chr,pos,ref,alt and GP/GL tags are being used, and we discard indels and non diallelic sites. Furthermore you are required to include a fai file and the number of individuals.
./angsd -vcf-gl
-> angsd version: 0.910-20-g553b991 (htslib: 1.2.1-192-ge7e2b3d) build(Dec 4 2015 12:17:14) -> Analysis helpbox/synopsis information: -> Command: ./angsd -vcf-gl -> Fri Dec 4 14:35:51 2015 ---------------- multiReader.cpp: -nLines 50 (Number of lines to read) -bpl 33554432 (bytesPerLine) -beagle (null) (Beagle Filename (can be .gz)) -vcf-GL (null) (vcf Filename (can be .gz)) -vcf-GP (null) (vcf Filename (can be .gz)) -glf (null) (glf Filename (can be .gz)) -pileup (null) (pileup Filename (can be .gz)) -intName 1 (Assume First column is chr_position) -isSim 0 (Simulated data assumes ancestral is A) -nInd 0 (Number of individuals) -minQ 13 (minimum base quality; only used in pileupreader) ---------------- multiReader.cpp:
Example
/angsd -vcf-gl file.vcf -fai hg19.fa.gz.fai -nind 10 -domaf 4
Arguments
- -vcf-gl [filename]
name of the vcf file.
A vcf file
##fileformat=VCFv4.2(angsd version) ##FORMAT=<ID=GT,Number=1,Type=Integer,Description="Genotype"> ##FORMAT=<ID=GP,Number=G,Type=Float,Description="Genotype Probabilities"> ##FORMAT=<ID=PL,Number=G,Type=Float,Description="Phred-scaled Genotype Likelihoods"> ##FORMAT=<ID=GL,Number=G,Type=Float,Description="scaled Genotype Likelihoods (loglikeratios to the most likely (in log10))"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ind0 ind1 1 14000873 . G A . PASS . GP:GL 0.000000,0.137003,0.862997:-15.128970,-1.505169,0.000000 0.716266,0.281975,0.001759:0.000000,-0.301034,-1.800000 1 14001018 . T C . PASS . GP:GL 0.000000,0.081718,0.918282:-13.701492,-1.806203,0.000000 0.850652,0.149348,0.000000:0.000000,-0.602068,-5.699627 1 14001867 . A G . PASS . GP:GL 0.000489,0.727550,0.271961:-3.600000,-0.301034,0.000000 0.914538,0.085462,0.000000:0.000000,-0.903101,-8.859124 1 14002422 . A T . PASS . GP:GL 0.000000,0.291570,0.708430:-9.777061,-0.903101,0.000000 0.767047,0.232952,0.000001:0.000000,-0.602068,-5.499530 1 14002474 . T C . PASS . GP:GL 0.995488,0.004512,0.000000:0.000000,-1.505169,-15.068561 0.965008,0.034992,0.000000:0.000000,-0.602068,-5.899399 1 14003581 . C T . PASS . GP:GL 0.000000,0.674489,0.325510:-7.200000,-0.602068,0.000000 0.992516,0.007484,0.000000:0.000000,-1.806203,-13.447742 1 14004623 . T C . PASS . GP:GL 0.000000,0.588345,0.411654:-6.999968,-0.602068,0.000000 0.989186,0.010814,0.000000:0.000000,-1.806203,-12.574310 1 14007493 . A G . PASS . GP:GL 0.000013,0.541811,0.458176:-5.286503,-0.602068,0.000000 0.398233,0.422941,0.178826:-0.400000,-0.301034,0.000000 1 14007558 . C T . PASS . GP:GL 0.000000,0.091908,0.908092:-15.524007,-1.505169,0.000000 0.284993,0.442044,0.272964:-0.400000,-0.301034,0.000000 1 14007649 . G A . PASS . GP:GL 0.000000,0.638610,0.361390:-7.340205,-0.602068,0.000000 0.779442,0.220538,0.000020:0.000000,-0.301034,-3.500000 1 14008734 . T A . PASS . GP:GL 0.000000,0.280425,0.719575:-13.909454,-1.204135,0.000000 0.757059,0.242817,0.000123:0.000000,-0.301034,-2.800000 1 14009723 . G C . PASS . GP:GL 0.000345,0.744684,0.254971:-3.800000,-0.301034,0.000000 0.744903,0.255042,0.000055:0.000000,-0.301034,-3.200000 1 14010597 . G A . PASS . GP:GL 0.000000,0.063511,0.936489:-17.446187,-1.806203,0.000000 0.684326,0.315309,0.000365:0.000000,-0.301034,-2.600000 1 14010654 . T C . PASS . GP:GL 0.600538,0.348812,0.050650:0.000000,0.000000,0.000000 0.600538,0.348812,0.050650:0.000000,0.000000,0.000000
- -nInd [int]
Number of individuals must be specified.
- -fai [filename]
The index to the reference genome.
- -bpl [int]=33554432
maximum bytes per line. Increase if the pileup has many individuals.
- -nLines [int]=50
Number of lines to read at a time. Increasing this number will affect the RAM use.
Tutorial
Create a VCF file using your favorate software or using angsd
./angsd -b bam.filelist -dovcf 1 -gl 1 -dopost 1 -domajorminor 1 -domaf 1 -snp_pval 1e-6
you can then use it as input to angsd if you have the GL info
./angsd -vcf-gl angsdput.vcf.gz -nind 10 -fai hg19.fa.gz.fai -domaf 1
Genotype Probability Files
Beagle format
Genotype probabilities in gz beagle format can be used as input. The format used is the haplotype imputation format outputted from beagle [2]. A newer version of beagle uses VCF files.
./angsd -beagle
-> angsd version: 0.910-20-g553b991 (htslib: 1.2.1-192-ge7e2b3d) build(Dec 4 2015 12:17:14) -> Analysis helpbox/synopsis information: -> Command: ./angsd -beagle -> Fri Dec 4 14:03:22 2015 ---------------- multiReader.cpp: -nLines 50 (Number of lines to read) -bpl 33554432 (bytesPerLine) -beagle (null) (Beagle Filename (can be .gz)) -vcf-GL (null) (vcf Filename (can be .gz)) -vcf-GP (null) (vcf Filename (can be .gz)) -glf (null) (glf Filename (can be .gz)) -pileup (null) (pileup Filename (can be .gz)) -intName 1 (Assume First column is chr_position) -isSim 0 (Simulated data assumes ancestral is A) -nInd 0 (Number of individuals) -minQ 13 (minimum base quality; only used in pileupreader) ---------------- multiReader.cpp:
Example
Example of estimating allele frequencies from beagle files
./angsd -out out -doMaf 4 -beagle file.beagle.gprobs.gz -fai ref.fai
Arguments
- -beagle [fileName]
beagle file name. The file must be gzipped. The file format is a single line per site. The first 3 coloums are
- markerName
- alleleA
- alleleB
For each individual 3 columns are added. These three columns should sum to one.
file with two individuals
marker alleleA alleleB NA06984 NA06984 NA06984 NA06986 NA06986 NA06986 chr9_95759065 G A 0.6563 0.3078 0.0358 0.5357 0.4016 0.0627 chr9_95759152 C A 1 0 0 0 1 0 chr9_95762332 G A 0.925 0.0734 0.0015 0.894 0.1031 0.0029 chr9_95762333 A T 0.8903 0.1067 0.003 0.811 0.1797 0.0093 chr9_95762343 G T 0.9149 0.0835 0.0017 0.8396 0.1541 0.0064
- -intName [int]=1
default 1. If the SNP name are written as chr_position this information will be parsed. If the SNP name is in another format then use -intName 0.
- -fai [filename]
The index to the reference genome
can also be obtained from the bam header
samtools view -H file.bam | grep SN |cut -f2,3 | sed 's/SN\://g' | sed 's/LN\://g' > ref.fai
- -bpl [int]=33554432
maximum bytes per line. Increase if the pileup has many individuals
- -nLines [int]=50
Number of lines to read at a time. Increasing this number will affect the RAM use