ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Input: Difference between revisions

From angsd
Jump to navigation Jump to search
No edit summary
 
(102 intermediate revisions by 2 users not shown)
Line 1: Line 1:
ANSGD currently supports various mapped data, genotype likelihood formats and imputed genotype probability files.
ANGSD currently supports various input formats


=Mapped sequence file=


==bam files==
ANGSD accepts bam files for mapped sequences. For imformation on the file specification and file creation see the samtools website [http://samtools.sourceforge.net/]


===arguments===
<classdiagram type="dir:LR">
[sequence data|BAM;CRAM;mpileup{bg:orange}]-[genotype;likelihoods|VCF;GLF;beagle{bg:orange}]
[genotype;likelihoods|VCF;GLF;beagle{bg:orange}]-[genotype;probability|beagle{bg:orange}]
</classdiagram>
 
Below is a short description of those we believe is of most use. Note that CRAM files are used interchangeably as BAM files. So use -bam for supplying both a CRAM list or BAM list or both.
 
 
=Sequence data (BAM/CRAM/mpileup)=
==BAM/CRAM==
 
ANGSD accepts BAM/CRAM files for mapped sequences and both are handled using the same -bam option. For information on the file specification and file creation see the samtools website [http://samtools.sourceforge.net/]. These are required do be sorted according to reference. To see the options for BAM/CRAM use the command:
 
<div class="toccolours mw-collapsible mw-collapsed">
./angsd -bam
<pre class="mw-collapsible-content">
-> angsd version: 0.910-14-g5e2711f (htslib: 1.2.1-252-ga2656aa) build(Dec  4 2015 10:40:24)
-> Analysis helpbox/synopsis information:
-> Command:
./angsd -bam
-> angsd version: 0.910-14-g5e2711f (htslib: 1.2.1-252-ga2656aa) build(Dec  4 2015 10:40:28)
-> Fri Dec  4 10:43:27 2015
---------------
parseArgs_bambi.cpp: bam reader:
-r (null) Supply a single region in commandline (see examples below)
-rf (null) Supply multiple regions in a file (see examples below)
-remove_bads 1 Discard 'bad' reads, (flag >=256)
-uniqueOnly 0 Discards reads that doesn't map uniquely
-show 0 Mimic 'samtools mpileup' also supply -ref fasta for printing reference column
-minMapQ 0 Discard reads with mapping quality below
-minQ 13 Discard bases with base quality below
-trim 0 Number of based to discard at both ends of the reads
-only_proper_pairs 1 Only use reads where the mate could be mapped
-C 0 adjust mapQ for excessive mismatches (as SAMtools), supply -ref
-baq 0 adjust qscores around indels (as SAMtools), supply -ref
-if 2 include flags for each read
-df 4 discard flags for each read
-checkBamHeaders 1 Exit if difference in BAM headers
-doCheck 1 Keep going even if datafile is not suffixed with .bam/.cram
-downSample 0.000000 Downsample to the fraction of original data
-minChunkSize 250 Minimum size of chunk sent to analyses
 
Examples for region specification:
chr: Use entire chromosome: chr
chr:start- Use region from start to end of chr
chr:-stop Use region from beginning of chromosome: chr to stop
chr:start-stop Use region from start to stop from chromosome: chr
chr:site Use single site on chromosome: chr
Will include read if:
includeflag:[2] (beta)each segment properly aligned according to the aligner,
Will discard read if:
discardflag:[4] (beta)segment unmapped,
</pre>
</div>
 
===Example===
Example of estimating allele frequencies from bam files
 
<pre>
./angsd -out out -doMaf 2 -bam bam.filelist -doMajorMinor 1 -GL 1 -P 5
</pre>
 
===Arguments===




;-bam [filelist]
;-bam [filelist]
;-b [filelist]


The filelist is a file containing the full path for each bam file with one filename per row.  
The filelist is a file containing the full path for each bam file with one filename per row.  


Example of a filelist with 6 individuals


<pre>
<div class="toccolours mw-collapsible mw-collapsed">
filelist with 6 individuals
<pre class="mw-collapsible-content">
/home/software/angsd/test/smallBam/smallNA12763.bam
/home/software/angsd/test/smallBam/smallNA12763.bam
/home/software/angsd/test/smallBam/smallNA11830.bam
/home/software/angsd/test/smallBam/smallNA11830.bam
Line 22: Line 83:
/home/software/angsd/test/smallBam/smallNA11993.bam
/home/software/angsd/test/smallBam/smallNA11993.bam
/home/software/angsd/test/smallBam/smallNA12761.bam
/home/software/angsd/test/smallBam/smallNA12761.bam
</pre>
</div>
;-r [region]
Specify a region with in a chromosome using the syntax [chr]:[start-stop]. examples
chr1:1-10000            \\ first 10000 based for chr1
chr2:50000-              \\chr2 but exclude the first 50000 bases
chr11:1-                \\all of chr11
chr11:                  \\all of chr11
chr7:123456              \\position 123456 of chr7
;-rf [region file]
Specify multiple regions in a file using the same syntax as -r
;-remove_bads [int]=1
Same as  the samtools flags -x which removes read with a flag above 255 (not primary, failure and duplicate reads).  0 no , 1 remove (default).
;-uniqueOnly [int]=0
Remove reads that have multiple best hits. 0 no (default), 1 remove.
;-minMapQ [int]=0
Minimum mapQ quality.
;-trim [int]=0
Number of bases to remove from both ends of the read.
;-only_proper_pairs [int]=1
Include only proper pairs (pairs of read with both mates mapped correctly).  1: include only proper (default), 0: use all reads. Only relevant for paired end data.
;-C [int] =0
Adjust mapQ for excessive mismatches (as SAMtools), supply -ref.
;-baq [int]=0
Perform BAQ computation, remember to cite the[http://bioinformatics.oxfordjournals.org/content/early/2011/02/13/bioinformatics.btr076 | BAQ paper] for this.
0: No BAQ calcualtion
1:normal BAQ (same as default in SAMtools).
2:extended BAQ (same as default in SAMtools).
;-redo-baq=0
if zero then it will use the existing record
You will need to supply your reference (-ref) for BAQ options.
;-checkBamHeaders [int]=1
Exits if the headers are not compatible for all files. 0 no , 1 remove (default). Not performing this check is not advisable
;-downSample [float]=0
Randomly remove reads to downsample your data. 0.25 will on average keep 25% of the reads
;-setMinChunkSize [int]=250
Minimum number of sites to read in before starting to analyze - larger number will use more RAM
==Pileup files==
Pileup files are the output files that are generated by SAMtools mpileup.
<div class="toccolours mw-collapsible mw-collapsed">
../angsd/angsd -pileup
<pre class="mw-collapsible-content">
-> angsd version: 0.910-20-g553b991 (htslib: 1.2.1-192-ge7e2b3d) build(Dec  4 2015 12:17:14)
-> Analysis helpbox/synopsis information:
-> Command:
../angsd/angsd -pileup -> Fri Dec  4 12:17:53 2015
----------------
multiReader.cpp:
-nLines 50 (Number of lines to read)
-bpl 33554432 (bytesPerLine)
-beagle (null) (Beagle Filename (can be .gz))
-vcf-GL (null) (vcf Filename (can be .gz))
-vcf-GP (null) (vcf Filename (can be .gz))
-glf (null) (glf Filename (can be .gz))
-pileup (null) (pileup Filename (can be .gz))
-intName 1 (Assume First column is chr_position)
-isSim 0 (Simulated data assumes ancestral is A)
-nInd 0 (Number of individuals)
-minQ 13 (minimum base quality; only used in pileupreader)
----------------
multiReader.cpp:
</pre>
</div>
===Example===
<pre>
./angsd -pileup sam.mpileup -nInd 10 -fai hg19.fa.gz.fai
</pre>
===Arguments===
;-pileup [filename]
name of the pileup file.
<div class="toccolours mw-collapsible mw-collapsed">
A pileup file
<pre class="mw-collapsible-content">
1 13999999 N 3 ggg I<B 2 Gg FF 2 Gg F7 6 ggGgGg DBA@=2
1 14000000 N 3 ggg 8EG 2 Gg BF 1 G B 7 ggGgGgg C>B=?:<
1 14000001 N 2 gg <@ 2 Gg AC 2 Gg :< 7 ggGgGgg DBB?832
1 14000002 N 0 2 Cc C1 1 C B 7 ccCcCcc =;A7485
1 14000003 N 2 gg </ 2 Gg <I 2 Gg </ 7 ggGgGgg C<;A84.
1 14000004 N 3 aaa 6C= 2 Aa A9 2 Aa BB 7 aaAaAaa CBA7951
1 14000005 N 2 cc 4; 2 Cc CC 2 Cc @@ 7 ccCcCcc CBAB930
1 14000006 N 3 aaa A9> 2 Aa E< 2 Aa ;C 7 aa$AaAaa D>BC6;:
1 14000007 N 3 ggg 43> 2 Gg BI 2 Gg D@ 6 gGgGgg BB?A.7
1 14000008 N 3 aaa 776 3 Aa^/A :<? 2 Aa BC 6 aAaAaa D>C;:5
1 14000009 N 2 gg 96 3 GgG BFD 2 Gg A< 6 gGgGgg CCA882
1 14000010 N 2 cc 54 3 CcC >;A 2 Cc A: 4 cCcC =A69
1 14000011 N 2 gg :0 3 GgG 9I< 2 Gg <A 6 gGgGgg C6A864
1 14000012 N 3 aaa >F? 3 AaA ?<? 2 Aa BC 5 aAaAa D>B99
1 14000013 N 3 ggg 2== 3 GgG AHD 2 Gg EA 6 gGgGgg C;A@63
1 14000014 N 3 aaa 8.6 3 AaA ?8A 2 Aa 2C 6 aAaAaa C3A88<
1 14000015 N 2 cc CD 3 CcC CEB 2 Cc ?= 6 cCcCcc D4<:=<
1 14000016 N 1 t 5 3 TtT BGC 2 Tt C@ 6 tT$tTtt C38A9>
1 14000017 N 3 ccc 17J 3 CcC BB3 2 Cc B7 5 ccCcc D::B?
1 14000018 N 3 ccc .:. 3 CcC B:B 2 Cc 2; 5 ccCcc <9956
</pre>
</div>
;-nInd [int]
Number of individuals must be specified.
;-fai [filename]
The index to the reference genome.
;-bpl [int]=33554432
maximum bytes per line. Increase if the pileup has many individuals.
;-nLines [int]=50
Number of lines to read at a time. Increasing this number will affect the RAM use.
;-minQ [int]=0
Minimum base quality score.
===Tutorial===
Various softwares can generate pileup format but the most used one is samtools
<pre>
samtools mpileup -b bam.filelist > sam.mpileup
</pre>
if you can then use it as input to angsd
<pre>
./angsd -pileup sam.mpileup -nInd 10 -fai hg19.fa.gz.fai -domaf 1 -domajorminor 1 -gl 1
</pre>
=BCF/VCF files=
BCF/VCF file as input is now included but with some limitations. Only chr,pos and PL tags are being used, and we discard indels.
<div class="toccolours mw-collapsible mw-collapsed">
./angsd -vcf-gl
<pre class="mw-collapsible-content">
-> angsd version: 0.910-20-g553b991 (htslib: 1.2.1-192-ge7e2b3d) build(Dec  4 2015 12:17:14)
-> Analysis helpbox/synopsis information:
-> Command:
./angsd -vcf-gl -> Fri Dec  4 14:35:51 2015
----------------
multiReader.cpp:
-nLines 50 (Number of lines to read)
-bpl 33554432 (bytesPerLine)
-beagle (null) (Beagle Filename (can be .gz))
-vcf-GL (null) (vcf Filename (can be .gz))
-vcf-GP (null) (vcf Filename (can be .gz))
-glf (null) (glf Filename (can be .gz))
-pileup (null) (pileup Filename (can be .gz))
-intName 1 (Assume First column is chr_position)
-isSim 0 (Simulated data assumes ancestral is A)
-nInd 0 (Number of individuals)
-minQ 13 (minimum base quality; only used in pileupreader)
----------------
multiReader.cpp:
</pre>
</div>
===Example===
<pre>
angsd -vcf-gl ../smallBam/small2.bcf -domajorminor 1 -domaf 1
</pre>
===Arguments===
;-vcf-gl [filename]
name of the vcf file.
<div class="toccolours mw-collapsible mw-collapsed">
A vcf file
<pre class="mw-collapsible-content">
##fileformat=VCFv4.2(angsd version)
##FORMAT=<ID=GT,Number=1,Type=Integer,Description="Genotype">
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Genotype Probabilities">
##FORMAT=<ID=PL,Number=G,Type=Float,Description="Phred-scaled Genotype Likelihoods">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="scaled Genotype Likelihoods (loglikeratios to the most likely (in log10))">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ind0 ind1
1 14000873 . G A . PASS . GP:GL 0.000000,0.137003,0.862997:-15.128970,-1.505169,0.000000 0.716266,0.281975,0.001759:0.000000,-0.301034,-1.800000
1 14001018 . T C . PASS . GP:GL 0.000000,0.081718,0.918282:-13.701492,-1.806203,0.000000 0.850652,0.149348,0.000000:0.000000,-0.602068,-5.699627
1 14001867 . A G . PASS . GP:GL 0.000489,0.727550,0.271961:-3.600000,-0.301034,0.000000 0.914538,0.085462,0.000000:0.000000,-0.903101,-8.859124
1 14002422 . A T . PASS . GP:GL 0.000000,0.291570,0.708430:-9.777061,-0.903101,0.000000 0.767047,0.232952,0.000001:0.000000,-0.602068,-5.499530
1 14002474 . T C . PASS . GP:GL 0.995488,0.004512,0.000000:0.000000,-1.505169,-15.068561 0.965008,0.034992,0.000000:0.000000,-0.602068,-5.899399
1 14003581 . C T . PASS . GP:GL 0.000000,0.674489,0.325510:-7.200000,-0.602068,0.000000 0.992516,0.007484,0.000000:0.000000,-1.806203,-13.447742
1 14004623 . T C . PASS . GP:GL 0.000000,0.588345,0.411654:-6.999968,-0.602068,0.000000 0.989186,0.010814,0.000000:0.000000,-1.806203,-12.574310
1 14007493 . A G . PASS . GP:GL 0.000013,0.541811,0.458176:-5.286503,-0.602068,0.000000 0.398233,0.422941,0.178826:-0.400000,-0.301034,0.000000
1 14007558 . C T . PASS . GP:GL 0.000000,0.091908,0.908092:-15.524007,-1.505169,0.000000 0.284993,0.442044,0.272964:-0.400000,-0.301034,0.000000
1 14007649 . G A . PASS . GP:GL 0.000000,0.638610,0.361390:-7.340205,-0.602068,0.000000 0.779442,0.220538,0.000020:0.000000,-0.301034,-3.500000
1 14008734 . T A . PASS . GP:GL 0.000000,0.280425,0.719575:-13.909454,-1.204135,0.000000 0.757059,0.242817,0.000123:0.000000,-0.301034,-2.800000
1 14009723 . G C . PASS . GP:GL 0.000345,0.744684,0.254971:-3.800000,-0.301034,0.000000 0.744903,0.255042,0.000055:0.000000,-0.301034,-3.200000
1 14010597 . G A . PASS . GP:GL 0.000000,0.063511,0.936489:-17.446187,-1.806203,0.000000 0.684326,0.315309,0.000365:0.000000,-0.301034,-2.600000
1 14010654 . T C . PASS . GP:GL 0.600538,0.348812,0.050650:0.000000,0.000000,0.000000 0.600538,0.348812,0.050650:0.000000,0.000000,0.000000
</pre>
</div>
;-nLines [int]=50
Number of lines to read at a time. Increasing this number will affect the RAM use.
===Tutorial===
Create a VCF file using your favorate software or using angsd
<pre>
./angsd -b bam.filelist -dovcf 1 -gl 1 -dopost 1 -domajorminor 1 -domaf 1 -snp_pval 1e-6
</pre>
you can then use it as input to angsd if you have the GL info
<pre>
./angsd -vcf-gl angsdput.vcf.gz -nind 10 -fai hg19.fa.gz.fai -domaf 1
</pre>
</pre>


Example of estimating allele frequencies from bam files
=Genotype Likelihood Files=
 
==-glf==
A simple format for genotype likelihoods: This is the format used by ''supersim'' subprogram and the ''-doglf 1'' option in angsd.
This format is binary, 10doubles per individual. -nInd therefore needs to be supplied
 
<div class="toccolours mw-collapsible mw-collapsed">
../angsd/angsd -glf
<pre class="mw-collapsible-content">
-> angsd version: 0.910-20-g553b991 (htslib: 1.2.1-192-ge7e2b3d) build(Dec  4 2015 12:17:14)
-> Analysis helpbox/synopsis information:
-> Command:
../angsd/angsd -pileup -> Fri Dec  4 12:17:53 2015
----------------
multiReader.cpp:
-nLines 50 (Number of lines to read)
-bpl 33554432 (bytesPerLine)
-beagle (null) (Beagle Filename (can be .gz))
-vcf-GL (null) (vcf Filename (can be .gz))
-vcf-GP (null) (vcf Filename (can be .gz))
-glf (null) (glf Filename (can be .gz))
-pileup (null) (pileup Filename (can be .gz))
-intName 1 (Assume First column is chr_position)
-isSim 0 (Simulated data assumes ancestral is A)
-nInd 0 (Number of individuals)
-minQ 13 (minimum base quality; only used in pileupreader)
----------------
multiReader.cpp:
</pre>
</div>
 
===Example===
 


<pre>
<pre>
./angsd -out out -doMaf 2 -bam bam.filelist
./angsd -glf data.glf.gz -nInd 10 -fai hg19.fa.gz.fai
</pre>
</pre>


===optional arguments===
===Arguments===
;-r [region]
 
Specify a region with in a chromosome using the syntax [chr]:[start-stop]. examples
;-glf [filename]:
chr1:1-10000            \\ first 10000 based for chr1
name of the glf file (gunzipped).  
chr2:50000-               \\chr2 but exclude the first 50000 bases
Every genotype likelihood is saved as binary double log scaled. In the following order. AA,AC,AG,AT,... for each individual
chr11:1-                  \\all of chr11
;-nInd [int]
;-only_proper_pairs [int]=0
Number of individuals must be specified.
Include only proper pairs (pairs of read with both mates mapped correctly). 0: include only proper (default), 1: use all reads. If your data is not paired end you have to choose 1
;-fai [filename]
;-rf [region file]
The index to the reference genome.
specify multiple regions in a file.  
;-bpl [int]=33554432
maximum bytes per line. Increase if the pileup has many individuals.
;-nLines [int]=50
;-nLines [int]=50
Number of lines to read per file at a time. Reducing this number will decrease the RAM usage with a small cost to the speed.
Number of lines to read at a time. Increasing this number will affect the RAM use.
;-uniqueOnly [int]=0
;-minQ [int]=0
remove reads that have multiple best hits.. 0 no (default), 1 remove
Minimum base quality score.
;-remove_bads [int]=1
 
Same as  the samtools flags -x which removes read with a flag above 255 (not primary, failure and duplicate reads)
===Tutorial===
 
Simulate genotype likelihoods
 
<pre>
supersim -outfiles data -nind 10 -nsites 100000 -errate 0.01 -depth 4
</pre>
 
then use it as input to angsd
 
<pre>
./angsd -glf data.glf.gz -nInd 10 -fai hg19.fa.gz.fai -domaf 1 -domajorminor 1
</pre>


=genotype likelihood files=
For historical reasons the program can use binary glfv3 files, and the text representations. These were generated from old versions of SAMtools, and is deprecated in newer versions of SAMtools.
Futhermore for internal use we can read the 'inhouse' tglf format files.


These formats are likely to be deprecated in future versions.
make GLF file from the chromosome 1
==glfv3==
;-samglf [filename]:
Samtools glf format (binary output). use the pileup -g options in samtools to generate the files. This format is deprecated in newer versions of samtools.
Samtools glf format (text output). use the pileup -g options in samtools to generate binary files followed the use of the samtools glfview. This format is deprecated in newer versions of samtools.
;-samglfclean [filename]:
==tglf==
A simple format for genotype likelihoods:
Every sample is in seperate files, Every genotype is saved as binary double log10 scaled. in the following order. AA,CC,GG,TT, etc


Since there are no information on position,reference,ancestral the -pos must be supplied in tandem with -tglf
<pre>
;-tglf [filename]:
./angsd -GL 1 -out genolike -doGlf 1 -doMajorMinor 1  -doMaf 2 -SNP_pval 1e-6 -bam bam.filelist -r 1:
;-posi [filename]:
</pre>


An example of a -posi file is:
recalculate the allele frequencies


<code>
<pre>
chr1 58924 0 4
./angsd -glf genolike.glf.gz -nInd 10 -fai hg19.fa.gz.fai -domaf 2 -domajorminor 1
chr1 58925 0 4
</pre>
chr1 58926 0 4
chr1 58927 0 4
chr1 58928 0 4
</code>


=genotype probability files=
==-glf10_text==
-glf10_text was added in commit: https://github.com/ANGSD/angsd/commit/46fc3edc181e80c4ad5e6bd644a64d23a5012e0e nov2 2017.
This allows for reading files in the output format as -doglf 4.
This is a simple text file with column 1 and column 2 being chromosome/scaffold and position. Then for each individual there are 10 logscaled genotype likelihoods in the order: AA,AC,AG,AT,CC,CG,CT,GG,GT,TT.
Example runs are:
First generate an example of this format:
<pre>
./angsd -gl 1 -doglf 4 -bam list -out first -domajorminor 1 -domaf 1
</pre>


==beagle format==
This generates the file first.glf.gz. Which we can then use as input.
Genotype probabilities in beagle format can be used as input. The format used is the haplotype imputation format outputted from beagle [http://faculty.washington.edu/browning/beagle/beagle.html].  
Example here:
<pre>
./angsd -glf10_text first.glf.gz -nind 33 -domaf 1 -domajorminor 1 -fai fai.fai
</pre>
Notice that -nInd and -fai needs to be supplied.


===options===
=Genotype Probability Files=
To include a beagle file us the option
==Beagle format==
Genotype probabilities in gz beagle format can be used as input. The format used is the haplotype imputation format outputted from beagle [http://faculty.washington.edu/browning/beagle/beagle.html]. A newer version of beagle uses VCF files. 


;-beagle [file]
<div class="toccolours mw-collapsible mw-collapsed">
./angsd -beagle
<pre class="mw-collapsible-content">
-> angsd version: 0.910-20-g553b991 (htslib: 1.2.1-192-ge7e2b3d) build(Dec  4 2015 12:17:14)
-> Analysis helpbox/synopsis information:
-> Command:
./angsd -beagle -> Fri Dec  4 14:03:22 2015
----------------
multiReader.cpp:
-nLines 50 (Number of lines to read)
-bpl 33554432 (bytesPerLine)
-beagle (null) (Beagle Filename (can be .gz))
-vcf-GL (null) (vcf Filename (can be .gz))
-vcf-GP (null) (vcf Filename (can be .gz))
-glf (null) (glf Filename (can be .gz))
-pileup (null) (pileup Filename (can be .gz))
-intName 1 (Assume First column is chr_position)
-isSim 0 (Simulated data assumes ancestral is A)
-nInd 0 (Number of individuals)
-minQ 13 (minimum base quality; only used in pileupreader)
----------------
multiReader.cpp:
</pre>
</div>
===Example===


===example===
Example of estimating allele frequencies from beagle files


The file format is a single linje per site. The first 3 coloums are
<pre>
./angsd -out out -doMaf 4 -beagle file.beagle.gprobs.gz -fai ref.fai
</pre>
 
 
 
===Arguments===
 
; -beagle [fileName]
beagle file name. The file must be gzipped.
The file format is a single line per site. The first 3 coloums are
* markerName
* markerName
* alleleA
* alleleA
* alleleB
* alleleB


For each individual 3 coloums are added. These three colums should sum to one.  
For each individual 3 columns are added. These three columns should sum to one.  


Example of a file with two individuals
<div class="toccolours mw-collapsible mw-collapsed">
 
file with two individuals
<pre>
<pre class="mw-collapsible-content">
marker alleleA alleleB NA06984 NA06984 NA06984 NA06986 NA06986 NA06986
marker alleleA alleleB NA06984 NA06984 NA06984 NA06986 NA06986 NA06986
chr9_95759065 G A 0.6563 0.3078 0.0358 0.5357 0.4016 0.0627
chr9_95759065 G A 0.6563 0.3078 0.0358 0.5357 0.4016 0.0627
Line 104: Line 451:
chr9_95762343 G T 0.9149 0.0835 0.0017 0.8396 0.1541 0.0064
chr9_95762343 G T 0.9149 0.0835 0.0017 0.8396 0.1541 0.0064
</pre>
</pre>
 
</div>
Example of estimating allele frequencies from beagle files
; -intName [int]=1
 
default 1. If the SNP name are written as chr_position this information will be parsed. If the SNP name is in another format then use -intName 0.
<pre>
;-fai [filename]
./angsd -out out -doMaf 16 -beagle file.beagle
The index to the reference genome
<div class="toccolours mw-collapsible mw-collapsed">
can also be obtained from the bam header
<pre class="mw-collapsible-content">
samtools view -file.bam | grep SN |cut -f2,3 | sed 's/SN\://g' |  sed 's/LN\://g' > ref.fai
</pre>
</pre>
</div>
;-bpl [int]=33554432
maximum bytes per line. Increase if the pileup has many individuals
;-nLines [int]=50
Number of lines to read at a time. Increasing this number will affect the RAM use

Latest revision as of 09:57, 28 September 2021

ANGSD currently supports various input formats


<classdiagram type="dir:LR"> [sequence data|BAM;CRAM;mpileup{bg:orange}]-[genotype;likelihoods|VCF;GLF;beagle{bg:orange}] [genotype;likelihoods|VCF;GLF;beagle{bg:orange}]-[genotype;probability|beagle{bg:orange}] </classdiagram>

Below is a short description of those we believe is of most use. Note that CRAM files are used interchangeably as BAM files. So use -bam for supplying both a CRAM list or BAM list or both.


Sequence data (BAM/CRAM/mpileup)

BAM/CRAM

ANGSD accepts BAM/CRAM files for mapped sequences and both are handled using the same -bam option. For information on the file specification and file creation see the samtools website [1]. These are required do be sorted according to reference. To see the options for BAM/CRAM use the command:

./angsd -bam

	-> angsd version: 0.910-14-g5e2711f (htslib: 1.2.1-252-ga2656aa) build(Dec  4 2015 10:40:24)
	-> Analysis helpbox/synopsis information:
	-> Command: 
./angsd -bam 
	-> angsd version: 0.910-14-g5e2711f (htslib: 1.2.1-252-ga2656aa) build(Dec  4 2015 10:40:28)
	-> Fri Dec  4 10:43:27 2015
---------------
parseArgs_bambi.cpp: bam reader:
	-r		(null)	Supply a single region in commandline (see examples below)
	-rf		(null)	Supply multiple regions in a file (see examples below)
	-remove_bads	1	Discard 'bad' reads, (flag >=256) 
	-uniqueOnly	0	Discards reads that doesn't map uniquely
	-show		0	Mimic 'samtools mpileup' also supply -ref fasta for printing reference column
	-minMapQ	0	Discard reads with mapping quality below
	-minQ		13	Discard bases with base quality below
	-trim		0	Number of based to discard at both ends of the reads
	-only_proper_pairs	1	Only use reads where the mate could be mapped
	-C		0	adjust mapQ for excessive mismatches (as SAMtools), supply -ref
	-baq		0	adjust qscores around indels (as SAMtools), supply -ref
	-if		2	include flags for each read
	-df		4	discard flags for each read
	-checkBamHeaders	1	Exit if difference in BAM headers
	-doCheck	1	Keep going even if datafile is not suffixed with .bam/.cram
	-downSample	0.000000	Downsample to the fraction of original data
	-minChunkSize	250	Minimum size of chunk sent to analyses

Examples for region specification:
		chr:		Use entire chromosome: chr
		chr:start-	Use region from start to end of chr
		chr:-stop	Use region from beginning of chromosome: chr to stop
		chr:start-stop	Use region from start to stop from chromosome: chr
		chr:site	Use single site on chromosome: chr
Will include read if:
	includeflag:[2] (beta)each segment properly aligned according to the aligner, 
Will discard read if:
	discardflag:[4] (beta)segment unmapped, 

Example

Example of estimating allele frequencies from bam files

./angsd -out out -doMaf 2 -bam bam.filelist -doMajorMinor 1 -GL 1 -P 5

Arguments

-bam [filelist]
-b [filelist]

The filelist is a file containing the full path for each bam file with one filename per row.


filelist with 6 individuals

/home/software/angsd/test/smallBam/smallNA12763.bam
/home/software/angsd/test/smallBam/smallNA11830.bam
/home/software/angsd/test/smallBam/smallNA12004.bam
/home/software/angsd/test/smallBam/smallNA06985.bam
/home/software/angsd/test/smallBam/smallNA11993.bam
/home/software/angsd/test/smallBam/smallNA12761.bam
-r [region]

Specify a region with in a chromosome using the syntax [chr]:[start-stop]. examples

chr1:1-10000             \\ first 10000 based for chr1
chr2:50000-              \\chr2 but exclude the first 50000 bases
chr11:1-                 \\all of chr11
chr11:                   \\all of chr11
chr7:123456              \\position 123456 of chr7
-rf [region file]

Specify multiple regions in a file using the same syntax as -r

-remove_bads [int]=1

Same as the samtools flags -x which removes read with a flag above 255 (not primary, failure and duplicate reads). 0 no , 1 remove (default).

-uniqueOnly [int]=0

Remove reads that have multiple best hits. 0 no (default), 1 remove.

-minMapQ [int]=0

Minimum mapQ quality.

-trim [int]=0

Number of bases to remove from both ends of the read.

-only_proper_pairs [int]=1

Include only proper pairs (pairs of read with both mates mapped correctly). 1: include only proper (default), 0: use all reads. Only relevant for paired end data.

-C [int] =0

Adjust mapQ for excessive mismatches (as SAMtools), supply -ref.

-baq [int]=0

Perform BAQ computation, remember to cite the| BAQ paper for this. 0: No BAQ calcualtion

1:normal BAQ (same as default in SAMtools). 2:extended BAQ (same as default in SAMtools).

-redo-baq=0

if zero then it will use the existing record

You will need to supply your reference (-ref) for BAQ options.

-checkBamHeaders [int]=1

Exits if the headers are not compatible for all files. 0 no , 1 remove (default). Not performing this check is not advisable

-downSample [float]=0

Randomly remove reads to downsample your data. 0.25 will on average keep 25% of the reads

-setMinChunkSize [int]=250

Minimum number of sites to read in before starting to analyze - larger number will use more RAM

Pileup files

Pileup files are the output files that are generated by SAMtools mpileup.

../angsd/angsd -pileup

	-> angsd version: 0.910-20-g553b991 (htslib: 1.2.1-192-ge7e2b3d) build(Dec  4 2015 12:17:14)
	-> Analysis helpbox/synopsis information:
	-> Command: 
../angsd/angsd -pileup 	-> Fri Dec  4 12:17:53 2015
----------------
multiReader.cpp:
	-nLines	50	(Number of lines to read)
	-bpl	33554432 (bytesPerLine)
	-beagle	(null)	(Beagle Filename (can be .gz))
	-vcf-GL	(null)	(vcf Filename (can be .gz))
	-vcf-GP	(null)	(vcf Filename (can be .gz))
	-glf	(null)	(glf Filename (can be .gz))
	-pileup	(null)	(pileup Filename (can be .gz))
	-intName 1	(Assume First column is chr_position)
	-isSim	0	(Simulated data assumes ancestral is A)
	-nInd	0		(Number of individuals)
	-minQ	13	(minimum base quality; only used in pileupreader)
----------------
multiReader.cpp:

Example

./angsd -pileup sam.mpileup -nInd 10 -fai hg19.fa.gz.fai

Arguments

-pileup [filename]

name of the pileup file.

A pileup file

1	13999999	N	3	ggg	I<B	2	Gg	FF	2	Gg	F7	6	ggGgGg	DBA@=2
1	14000000	N	3	ggg	8EG	2	Gg	BF	1	G	B	7	ggGgGgg	C>B=?:<
1	14000001	N	2	gg	<@	2	Gg	AC	2	Gg	:<	7	ggGgGgg	DBB?832
1	14000002	N	0			2	Cc	C1	1	C	B	7	ccCcCcc	=;A7485
1	14000003	N	2	gg	</	2	Gg	<I	2	Gg	</	7	ggGgGgg	C<;A84.
1	14000004	N	3	aaa	6C=	2	Aa	A9	2	Aa	BB	7	aaAaAaa	CBA7951
1	14000005	N	2	cc	4;	2	Cc	CC	2	Cc	@@	7	ccCcCcc	CBAB930
1	14000006	N	3	aaa	A9>	2	Aa	E<	2	Aa	;C	7	aa$AaAaa	D>BC6;:
1	14000007	N	3	ggg	43>	2	Gg	BI	2	Gg	D@	6	gGgGgg	BB?A.7
1	14000008	N	3	aaa	776	3	Aa^/A	:<?	2	Aa	BC	6	aAaAaa	D>C;:5
1	14000009	N	2	gg	96	3	GgG	BFD	2	Gg	A<	6	gGgGgg	CCA882
1	14000010	N	2	cc	54	3	CcC	>;A	2	Cc	A:	4	cCcC	=A69
1	14000011	N	2	gg	:0	3	GgG	9I<	2	Gg	<A	6	gGgGgg	C6A864
1	14000012	N	3	aaa	>F?	3	AaA	?<?	2	Aa	BC	5	aAaAa	D>B99
1	14000013	N	3	ggg	2==	3	GgG	AHD	2	Gg	EA	6	gGgGgg	C;A@63
1	14000014	N	3	aaa	8.6	3	AaA	?8A	2	Aa	2C	6	aAaAaa	C3A88<
1	14000015	N	2	cc	CD	3	CcC	CEB	2	Cc	?=	6	cCcCcc	D4<:=<
1	14000016	N	1	t	5	3	TtT	BGC	2	Tt	C@	6	tT$tTtt	C38A9>
1	14000017	N	3	ccc	17J	3	CcC	BB3	2	Cc	B7	5	ccCcc	D::B?
1	14000018	N	3	ccc	.:.	3	CcC	B:B	2	Cc	2;	5	ccCcc	<9956


-nInd [int]

Number of individuals must be specified.

-fai [filename]

The index to the reference genome.

-bpl [int]=33554432

maximum bytes per line. Increase if the pileup has many individuals.

-nLines [int]=50

Number of lines to read at a time. Increasing this number will affect the RAM use.

-minQ [int]=0

Minimum base quality score.

Tutorial

Various softwares can generate pileup format but the most used one is samtools

samtools mpileup -b bam.filelist > sam.mpileup

if you can then use it as input to angsd

./angsd -pileup sam.mpileup -nInd 10 -fai hg19.fa.gz.fai -domaf 1 -domajorminor 1 -gl 1

BCF/VCF files

BCF/VCF file as input is now included but with some limitations. Only chr,pos and PL tags are being used, and we discard indels.

./angsd -vcf-gl

	-> angsd version: 0.910-20-g553b991 (htslib: 1.2.1-192-ge7e2b3d) build(Dec  4 2015 12:17:14)
	-> Analysis helpbox/synopsis information:
	-> Command: 
./angsd -vcf-gl 	-> Fri Dec  4 14:35:51 2015
----------------
multiReader.cpp:
	-nLines	50	(Number of lines to read)
	-bpl	33554432 (bytesPerLine)
	-beagle	(null)	(Beagle Filename (can be .gz))
	-vcf-GL	(null)	(vcf Filename (can be .gz))
	-vcf-GP	(null)	(vcf Filename (can be .gz))
	-glf	(null)	(glf Filename (can be .gz))
	-pileup	(null)	(pileup Filename (can be .gz))
	-intName 1	(Assume First column is chr_position)
	-isSim	0	(Simulated data assumes ancestral is A)
	-nInd	0		(Number of individuals)
	-minQ	13	(minimum base quality; only used in pileupreader)
----------------
multiReader.cpp:


Example

angsd -vcf-gl ../smallBam/small2.bcf -domajorminor 1 -domaf 1

Arguments

-vcf-gl [filename]

name of the vcf file.

A vcf file

##fileformat=VCFv4.2(angsd version)
##FORMAT=<ID=GT,Number=1,Type=Integer,Description="Genotype">
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Genotype Probabilities">
##FORMAT=<ID=PL,Number=G,Type=Float,Description="Phred-scaled Genotype Likelihoods">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="scaled Genotype Likelihoods (loglikeratios to the most likely (in log10))">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	ind0	ind1
1	14000873	.	G	A	.	PASS	.	GP:GL	0.000000,0.137003,0.862997:-15.128970,-1.505169,0.000000	0.716266,0.281975,0.001759:0.000000,-0.301034,-1.800000
1	14001018	.	T	C	.	PASS	.	GP:GL	0.000000,0.081718,0.918282:-13.701492,-1.806203,0.000000	0.850652,0.149348,0.000000:0.000000,-0.602068,-5.699627
1	14001867	.	A	G	.	PASS	.	GP:GL	0.000489,0.727550,0.271961:-3.600000,-0.301034,0.000000	0.914538,0.085462,0.000000:0.000000,-0.903101,-8.859124
1	14002422	.	A	T	.	PASS	.	GP:GL	0.000000,0.291570,0.708430:-9.777061,-0.903101,0.000000	0.767047,0.232952,0.000001:0.000000,-0.602068,-5.499530
1	14002474	.	T	C	.	PASS	.	GP:GL	0.995488,0.004512,0.000000:0.000000,-1.505169,-15.068561	0.965008,0.034992,0.000000:0.000000,-0.602068,-5.899399
1	14003581	.	C	T	.	PASS	.	GP:GL	0.000000,0.674489,0.325510:-7.200000,-0.602068,0.000000	0.992516,0.007484,0.000000:0.000000,-1.806203,-13.447742
1	14004623	.	T	C	.	PASS	.	GP:GL	0.000000,0.588345,0.411654:-6.999968,-0.602068,0.000000	0.989186,0.010814,0.000000:0.000000,-1.806203,-12.574310
1	14007493	.	A	G	.	PASS	.	GP:GL	0.000013,0.541811,0.458176:-5.286503,-0.602068,0.000000	0.398233,0.422941,0.178826:-0.400000,-0.301034,0.000000
1	14007558	.	C	T	.	PASS	.	GP:GL	0.000000,0.091908,0.908092:-15.524007,-1.505169,0.000000	0.284993,0.442044,0.272964:-0.400000,-0.301034,0.000000
1	14007649	.	G	A	.	PASS	.	GP:GL	0.000000,0.638610,0.361390:-7.340205,-0.602068,0.000000	0.779442,0.220538,0.000020:0.000000,-0.301034,-3.500000
1	14008734	.	T	A	.	PASS	.	GP:GL	0.000000,0.280425,0.719575:-13.909454,-1.204135,0.000000	0.757059,0.242817,0.000123:0.000000,-0.301034,-2.800000
1	14009723	.	G	C	.	PASS	.	GP:GL	0.000345,0.744684,0.254971:-3.800000,-0.301034,0.000000	0.744903,0.255042,0.000055:0.000000,-0.301034,-3.200000
1	14010597	.	G	A	.	PASS	.	GP:GL	0.000000,0.063511,0.936489:-17.446187,-1.806203,0.000000	0.684326,0.315309,0.000365:0.000000,-0.301034,-2.600000
1	14010654	.	T	C	.	PASS	.	GP:GL	0.600538,0.348812,0.050650:0.000000,0.000000,0.000000	0.600538,0.348812,0.050650:0.000000,0.000000,0.000000
-nLines [int]=50

Number of lines to read at a time. Increasing this number will affect the RAM use.

Tutorial

Create a VCF file using your favorate software or using angsd

./angsd -b bam.filelist -dovcf 1 -gl 1 -dopost 1 -domajorminor 1 -domaf 1 -snp_pval 1e-6

you can then use it as input to angsd if you have the GL info

./angsd -vcf-gl angsdput.vcf.gz -nind 10 -fai hg19.fa.gz.fai -domaf 1

Genotype Likelihood Files

-glf

A simple format for genotype likelihoods: This is the format used by supersim subprogram and the -doglf 1 option in angsd. This format is binary, 10doubles per individual. -nInd therefore needs to be supplied

../angsd/angsd -glf

	-> angsd version: 0.910-20-g553b991 (htslib: 1.2.1-192-ge7e2b3d) build(Dec  4 2015 12:17:14)
	-> Analysis helpbox/synopsis information:
	-> Command: 
../angsd/angsd -pileup 	-> Fri Dec  4 12:17:53 2015
----------------
multiReader.cpp:
	-nLines	50	(Number of lines to read)
	-bpl	33554432 (bytesPerLine)
	-beagle	(null)	(Beagle Filename (can be .gz))
	-vcf-GL	(null)	(vcf Filename (can be .gz))
	-vcf-GP	(null)	(vcf Filename (can be .gz))
	-glf	(null)	(glf Filename (can be .gz))
	-pileup	(null)	(pileup Filename (can be .gz))
	-intName 1	(Assume First column is chr_position)
	-isSim	0	(Simulated data assumes ancestral is A)
	-nInd	0		(Number of individuals)
	-minQ	13	(minimum base quality; only used in pileupreader)
----------------
multiReader.cpp:

Example

./angsd -glf data.glf.gz -nInd 10 -fai hg19.fa.gz.fai

Arguments

-glf [filename]

name of the glf file (gunzipped). Every genotype likelihood is saved as binary double log scaled. In the following order. AA,AC,AG,AT,... for each individual

-nInd [int]

Number of individuals must be specified.

-fai [filename]

The index to the reference genome.

-bpl [int]=33554432

maximum bytes per line. Increase if the pileup has many individuals.

-nLines [int]=50

Number of lines to read at a time. Increasing this number will affect the RAM use.

-minQ [int]=0

Minimum base quality score.

Tutorial

Simulate genotype likelihoods

supersim -outfiles data -nind 10 -nsites 100000 -errate 0.01 -depth 4

then use it as input to angsd

./angsd -glf data.glf.gz -nInd 10 -fai hg19.fa.gz.fai -domaf 1 -domajorminor 1


make GLF file from the chromosome 1

./angsd -GL 1 -out genolike -doGlf 1 -doMajorMinor 1  -doMaf 2 -SNP_pval 1e-6 -bam bam.filelist -r 1:

recalculate the allele frequencies

./angsd -glf genolike.glf.gz -nInd 10 -fai hg19.fa.gz.fai -domaf 2 -domajorminor 1

-glf10_text

-glf10_text was added in commit: https://github.com/ANGSD/angsd/commit/46fc3edc181e80c4ad5e6bd644a64d23a5012e0e nov2 2017. This allows for reading files in the output format as -doglf 4. This is a simple text file with column 1 and column 2 being chromosome/scaffold and position. Then for each individual there are 10 logscaled genotype likelihoods in the order: AA,AC,AG,AT,CC,CG,CT,GG,GT,TT. Example runs are: First generate an example of this format:

./angsd -gl 1 -doglf 4 -bam list -out first -domajorminor 1 -domaf 1

This generates the file first.glf.gz. Which we can then use as input. Example here:

 ./angsd -glf10_text first.glf.gz -nind 33 -domaf 1 -domajorminor 1 -fai fai.fai 

Notice that -nInd and -fai needs to be supplied.

Genotype Probability Files

Beagle format

Genotype probabilities in gz beagle format can be used as input. The format used is the haplotype imputation format outputted from beagle [2]. A newer version of beagle uses VCF files.

./angsd -beagle

	-> angsd version: 0.910-20-g553b991 (htslib: 1.2.1-192-ge7e2b3d) build(Dec  4 2015 12:17:14)
	-> Analysis helpbox/synopsis information:
	-> Command: 
./angsd -beagle 	-> Fri Dec  4 14:03:22 2015
----------------
multiReader.cpp:
	-nLines	50	(Number of lines to read)
	-bpl	33554432 (bytesPerLine)
	-beagle	(null)	(Beagle Filename (can be .gz))
	-vcf-GL	(null)	(vcf Filename (can be .gz))
	-vcf-GP	(null)	(vcf Filename (can be .gz))
	-glf	(null)	(glf Filename (can be .gz))
	-pileup	(null)	(pileup Filename (can be .gz))
	-intName 1	(Assume First column is chr_position)
	-isSim	0	(Simulated data assumes ancestral is A)
	-nInd	0		(Number of individuals)
	-minQ	13	(minimum base quality; only used in pileupreader)
----------------
multiReader.cpp:

Example

Example of estimating allele frequencies from beagle files

./angsd -out out -doMaf 4 -beagle file.beagle.gprobs.gz -fai ref.fai


Arguments

-beagle [fileName]

beagle file name. The file must be gzipped. The file format is a single line per site. The first 3 coloums are

  • markerName
  • alleleA
  • alleleB

For each individual 3 columns are added. These three columns should sum to one.

file with two individuals
marker alleleA alleleB NA06984 NA06984 NA06984 NA06986 NA06986 NA06986
chr9_95759065 G A 0.6563 0.3078 0.0358 0.5357 0.4016 0.0627
chr9_95759152 C A 1 0 0 0 1 0
chr9_95762332 G A 0.925 0.0734 0.0015 0.894 0.1031 0.0029
chr9_95762333 A T 0.8903 0.1067 0.003 0.811 0.1797 0.0093
chr9_95762343 G T 0.9149 0.0835 0.0017 0.8396 0.1541 0.0064
-intName [int]=1

default 1. If the SNP name are written as chr_position this information will be parsed. If the SNP name is in another format then use -intName 0.

-fai [filename]

The index to the reference genome

can also be obtained from the bam header

samtools view -H  file.bam | grep SN |cut -f2,3 | sed 's/SN\://g' |  sed 's/LN\://g' > ref.fai
-bpl [int]=33554432

maximum bytes per line. Increase if the pileup has many individuals

-nLines [int]=50

Number of lines to read at a time. Increasing this number will affect the RAM use