ANGSD: Analysis of next generation Sequencing Data
Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.
Input: Difference between revisions
No edit summary |
|||
(27 intermediate revisions by 2 users not shown) | |||
Line 104: | Line 104: | ||
Number of bases to remove from both ends of the read. | Number of bases to remove from both ends of the read. | ||
;-only_proper_pairs [int]=1 | ;-only_proper_pairs [int]=1 | ||
Include only proper pairs (pairs of read with both mates mapped correctly). 1: include only proper (default), 0: use all reads. | Include only proper pairs (pairs of read with both mates mapped correctly). 1: include only proper (default), 0: use all reads. Only relevant for paired end data. | ||
;-C [int] =0 | ;-C [int] =0 | ||
Adjust mapQ for excessive mismatches (as SAMtools), supply -ref. | Adjust mapQ for excessive mismatches (as SAMtools), supply -ref. | ||
Line 110: | Line 110: | ||
Perform BAQ computation, remember to cite the[http://bioinformatics.oxfordjournals.org/content/early/2011/02/13/bioinformatics.btr076 | BAQ paper] for this. | Perform BAQ computation, remember to cite the[http://bioinformatics.oxfordjournals.org/content/early/2011/02/13/bioinformatics.btr076 | BAQ paper] for this. | ||
0: No BAQ calcualtion | 0: No BAQ calcualtion | ||
1: | |||
2:extended BAQ ( | 1:normal BAQ (same as default in SAMtools). | ||
2:extended BAQ (same as default in SAMtools). | |||
;-redo-baq=0 | |||
if zero then it will use the existing record | |||
You will need to supply your reference (-ref) for BAQ options. | You will need to supply your reference (-ref) for BAQ options. | ||
;-checkBamHeaders [int]=1 | ;-checkBamHeaders [int]=1 | ||
Line 117: | Line 122: | ||
;-downSample [float]=0 | ;-downSample [float]=0 | ||
Randomly remove reads to downsample your data. 0.25 will on average keep 25% of the reads | Randomly remove reads to downsample your data. 0.25 will on average keep 25% of the reads | ||
;- | ;-setMinChunkSize [int]=250 | ||
Minimum number of sites to read in before starting to analyze - larger number will use more RAM | Minimum number of sites to read in before starting to analyze - larger number will use more RAM | ||
Line 207: | Line 212: | ||
<pre> | <pre> | ||
./angsd -pileup sam.mpileup -nInd 10 -fai hg19.fa.gz.fai -domaf 1 -domajorminor 1 -gl 1 | ./angsd -pileup sam.mpileup -nInd 10 -fai hg19.fa.gz.fai -domaf 1 -domajorminor 1 -gl 1 | ||
</pre> | |||
=BCF/VCF files= | |||
BCF/VCF file as input is now included but with some limitations. Only chr,pos and PL tags are being used, and we discard indels. | |||
<div class="toccolours mw-collapsible mw-collapsed"> | |||
./angsd -vcf-gl | |||
<pre class="mw-collapsible-content"> | |||
-> angsd version: 0.910-20-g553b991 (htslib: 1.2.1-192-ge7e2b3d) build(Dec 4 2015 12:17:14) | |||
-> Analysis helpbox/synopsis information: | |||
-> Command: | |||
./angsd -vcf-gl -> Fri Dec 4 14:35:51 2015 | |||
---------------- | |||
multiReader.cpp: | |||
-nLines 50 (Number of lines to read) | |||
-bpl 33554432 (bytesPerLine) | |||
-beagle (null) (Beagle Filename (can be .gz)) | |||
-vcf-GL (null) (vcf Filename (can be .gz)) | |||
-vcf-GP (null) (vcf Filename (can be .gz)) | |||
-glf (null) (glf Filename (can be .gz)) | |||
-pileup (null) (pileup Filename (can be .gz)) | |||
-intName 1 (Assume First column is chr_position) | |||
-isSim 0 (Simulated data assumes ancestral is A) | |||
-nInd 0 (Number of individuals) | |||
-minQ 13 (minimum base quality; only used in pileupreader) | |||
---------------- | |||
multiReader.cpp: | |||
</pre> | |||
</div> | |||
===Example=== | |||
<pre> | |||
angsd -vcf-gl ../smallBam/small2.bcf -domajorminor 1 -domaf 1 | |||
</pre> | |||
===Arguments=== | |||
;-vcf-gl [filename] | |||
name of the vcf file. | |||
<div class="toccolours mw-collapsible mw-collapsed"> | |||
A vcf file | |||
<pre class="mw-collapsible-content"> | |||
##fileformat=VCFv4.2(angsd version) | |||
##FORMAT=<ID=GT,Number=1,Type=Integer,Description="Genotype"> | |||
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Genotype Probabilities"> | |||
##FORMAT=<ID=PL,Number=G,Type=Float,Description="Phred-scaled Genotype Likelihoods"> | |||
##FORMAT=<ID=GL,Number=G,Type=Float,Description="scaled Genotype Likelihoods (loglikeratios to the most likely (in log10))"> | |||
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ind0 ind1 | |||
1 14000873 . G A . PASS . GP:GL 0.000000,0.137003,0.862997:-15.128970,-1.505169,0.000000 0.716266,0.281975,0.001759:0.000000,-0.301034,-1.800000 | |||
1 14001018 . T C . PASS . GP:GL 0.000000,0.081718,0.918282:-13.701492,-1.806203,0.000000 0.850652,0.149348,0.000000:0.000000,-0.602068,-5.699627 | |||
1 14001867 . A G . PASS . GP:GL 0.000489,0.727550,0.271961:-3.600000,-0.301034,0.000000 0.914538,0.085462,0.000000:0.000000,-0.903101,-8.859124 | |||
1 14002422 . A T . PASS . GP:GL 0.000000,0.291570,0.708430:-9.777061,-0.903101,0.000000 0.767047,0.232952,0.000001:0.000000,-0.602068,-5.499530 | |||
1 14002474 . T C . PASS . GP:GL 0.995488,0.004512,0.000000:0.000000,-1.505169,-15.068561 0.965008,0.034992,0.000000:0.000000,-0.602068,-5.899399 | |||
1 14003581 . C T . PASS . GP:GL 0.000000,0.674489,0.325510:-7.200000,-0.602068,0.000000 0.992516,0.007484,0.000000:0.000000,-1.806203,-13.447742 | |||
1 14004623 . T C . PASS . GP:GL 0.000000,0.588345,0.411654:-6.999968,-0.602068,0.000000 0.989186,0.010814,0.000000:0.000000,-1.806203,-12.574310 | |||
1 14007493 . A G . PASS . GP:GL 0.000013,0.541811,0.458176:-5.286503,-0.602068,0.000000 0.398233,0.422941,0.178826:-0.400000,-0.301034,0.000000 | |||
1 14007558 . C T . PASS . GP:GL 0.000000,0.091908,0.908092:-15.524007,-1.505169,0.000000 0.284993,0.442044,0.272964:-0.400000,-0.301034,0.000000 | |||
1 14007649 . G A . PASS . GP:GL 0.000000,0.638610,0.361390:-7.340205,-0.602068,0.000000 0.779442,0.220538,0.000020:0.000000,-0.301034,-3.500000 | |||
1 14008734 . T A . PASS . GP:GL 0.000000,0.280425,0.719575:-13.909454,-1.204135,0.000000 0.757059,0.242817,0.000123:0.000000,-0.301034,-2.800000 | |||
1 14009723 . G C . PASS . GP:GL 0.000345,0.744684,0.254971:-3.800000,-0.301034,0.000000 0.744903,0.255042,0.000055:0.000000,-0.301034,-3.200000 | |||
1 14010597 . G A . PASS . GP:GL 0.000000,0.063511,0.936489:-17.446187,-1.806203,0.000000 0.684326,0.315309,0.000365:0.000000,-0.301034,-2.600000 | |||
1 14010654 . T C . PASS . GP:GL 0.600538,0.348812,0.050650:0.000000,0.000000,0.000000 0.600538,0.348812,0.050650:0.000000,0.000000,0.000000 | |||
</pre> | |||
</div> | |||
;-nLines [int]=50 | |||
Number of lines to read at a time. Increasing this number will affect the RAM use. | |||
===Tutorial=== | |||
Create a VCF file using your favorate software or using angsd | |||
<pre> | |||
./angsd -b bam.filelist -dovcf 1 -gl 1 -dopost 1 -domajorminor 1 -domaf 1 -snp_pval 1e-6 | |||
</pre> | |||
you can then use it as input to angsd if you have the GL info | |||
<pre> | |||
./angsd -vcf-gl angsdput.vcf.gz -nind 10 -fai hg19.fa.gz.fai -domaf 1 | |||
</pre> | </pre> | ||
Line 212: | Line 297: | ||
==-glf== | ==-glf== | ||
A simple format for genotype likelihoods: | A simple format for genotype likelihoods: This is the format used by ''supersim'' subprogram and the ''-doglf 1'' option in angsd. | ||
This format is binary, 10doubles per individual. -nInd therefore needs to be supplied | |||
<div class="toccolours mw-collapsible mw-collapsed"> | |||
../angsd/angsd -glf | |||
<pre class="mw-collapsible-content"> | |||
-> angsd version: 0.910-20-g553b991 (htslib: 1.2.1-192-ge7e2b3d) build(Dec 4 2015 12:17:14) | |||
-> Analysis helpbox/synopsis information: | |||
-> Command: | |||
../angsd/angsd -pileup -> Fri Dec 4 12:17:53 2015 | |||
---------------- | |||
multiReader.cpp: | |||
-nLines 50 (Number of lines to read) | |||
-bpl 33554432 (bytesPerLine) | |||
-beagle (null) (Beagle Filename (can be .gz)) | |||
-vcf-GL (null) (vcf Filename (can be .gz)) | |||
-vcf-GP (null) (vcf Filename (can be .gz)) | |||
-glf (null) (glf Filename (can be .gz)) | |||
-pileup (null) (pileup Filename (can be .gz)) | |||
-intName 1 (Assume First column is chr_position) | |||
-isSim 0 (Simulated data assumes ancestral is A) | |||
-nInd 0 (Number of individuals) | |||
-minQ 13 (minimum base quality; only used in pileupreader) | |||
---------------- | |||
multiReader.cpp: | |||
</pre> | |||
</div> | |||
===Example=== | |||
<pre> | |||
./angsd -glf data.glf.gz -nInd 10 -fai hg19.fa.gz.fai | |||
</pre> | |||
===Arguments=== | |||
;-glf [filename]: | ;-glf [filename]: | ||
name of the glf file (gunzipped). | |||
Every genotype likelihood is saved as binary double log scaled. In the following order. AA,AC,AG,AT,... for each individual | |||
;-nInd [int] | |||
Number of individuals must be specified. | |||
;-fai [filename] | |||
The index to the reference genome. | |||
;-bpl [int]=33554432 | |||
maximum bytes per line. Increase if the pileup has many individuals. | |||
;-nLines [int]=50 | |||
Number of lines to read at a time. Increasing this number will affect the RAM use. | |||
;-minQ [int]=0 | |||
Minimum base quality score. | |||
===Tutorial=== | |||
Simulate genotype likelihoods | |||
<pre> | |||
supersim -outfiles data -nind 10 -nsites 100000 -errate 0.01 -depth 4 | |||
</pre> | |||
then use it as input to angsd | |||
<pre> | <pre> | ||
./angsd -glf data.glf.gz -nInd 10 -fai hg19.fa.gz.fai -domaf 1 -domajorminor 1 | |||
./angsd - | |||
</pre> | </pre> | ||
make GLF file from the chromosome 1 | |||
== | <pre> | ||
./angsd -GL 1 -out genolike -doGlf 1 -doMajorMinor 1 -doMaf 2 -SNP_pval 1e-6 -bam bam.filelist -r 1: | |||
</pre> | |||
recalculate the allele frequencies | |||
<pre> | |||
./angsd -glf genolike.glf.gz -nInd 10 -fai hg19.fa.gz.fai -domaf 2 -domajorminor 1 | |||
</pre> | |||
==-glf10_text== | |||
-glf10_text was added in commit: https://github.com/ANGSD/angsd/commit/46fc3edc181e80c4ad5e6bd644a64d23a5012e0e nov2 2017. | |||
This allows for reading files in the output format as -doglf 4. | |||
This is a simple text file with column 1 and column 2 being chromosome/scaffold and position. Then for each individual there are 10 logscaled genotype likelihoods in the order: AA,AC,AG,AT,CC,CG,CT,GG,GT,TT. | |||
Example runs are: | |||
First generate an example of this format: | |||
<pre> | |||
./angsd -gl 1 -doglf 4 -bam list -out first -domajorminor 1 -domaf 1 | |||
</pre> | |||
This generates the file first.glf.gz. Which we can then use as input. | |||
Example here: | |||
<pre> | |||
./angsd -glf10_text first.glf.gz -nind 33 -domaf 1 -domajorminor 1 -fai fai.fai | |||
</pre> | |||
Notice that -nInd and -fai needs to be supplied. | |||
=Genotype Probability Files= | =Genotype Probability Files= | ||
==Beagle format== | |||
Genotype probabilities in gz beagle format can be used as input. The format used is the haplotype imputation format outputted from beagle [http://faculty.washington.edu/browning/beagle/beagle.html]. A newer version of beagle uses VCF files. | Genotype probabilities in gz beagle format can be used as input. The format used is the haplotype imputation format outputted from beagle [http://faculty.washington.edu/browning/beagle/beagle.html]. A newer version of beagle uses VCF files. | ||
Latest revision as of 10:57, 28 September 2021
ANGSD currently supports various input formats
<classdiagram type="dir:LR"> [sequence data|BAM;CRAM;mpileup{bg:orange}]-[genotype;likelihoods|VCF;GLF;beagle{bg:orange}] [genotype;likelihoods|VCF;GLF;beagle{bg:orange}]-[genotype;probability|beagle{bg:orange}] </classdiagram>
Below is a short description of those we believe is of most use. Note that CRAM files are used interchangeably as BAM files. So use -bam for supplying both a CRAM list or BAM list or both.
Sequence data (BAM/CRAM/mpileup)
BAM/CRAM
ANGSD accepts BAM/CRAM files for mapped sequences and both are handled using the same -bam option. For information on the file specification and file creation see the samtools website [1]. These are required do be sorted according to reference. To see the options for BAM/CRAM use the command:
./angsd -bam
Example
Example of estimating allele frequencies from bam files
./angsd -out out -doMaf 2 -bam bam.filelist -doMajorMinor 1 -GL 1 -P 5
Arguments
- -bam [filelist]
- -b [filelist]
The filelist is a file containing the full path for each bam file with one filename per row.
filelist with 6 individuals
- -r [region]
Specify a region with in a chromosome using the syntax [chr]:[start-stop]. examples
chr1:1-10000 \\ first 10000 based for chr1 chr2:50000- \\chr2 but exclude the first 50000 bases chr11:1- \\all of chr11 chr11: \\all of chr11 chr7:123456 \\position 123456 of chr7
- -rf [region file]
Specify multiple regions in a file using the same syntax as -r
- -remove_bads [int]=1
Same as the samtools flags -x which removes read with a flag above 255 (not primary, failure and duplicate reads). 0 no , 1 remove (default).
- -uniqueOnly [int]=0
Remove reads that have multiple best hits. 0 no (default), 1 remove.
- -minMapQ [int]=0
Minimum mapQ quality.
- -trim [int]=0
Number of bases to remove from both ends of the read.
- -only_proper_pairs [int]=1
Include only proper pairs (pairs of read with both mates mapped correctly). 1: include only proper (default), 0: use all reads. Only relevant for paired end data.
- -C [int] =0
Adjust mapQ for excessive mismatches (as SAMtools), supply -ref.
- -baq [int]=0
Perform BAQ computation, remember to cite the| BAQ paper for this. 0: No BAQ calcualtion
1:normal BAQ (same as default in SAMtools). 2:extended BAQ (same as default in SAMtools).
- -redo-baq=0
if zero then it will use the existing record
You will need to supply your reference (-ref) for BAQ options.
- -checkBamHeaders [int]=1
Exits if the headers are not compatible for all files. 0 no , 1 remove (default). Not performing this check is not advisable
- -downSample [float]=0
Randomly remove reads to downsample your data. 0.25 will on average keep 25% of the reads
- -setMinChunkSize [int]=250
Minimum number of sites to read in before starting to analyze - larger number will use more RAM
Pileup files
Pileup files are the output files that are generated by SAMtools mpileup.
../angsd/angsd -pileup
Example
./angsd -pileup sam.mpileup -nInd 10 -fai hg19.fa.gz.fai
Arguments
- -pileup [filename]
name of the pileup file.
A pileup file
- -nInd [int]
Number of individuals must be specified.
- -fai [filename]
The index to the reference genome.
- -bpl [int]=33554432
maximum bytes per line. Increase if the pileup has many individuals.
- -nLines [int]=50
Number of lines to read at a time. Increasing this number will affect the RAM use.
- -minQ [int]=0
Minimum base quality score.
Tutorial
Various softwares can generate pileup format but the most used one is samtools
samtools mpileup -b bam.filelist > sam.mpileup
if you can then use it as input to angsd
./angsd -pileup sam.mpileup -nInd 10 -fai hg19.fa.gz.fai -domaf 1 -domajorminor 1 -gl 1
BCF/VCF files
BCF/VCF file as input is now included but with some limitations. Only chr,pos and PL tags are being used, and we discard indels.
./angsd -vcf-gl
Example
angsd -vcf-gl ../smallBam/small2.bcf -domajorminor 1 -domaf 1
Arguments
- -vcf-gl [filename]
name of the vcf file.
A vcf file
- -nLines [int]=50
Number of lines to read at a time. Increasing this number will affect the RAM use.
Tutorial
Create a VCF file using your favorate software or using angsd
./angsd -b bam.filelist -dovcf 1 -gl 1 -dopost 1 -domajorminor 1 -domaf 1 -snp_pval 1e-6
you can then use it as input to angsd if you have the GL info
./angsd -vcf-gl angsdput.vcf.gz -nind 10 -fai hg19.fa.gz.fai -domaf 1
Genotype Likelihood Files
-glf
A simple format for genotype likelihoods: This is the format used by supersim subprogram and the -doglf 1 option in angsd. This format is binary, 10doubles per individual. -nInd therefore needs to be supplied
../angsd/angsd -glf
Example
./angsd -glf data.glf.gz -nInd 10 -fai hg19.fa.gz.fai
Arguments
- -glf [filename]
name of the glf file (gunzipped). Every genotype likelihood is saved as binary double log scaled. In the following order. AA,AC,AG,AT,... for each individual
- -nInd [int]
Number of individuals must be specified.
- -fai [filename]
The index to the reference genome.
- -bpl [int]=33554432
maximum bytes per line. Increase if the pileup has many individuals.
- -nLines [int]=50
Number of lines to read at a time. Increasing this number will affect the RAM use.
- -minQ [int]=0
Minimum base quality score.
Tutorial
Simulate genotype likelihoods
supersim -outfiles data -nind 10 -nsites 100000 -errate 0.01 -depth 4
then use it as input to angsd
./angsd -glf data.glf.gz -nInd 10 -fai hg19.fa.gz.fai -domaf 1 -domajorminor 1
make GLF file from the chromosome 1
./angsd -GL 1 -out genolike -doGlf 1 -doMajorMinor 1 -doMaf 2 -SNP_pval 1e-6 -bam bam.filelist -r 1:
recalculate the allele frequencies
./angsd -glf genolike.glf.gz -nInd 10 -fai hg19.fa.gz.fai -domaf 2 -domajorminor 1
-glf10_text
-glf10_text was added in commit: https://github.com/ANGSD/angsd/commit/46fc3edc181e80c4ad5e6bd644a64d23a5012e0e nov2 2017. This allows for reading files in the output format as -doglf 4. This is a simple text file with column 1 and column 2 being chromosome/scaffold and position. Then for each individual there are 10 logscaled genotype likelihoods in the order: AA,AC,AG,AT,CC,CG,CT,GG,GT,TT. Example runs are: First generate an example of this format:
./angsd -gl 1 -doglf 4 -bam list -out first -domajorminor 1 -domaf 1
This generates the file first.glf.gz. Which we can then use as input. Example here:
./angsd -glf10_text first.glf.gz -nind 33 -domaf 1 -domajorminor 1 -fai fai.fai
Notice that -nInd and -fai needs to be supplied.
Genotype Probability Files
Beagle format
Genotype probabilities in gz beagle format can be used as input. The format used is the haplotype imputation format outputted from beagle [2]. A newer version of beagle uses VCF files.
./angsd -beagle
Example
Example of estimating allele frequencies from beagle files
./angsd -out out -doMaf 4 -beagle file.beagle.gprobs.gz -fai ref.fai
Arguments
- -beagle [fileName]
beagle file name. The file must be gzipped. The file format is a single line per site. The first 3 coloums are
- markerName
- alleleA
- alleleB
For each individual 3 columns are added. These three columns should sum to one.
file with two individuals
- -intName [int]=1
default 1. If the SNP name are written as chr_position this information will be parsed. If the SNP name is in another format then use -intName 0.
- -fai [filename]
The index to the reference genome
can also be obtained from the bam header
- -bpl [int]=33554432
maximum bytes per line. Increase if the pileup has many individuals
- -nLines [int]=50
Number of lines to read at a time. Increasing this number will affect the RAM use