ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Input

From angsd
Revision as of 17:23, 22 February 2019 by Thorfinn (talk | contribs) (→‎VCF files)
Jump to navigation Jump to search

ANGSD currently supports various input formats


<classdiagram type="dir:LR"> [sequence data|BAM;CRAM;mpileup{bg:orange}]-[genotype;likelihoods|VCF;GLF;beagle{bg:orange}] [genotype;likelihoods|VCF;GLF;beagle{bg:orange}]-[genotype;probability|beagle{bg:orange}] </classdiagram>

Below is a short description of those we believe is of most use. Note that CRAM files are used interchangeably as BAM files. So use -bam for supplying both a CRAM list or BAM list or both.


Sequence data (BAM/CRAM/mpileup)

BAM/CRAM

ANGSD accepts BAM/CRAM files for mapped sequences and both are handled using the same -bam option. For information on the file specification and file creation see the samtools website [1]. These are required do be sorted according to reference. To see the options for BAM/CRAM use the command:

./angsd -bam

	-> angsd version: 0.910-14-g5e2711f (htslib: 1.2.1-252-ga2656aa) build(Dec  4 2015 10:40:24)
	-> Analysis helpbox/synopsis information:
	-> Command: 
./angsd -bam 
	-> angsd version: 0.910-14-g5e2711f (htslib: 1.2.1-252-ga2656aa) build(Dec  4 2015 10:40:28)
	-> Fri Dec  4 10:43:27 2015
---------------
parseArgs_bambi.cpp: bam reader:
	-r		(null)	Supply a single region in commandline (see examples below)
	-rf		(null)	Supply multiple regions in a file (see examples below)
	-remove_bads	1	Discard 'bad' reads, (flag >=256) 
	-uniqueOnly	0	Discards reads that doesn't map uniquely
	-show		0	Mimic 'samtools mpileup' also supply -ref fasta for printing reference column
	-minMapQ	0	Discard reads with mapping quality below
	-minQ		13	Discard bases with base quality below
	-trim		0	Number of based to discard at both ends of the reads
	-only_proper_pairs	1	Only use reads where the mate could be mapped
	-C		0	adjust mapQ for excessive mismatches (as SAMtools), supply -ref
	-baq		0	adjust qscores around indels (as SAMtools), supply -ref
	-if		2	include flags for each read
	-df		4	discard flags for each read
	-checkBamHeaders	1	Exit if difference in BAM headers
	-doCheck	1	Keep going even if datafile is not suffixed with .bam/.cram
	-downSample	0.000000	Downsample to the fraction of original data
	-minChunkSize	250	Minimum size of chunk sent to analyses

Examples for region specification:
		chr:		Use entire chromosome: chr
		chr:start-	Use region from start to end of chr
		chr:-stop	Use region from beginning of chromosome: chr to stop
		chr:start-stop	Use region from start to stop from chromosome: chr
		chr:site	Use single site on chromosome: chr
Will include read if:
	includeflag:[2] (beta)each segment properly aligned according to the aligner, 
Will discard read if:
	discardflag:[4] (beta)segment unmapped, 

Example

Example of estimating allele frequencies from bam files

./angsd -out out -doMaf 2 -bam bam.filelist -doMajorMinor 1 -GL 1 -P 5

Arguments

-bam [filelist]
-b [filelist]

The filelist is a file containing the full path for each bam file with one filename per row.


filelist with 6 individuals

/home/software/angsd/test/smallBam/smallNA12763.bam
/home/software/angsd/test/smallBam/smallNA11830.bam
/home/software/angsd/test/smallBam/smallNA12004.bam
/home/software/angsd/test/smallBam/smallNA06985.bam
/home/software/angsd/test/smallBam/smallNA11993.bam
/home/software/angsd/test/smallBam/smallNA12761.bam
-r [region]

Specify a region with in a chromosome using the syntax [chr]:[start-stop]. examples

chr1:1-10000             \\ first 10000 based for chr1
chr2:50000-              \\chr2 but exclude the first 50000 bases
chr11:1-                 \\all of chr11
chr11:                   \\all of chr11
chr7:123456              \\position 123456 of chr7
-rf [region file]

Specify multiple regions in a file using the same syntax as -r

-remove_bads [int]=1

Same as the samtools flags -x which removes read with a flag above 255 (not primary, failure and duplicate reads). 0 no , 1 remove (default).

-uniqueOnly [int]=0

Remove reads that have multiple best hits. 0 no (default), 1 remove.

-minMapQ [int]=0

Minimum mapQ quality.

-trim [int]=0

Number of bases to remove from both ends of the read.

-only_proper_pairs [int]=1

Include only proper pairs (pairs of read with both mates mapped correctly). 1: include only proper (default), 0: use all reads. Only relevant for paired end data.

-C [int] =0

Adjust mapQ for excessive mismatches (as SAMtools), supply -ref.

-baq [int]=0

Perform BAQ computation, remember to cite the| BAQ paper for this. 0: No BAQ calcualtion

1:normal BAQ (same as default in SAMtools). 2:extended BAQ (same as default in SAMtools).

-redo-baq=0

if zero then it will use the existing record

You will need to supply your reference (-ref) for BAQ options.

-checkBamHeaders [int]=1

Exits if the headers are not compatible for all files. 0 no , 1 remove (default). Not performing this check is not advisable

-downSample [float]=0

Randomly remove reads to downsample your data. 0.25 will on average keep 25% of the reads

-setMinChunkSize [int]=250

Minimum number of sites to read in before starting to analyze - larger number will use more RAM

Pileup files

Pileup files are the output files that are generated by SAMtools mpileup.

../angsd/angsd -pileup

	-> angsd version: 0.910-20-g553b991 (htslib: 1.2.1-192-ge7e2b3d) build(Dec  4 2015 12:17:14)
	-> Analysis helpbox/synopsis information:
	-> Command: 
../angsd/angsd -pileup 	-> Fri Dec  4 12:17:53 2015
----------------
multiReader.cpp:
	-nLines	50	(Number of lines to read)
	-bpl	33554432 (bytesPerLine)
	-beagle	(null)	(Beagle Filename (can be .gz))
	-vcf-GL	(null)	(vcf Filename (can be .gz))
	-vcf-GP	(null)	(vcf Filename (can be .gz))
	-glf	(null)	(glf Filename (can be .gz))
	-pileup	(null)	(pileup Filename (can be .gz))
	-intName 1	(Assume First column is chr_position)
	-isSim	0	(Simulated data assumes ancestral is A)
	-nInd	0		(Number of individuals)
	-minQ	13	(minimum base quality; only used in pileupreader)
----------------
multiReader.cpp:

Example

./angsd -pileup sam.mpileup -nInd 10 -fai hg19.fa.gz.fai

Arguments

-pileup [filename]

name of the pileup file.

A pileup file

1	13999999	N	3	ggg	I<B	2	Gg	FF	2	Gg	F7	6	ggGgGg	DBA@=2
1	14000000	N	3	ggg	8EG	2	Gg	BF	1	G	B	7	ggGgGgg	C>B=?:<
1	14000001	N	2	gg	<@	2	Gg	AC	2	Gg	:<	7	ggGgGgg	DBB?832
1	14000002	N	0			2	Cc	C1	1	C	B	7	ccCcCcc	=;A7485
1	14000003	N	2	gg	</	2	Gg	<I	2	Gg	</	7	ggGgGgg	C<;A84.
1	14000004	N	3	aaa	6C=	2	Aa	A9	2	Aa	BB	7	aaAaAaa	CBA7951
1	14000005	N	2	cc	4;	2	Cc	CC	2	Cc	@@	7	ccCcCcc	CBAB930
1	14000006	N	3	aaa	A9>	2	Aa	E<	2	Aa	;C	7	aa$AaAaa	D>BC6;:
1	14000007	N	3	ggg	43>	2	Gg	BI	2	Gg	D@	6	gGgGgg	BB?A.7
1	14000008	N	3	aaa	776	3	Aa^/A	:<?	2	Aa	BC	6	aAaAaa	D>C;:5
1	14000009	N	2	gg	96	3	GgG	BFD	2	Gg	A<	6	gGgGgg	CCA882
1	14000010	N	2	cc	54	3	CcC	>;A	2	Cc	A:	4	cCcC	=A69
1	14000011	N	2	gg	:0	3	GgG	9I<	2	Gg	<A	6	gGgGgg	C6A864
1	14000012	N	3	aaa	>F?	3	AaA	?<?	2	Aa	BC	5	aAaAa	D>B99
1	14000013	N	3	ggg	2==	3	GgG	AHD	2	Gg	EA	6	gGgGgg	C;A@63
1	14000014	N	3	aaa	8.6	3	AaA	?8A	2	Aa	2C	6	aAaAaa	C3A88<
1	14000015	N	2	cc	CD	3	CcC	CEB	2	Cc	?=	6	cCcCcc	D4<:=<
1	14000016	N	1	t	5	3	TtT	BGC	2	Tt	C@	6	tT$tTtt	C38A9>
1	14000017	N	3	ccc	17J	3	CcC	BB3	2	Cc	B7	5	ccCcc	D::B?
1	14000018	N	3	ccc	.:.	3	CcC	B:B	2	Cc	2;	5	ccCcc	<9956


-nInd [int]

Number of individuals must be specified.

-fai [filename]

The index to the reference genome.

-bpl [int]=33554432

maximum bytes per line. Increase if the pileup has many individuals.

-nLines [int]=50

Number of lines to read at a time. Increasing this number will affect the RAM use.

-minQ [int]=0

Minimum base quality score.

Tutorial

Various softwares can generate pileup format but the most used one is samtools

samtools mpileup -b bam.filelist > sam.mpileup

if you can then use it as input to angsd

./angsd -pileup sam.mpileup -nInd 10 -fai hg19.fa.gz.fai -domaf 1 -domajorminor 1 -gl 1

Genotype Likelihood Files

-glf

A simple format for genotype likelihoods: This is the format used by supersim subprogram and the -doglf 1 option in angsd. This format is binary, 10doubles per individual. -nInd therefore needs to be supplied

../angsd/angsd -glf

	-> angsd version: 0.910-20-g553b991 (htslib: 1.2.1-192-ge7e2b3d) build(Dec  4 2015 12:17:14)
	-> Analysis helpbox/synopsis information:
	-> Command: 
../angsd/angsd -pileup 	-> Fri Dec  4 12:17:53 2015
----------------
multiReader.cpp:
	-nLines	50	(Number of lines to read)
	-bpl	33554432 (bytesPerLine)
	-beagle	(null)	(Beagle Filename (can be .gz))
	-vcf-GL	(null)	(vcf Filename (can be .gz))
	-vcf-GP	(null)	(vcf Filename (can be .gz))
	-glf	(null)	(glf Filename (can be .gz))
	-pileup	(null)	(pileup Filename (can be .gz))
	-intName 1	(Assume First column is chr_position)
	-isSim	0	(Simulated data assumes ancestral is A)
	-nInd	0		(Number of individuals)
	-minQ	13	(minimum base quality; only used in pileupreader)
----------------
multiReader.cpp:

Example

./angsd -glf data.glf.gz -nInd 10 -fai hg19.fa.gz.fai

Arguments

-glf [filename]

name of the glf file (gunzipped). Every genotype likelihood is saved as binary double log scaled. In the following order. AA,AC,AG,AT,... for each individual

-nInd [int]

Number of individuals must be specified.

-fai [filename]

The index to the reference genome.

-bpl [int]=33554432

maximum bytes per line. Increase if the pileup has many individuals.

-nLines [int]=50

Number of lines to read at a time. Increasing this number will affect the RAM use.

-minQ [int]=0

Minimum base quality score.

Tutorial

Simulate genotype likelihoods

supersim -outfiles data -nind 10 -nsites 100000 -errate 0.01 -depth 4

then use it as input to angsd

./angsd -glf data.glf.gz -nInd 10 -fai hg19.fa.gz.fai -domaf 1 -domajorminor 1


make GLF file from the chromosome 1

./angsd -GL 1 -out genolike -doGlf 1 -doMajorMinor 1  -doMaf 2 -SNP_pval 2e-6 -bam bam.filelist -r 1:

recalculate the allele frequencies

./angsd -glf genolike.glf.gz -nInd 10 -fai hg19.fa.gz.fai -domaf 2 -domajorminor 1

-glf10_text

-glf10_text was added in commit: https://github.com/ANGSD/angsd/commit/46fc3edc181e80c4ad5e6bd644a64d23a5012e0e nov2 2017. This allows for reading files in the output format as -doglf 4. This is a simple text file with column 1 and column 2 being chromosome/scaffold and position. Then for each individual there are 10 logscaled genotype likelihoods in the order: AA,AC,AG,AT,CC,CG,CT,GG,GT,TT. Example runs are: First generate an example of this format:

./angsd -gl 1 -doglf 4 -bam list -out first -domajorminor 1 -domaf 1

This generates the file first.glf.gz. Which we can then use as input. Example here:

 ./angsd -glf10_text first.glf.gz -nind 33 -domaf 1 -domajorminor 1 -fai fai.fai 

Notice that -nInd and -fai needs to be supplied.

VCF files

VCF file as input is now included but with some limitations. Only chr,pos,ref,alt and GP/GL tags are being used, and we discard indels and non diallelic sites. Furthermore you are required to include a fai file and the number of individuals.


./angsd -vcf-gl

	-> angsd version: 0.910-20-g553b991 (htslib: 1.2.1-192-ge7e2b3d) build(Dec  4 2015 12:17:14)
	-> Analysis helpbox/synopsis information:
	-> Command: 
./angsd -vcf-gl 	-> Fri Dec  4 14:35:51 2015
----------------
multiReader.cpp:
	-nLines	50	(Number of lines to read)
	-bpl	33554432 (bytesPerLine)
	-beagle	(null)	(Beagle Filename (can be .gz))
	-vcf-GL	(null)	(vcf Filename (can be .gz))
	-vcf-GP	(null)	(vcf Filename (can be .gz))
	-glf	(null)	(glf Filename (can be .gz))
	-pileup	(null)	(pileup Filename (can be .gz))
	-intName 1	(Assume First column is chr_position)
	-isSim	0	(Simulated data assumes ancestral is A)
	-nInd	0		(Number of individuals)
	-minQ	13	(minimum base quality; only used in pileupreader)
----------------
multiReader.cpp:


Example

/angsd -vcf-gl file.vcf -fai hg19.fa.gz.fai -nind 10 -domaf 4

Arguments

-vcf-gl [filename]

name of the vcf file.

A vcf file

##fileformat=VCFv4.2(angsd version)
##FORMAT=<ID=GT,Number=1,Type=Integer,Description="Genotype">
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Genotype Probabilities">
##FORMAT=<ID=PL,Number=G,Type=Float,Description="Phred-scaled Genotype Likelihoods">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="scaled Genotype Likelihoods (loglikeratios to the most likely (in log10))">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	ind0	ind1
1	14000873	.	G	A	.	PASS	.	GP:GL	0.000000,0.137003,0.862997:-15.128970,-1.505169,0.000000	0.716266,0.281975,0.001759:0.000000,-0.301034,-1.800000
1	14001018	.	T	C	.	PASS	.	GP:GL	0.000000,0.081718,0.918282:-13.701492,-1.806203,0.000000	0.850652,0.149348,0.000000:0.000000,-0.602068,-5.699627
1	14001867	.	A	G	.	PASS	.	GP:GL	0.000489,0.727550,0.271961:-3.600000,-0.301034,0.000000	0.914538,0.085462,0.000000:0.000000,-0.903101,-8.859124
1	14002422	.	A	T	.	PASS	.	GP:GL	0.000000,0.291570,0.708430:-9.777061,-0.903101,0.000000	0.767047,0.232952,0.000001:0.000000,-0.602068,-5.499530
1	14002474	.	T	C	.	PASS	.	GP:GL	0.995488,0.004512,0.000000:0.000000,-1.505169,-15.068561	0.965008,0.034992,0.000000:0.000000,-0.602068,-5.899399
1	14003581	.	C	T	.	PASS	.	GP:GL	0.000000,0.674489,0.325510:-7.200000,-0.602068,0.000000	0.992516,0.007484,0.000000:0.000000,-1.806203,-13.447742
1	14004623	.	T	C	.	PASS	.	GP:GL	0.000000,0.588345,0.411654:-6.999968,-0.602068,0.000000	0.989186,0.010814,0.000000:0.000000,-1.806203,-12.574310
1	14007493	.	A	G	.	PASS	.	GP:GL	0.000013,0.541811,0.458176:-5.286503,-0.602068,0.000000	0.398233,0.422941,0.178826:-0.400000,-0.301034,0.000000
1	14007558	.	C	T	.	PASS	.	GP:GL	0.000000,0.091908,0.908092:-15.524007,-1.505169,0.000000	0.284993,0.442044,0.272964:-0.400000,-0.301034,0.000000
1	14007649	.	G	A	.	PASS	.	GP:GL	0.000000,0.638610,0.361390:-7.340205,-0.602068,0.000000	0.779442,0.220538,0.000020:0.000000,-0.301034,-3.500000
1	14008734	.	T	A	.	PASS	.	GP:GL	0.000000,0.280425,0.719575:-13.909454,-1.204135,0.000000	0.757059,0.242817,0.000123:0.000000,-0.301034,-2.800000
1	14009723	.	G	C	.	PASS	.	GP:GL	0.000345,0.744684,0.254971:-3.800000,-0.301034,0.000000	0.744903,0.255042,0.000055:0.000000,-0.301034,-3.200000
1	14010597	.	G	A	.	PASS	.	GP:GL	0.000000,0.063511,0.936489:-17.446187,-1.806203,0.000000	0.684326,0.315309,0.000365:0.000000,-0.301034,-2.600000
1	14010654	.	T	C	.	PASS	.	GP:GL	0.600538,0.348812,0.050650:0.000000,0.000000,0.000000	0.600538,0.348812,0.050650:0.000000,0.000000,0.000000


-nInd [int]

Number of individuals must be specified.

-fai [filename]

The index to the reference genome.

-bpl [int]=33554432

maximum bytes per line. Increase if the pileup has many individuals.

-nLines [int]=50

Number of lines to read at a time. Increasing this number will affect the RAM use.


Tutorial

Create a VCF file using your favorate software or using angsd

./angsd -b bam.filelist -dovcf 1 -gl 1 -dopost 1 -domajorminor 1 -domaf 1 -snp_pval 1e-6

you can then use it as input to angsd if you have the GL info

./angsd -vcf-gl angsdput.vcf.gz -nind 10 -fai hg19.fa.gz.fai -domaf 1

Genotype Probability Files

Beagle format

Genotype probabilities in gz beagle format can be used as input. The format used is the haplotype imputation format outputted from beagle [2]. A newer version of beagle uses VCF files.

./angsd -beagle

	-> angsd version: 0.910-20-g553b991 (htslib: 1.2.1-192-ge7e2b3d) build(Dec  4 2015 12:17:14)
	-> Analysis helpbox/synopsis information:
	-> Command: 
./angsd -beagle 	-> Fri Dec  4 14:03:22 2015
----------------
multiReader.cpp:
	-nLines	50	(Number of lines to read)
	-bpl	33554432 (bytesPerLine)
	-beagle	(null)	(Beagle Filename (can be .gz))
	-vcf-GL	(null)	(vcf Filename (can be .gz))
	-vcf-GP	(null)	(vcf Filename (can be .gz))
	-glf	(null)	(glf Filename (can be .gz))
	-pileup	(null)	(pileup Filename (can be .gz))
	-intName 1	(Assume First column is chr_position)
	-isSim	0	(Simulated data assumes ancestral is A)
	-nInd	0		(Number of individuals)
	-minQ	13	(minimum base quality; only used in pileupreader)
----------------
multiReader.cpp:

Example

Example of estimating allele frequencies from beagle files

./angsd -out out -doMaf 4 -beagle file.beagle.gprobs.gz -fai ref.fai


Arguments

-beagle [fileName]

beagle file name. The file must be gzipped. The file format is a single line per site. The first 3 coloums are

  • markerName
  • alleleA
  • alleleB

For each individual 3 columns are added. These three columns should sum to one.

file with two individuals
marker alleleA alleleB NA06984 NA06984 NA06984 NA06986 NA06986 NA06986
chr9_95759065 G A 0.6563 0.3078 0.0358 0.5357 0.4016 0.0627
chr9_95759152 C A 1 0 0 0 1 0
chr9_95762332 G A 0.925 0.0734 0.0015 0.894 0.1031 0.0029
chr9_95762333 A T 0.8903 0.1067 0.003 0.811 0.1797 0.0093
chr9_95762343 G T 0.9149 0.0835 0.0017 0.8396 0.1541 0.0064
-intName [int]=1

default 1. If the SNP name are written as chr_position this information will be parsed. If the SNP name is in another format then use -intName 0.

-fai [filename]

The index to the reference genome

can also be obtained from the bam header

samtools view -H  file.bam | grep SN |cut -f2,3 | sed 's/SN\://g' |  sed 's/LN\://g' > ref.fai
-bpl [int]=33554432

maximum bytes per line. Increase if the pileup has many individuals

-nLines [int]=50

Number of lines to read at a time. Increasing this number will affect the RAM use