ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Haploid calling

From angsd
Jump to navigation Jump to search

Simple haploid output based on sampling or consensus. Latest github version of angsd has a small utility program in the misc folde that converts to plink output (tfam/tped).



<classdiagram type="dir:LR">

[BAM files{bg:orange}]->[Sequence data|Random base;Consensus base]

[sequence data]->[*.haplo.gz|single base file{bg:blue}] </classdiagram>

Brief Overview

> ./angsd -doHaploCall
	-> angsd version: 0.910-45-g2b2b4f0-dirty (htslib: 1.2.1-192-ge7e2b3d) build(Jan  3 2016 14:45:41)
	-> Analysis helpbox/synopsis information:
	-> Command: 
./angsd -doHaploCall 	-> Sun Jan  3 15:18:15 2016
--------------
abcHaploCall.cpp:
	-doHaploCall	0
	(Sampling strategies)
	 0:	 no haploid calling 
	 1:	 (Sample single base)
	 2:	 (Concensus base)
	-doCounts	0	Must choose -doCount 1
Optional
	-minMinor	0	Minimum observed minor alleles
	-maxMis	-1	Maximum missing bases (per site)


This function outputs a base for each individual for each site

Options

-doHaploCall [int]

1; sample a random base 2; most frequent base. Random base for ties

-doCounts 1

use -doCounts 1 in order to count the bases at each sites after filters.

-minMinor [int]

Minimum observed minor alleles; only prints sites with more than minMinor sampled alleles (across individuals).

-maxMis [int]

maximum allowed missing alleles (accross individuals). -maxMis 0 means only sites without missing alleles are printed


Output

  • .haplo.gz

Output: Each line represents site. chromsome name (Column 1), position (Column 2), major allele (Column 3). One column for each individual with the sampled allele.

Example

Create a fasta file bases from a random samples of bases.

./angsd -bam bam.filelist -dohaplocall 1 -doCounts 1 -r 1: -minMinor 1

Output

chr	pos	major	ind0	ind1	ind2	ind3	ind4	ind5	ind6
1	14000170	C	T	T	C	N	C	C	C
1	14000202	A	A	N	G	A	N	N	G
1	14000457	G	G	G	G	G	G	N	A
1	14000459	G	G	G	G	G	A	N	N
1	14000774	G	T	G	G	G	G	G	T
1	14002083	C	G	N	C	C	C	C	C
1	14002351	A	A	C	C	A	C	N	A
1	14002950	A	T	A	A	A	T	N	T
1	14004832	G	G	G	A	G	G	A	G
1	14006543	G	T	G	G	G	G	G	G
1	14006631	A	C	N	A	N	A	N	A
1	14007068	G	T	T	T	G	G	G	N
1	14009284	A	A	C	C	C	N	A	N
1	14009775	G	G	G	G	G	C	G	C
1	14009787	T	T	T	G	T	G	T	T
1	14009791	A	G	G	A	G	A	G	A
1	14009794	A	A	A	A	N	N	A	A
1	14009800	A	G	A	A	G	N	G	A
1	14010748	A	G	N	A	G	A	A	A

columns are

chr

chromosome

pos

position

major

major allele (most common of the sampled alleles)

ind0

first individual - same order as in the input files