ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Allele Counts: Difference between revisions

From angsd
Jump to navigation Jump to search
 
(56 intermediate revisions by 2 users not shown)
Line 1: Line 1:
=allele counts=
__TOC__
; -doCounts 1
Sometimes we want or need the frequency of the different bases. This is what -doCounts does.
Counts the number of observed base types for each individual for each site.  
 
==options==
You can refine which bases to be included using the filter parameters '''-minMapQ/-minQ/-trim'''. Based on the total depth for each you can discard sites for further analysis if the total depth is below/above some threshold '''-setMaxDepth/setMinDepth''', and you can discard a site if the effective sample size is below some threshold '''-minInd'''.
 
You can output summary statistics such as Q score distribution '''-doQsDist''', depth distribution '''-doDepth''', or various per site counts '''-dumpCounts'''. All output files has a nice header which should make the interpretation straightforward.
 
=Brief Overview=
<pre>
./angsd -doCounts
-> angsd version: 0.560 build(Dec  4 2013 13:27:02)
-> Analysis helpbox/synopsis information:
---------------
analysisCount.cpp:
-doCounts 0 (Count the number A,C,G,T. All sites, All samples)
-minQ 13 (remove bases with qscore<minQ)
-minQfile (null) file with individuals quality score threshold)
-setMaxDepth -1 (If total depth is larger then site is removed from analysis.
-1 indicates no filtering)
-setMinDepth -1 (If total depth is smaller then site is removed from analysis.
-1 indicates no filtering)
-trim 0 (trim ends of reads)
-minInd 0 (Discard site if effective sample size below value.
0 indicates no filtering)
Filedumping:
-doDepth 0 (dump distribution of seqdepth) .depthSample,.depthGlobal
  -maxDepth 100 (bin together high depths)
-doQsDist 0 (dump distribution of qscores) .qs
-dumpCounts 0
  1: total seqdepth for site .pos.gz
  2: seqdepth persample .pos.gz,.counts.gz
  3: A,C,G,T sum all samples .pos.gz,.counts.gz
  4: A,C,G,T sum every sample .pos.gz,.counts.gz
</pre>
 
=Options=
==Filtering==
;-minQ [int]
Default 13, Discard bases with a qscore below this threshold.
;-trim [int]
Default 0. Trim [int] bases at both ends of the reads. Useful for ancient DNA.
;-setMinDepth [int]
Default -1. If the total depth is below this value, the site is discarded
;-setMaxDepth [int]
Default -1. If the total depth is above this value, the site is discarded
;-minQfile [fileName]
Default NULL. File with individual base quality score. This should be a file with the number of rows matching the number of individuals and the number of columns should either be 1 or 4. If four columns are given then a separate quality threshold is used for each base (A C G T). Both space and tab is acceptable as delimiters.
 
==output summary==
 
; -dumpCounts [int]
; -dumpCounts [int]
see below  
Default 0. See examples below. Output files are called '''.pos,.counts.gz'''.
; -minQ [int]
; -doQsDist [int]
default 0. The minimum allowed base quality score.  
Default 0. Output the distribution of scores. Output files are called '''.qs'''.
; -minMapQ [int]
; -doDepth [int]
default 0. The minimum allowed mapping quality score.  
Default 0. Output the distribution of sequencing depths. Sites with depth above> '''-maxDepth''', will be binned. Output files are called '''.depthSample,depthGlobal''''.
; -minKeepIndC [int]
;-maxDepth [int]
default 0. Remove sites were less than 'minKeepIndC' individuals have at least one read
Default 100. See '''-doDepth''' parameter.


==printing counts==
=Output formats=
==Printing Counts per site==
; -dumpCounts [int]
; -dumpCounts [int]
1: print overall depth in the .pos file. This depth is the sum of reads covering a sites for all individuals. The first colum is the chromosome, the second it the position the third is the total depth
1: Print overall depth in the .pos file. This depth is the sum of reads covering a sites for all individuals. The first column is the chromosome, the second it the position the third is the total depth.
<pre>
<pre>
1 13999959 3
chr pos totDepth
1 13999960 3
1 13999902 1
1 13999961 3
1 13999903 1
1 13999962 3
1 13999904 1
1 13999963 4
1 13999905 2
1 13999964 5
1 13999906 2
1 13999965 6
1 13999907 2
1 13999966 6
1 13999908 2
1 13999967 6
1 13999909 2
1 13999968 6
1 13999910 2
</pre>
</pre>
    
    
2: prints the depth of each individual. Example of the depth of 10 individuals. Each line corresponce to the same line in the postion file.  
2: prints the depth of each individual. Example of the depth of 10 individuals. Each line corresponce to the same line in the postion file.  
<pre>
<pre>
0 0 0 0 0 0 0 1 0 2
ind0TotDepth ind1TotDepth ind2TotDepth ind3TotDepth ind4TotDepth
0 0 0 0 0 0 0 1 0 2
0 0 0 7 0
0 0 0 0 0 0 0 1 0 2
0 3 0 0 0
0 0 0 0 0 0 0 1 0 2
0 0 4 4 0
0 0 0 0 0 0 0 1 0 3
0 0 0 0 1
0 0 0 0 0 1 0 1 0 3
5 0 0 0 0
1 0 0 0 0 1 0 1 0 3
0 0 10 0 0
1 0 0 0 0 1 0 1 0 3
0 0 0 0 1
1 0 0 0 0 1 0 1 0 3
0 4 0 10 0
1 0 0 0 0 1 0 1 0 3
0 0 0 2 0
 
</pre>
</pre>
3:  Prints the depth for each of the four bases for each indivial for each site. Example with the first four column belonging to the first individuals the counts of the number of A C G and Ts. Only two indivduals are shown. Each line corresponce to the same line in the postion file.  
3:  Prints the depth for each of the four bases across all individuals. Each line corresponce to the same line in the postion file.  
<pre>
<pre>
0 0 0 0 0 0 0 0 ...
totA    totC    totG    totT
0 0 0 0 0 0 0 0 ...
1      0       0       0
0 0 0 0 0 0 0 0 ...
0       0       1      0
0 0 0 0 0 0 0 0 ...
0       1      0       0
0 0 0 0 0 0 0 0 ...
0       0       0       2
0 0 0 0 0 0 0 0 ...
2      0       0       0
0 1 0 0 0 0 0 0 ...
0       2      0       0
0 0 1 0 0 0 0 0 ...
0       0       0       2
1 0 0 0 0 0 0 0 ...
0       2      0       0
0 0 1 0 0 0 0 0 ...
0       0       2      0
0       0       2      0
0       0       2      0
2      0       0       0
0       0       2      0
</pre>
</pre>
===requred===
In order to print the counts the options '-doCounts' have to be used and the input data needs to be sequence data.




==Example==
4:  Prints the depth for each of the four bases for each indivial for each site. Example with the first four column belonging to the first individuals the counts of the number of A C G and Ts. Only two indivduals are shown. Each line corresponce to the same line in the postion file.
<pre>
ind0_A ind0_C ind0_G ind0_T ind1_A ind1_C ind1_G ind1_T ind2_A ind2_C ind2_G ind2_T
0 1 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0 1
0 0 1 0 0 0 0 0 0 0 1 0
0 1 0 0 0 0 0 0 0 0 0 0
</pre>
===Example===
Print the individuals depth from bam files
Print the individuals depth from bam files
<pre>
<pre>
./angsd -out out -doCounts 1 -dumpCounts 2 -bam bam.filelist
./angsd -out out -doCounts 1 -dumpCounts 2 -bam bam.filelist
</pre>
Print the individuals depth from bam files but filter away low quality bases
<pre>
./angsd -out out -doCounts 1 -dumpCounts 2 -bam bam.filelist -minQ 20
</pre>
Print the individuals depth from bam files but filter away low quality bases based on different threshold per individuals and base type
<pre>
./angsd -out out -doCounts 1 -dumpCounts 2 -bam bam.filelist -minQ 20 -minQfile qThres.txt
</pre>
qThres.txt:
<pre>
20 23 23 20
30 34 34 30
30 34 34 30
30 34 34 30
30 34 34 30
20 23 23 20
30 34 34 30
30 34 34 30
20 30 30 20
20 23 23 20
</pre>
The above analysis removes A and T bases with a Q score less then 20 for individual 1. The other individuals uses different thresholds
==qscore Distribution==
Column 1 is the qscore value, and column 2 are the corresponding count.
<pre>
qscore counts
13 87501
14 102888
15 113625
16 130494
17 145577
18 163049
19 180678
20 209447
21 247044
22 279325
23 332391
24 401459
25 484744
26 554127
27 609758
28 772123
29 1041218
30 1204349
31 1516248
32 1934112
33 2210498
34 2269812
35 2083536
36 1901735
37 1151146
38 441422
39 78625
40 21617
41 5870
42 1577
43 551
44 183
45 55
46 23
47 13
48 2
</pre>
==Depth Distribution==
Column1 in the '''.depthSample,.depthGlobal''' contains the number of sites with sequencing depth of 0. Column2 is the number of sites with a sequencing depth of 1, etc.
The '''.depthSample''' contains depth per sample. Line one corresponds to individual 1. Column2 corresponds to individual 2 etc.
<pre>
29403 87426 162912 229726 267115 259774 222153 170894 114295 71777 41654 22149 11030 5305 2425 1037 419 257 84 60 31 18 19 16 25 16 10
26318 88728 171544 244276 275342 263071 217952 162616 107571 65839 37466 20070 10150 4828 2237 1110 531 253 111 31 3 0 0 0 0 0 00
211936 393333 422459 322225 191564 95488 39672 15427 5220 1658 460 157 90 71 53 38 24 60 32 7 1 2 2 4 2
</pre>
The '''.depthGlobal''' file contains the depth distribution across all individuals.
<pre>
395 4299 7207 13203 23358 37489 56976 80588 107748 131669 150595 160482 161650 153690 138321 118217 96207 75735 57501 41561 29112 19549 12818 8200 5114 3247 1936 1123 646 378 238 165 105 75 71 43 43 33 27 19 15 17 17 21 24 11 7 7 14 5 1 3 3 3 1 1 3 2 3 1 1 5 4 5 6 11 4 2 1 2 0
</pre>
</pre>

Latest revision as of 05:34, 2 May 2020

Sometimes we want or need the frequency of the different bases. This is what -doCounts does.

You can refine which bases to be included using the filter parameters -minMapQ/-minQ/-trim. Based on the total depth for each you can discard sites for further analysis if the total depth is below/above some threshold -setMaxDepth/setMinDepth, and you can discard a site if the effective sample size is below some threshold -minInd.

You can output summary statistics such as Q score distribution -doQsDist, depth distribution -doDepth, or various per site counts -dumpCounts. All output files has a nice header which should make the interpretation straightforward.

Brief Overview

./angsd -doCounts 
	-> angsd version: 0.560	 build(Dec  4 2013 13:27:02)
	-> Analysis helpbox/synopsis information:
---------------
analysisCount.cpp:
	-doCounts	0	(Count the number A,C,G,T. All sites, All samples)
	-minQ		13	(remove bases with qscore<minQ)
	-minQfile	(null)	 file with individuals quality score threshold)
	-setMaxDepth	-1	(If total depth is larger then site is removed from analysis.
				 -1 indicates no filtering)
	-setMinDepth	-1	(If total depth is smaller then site is removed from analysis.
				 -1 indicates no filtering)
	-trim		0	(trim ends of reads)
	-minInd		0	(Discard site if effective sample size below value.
				 0 indicates no filtering)
Filedumping:
	-doDepth	0	(dump distribution of seqdepth)	.depthSample,.depthGlobal
	  -maxDepth	100	(bin together high depths)
	-doQsDist	0	(dump distribution of qscores)	.qs
	-dumpCounts	0
	  1: total seqdepth for site	.pos.gz
	  2: seqdepth persample		.pos.gz,.counts.gz
	  3: A,C,G,T sum all samples	.pos.gz,.counts.gz
	  4: A,C,G,T sum every sample	.pos.gz,.counts.gz

Options

Filtering

-minQ [int]

Default 13, Discard bases with a qscore below this threshold.

-trim [int]

Default 0. Trim [int] bases at both ends of the reads. Useful for ancient DNA.

-setMinDepth [int]

Default -1. If the total depth is below this value, the site is discarded

-setMaxDepth [int]

Default -1. If the total depth is above this value, the site is discarded

-minQfile [fileName]

Default NULL. File with individual base quality score. This should be a file with the number of rows matching the number of individuals and the number of columns should either be 1 or 4. If four columns are given then a separate quality threshold is used for each base (A C G T). Both space and tab is acceptable as delimiters.

output summary

-dumpCounts [int]

Default 0. See examples below. Output files are called .pos,.counts.gz.

-doQsDist [int]

Default 0. Output the distribution of scores. Output files are called .qs.

-doDepth [int]

Default 0. Output the distribution of sequencing depths. Sites with depth above> -maxDepth, will be binned. Output files are called .depthSample,depthGlobal'.

-maxDepth [int]

Default 100. See -doDepth parameter.

Output formats

Printing Counts per site

-dumpCounts [int]

1: Print overall depth in the .pos file. This depth is the sum of reads covering a sites for all individuals. The first column is the chromosome, the second it the position the third is the total depth.

chr	pos	totDepth
1	13999902	1
1	13999903	1
1	13999904	1
1	13999905	2
1	13999906	2
1	13999907	2
1	13999908	2
1	13999909	2
1	13999910	2

2: prints the depth of each individual. Example of the depth of 10 individuals. Each line corresponce to the same line in the postion file.

ind0TotDepth	ind1TotDepth	ind2TotDepth	ind3TotDepth	ind4TotDepth
0	0	0	7	0
0	3	0	0	0
0	0	4	4	0
0	0	0	0	1
5	0	0	0	0
0	0	10	0	0
0	0	0	0	1
0	4	0	10	0
0	0	0	2	0

3: Prints the depth for each of the four bases across all individuals. Each line corresponce to the same line in the postion file.

totA    totC    totG    totT
1       0       0       0
0       0       1       0
0       1       0       0
0       0       0       2
2       0       0       0
0       2       0       0
0       0       0       2
0       2       0       0
0       0       2       0
0       0       2       0
0       0       2       0
2       0       0       0
0       0       2       0


4: Prints the depth for each of the four bases for each indivial for each site. Example with the first four column belonging to the first individuals the counts of the number of A C G and Ts. Only two indivduals are shown. Each line corresponce to the same line in the postion file.

ind0_A ind0_C ind0_G ind0_T ind1_A ind1_C ind1_G ind1_T ind2_A ind2_C ind2_G ind2_T 
0 1 0 0 0 0 0 0 0 0 0 0 
0 0 1 0 0 0 0 0 0 0 0 0 
1 0 0 0 0 0 0 0 0 0 0 0 
0 0 1 0 0 0 0 0 0 0 0 0 
1 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 1 0 0 0 0 0 0 0 0 
0 0 0 1 0 0 0 0 0 0 0 1 
0 0 1 0 0 0 0 0 0 0 1 0 
0 1 0 0 0 0 0 0 0 0 0 0 

Example

Print the individuals depth from bam files

./angsd -out out -doCounts 1 -dumpCounts 2 -bam bam.filelist

Print the individuals depth from bam files but filter away low quality bases

./angsd -out out -doCounts 1 -dumpCounts 2 -bam bam.filelist -minQ 20

Print the individuals depth from bam files but filter away low quality bases based on different threshold per individuals and base type

./angsd -out out -doCounts 1 -dumpCounts 2 -bam bam.filelist -minQ 20 -minQfile qThres.txt

qThres.txt:

20 23 23 20
30 34 34 30
30 34 34 30
30 34 34 30
30 34 34 30
20 23 23 20
30 34 34 30
30 34 34 30
20 30 30 20
20 23 23 20

The above analysis removes A and T bases with a Q score less then 20 for individual 1. The other individuals uses different thresholds

qscore Distribution

Column 1 is the qscore value, and column 2 are the corresponding count.

qscore	counts
13	87501
14	102888
15	113625
16	130494
17	145577
18	163049
19	180678
20	209447
21	247044
22	279325
23	332391
24	401459
25	484744
26	554127
27	609758
28	772123
29	1041218
30	1204349
31	1516248
32	1934112
33	2210498
34	2269812
35	2083536
36	1901735
37	1151146
38	441422
39	78625
40	21617
41	5870
42	1577
43	551
44	183
45	55
46	23
47	13
48	2

Depth Distribution

Column1 in the .depthSample,.depthGlobal contains the number of sites with sequencing depth of 0. Column2 is the number of sites with a sequencing depth of 1, etc.

The .depthSample contains depth per sample. Line one corresponds to individual 1. Column2 corresponds to individual 2 etc.

29403	87426	162912	229726	267115	259774	222153	170894	114295	71777	41654	22149	11030	5305	2425	1037	419	257	84	60	31	18	19	16	25	16	10
26318	88728	171544	244276	275342	263071	217952	162616	107571	65839	37466	20070	10150	4828	2237	1110	531	253	111	31	3	0	0	0	0	0	00
211936	393333	422459	322225	191564	95488	39672	15427	5220	1658	460	157	90	71	53	38	24	60	32	7	1	2	2	4	2

The .depthGlobal file contains the depth distribution across all individuals.

395	4299	7207	13203	23358	37489	56976	80588	107748	131669	150595	160482	161650	153690	138321	118217	96207	75735	57501	41561	29112	19549	12818	8200	5114	3247	1936	1123	646	378	238	165	105	75	71	43	43	33	27	19	15	17	17	21	24	11	7	7	14	5	1	3	3	3	1	1	3	2	3	1	1	5	4	5	6	11	4	2	1	2	0