ANGSD: Analysis of next generation Sequencing Data
Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.
Abbababa2: Difference between revisions
No edit summary |
No edit summary |
||
Line 7: | Line 7: | ||
<classdiagram type="dir:LR"> | <classdiagram type="dir:LR"> | ||
[*.bam and/or *.cram| NGS genome datasets{bg:orange}]->[Sequence data|All bases or Random bases] | [*.bam and/or *.cram| NGS genome datasets{bg:orange}]->[Sequence data|All bases or Random bases] | ||
[ | [Sequence data]->[Elaborate multiple genomes per population] | ||
[ | [Sequence data]->[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}] | ||
[Elaborate multiple genomes per population]->[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}] | [Elaborate multiple genomes per population]->[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}] | ||
</classdiagram> | </classdiagram> |
Revision as of 12:42, 26 August 2016
The ABBABABA (multipop) compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology). Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.
- some of the options only works for the developmental version availeble from github
- if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.
<classdiagram type="dir:LR">
[*.bam and/or *.cram| NGS genome datasets{bg:orange}]->[Sequence data|All bases or Random bases]
[Sequence data]->[Elaborate multiple genomes per population] [Sequence data]->[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}] [Elaborate multiple genomes per population]->[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}] </classdiagram>
<classdiagram type="dir:LR"> [*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]->->estAvgError.R[*.Observed.txt|D stat and Z scores + Error est and Plots{bg:blue}] </classdiagram>
Method
Brief Overview
> ./angsd -doAbbababa2 -------------- abcDstat2.cpp: -doAbbababa2 0 run the abbababa analysis -rmTrans 0 remove transitions -blockSize 5000000 size of each block in bases -anc (null) fasta file with outgroup -sample 0 sample a single base in each individual -maxDepth 100 max depth of each site allowed -sizeH1 1 num of individuals in group H1 -sizeH2 1 num of individuals in group H2 -sizeH3 1 num of individuals in group H3 -sizeH4 1 num of individuals in group H4 -enhance 0 only analyze sites where outgroup H4 is non poly -Aanc 0 set H4 outgroup allele as A in each site -combFile 0 create an optional *.abbababa2counts file where are printed the numbers of alleles combinations without having weighted the individuals
This function will counts the number of ABBA and BABA sites.
Options
- -doAbbababa2 1
take all bases at each position.
- -rmTrans [int]
0; use all reads (default), 1 Remove transitions (important for ancient DNA)
- -blockSize [INT]
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used.
- -anc [fileName.fa]
Include an outgroup in fasta format.
- -doCounts 1
use -doCounts 1 in order to count the bases at each sites after filters.
- -enhance [int]
1: use only sites where the reads for the outgroup has the same base for all reads.
- -sample [int]
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test
- -maxDepth [int]
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts
- -sizeH* [int]
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value
- -Aanc [int]
1: H4 allele is A in each site.
- -combFile [int]
1: create an intermediate *.abbababa2counts to obtain the allele events before weighting the samples (however, this file is not used for the estimation of the D-statistic).
In order to do fancy filtering of bases based on quality scores see the Allele counts options.
Output
- .abbbababa2
Output: Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is used as input for the R script estAvgError.R. Type "Rscript R/estAvgError.R" to see additional options.
- .abbbababa2counts (optional file)
Output: As above each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is not used as input for the ABBABABA test.
Example
Run the ABBABABA (multipop), without sampling a single allele for each site. We use 2 individuals for each group, thus the smallBam.filelist file contains the 8 file names ordered from population H1 to H4.
# select 8 individuals (suppose they are already in the desired order H1,..,H4) head -n8 bam.filelist > smallBam.filelist #run angsd ./angsd -out out -doAbbababa2 1 -bam smallBam.filelist -doCounts 1 -sample 0 -enhance 0 -sizeH1 2 -sizeH2 2 -sizeH3 3 -sizeH4 2 -combFile 0 #estimate Z score Rscript R/estAvgError.R file=out outfile=outDstat
This results in multiple output files with all the results.
1) a file outDstatStd.txt, that contains the results of the ABBABABA test without applying neither error correction or ancient transition removal
2) a file outDstatNoErrorNoTrans.txt, that contains the results of the ABBABABA test without applying error correction but removing ancient transitions
Output
1)outDstatStd.txt
mean(D) JK-D V(JK-D) Z pvalue nABBA nBABA nBBAA -0.042874 -0.042945 0.000643 -1.690314 0.090968 1697.447933 1704.675933 6374.948767
2)outDstatNoErrorNoTrans.txt
mean(D) JK-D V(JK-D) Z pvalue nABBA nBABA nBBAA -0.042874 -0.042945 0.000643 -1.690314 0.090968 1697.447933 1704.675933 6374.948767
mean(D) The average of test statistics: (nABBA-nBABA)/(nABBA+nBABA), each one calculated for a block of data.
JK-D The estimated test statistic: (nABBA-nBABA)/(nABBA+nBABA) after being bias corrected. This value should be similar to the one in column 1. A negative value means that H1 is closer to H3 than H2 is. A positive value means that H2 is closer to H3 than H1 is.
V(JK-D) estimated m-delete blocked Jackknife variance of the estimator of column 2. It's used to compute the Z-value.
Z Z value that can be used to determine the significance of the test. As in Reich et al. an absolute value of the Z score above 3 is often used as a critical value.
pvalue p-value corresponding to Z for a double-sided standard test. the critical value 0.001 correspond to the value of Z=3.
nABBA the total counts of ABBA patterns
nBABA the total counts of BABA patterns
nBBAA the total counts of BBAA patterns