ANGSD: Analysis of next generation Sequencing Data
Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.
Abbababa2: Difference between revisions
| No edit summary | No edit summary | ||
| Line 16: | Line 16: | ||
| [*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]->estAvgError.R[*.ErrorCorr.txt|D stat and Z scores Error Corrected{bg:blue}] | [*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]->estAvgError.R[*.ErrorCorr.txt|D stat and Z scores Error Corrected{bg:blue}] | ||
| [*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]->estAvgError.R[*.TransRemErrorCorr.txt|D stat and Z scores Error Corrected with ancient Transition Removal{bg:blue}] | [*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]->estAvgError.R[*.TransRemErrorCorr.txt|D stat and Z scores Error Corrected with ancient Transition Removal{bg:blue}] | ||
| [*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]->estAvgError.R[*. | [*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]->estAvgError.R[*.RemTrans.txt|D stat and Z scores with Transition Removal{bg:blue}] | ||
| </classdiagram> | </classdiagram> | ||
Revision as of 13:52, 26 August 2016
The ABBABABA (multipop) compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology). Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.
- some of the options only works for the developmental version availeble from github
- if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.
<classdiagram type="dir:LR">
[*.bam and/or *.cram| NGS genome datasets{bg:orange}]->[Sequence data|All bases or Random bases]
[Sequence data]->[Elaborate multiple genomes per population] [Sequence data]->[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}] [Elaborate multiple genomes per population]->[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}] </classdiagram>
<classdiagram type="dir:LR"> [*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]->estAvgError.R[*.Observed.txt|Observed D stat and Z scores{bg:blue}] [*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]->estAvgError.R[*.ErrorCorr.txt|D stat and Z scores Error Corrected{bg:blue}] [*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]->estAvgError.R[*.TransRemErrorCorr.txt|D stat and Z scores Error Corrected with ancient Transition Removal{bg:blue}] [*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]->estAvgError.R[*.RemTrans.txt|D stat and Z scores with Transition Removal{bg:blue}] </classdiagram>
Method
Brief Overview
> ./angsd -doAbbababa2
--------------
abcDstat2.cpp:
	-doAbbababa2	                0	run the abbababa analysis
	-rmTrans		        0       remove transitions
	-blockSize		       5000000	size of each block in bases
	-anc			       (null)	fasta file with outgroup
	-sample			        0	sample a single base in each individual
	-maxDepth		        100	max depth of each site allowed
	-sizeH1			        1	num of individuals in group H1
	-sizeH2			        1	num of individuals in group H2
	-sizeH3			        1	num of individuals in group H3
	-sizeH4			        1	num of individuals in group H4
	-enhance			0	only analyze sites where outgroup H4 is non poly
	-Aanc			        0	set H4 outgroup allele as A in each site
	-combFile		        0	create an optional *.abbababa2counts file where are printed the 
                                     numbers of alleles combinations without having weighted the individuals
This function will counts the number of ABBA and BABA sites.
Options
- -doAbbababa2 1
take all bases at each position.
- -rmTrans [int]
0; use all reads (default), 1 Remove transitions (important for ancient DNA)
- -blockSize [INT]
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used.
- -anc [fileName.fa]
Include an outgroup in fasta format.
- -doCounts 1
use -doCounts 1 in order to count the bases at each sites after filters.
- -enhance [int]
1: use only sites where the reads for the outgroup has the same base for all reads (If you want to use it, it must be only when you have one genome in the outgroup, it won't work otherwise).
- -sample [int]
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test
- -maxDepth [int]
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts
- -sizeH* [int]
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value
- -Aanc [int]
1: H4 allele is A in each site.
- -combFile [int]
1: create an intermediate *.abbababa2counts to obtain the allele events before weighting the samples (however, this file is not used for the estimation of the D-statistic).
In order to do fancy filtering of bases based on quality scores see the Allele counts options.
Tutorial of the ABBABABA (Multipop) test
- Some preparation steps before using ANGSD
Download the latest version of angsd in your working folder from the github repository
https://github.com/ANGSD/angsd.git
Create symbolic links to angsd and the necessary R script
ln -s ./angsd/angsd ANGSD ln -s ./angsd/R/estAvgError.R RSCRIPT
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets
wget http://popgen.dk/software/download/angsd/bams.tar.gz tar xf bams.tar.gz for i in bams/*.bam;do samtools index $i;done #index bam files ls bams/*.bam > bam.filelist rm bams.tar.gz #remove zipped file
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the four-population test. One can use a bam file as well.
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz mv hg19ancNoChr.fa.gz chimpHg19.fa.gz gunzip chimpHg19.fa.gz samtools faidx chimpHg19.fa #indexing the fasta file rm chimpHg19.fa.gz
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -out perfectSampleCEU gunzip perfectSampleCEU.fa.gz samtools faidx perfectSampleCEU.fa
- Prepare files for the estimation of type specific error rates
Assume population H1 consists of the first two genomes of our list, while population H2 consists of the genomes 3 to 7. We want to apply error correction to those genomes, because we know they have been subjected to contamination. We generate two files containing the pathnames of the genomes of H1 and H2 on which we want to apply error correction.
sed -n 1,2p bam.filelist > bamH1.filelist sed -n 3,7p bam.filelist > bamH2.filelist
and then we use "doAncError" to generate the intermediate files that we will use later to estimate the error rates for the two groups H1 and H2. "doAncError" apply the so called "perfect individual assumption", based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH1 -bam bamH1.filelist ./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH2 -bam bamH2.filelist
- ABBABABA test
Now, we want to run the four population test on: H1: first 2 bam files H2: bam files from 3 to 7 H3: bam files from 8 to 10 H4: chimpHg19.fa file After running ANGSD we will call the R script who apply error correction to the ABBA and BABA allele combinations and produce the final output files.
./ANGSD -doAbbababa2 1 -bam bam.filelist -doCounts 1 -out bam.AllelePatterns -sizeH1 2 -sizeH2 5 -sizeH3 3 -anc chimpHg19.fa -minQ 20 -minMapQ 30
The output file is
- bam.AllelePatterns.abbbababa2 (used for the 4-population test)
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is used as input for the R script estAvgError.R. Optionally, one can also produce
- bam.AllelePatterns.abbbababa2counts (optional file)
As above each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is not used as input for the ABBABABA test.
We run the R script specifying the intermediate error files for populations H1 and H2. We also want to study the effect of error correction if we add individually to each population an error rate between 0 and 0.005 with step 0.001 and involving transitions A->T and C-->T. It is also possible to specify the names of H1,H2,H3 to be seen on the plot. In this case we use the generic names CEU1,CEU2,CEU3. When at least an error file is given as input, the script will apply error correction.
Rscript RSCRIPT angsdFile="bam.AllelePatterns" out="result" file1="bamH1.ancError" file2="bamH2.ancError" addErr="0,0.005,0.001;A,C;T;CEU1,CEU2,CEU3"
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities.
--- Table of Results --- --------------------------------------------------------------------------------- Mode |Dstat |sd(Dstat) |Djack |Zscore |Pvalue --------------------------------------------------------------------------------- Observed |-6.323e-02 |6.985e-02 |-6.323e-02 |-0.905 |3.7e-01 --------------------------------------------------------------------------------- Err Corr |-6.430e-02 |7.226e-02 |-6.431e-02 |-0.890 |3.7e-01 --------------------------------------------------------------------------------- No Trans |-1.141e-02 |6.311e-02 |-1.141e-02 |-0.181 |8.6e-01 --------------------------------------------------------------------------------- Err Corr | | | | | and |-1.494e-02 |6.615e-02 |-1.496e-02 |-0.226 |8.2e-01 No Trans | | | | | --------------------------------------------------------------------------------- plots with effect of removed errors and D statistic files for all the removed errors are in folder result.errorDataFolder
Those results are also contained in four distinct files
- 1) result.Observed.txt
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal
mean(D) JK-D V(JK-D) Z pvalue nABBA nBABA nBBAA -0.063233 -0.063233 0.004878 -0.905320 0.365296 246.033565 279.248560 292.834879
- 2) result.ErrorCorr.txt
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal
mean(D) JK-D V(JK-D) Z pvalue nABBA nBABA nBBAA -0.064295 -0.064309 0.005221 -0.889833 0.373555 238.242964 270.983960 293.326044
- 3) result.TransRemErrorCorr.txt
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal
mean(D) JK-D V(JK-D) Z pvalue nABBA nBABA nBBAA -0.014939 -0.014959 0.004376 -0.225829 0.821335 81.636843 84.112983 293.326044
- 4) result.RemTrans.txt
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal
mean(D) JK-D V(JK-D) Z pvalue nABBA nBABA nBBAA -0.011406 -0.011406 0.003983 -0.180730 0.856580 85.730478 87.708709 292.834879
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, nBBAA=all the other observed patterns. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.
In case of error correction, the R script also creates the folder result.errorDataFolder containing: -the file barPlotErrors.pdf showing a barplot of the error rates -the file plotAddErr.A2T.pdf showing the effect of error correction on transition A-->T -the file plotAddErr.C2T.pdf showing the effect of error correction on transition C-->T -the file errorRates.txt showing in each line transition errors for each population, respectively -all the files related to the addition of error correction to H1,H2,H3, necessary to plot the files plotAddErr.A2T.pdf and plotAddErr.C2T.pdf.
