EvalAdmix: Difference between revisions
No edit summary |
|||
Line 24: | Line 24: | ||
* '''-beagle''' beagle file of genotype likelihoods | * '''-beagle''' beagle file of genotype likelihoods | ||
* '''-plink''' binary plink file prefix with genotype data | * '''-plink''' binary plink file prefix with genotype data | ||
* '''-fname''' file with ancestral frequencies | * '''-fname''' file with ancestral frequencies (space delimited, rows are sites and columns ancestral populations) | ||
* '''-qname''' file with admixture proportions | * '''-qname''' file with admixture proportions (space delimited, rows are individuals and columns ancestral populations) | ||
* '''-o''' prefix of output file names | * '''-o''' prefix of output file names | ||
* '''-P''' Number of threads used | * '''-P''' Number of threads used |
Revision as of 09:41, 27 April 2020
evalAdmix allows to evaluate the results of an admixture analysis (i.e. the result of applying ADMIXTURE, STRUCTURE, NGSadmix and similar). It only needs the input genotype data used for the previous admixture analysis and the output of that analysis (admixture proportions and ancestral population frequencies). The genotype input data can either be called genotypes in binary plink format or genotype likelihoods in beagle format.
The output is a pairwise correlation of residuals matrix between individuals. The correlation will be close to 0 in case of a good fit of the data to the admixture model. When individuals do not fit the model, individuals with similar demographic histories (i.e. usually individuals from the same population) will be positively correlated; and individuals with different histories but that are modelled as sharing one or more ancestral populations as admixture sources will have a negative correlation. Positive correlation between a pair of individuals might also be due to relatedness.
Download and Installation
evalAdmix can be installed from github
git clone https://github.com/GenisGE/evalAdmix.git cd evalAdmix make
Quick start
./evalAdmix -beagle inputBeagleFile.gz -fname myoutfiles.fopt.gz -qname myoutfiles.qopt -P 10
./evalAdmix -plink inputPlinkPrefix -fname myoutfiles.fopt.gz -qname myoutfiles.qopt -P 10
- -beagle beagle file of genotype likelihoods
- -plink binary plink file prefix with genotype data
- -fname file with ancestral frequencies (space delimited, rows are sites and columns ancestral populations)
- -qname file with admixture proportions (space delimited, rows are individuals and columns ancestral populations)
- -o prefix of output file names
- -P Number of threads used
Parameters
./evalAdmix
Arguments: Required: -plink path to binary plink file (excluding the .bed) or -beagle path to beagle file containing genotype likelihoods (alternative to -plink) -fname path to ancestral population frequencies file -qname path to admixture proportions file Optional: -o name of the output file Setup (optional): -P 1 number of threads -autosomeMax 23 autosome ends with this chromsome -nIts 5 number of iterations to do for frequency correction; if set to 0 calculates correlation without correction (fast but biased) -useSites 1.0 proportion of sites to use to calculate correlation of residuals -useInds filename path to tab delimited file with first column containing all individuals ID and second column with 1 to include individual in analysis and 0 otherwise (default all individuals are included) -misTol 0.05 tolerance for considering site as missing when using genotype likelihoods. Use same value as used in NGSadmix to keep compatibility when using genotype likelihoods (-beagle) -minMaf 0.05 minimum minor allele frequency to keep site. Use same value as used in NGSadmix to keep compatibility when using genotype likelihoods (-beagle)
Input Files
Plink
Genotype data files in binary PLINK format (.bed .fam .bim).
Beagle genotype likelhoods
The input file contains genotype likelihoods in a .beagle file format [1]. and can be compressed with gzip.
BAM files
If you have BAM files you can use ANGSD to produce genotype likelihoods in .beagle format. Please see Creation of Beagle files with ANGSD
VCF files
If you already have made a VCF file that contains genotype likehood information then it should be possible to convert .vcf files with genotype likelihoods to .beagle file via vcftools [2]
vcftools --vcf input.vcf --out test --BEAGLE-GL --chr 1,2
Chromosome has to be specified.
You can also use bcftools' [3] 'query' option for generating a .beagle file from a .vcf file.
Output File
The analysis performed by evalAdmix produces one file, containing a tab delimited N times N symmetric correlation matrix, where column i in line j contains the correlation of residuals between individual i and j, and the diagonal values (self-correlation) are set to NA:
NA 0.008609 -0.006919 0.002731 0.020224
0.008609 NA 0.000033 0.004968 -0.008470
-0.006919 0.000033 NA 0.006982 0.005664
0.002731 0.004968 0.006982 NA 0.000521
0.020224 -0.008470 0.005664 0.000521 NA
Run command example
Genotype data
Download the input file containing genotypes in binary plink format
wget http://pontus.popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.bed wget http://pontus.popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.bim wget http://pontus.popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.fam
Run ADMIXTURE [4] to obtain admixture proprotions
admixture admixTjeck2.bed 3
Run evalAdmix
./evalAdmix -plink admixTjeck2 -fname admixTjeck2.3.P -qname admixTjeck2.3.Q -P 20
- Genotypes file prefix admixTjeck2 (-plink admixTjeck2).
- Ancestral Populations frequency file (space delimited matrix where rows are sites and columns ancestral populations) (-fname admixTjeck2.3.P).
- Admixture proportions file (space delimited matrix where rows are individuals and columns ancestral populations) myoutfiles.fopt.gz (-qname admixTjeck2.3.Q).
- Computer cores = 20 (-P 20).
Plot results in R
source("visFuns.R") # read population labels and estimated admixture proportions pop<-read.table("admixTjeck2.fam") q<-read.table("admixTjeck2.3.Q") # order according to population and plot the ADMIXTURE reults ord<-orderInds(pop = pop, q = q) barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=3") text(tapply(1:nrow(pop),pop[ord,2],mean),-0.05,unique(pop[ord,2]),xpd=T) abline(v=cumsum(sapply(unique(pop[ord,2]),function(x){sum(pop[ord,2]==x)})),col=1,lwd=1.2) r<-as.matrix(read.table("output.corres.txt")) # Plot correlation of residuals plotCorRes(cor_mat = r, pop = as.vector(pop[,2]), ord=ord, title="Evaluation of 1000G admixture proportions with K=3", max_z=0.1, min_z=-0.1)
Low depth sequencing data
Download the input file containing genotype likelihoods in beagle format
wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz wget popgen.dk/software/download/NGSadmix/data/Demo2pop.info
Execute NGSadmix to obtain admixture proportions
./NGSadmix -likes Demo2input.gz -K 3 -P 20 -o myoutfiles -minMaf 0.05
- Input file = Demo2input.gz
- Ancestral Populations K=3
- Computer cores = 20 (-P 20).
- Output prefix = myoutfiles (-o myoutfiles)
- SNPs with MAF > 5% (-minMaf 0.05)
Run evalAdmix
./evalAdmix -beagle Demo2input.gz -fname myoutfiles.fopt.gz -qname myoutfiles.qopt -P 20
- Genotype likelihoods file Demo2input.gz (-beagle Demo2input.gz).
- Ancestral Populations frequency file (space delimited matrix where rows are sites and columns ancestral populations) (-fname myoutfiles.fopt.gz).
- Admixture proportions file (space delimited matrix where rows are individuals and columns ancestral populations) myoutfiles.fopt.gz (-qname myoutfiles.qopt).
- Computer cores = 20 (-P 20).
Plot results in R
source("visFuns.R") # read population labels and estimated admixture proportions pop<-read.table("Demo2pop.info",as.is=T) q<-read.table("myoutfiles.qopt") # order according to population and plot the NGSadmix reults ord<-orderInds(pop = pop, q = q) barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=3") text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T) abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2) r<-read.table("output.corres.txt") # Plot correlation of residuals plotCorRes(cor_mat = r, pop = as.vector(pop[,1]), ord = ord, title="Evaluation of 1000G admixture proportions with K=3", max_z=0.1, min_z=-0.1)
Citation
evalAdmix has a preprint