EvalAdmix: Difference between revisions

From software
Jump to navigation Jump to search
(Created page with "evalAdmix allows to evaluate the results of an admixture analysis. It takes as input the genotype data (either called genotypes in plink files or genotype likelihoods beagle f...")
 
No edit summary
 
(50 intermediate revisions by 2 users not shown)
Line 1: Line 1:
evalAdmix allows to evaluate the results of an admixture analysis. It takes as input the genotype data (either called genotypes in plink files or genotype likelihoods beagle files) used in the admixture analysis and the frequency
'''IMPORTANT''': version 0.95 (updated on 30/06/2021) fixes a bug in the implementation for genotype data, which caused displacement of genotypes between samples when a site had missing data. When all sites have some missingness this would result in the last samples from the analyses having a correlation of nan with all other samples; but might have some more subtle effects whenever there is some level of missingness. If you have analyses from previous versions based on genotype data with any missingness might be a good idea to re-run them after updating. The bug did not affect the genotype likelihoods implementation so if you based the analyses on genotype likelihoods you do not need to worry. If you applied it to gentoype data without any missingness you also do not need to worry.
and admixture propotions (P and Q files) generated.


The output is a pairwise correlation of residuals matrix between individuals The correlation will be 0 in case of a good fit of the data to the admixture model. When something is wrong, individuals from the same population will be positively correlated; and individuals from different populationts but that share one or more ancestral populations as admixture sources will have a negative correlation. Positive correlation between a pair of individuals might also be due to relatedness.
evalAdmix allows to evaluate the results of an admixture analysis (i.e. the result of applying [https://genome.cshlp.org/content/19/9/1655.long ADMIXTURE], [https://web.stanford.edu/group/pritchardlab/structure.html STRUCTURE], [http://www.popgen.dk/software/index.php/NgsAdmix NGSadmix] and similar). It only needs the input genotype data used for the previous admixture analysis and the output of that analysis (admixture proportions and ancestral population frequencies). The genotype input data can either be called genotypes in [https://www.cog-genomics.org/plink/1.9/formats#bed binary plink format] or genotype likelihoods in [http://www.popgen.dk/angsd/index.php/Genotype_Likelihoods#Beagle_format beagle format].


The output is a pairwise correlation of residuals matrix between individuals. The correlation will be close to 0 in case of a good fit of the data to the admixture model. When individuals do not fit the model, individuals with similar demographic histories (i.e. usually individuals from the same population) will be positively correlated; and individuals with different histories but that are modelled as sharing one or more ancestral populations as admixture sources will have a negative correlation. Positive correlation between a pair of individuals might also be due to relatedness.




[[File:evalAdmix.png|thumb]]
 
[[File:evalAdmix.jpg|thumb|550px]]




Line 14: Line 15:
evalAdmix can be installed from [https://github.com/GenisGE/evalAdmix github]  
evalAdmix can be installed from [https://github.com/GenisGE/evalAdmix github]  


<code>  
<pre>git clone https://github.com/GenisGE/evalAdmix.git
git clone https://github.com/GenisGE/evalAdmix.git
cd evalAdmix
cd evalAdmix
make
make
</code>
</pre>


==Quick start==
==Quick start==
:<code> ./NGSadmix -likes inputBeagleFile.gz -K 3 -o outFileName -P 10  </code>
:<code> ./evalAdmix -beagle inputBeagleFile.gz -fname myoutfiles.fopt.gz -qname myoutfiles.qopt -o evaladmixOut.corres -P 10  </code>
:<code> ./evalAdmix -plink inputPlinkPrefix  -fname inputPlinkPrefix.K.P -qname inputPlinkPrefix.K.Q -o evaladmixOut.corres -P 10  </code>


* '''-likes''' beagle file of genotype likelihoods
* '''-beagle''' beagle file of genotype likelihoods
* '''-K''' number of clusters
* '''-plink''' binary plink file prefix with genotype data
* '''-fname''' file with ancestral frequencies (space delimited, rows are sites and columns ancestral populations)
* '''-qname''' file with admixture proportions (space delimited, rows are individuals and columns ancestral populations)
* '''-o''' prefix of output file names
* '''-o''' prefix of output file names
* '''-P''' Number of threads used
* '''-P''' Number of threads used
Line 30: Line 33:
==Parameters==
==Parameters==


All parameters are set using '''-par value'''.
<pre>./evalAdmix  </pre>
For example, to get additional information, you would write '''-printInfo 1'''.


<pre>./NGSadmix </pre>
<pre>
Arguments:
Required:
-plink path to binary plink file (excluding the .bed)
or
      -beagle path to beagle file containing genotype likelihoods (alternative to -plink)
-fname path to ancestral population frequencies file
      -qname path to admixture proportions file
Optional:     
      -o name of the output file
     
Setup (optional):
      -P 1 number of threads
      -autosomeMax 23 autosome ends with this chromsome
      -nIts 5 number of iterations to do for frequency correction; if set to 0 calculates correlation without correction (fast but biased)
      -useSites 1.0 proportion of sites to use to calculate correlation of residuals
      -useInds filename    path to tab delimited file with first column containing all individuals ID and second column with 1 to include individual in analysis and 0 otherwise (default all individuals are included)
      -misTol 0.05 tolerance for considering site as missing when using genotype likelihoods. Use same value as used in NGSadmix to keep compatibility when using genotype likelihoods (-beagle)
      -minMaf 0.05 minimum minor allele frequency to keep site. Use same value as used in NGSadmix to keep compatibility when using genotype likelihoods (-beagle)
</pre>


Arguments:
==Input Files==


::'''-likes''' .beagle format filename with genotype likelihoods
===Plink===
Genotype data files in binary PLINK format (.bed .fam .bim).
===Beagle genotype likelhoods===
The input file contains genotype likelihoods in a .beagle file format [http://faculty.washington.edu/browning/beagle/beagle.html].
and can be compressed with gzip.
==== BAM files  ====
If you have BAM files you can use [[ANGSD]] to produce genotype likelihoods in .beagle format. Please
see [http://www.popgen.dk/angsd/index.php/Beagle_input Creation of Beagle files with ANGSD]


::'''-K''' Number of ancestral populations
==== VCF files ====
If you already have made a VCF file that contains genotype likehood information then  it should be possible to convert .vcf files with genotype likelihoods to .beagle file via vcftools [https://vcftools.github.io/man_latest.html]


Optional:
<pre>
vcftools --vcf input.vcf --out test --BEAGLE-GL --chr 1,2
</pre>
Chromosome has to be specified.


::'''-fname''' Ancestral population frequencies
You can also use bcftools' [https://samtools.github.io/bcftools/bcftools.html] 'query' option for generating a .beagle file from a .vcf file.


::'''-qname''' Admixture proportions
==Output File==
The analysis performed by evalAdmix produces one file, containing a tab delimited N times N symmetric correlation matrix, where column i in line j contains the correlation of residuals between individual i and j, and the diagonal values (self-correlation) are set to NA:


::'''-outfiles''' Prefix for output files
NA      0.008609        -0.006919      0.002731        0.020224<br />
0.008609        NA      0.000033        0.004968        -0.008470<br />
-0.006919      0.000033        NA      0.006982        0.005664<br />
0.002731        0.004968        0.006982        NA      0.000521<br />
0.020224        -0.008470      0.005664        0.000521        NA


::'''-printInfo''' print ID and mean maximum allele frequency (maf) for the SNPs that were analysed
==Run command example==


Setup:
=== Genotype data ===
Download the input file containing genotypes in binary plink format
<pre>
wget https://popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.bed
wget https://popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.bim
wget https://popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.fam
</pre>


::'''-seed''' Seed for initial guess in EM algorithm (a number lower than 1M is preferred).
Run ADMIXTURE [http://software.genetics.ucla.edu/admixture/] to obtain admixture proprotions
::The same seed can be used to reproduce the analysis, and 3 different seeds can be used to test convergence.


::'''-P''' Number of threads
<pre>admixture admixTjeck2.bed 3 -j20</pre>


::'''-method''' 0 indicates no acceleration of EM algorithm. Please refer to the paper for more information.
::Genotypes file prefix admixTjeck2.bed
::Number of Ancestral Populations K = 3
::Number of CPU threads = 20


::'''-misTol''' Tolerance for considering a site as missing. Default = 0.05.
::To include high quality genotypes only, increase this value (for example, 0.9)


Stop criteria:
Run evalAdmix
<pre>./evalAdmix -plink admixTjeck2 -fname admixTjeck2.3.P -qname admixTjeck2.3.Q -P 20 </pre>


::'''-tolLike50''' Loglikelihood difference in 50 iterations. Default= 0.1
::Genotypes file prefix admixTjeck2 (-plink admixTjeck2).
::Ancestral Populations frequency file admixTjeck2.3.P (-fname admixTjeck2.3.P).
::Admixture proportions file admixTjeck2.3.Q (-qname admixTjeck2.3.Q).
::Computer cores = 20 (-P 20).  


::'''-tol''' Tolerance for convergence. Default = 1x10<sup>-5</sup>. Use maller values for higher accuracy.
::It's the maximum squared difference of F and Q (please refer to the paper for formula).


::'''-dymBound''' Use dymamic boundaries (1: yes (default) 0: no).
Plot results in R
[[File:evalAdmixK3.Q.png|thumb|frame|admixture proportions]]
[[File:evalAdmixK3.cor.png|thumb|frame|evalAdmix correlations]]
<pre>
source("visFuns.R")


# read population labels and estimated admixture proportions
pop<-read.table("admixTjeck2.fam")
q<-read.table("admixTjeck2.3.Q",stringsAsFactors=T)


::'''-maxiter''' Maximum number of EM iterations. Default = 2000 (high value).
palette(c("#A6CEE3","#1F78B4","#B2DF8A","#33A02C","#FB9A99","#E31A1C","#FDBF6F","#FF7F00","#CAB2D6","#6A3D9A","#FFFF99","#B15928","#1B9E77","#999999"))
::In case it doesn't converge, this value needs to be higher.


Filtering:
# order according to population and plot the ADMIXTURE reults
ord<-orderInds(pop = as.vector(pop[,2]), q = q)


::'''-minMaf''' Minimum minor allele frequency. Default = 5%
#make barplot
plotAdmix(q,ord=ord,pop=pop[,2])


::'''-minLrt''' Minimum likelihood ratio value for maf>0. Default = 0


::'''-minInd''' Minumum number of informative individuals. Default = 0
::It only keeps sites where there is at least x # of individuals with NGS data.


==Input File==
r<-as.matrix(read.table("output.corres.txt"))


The input file contains genotype likelihoods in a .beagle file format [http://faculty.washington.edu/browning/beagle/beagle.html].
# Plot correlation of residuals
and can be compressed with gzip.
plotCorRes(cor_mat = r, pop = as.vector(pop[,2]), ord=ord, title="Evaluation of 1000G admixture proportions with K=3", max_z=0.1, min_z=-0.1)
=== BAM files  ===
If you have BAM files you can use [[ANGSD]] to produce genotype likelihoods in .beagle format. Please
see [http://www.popgen.dk/angsd/index.php/Beagle_input Creation of Beagle files with ANGSD]


=== VCF files ===
</pre>
If you already have made a VCF file that contains genotype likehood information then  it should be possible to convert .vcf files with genotype likelihoods to .beagle file via vcftools [https://vcftools.github.io/man_latest.html]


=== Low depth sequencing data ===
Download the input file containing genotype likelihoods in beagle format
<pre>
<pre>
vcftools --vcf input.vcf --out test --BEAGLE-GL --chr 1,2
wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz
wget popgen.dk/software/download/NGSadmix/data/Demo2pop.info
</pre>
</pre>
Chromosome has to be specified.


You can also use bcftools' [https://samtools.github.io/bcftools/bcftools.html] 'query' option for generating a .beagle file from a .vcf file.
Execute [[NgsAdmix |NGSadmix]] to obtain admixture proportions
<pre>./NGSadmix -likes Demo2input.gz -K 3 -P 20 -o myoutfiles -minMaf 0.05</pre>


==Output Files==
::Input file = Demo2input.gz
The analysis performed by NGSadmix produces 4 files:
::Ancestral Populations K=3
::Computer cores = 20 (-P 20).
::Output prefix = myoutfiles (-o myoutfiles)
::SNPs with MAF > 5%  (-minMaf 0.05)


* Log likelihood of the estimates: a .log file that summarizes the run. The Command line used for running the program, what the likelihood is every 50 iterations, and finally how long it took to do the run.
Run evalAdmix
<pre>./evalAdmix -beagle Demo2input.gz -fname myoutfiles.fopt.gz -qname myoutfiles.qopt -P 20 </pre>


* Estimated allele frequency: a zipped .fopt file, that contains an estimate of the allele frequency in each of the 3 assumed ancestral populations. There is a line for each locus.
::Genotype likelihoods file Demo2input.gz (-beagle Demo2input.gz).
::Ancestral Populations frequency file myoutfiles.fopt.gz (-fname myoutfiles.fopt.gz).
::Admixture proportions file myoutfiles.qopt(-qname myoutfiles.qopt).
::Computer cores = 20 (-P 20).  


* Estimated admixture proportions: a .qopt file, that contains an estimate of the individual's ancestry proportion (admixture) from each of the three assumed ancestral populations for all individuals. There is a line for each individual.
Plot results in R
<pre>
source("visFuns.R")
# read population labels and estimated admixture proportions
pop<-read.table("Demo2pop.info",as.is=T)
q<-read.table("myoutfiles.qopt")


==Run command example==
# order according to population and plot the NGSadmix reults
ord<-orderInds(pop = as.vector(pop[,1]), q = q)
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=3")
text(sort(tapply(1:nrow(pop),pop[ord,1],mean)),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)


Download the input file
r<-read.table("output.corres.txt")
::<code>wget popgen.dk/software/download/NGSadmix/data/input.gz</code>


Execute NGSadmix
# Plot correlation of residuals
::<code>./NGSadmix -likes input.gz -K 3 -P 4 -o myoutfiles -minMaf 0.05</code>


::Input file = input.gz
plotCorRes(cor_mat = r, pop = as.vector(pop[,1]), ord = ord, title="Evaluation of 1000G admixture proportions with K=3", max_z=0.1, min_z=-0.1)
::Ancestral Populations K=3
::Computer cores = 4 (-P 4).  
::Output prefix = myoutfiles (-o myoutfiles)
::SNPs with MAF > 5%  (-minMaf 0.05)


</pre>


==Citation==
==Citation==
[https://doi.org/10.1111/1755-0998.13171 Evaluation of Model Fit of Inferred Admixture Proportions]

Latest revision as of 18:57, 4 March 2024

IMPORTANT: version 0.95 (updated on 30/06/2021) fixes a bug in the implementation for genotype data, which caused displacement of genotypes between samples when a site had missing data. When all sites have some missingness this would result in the last samples from the analyses having a correlation of nan with all other samples; but might have some more subtle effects whenever there is some level of missingness. If you have analyses from previous versions based on genotype data with any missingness might be a good idea to re-run them after updating. The bug did not affect the genotype likelihoods implementation so if you based the analyses on genotype likelihoods you do not need to worry. If you applied it to gentoype data without any missingness you also do not need to worry.

evalAdmix allows to evaluate the results of an admixture analysis (i.e. the result of applying ADMIXTURE, STRUCTURE, NGSadmix and similar). It only needs the input genotype data used for the previous admixture analysis and the output of that analysis (admixture proportions and ancestral population frequencies). The genotype input data can either be called genotypes in binary plink format or genotype likelihoods in beagle format.

The output is a pairwise correlation of residuals matrix between individuals. The correlation will be close to 0 in case of a good fit of the data to the admixture model. When individuals do not fit the model, individuals with similar demographic histories (i.e. usually individuals from the same population) will be positively correlated; and individuals with different histories but that are modelled as sharing one or more ancestral populations as admixture sources will have a negative correlation. Positive correlation between a pair of individuals might also be due to relatedness.



Download and Installation

evalAdmix can be installed from github

git clone https://github.com/GenisGE/evalAdmix.git
cd evalAdmix
make

Quick start

./evalAdmix -beagle inputBeagleFile.gz -fname myoutfiles.fopt.gz -qname myoutfiles.qopt -o evaladmixOut.corres -P 10
./evalAdmix -plink inputPlinkPrefix -fname inputPlinkPrefix.K.P -qname inputPlinkPrefix.K.Q -o evaladmixOut.corres -P 10
  • -beagle beagle file of genotype likelihoods
  • -plink binary plink file prefix with genotype data
  • -fname file with ancestral frequencies (space delimited, rows are sites and columns ancestral populations)
  • -qname file with admixture proportions (space delimited, rows are individuals and columns ancestral populations)
  • -o prefix of output file names
  • -P Number of threads used

Parameters

./evalAdmix  
Arguments:
	Required:
		-plink path to binary plink file (excluding the .bed)
		or
	      	-beagle path to beagle file containing genotype likelihoods (alternative to -plink)
		
		-fname path to ancestral population frequencies file
	       	-qname path to admixture proportions file
		
	Optional:       
	
	       -o name of the output file
	       
	 Setup (optional):
	 
	       -P 1 number of threads
	       -autosomeMax 23	 autosome ends with this chromsome
	       -nIts 5	 number of iterations to do for frequency correction; if set to 0 calculates correlation without correction (fast but biased)
	       -useSites 1.0	 proportion of sites to use to calculate correlation of residuals
	       -useInds filename     path to tab delimited file with first column containing all individuals ID and second column with 1 to include individual in analysis and 0 otherwise (default all individuals are included)
	       -misTol 0.05 	 tolerance for considering site as missing when using genotype likelihoods. Use same value as used in NGSadmix to keep compatibility when using genotype likelihoods (-beagle)
	       -minMaf 0.05 	 minimum minor allele frequency to keep site. Use same value as used in NGSadmix to keep compatibility when using genotype likelihoods (-beagle)

Input Files

Plink

Genotype data files in binary PLINK format (.bed .fam .bim).

Beagle genotype likelhoods

The input file contains genotype likelihoods in a .beagle file format [1]. and can be compressed with gzip.

BAM files

If you have BAM files you can use ANGSD to produce genotype likelihoods in .beagle format. Please see Creation of Beagle files with ANGSD

VCF files

If you already have made a VCF file that contains genotype likehood information then it should be possible to convert .vcf files with genotype likelihoods to .beagle file via vcftools [2]

vcftools --vcf input.vcf --out test --BEAGLE-GL --chr 1,2

Chromosome has to be specified.

You can also use bcftools' [3] 'query' option for generating a .beagle file from a .vcf file.

Output File

The analysis performed by evalAdmix produces one file, containing a tab delimited N times N symmetric correlation matrix, where column i in line j contains the correlation of residuals between individual i and j, and the diagonal values (self-correlation) are set to NA:

NA 0.008609 -0.006919 0.002731 0.020224
0.008609 NA 0.000033 0.004968 -0.008470
-0.006919 0.000033 NA 0.006982 0.005664
0.002731 0.004968 0.006982 NA 0.000521
0.020224 -0.008470 0.005664 0.000521 NA

Run command example

Genotype data

Download the input file containing genotypes in binary plink format

wget https://popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.bed
wget https://popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.bim
wget https://popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.fam

Run ADMIXTURE [4] to obtain admixture proprotions

admixture admixTjeck2.bed 3 -j20
Genotypes file prefix admixTjeck2.bed
Number of Ancestral Populations K = 3
Number of CPU threads = 20


Run evalAdmix

./evalAdmix -plink admixTjeck2 -fname admixTjeck2.3.P -qname admixTjeck2.3.Q -P 20 
Genotypes file prefix admixTjeck2 (-plink admixTjeck2).
Ancestral Populations frequency file admixTjeck2.3.P (-fname admixTjeck2.3.P).
Admixture proportions file admixTjeck2.3.Q (-qname admixTjeck2.3.Q).
Computer cores = 20 (-P 20).


Plot results in R

Error creating thumbnail: File missing
admixture proportions
Error creating thumbnail: File missing
evalAdmix correlations
source("visFuns.R")

# read population labels and estimated admixture proportions
pop<-read.table("admixTjeck2.fam")
q<-read.table("admixTjeck2.3.Q",stringsAsFactors=T)

palette(c("#A6CEE3","#1F78B4","#B2DF8A","#33A02C","#FB9A99","#E31A1C","#FDBF6F","#FF7F00","#CAB2D6","#6A3D9A","#FFFF99","#B15928","#1B9E77","#999999"))

# order according to population and plot the ADMIXTURE reults
ord<-orderInds(pop = as.vector(pop[,2]), q = q)

#make barplot
plotAdmix(q,ord=ord,pop=pop[,2])



r<-as.matrix(read.table("output.corres.txt"))

# Plot correlation of residuals
plotCorRes(cor_mat = r, pop = as.vector(pop[,2]), ord=ord, title="Evaluation of 1000G admixture proportions with K=3", max_z=0.1, min_z=-0.1)

Low depth sequencing data

Download the input file containing genotype likelihoods in beagle format

wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz
wget popgen.dk/software/download/NGSadmix/data/Demo2pop.info

Execute NGSadmix to obtain admixture proportions

./NGSadmix -likes Demo2input.gz -K 3 -P 20 -o myoutfiles -minMaf 0.05
Input file = Demo2input.gz
Ancestral Populations K=3
Computer cores = 20 (-P 20).
Output prefix = myoutfiles (-o myoutfiles)
SNPs with MAF > 5% (-minMaf 0.05)

Run evalAdmix

./evalAdmix -beagle Demo2input.gz -fname myoutfiles.fopt.gz -qname myoutfiles.qopt -P 20 
Genotype likelihoods file Demo2input.gz (-beagle Demo2input.gz).
Ancestral Populations frequency file myoutfiles.fopt.gz (-fname myoutfiles.fopt.gz).
Admixture proportions file myoutfiles.qopt(-qname myoutfiles.qopt).
Computer cores = 20 (-P 20).

Plot results in R

source("visFuns.R")
# read population labels and estimated admixture proportions
pop<-read.table("Demo2pop.info",as.is=T)
q<-read.table("myoutfiles.qopt")

# order according to population and plot the NGSadmix reults
ord<-orderInds(pop = as.vector(pop[,1]), q = q)
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=3")
text(sort(tapply(1:nrow(pop),pop[ord,1],mean)),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

r<-read.table("output.corres.txt")

# Plot correlation of residuals

plotCorRes(cor_mat = r, pop = as.vector(pop[,1]), ord = ord, title="Evaluation of 1000G admixture proportions with K=3", max_z=0.1, min_z=-0.1)

Citation

Evaluation of Model Fit of Inferred Admixture Proportions