EvalAdmix
evalAdmix allows to evaluate the results of an admixture analysis. It takes as input the genotype data (either called genotypes in plink files or genotype likelihoods beagle files) used in the admixture analysis and the frequency and admixture propotions (P and Q files) generated.
The output is a pairwise correlation of residuals matrix between individuals The correlation will be 0 in case of a good fit of the data to the admixture model. When something is wrong, individuals from the same population will be positively correlated; and individuals from different populationts but that share one or more ancestral populations as admixture sources will have a negative correlation. Positive correlation between a pair of individuals might also be due to relatedness.
Download and Installation
evalAdmix can be installed from github
git clone https://github.com/GenisGE/evalAdmix.git
cd evalAdmix
make
Quick start
./NGSadmix -likes inputBeagleFile.gz -K 3 -o outFileName -P 10
- -likes beagle file of genotype likelihoods
- -K number of clusters
- -o prefix of output file names
- -P Number of threads used
Parameters
All parameters are set using -par value. For example, to get additional information, you would write -printInfo 1.
./NGSadmix
Arguments:
- -likes .beagle format filename with genotype likelihoods
- -K Number of ancestral populations
Optional:
- -fname Ancestral population frequencies
- -qname Admixture proportions
- -outfiles Prefix for output files
- -printInfo print ID and mean maximum allele frequency (maf) for the SNPs that were analysed
Setup:
- -seed Seed for initial guess in EM algorithm (a number lower than 1M is preferred).
- The same seed can be used to reproduce the analysis, and 3 different seeds can be used to test convergence.
- -P Number of threads
- -method 0 indicates no acceleration of EM algorithm. Please refer to the paper for more information.
- -misTol Tolerance for considering a site as missing. Default = 0.05.
- To include high quality genotypes only, increase this value (for example, 0.9)
Stop criteria:
- -tolLike50 Loglikelihood difference in 50 iterations. Default= 0.1
- -tol Tolerance for convergence. Default = 1x10-5. Use maller values for higher accuracy.
- It's the maximum squared difference of F and Q (please refer to the paper for formula).
- -dymBound Use dymamic boundaries (1: yes (default) 0: no).
- -maxiter Maximum number of EM iterations. Default = 2000 (high value).
- In case it doesn't converge, this value needs to be higher.
Filtering:
- -minMaf Minimum minor allele frequency. Default = 5%
- -minLrt Minimum likelihood ratio value for maf>0. Default = 0
- -minInd Minumum number of informative individuals. Default = 0
- It only keeps sites where there is at least x # of individuals with NGS data.
Input File
The input file contains genotype likelihoods in a .beagle file format [1]. and can be compressed with gzip.
BAM files
If you have BAM files you can use ANGSD to produce genotype likelihoods in .beagle format. Please see Creation of Beagle files with ANGSD
VCF files
If you already have made a VCF file that contains genotype likehood information then it should be possible to convert .vcf files with genotype likelihoods to .beagle file via vcftools [2]
vcftools --vcf input.vcf --out test --BEAGLE-GL --chr 1,2
Chromosome has to be specified.
You can also use bcftools' [3] 'query' option for generating a .beagle file from a .vcf file.
Output Files
The analysis performed by NGSadmix produces 4 files:
- Log likelihood of the estimates: a .log file that summarizes the run. The Command line used for running the program, what the likelihood is every 50 iterations, and finally how long it took to do the run.
- Estimated allele frequency: a zipped .fopt file, that contains an estimate of the allele frequency in each of the 3 assumed ancestral populations. There is a line for each locus.
- Estimated admixture proportions: a .qopt file, that contains an estimate of the individual's ancestry proportion (admixture) from each of the three assumed ancestral populations for all individuals. There is a line for each individual.
Run command example
Download the input file
wget popgen.dk/software/download/NGSadmix/data/input.gz
Execute NGSadmix
./NGSadmix -likes input.gz -K 3 -P 4 -o myoutfiles -minMaf 0.05
- Input file = input.gz
- Ancestral Populations K=3
- Computer cores = 4 (-P 4).
- Output prefix = myoutfiles (-o myoutfiles)
- SNPs with MAF > 5% (-minMaf 0.05)