PCAngsd: Difference between revisions

From software
Jump to navigation Jump to search
No edit summary
 
(4 intermediate revisions by 2 users not shown)
Line 2: Line 2:




PCAngsd is a program that estimates the covariance matrix for low depth next-generation sequencing (NGS) data in structured/heterogeneous populations using principal component analysis (PCA) to perform multiple population genetic analyses using an iterative procedure based on genotype likelihoods.  
PCAngsd is a program that estimates the covariance matrix and individual allele frequencies for low-depth next-generation sequencing (NGS) data in structured/heterogeneous populations using principal component analysis (PCA) to perform multiple population genetic analyses using genotype likelihoods. Since version 0.98, PCAngsd was re-written to be based on Cython for computational bottlenecks and parallelization.


Since version 0.98, PCAngsd was re-written to be based on Cython for computational bottlenecks and parallelization and is now compatible with any newer Python version.
The main method was published in 2018 and can be found here: [https://www.genetics.org/content/210/2/719]


The method was published in 2018 and can be found here: [https://www.genetics.org/content/210/2/719]
The HWE test was published in 2019 and can be found here: [https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13019]
[[File:Pcangsd_admix.gif|frame]]
[[File:Pcangsd_admix3.gif|frame]]


[[File:Pcangsd_pca.png|thumb|400px|Simulated low depth NGS data of 3 populations]]
[[File:Pcangsd_pca.png|thumb|400px|Simulated low depth NGS data of 3 populations]]
Line 13: Line 13:


=Overview=
=Overview=
Framework for analyzing low depth next-generation sequencing (NGS) data in heterogeneous populations using principal component analysis (PCA). Population structure is inferred to detect the number of significant principal components which is used to estimate individual allele frequencies using genotype dosages in a SVD model. The estimated individual allele frequencies are then used in an probabilistic framework to update the genotype dosages such that an updated set of individual allele frequencies can be estimated iteratively based on inferred population structure. A covariance matrix can be estimated using the updated prior information of the estimated individual allele frequencies.
Framework for analyzing low-depth next-generation sequencing (NGS) data in heterogeneous/structured populations using principal component analysis (PCA). Population structure is inferred by estimating individual allele frequencies in an iterative approach using a truncated SVD model. The covariance matrix is estimated using the estimated individual allele frequencies as prior information for the unobserved genotypes in low-depth NGS data.


The estimated individual allele frequencies and principal components can be used as prior knowledge in other probabilistic methods based on a same Bayesian principle. PCAngsd can perform the following analyses:
The estimated individual allele frequencies can further be used to account for population structure in other probabilistic methods. PCAngsd can perform the following analyses:
*Covariance matrix
*Covariance matrix
*Genotype calling
*Admixture estimations
*Admixture
*Inbreeding coefficients (both per-individual and per-site)
*Inbreeding coefficients (both per-individual and per-site)
*HWE test
*HWE test
*Genome selection scan
*Genome-wide selection scan
*Kinship matrix
*Genotype calling
*Estimate NJ tree of samples


 
Older versions of PCAngsd can be found here [https://github.com/Rosemeis/pcangsd/releases/].
The older version, based on the Numba library (only working with Python 2.7) is still available in version 0.973 and can be found here [https://github.com/Rosemeis/pcangsd/releases/tag/0.973].


=Download and Installation=
=Download and Installation=


PCAngsd should work on all platforms meeting the requirements but server-side usage is highly recommended.  
PCAngsd should work on all platforms meeting the requirements but server-side usage is highly recommended. Installation has only been tested on Linux systems.
Installation has only been tested on Linux systems.


Get PCAngsd and build
Get PCAngsd and build
Line 40: Line 38:
Install dependencies:
Install dependencies:


The required set of Python packages are easily installed using the pip command and the requirements.txt file included in the pcangsd folder.
The required set of Python packages are easily installed using the pip command and the 'requirements.txt file' included in the 'pcangsd' folder.


<code>pip install --user -r requirements.txt</code>
<pre>
pip install --user -r requirements.txt
</pre>


=Quick start=
=Quick start=


PCAngsd is used by running the main caller file pcangsd.py. To see all available options use the following command:
<pre>
<pre>
python pcangsd.py -h


# Download the input beagle file with genotype likelihoods
# Genotype likelihoods using 64 threads
wget popgen.dk/software/download/NGSadmix/data/input.gz
python pcangsd.py -beagle input.beagle.gz -out output -threads 64
 
 
# Only estimate covariance matrix using 10 threads
python pcangsd.py -beagle input.gz -o test1 -threads 10


# Estimate covariance matrix and individual admixture proportions
# PLINK files (using file-prefix, *.bed, *.bim, *.fam)
python pcangsd.py -beagle input.gz -admix -o test2 -threads 10
python pcangsd.py -beagle input.plink -out output -threads 64
 
# Estimate covariance matrix and inbreeding coefficients
python pcangsd.py -beagle input.gz -inbreed 2 -o test3 -threads 10
 
# Estimate covariance matrix and perform selection scan
python pcangsd.py -beagle input.gz -selection -o test4 -threads 10
</pre>
</pre>


==Detailed Examples and Tutorial==
PCAngsd accepts either genotype likelihoods in Beagle format or PLINK genotype files. Beagle files can be generated from BAM files using [http://popgen.dk/angsd ANGSD]. For inference of population structure in genotype data with non-random missigness, we recommend our [http://www.popgen.dk/software/index.php/EMU EMU] software that performs accelerated EM-PCA, however with fewer functionalities than PCAngsd (#soon).
 
Please refer to the tutorial's page [http://www.popgen.dk/software/index.php/PCAngsdTutorial]
 
=Input=
The only input PCAngsd needs and accepts are genotype likelihoods in [http://faculty.washington.edu/browning/beagle/beagle.html Beagle] format. New functionality for using PLINK files has been added (version 0.9). Genotypes are automatically converted into a genotype likelihood matrix where the user can incorporate an error model.
 
[http://popgen.dk/angsd ANGSD] can be easily be used to compute genotype likelihoods and output them in the required Beagle format.


PCAngsd will mostly output files in binary Numpy format (.npy) with a few exceptions. In order to read files in python:
<pre>
<pre>
./angsd -GL 2 -out data -nThreads 10 -doGlf 2 -doMajorMinor 1 -doMaf 2 -SNP_pval 1e-6 -bam bam.filelist
import numpy as np
C = np.genfromtxt("output.cov") # Reads in estimated covariance matrix (text)
D = np.load("output.selection.npy") # Reads PC based selection statistics
</pre>
</pre>


See [http://popgen.dk/angsd ANGSD] for more information on how to compute the genotype likelihoods and call SNPs.
R can also read Numpy matrices using the "RcppCNPy" R library:
 
=Output=
 
Since version 0.98, PCAngsd's output is only in binary Numpy format (.npy) except for the covariance matrix.
 
In order to read files in python:
 
<pre>
<pre>
import numpy as np
library(RcppCNPy)
S = np.load("output.selection.npy") # Reads results from selection scan
C <- as.matrix(read.table("output.cov")) # Reads in estimated covariance matrix
D <- npyLoad("output.selection.npy") # Reads PC based selection statistics
</pre>
</pre>


An example of generating genotype likelihoods in [http://popgen.dk/angsd ANGSD] and output them in the required Beagle text format.


R can also read Numpy matrices using the "RcppCNPy" library:
<pre>
<pre>
library(RcppCNPy)
./angsd -GL 2 -out input -nThreads 4 -doGlf 2 -doMajorMinor 1 -doMaf 2 -SNP_pval 1e-6 -bam bam.filelist
S <- npyLoad("output.selection.npy") # Reads results from selection scan
</pre>
</pre>


=Using PCAngsd=
=Tutorial=


All the different options in PCAngsd are listed here. PCAngsd will always compute the covariance matrix, where it uses principal components to estimate individual allele frequencies in an iterative procedure. The estimated individual allele frequencies will then be used in any of the other specified options of PCAngsd.´
Please refer to the tutorial's page [http://www.popgen.dk/software/index.php/PCAngsdTutorial]


=Options=
<pre>
<pre>
# See all options in PCAngsd
# See all options in PCAngsd
Line 107: Line 89:
</pre>
</pre>


==Estimation of individual allele frequencies==
==General usage==
; -beagle [Beagle filename]
; -beagle [Beagle file]
Input file of genotype likelihoods in Beagle format (.beagle.gz).
Input file of genotype likelihoods in Beagle format (.beagle.gz).
; -filter [Text file]
Input file of 1's or 0's whether to keep individuals or not.
; -plink [Prefix for binary PLINK files]
; -plink [Prefix for binary PLINK files]
Path to PLINK files using their prefix (.bed, .bim, .fam).
Path to PLINK files using their ONLY prefix (.bed, .bim, .fam).
; -plink_error [float]
; -plink_error [float]
Incorporate error model for PLINK genotypes.
Incorporate errors into genotypes by specifying rate as argument.
; -minMaf [float]
; -minMaf [float]
Minimum minor allele frequency threshold. (Default: 0.05)
Minimum minor allele frequency threshold. (Default: 0.05)
; -maf_iter [int]
Maximum number of EM iterations for computing the population allele frequencies (Default: 200).
; -maf_tole [float]
Tolerance value in EM algorithm for population allele frequencies estimation (Default: 1e-4).
; -iter [int]
; -iter [int]
Maximum number of iterations for estimation of individual allele frequencies (Default: 100).
Maximum number of iterations for estimation of individual allele frequencies (Default: 100).
; -tole [float]
; -tole [float]
Tolerance value for update in estimation of individual allele frequencies (Default: 1e-5).
Tolerance value for update in estimation of individual allele frequencies (Default: 1e-5).
; -maf_iter [int]
; -hwe [.lrt.npy file]
Maximum number of EM iterations for computing the population allele frequencies (Default: 100).
Input file of LRT binary file from previous PCAngsd run to filter based on HWE.
; -maf_tole [float]
; -hwe_tole [float]
Tolerance value in EM algorithm for population allele frequencies estimation (Default: 1e-4).
Threshold for HWE filtering of sites.  
; -e [int]
; -e [int]
Manually select the number of eigenvalues to use in the modelling of individual allele frequencies (Default: Automatically tested using MAP test).
Manually select the number of eigenvalues to use in the modelling of individual allele frequencies (Default: Automatically tested using MAP test).
; -o [prefix]
; -pi [.pi.npy file]
Set the prefix for all output files created by PCAngsd (Default: "pcangsd").
Load previous estimation of individual allele frequencies to skip covariance estimation.
; -indf_save
; -maf_save
Choose to save estimated individual allele frequencies (Binary). Numpy format (.npy).
Choose to save estimated population allele frequencies (Binary). Numpy format (.npy).
; -pi_save
Choose to save estimated individual allele frequencies (Binary). Numpy format (.npy). Can be used with the '-pi' command.
; -dosage_save
; -dosage_save
Choose to save estimated genotype dosages (Binary). Numpy format (.npy).
Choose to save estimated genotype dosages (Binary). Numpy format (.npy).
; -sites_save
Choose to save the marker IDs after performing filtering using population allele frequencies. Useful for especially selection scans and per-site inbreeding coefficients.
; -post_save
; -post_save
Choose to save the posterior genotype probabilities. Beagle format (.beagle).
Choose to save the posterior genotype probabilities. Beagle format (.beagle).
; -sites_save
Choose to save the kept sites after filtering which is useful for downstream analysis. Outputs a file of 1's and 0's for keeping a site or not, respectively.
; -threads [int]
; -threads [int]
Specify the number of thread(s) to use (Default: 1).
Specify the number of thread(s) to use (Default: 1).
; -out [output prefix]
Fileprefix for all output files created by PCAngsd (Default: "pcangsd").


==Call genotypes==
==Selection==
Genotypes can be called from posterior genotype probabilities incorporating the individual allele frequencies as prior information.
Perform PC-based genome-wide selection scans using posterior expectations of the genotypes (genotype dosages):


; -geno [float]
; -selection
Call genotypes with defined threshold.
Using an extended model of [http://www.cell.com/ajhg/abstract/S0002-9297(16)00003-3 FastPCA]. Performs a genome-wide selection scan along all significant PCs. Outputs the selection statistics and must be converted to p-values by user. Each column reflect the selection statistics along a tested PC and they are χ²-distributed with 1 degree of freedom.
; -genoInbreed [float]
Call genotypes with defined threshold also taking inbreeding into account. '''-inbreed [int]''' is required, since individual inbreeding coefficients must have been estimated prior to calling genotypes using that information.


==Admixture==
; -pcadapt
Individual admixture proportions and population-specific allele frequencies can be estimated based on assuming K ancestral populations using an accelerated mini-batch NMF method.
Using an extended model of [https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.12592 pcadapt]. Performs a genome-wide selection scan across all significant PCs. Outputs the z-scores and must be converted to test statistics with the provided script 'pcangsd/scripts/pcadapt.R', and the test statistics are χ²-distributed with K degree of freedom.


; -admix
; -snp_weights
Toggles admixture estimations. Individual ancestry proportions are saved (Binary). Numpy format (.npy).
Output the SNP weights of the significant K eigenvectors.
; -admix_alpha [float-list]
Specify alpha (sparseness regularization parameter). Can be specified as a sequence to try several alpha's in a single run (Default: 0).
; -admix_auto [float]
Enable automatic search for optimal alpha using likelihood measure, by giving soft upper search bound of alpha.
; -admix_seed [int-list]
Specify seed for random initializations of factor matrices in admixture estimations. Can be specified as a sequence to try several different seeds in a single run.
; -admix_K [int]
Not recommended. Manually specify the number of ancestral populations to use in admixture estimations (overrides number chosen from '''-e'''). Structure explained by individual allele frequencies may therefore not reflect the manually chosen K. It is recommended to adjust '''-e''' instead of '''-admix_K'''.
; -admix_iter [int]
Maximum number of iterations for admixture estimations using NMF. (Default: 200)
; -admix_tole [float]
Tolerance value for update in admixture estimations using NMF. (Default: 1e-5)
; -admix_batch [int]
Specify the number of batches to use in NMF method. (Default: 10)
; -admix_save
Choose to save the population-specific allele frequencies (Binary). Numpy format (.npy).


==Inbreeding==
==Inbreeding==
Per-individual inbreeding coefficients incorporating population structure can be computed using three different methods. However, -inbreed 1 is recommended for low depth cases.
; -inbreedSites
 
Estimate per-site inbreeding coefficients accounting for population structure and perform likehood ratio test for detecting sites deviating from HWE [https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13019].
; -inbreed 1
Simple estimator computed by an EM algorithm. Allows for F-values between -1 and 1. Based on [http://genome.cshlp.org/content/23/11/1852.full ngsF].
 
; -inbreed 2
A maximum likelihood estimator also computed by an EM algorithm. Only allows for F-values between 0 and 1. Based on [https://www.cambridge.org/core/journals/genetics-research/article/maximum-likelihood-estimation-of-individual-inbreeding-coefficients-and-null-allele-frequencies/2DEBA0C0C2B92DF0EE89BD27DFCAD3FB].


; -inbreed 3
; -inbreedSamples
Estimator using an estimated kinship matrix. Allows for F-values between -1 and 1. Based on [http://www.cell.com/ajhg/abstract/S0002-9297(15)00493-0 PC-Relate].
Estimate per-individual inbreeding coefficients accounting for population structure which is based on an extension of [http://genome.cshlp.org/content/23/11/1852.full ngsF] for structured populations.  


; -inbreed_iter [int]
; -inbreed_iter [int]
Maximum number of iterations for the EM algorithm methods. (Default: 200)
Maximum number of iterations for inbreeding EM algorithm. (Default: 200)


; -inbreed_tole [float]
; -inbreed_tole [float]
Tolerance value for the EM algorithms for inbreeding coefficients estimation. (Default: 1e-4)
Tolerance value for inbreeding EM algorithm in estimating inbreeding coefficients. (Default: 1e-4)
Per-site inbreeding coefficients incorporating population structure alongside likehood ratio tests for HWE can be computed as follows:


; -inbreedSites
==Call genotypes==
Use likelihood ratio tests (.lrt.sites.gz) generated from '''-inbreedSites''' to filter out variable sites using a given threshold for HWE test p-value:
Genotypes can be called from posterior genotype probabilities by incorporating the individual allele frequencies as prior information.


; -hwe [LRT filename]
; -geno [float]
Call genotypes with defined threshold.
; -genoInbreed [float]
Call genotypes with defined threshold also taking inbreeding into account. '-inbreedSamples' must also be called for using this option.


; -hwe_tole [float]
==Admixture==
Tolerance value for HWE test. (Default: 1e-6)
Individual admixture proportions and ancestral allele frequencies can be estimated assuming K ancestral populations using an accelerated mini-batch NMF method.


==Selection==
; -admix
A genome selection scan can be computed based on posterior expectations of the genotypes (genotype dosages):
Toggles admixture estimations. Estimates admixture proportions and ancestral allele frequencies.
 
; -admix_K [int]
; -selection
Not recommended. Override the number of ancestry components (K) to use, instead of using K=e-1.
Using an extended model of [http://www.cell.com/ajhg/abstract/S0002-9297(16)00003-3 FastPCA]. Performs a genome selection scan along all significant PCs. Outputs the selection statistics and must be converted to p-values by user. Each column reflect the selection statistics along a tested PC and they are χ²-distributed with 1 degree of freedom.
; -admix_iter [int]
 
Maximum number of iterations for admixture estimations using NMF. (Default: 200)
==Relatedness==
; -admix_tole [float]
Estimate kinship matrix based on method Based on [http://www.cell.com/ajhg/abstract/S0002-9297(15)00493-0 PC-Relate]:
Tolerance value for update in admixture estimations using NMF. (Default: 1e-5)
 
; -admix_alpha [float
; -kinship
Specify alpha (sparseness regularization parameter). (Default: 0)
Automatically estimated if '''-inbreed 3''' has been selected.
; -admix_auto [float]
Enable automatic search for optimal alpha using likelihood measure, by giving soft upper search bound of alpha.
; -admix_seed [int]
Specify seed for random initializations of factor matrices in admixture estimations.


Remove related individuals based on kinhsip matrix of previous run:
==Tree==
; -relate [Kinship filename]
; -tree
; -relate_tole [float]
Construct neighbour-joining tree of samples from estimated covariance matrix estimated based on indivdual allele frequencies.
Threshold for kinship coefficients for removing individuals (Default: 0.0625).
; -tree_samples
Provide a list of sample names of all individuals to construct a beautiful tree.


=Citation=
=Citation=

Latest revision as of 12:26, 24 October 2023


PCAngsd is a program that estimates the covariance matrix and individual allele frequencies for low-depth next-generation sequencing (NGS) data in structured/heterogeneous populations using principal component analysis (PCA) to perform multiple population genetic analyses using genotype likelihoods. Since version 0.98, PCAngsd was re-written to be based on Cython for computational bottlenecks and parallelization.

The main method was published in 2018 and can be found here: [1]

The HWE test was published in 2019 and can be found here: [2]

Simulated low depth NGS data of 3 populations


Overview

Framework for analyzing low-depth next-generation sequencing (NGS) data in heterogeneous/structured populations using principal component analysis (PCA). Population structure is inferred by estimating individual allele frequencies in an iterative approach using a truncated SVD model. The covariance matrix is estimated using the estimated individual allele frequencies as prior information for the unobserved genotypes in low-depth NGS data.

The estimated individual allele frequencies can further be used to account for population structure in other probabilistic methods. PCAngsd can perform the following analyses:

  • Covariance matrix
  • Admixture estimations
  • Inbreeding coefficients (both per-individual and per-site)
  • HWE test
  • Genome-wide selection scan
  • Genotype calling
  • Estimate NJ tree of samples

Older versions of PCAngsd can be found here [3].

Download and Installation

PCAngsd should work on all platforms meeting the requirements but server-side usage is highly recommended. Installation has only been tested on Linux systems.

Get PCAngsd and build

git clone https://github.com/Rosemeis/pcangsd.git
cd pcangsd/
python setup.py build_ext --inplace

Install dependencies:

The required set of Python packages are easily installed using the pip command and the 'requirements.txt file' included in the 'pcangsd' folder.

pip install --user -r requirements.txt

Quick start

PCAngsd is used by running the main caller file pcangsd.py. To see all available options use the following command:

python pcangsd.py -h

# Genotype likelihoods using 64 threads
python pcangsd.py -beagle input.beagle.gz -out output -threads 64

# PLINK files (using file-prefix, *.bed, *.bim, *.fam)
python pcangsd.py -beagle input.plink -out output -threads 64

PCAngsd accepts either genotype likelihoods in Beagle format or PLINK genotype files. Beagle files can be generated from BAM files using ANGSD. For inference of population structure in genotype data with non-random missigness, we recommend our EMU software that performs accelerated EM-PCA, however with fewer functionalities than PCAngsd (#soon).

PCAngsd will mostly output files in binary Numpy format (.npy) with a few exceptions. In order to read files in python:

import numpy as np
C = np.genfromtxt("output.cov") # Reads in estimated covariance matrix (text)
D = np.load("output.selection.npy") # Reads PC based selection statistics

R can also read Numpy matrices using the "RcppCNPy" R library:

library(RcppCNPy)
C <- as.matrix(read.table("output.cov")) # Reads in estimated covariance matrix
D <- npyLoad("output.selection.npy") # Reads PC based selection statistics

An example of generating genotype likelihoods in ANGSD and output them in the required Beagle text format.

./angsd -GL 2 -out input -nThreads 4 -doGlf 2 -doMajorMinor 1 -doMaf 2 -SNP_pval 1e-6 -bam bam.filelist

Tutorial

Please refer to the tutorial's page [4]

Options

# See all options in PCAngsd
python pcangsd.py -h

General usage

-beagle [Beagle file]

Input file of genotype likelihoods in Beagle format (.beagle.gz).

-filter [Text file]

Input file of 1's or 0's whether to keep individuals or not.

-plink [Prefix for binary PLINK files]

Path to PLINK files using their ONLY prefix (.bed, .bim, .fam).

-plink_error [float]

Incorporate errors into genotypes by specifying rate as argument.

-minMaf [float]

Minimum minor allele frequency threshold. (Default: 0.05)

-maf_iter [int]

Maximum number of EM iterations for computing the population allele frequencies (Default: 200).

-maf_tole [float]

Tolerance value in EM algorithm for population allele frequencies estimation (Default: 1e-4).

-iter [int]

Maximum number of iterations for estimation of individual allele frequencies (Default: 100).

-tole [float]

Tolerance value for update in estimation of individual allele frequencies (Default: 1e-5).

-hwe [.lrt.npy file]

Input file of LRT binary file from previous PCAngsd run to filter based on HWE.

-hwe_tole [float]

Threshold for HWE filtering of sites.

-e [int]

Manually select the number of eigenvalues to use in the modelling of individual allele frequencies (Default: Automatically tested using MAP test).

-pi [.pi.npy file]

Load previous estimation of individual allele frequencies to skip covariance estimation.

-maf_save

Choose to save estimated population allele frequencies (Binary). Numpy format (.npy).

-pi_save

Choose to save estimated individual allele frequencies (Binary). Numpy format (.npy). Can be used with the '-pi' command.

-dosage_save

Choose to save estimated genotype dosages (Binary). Numpy format (.npy).

-post_save

Choose to save the posterior genotype probabilities. Beagle format (.beagle).

-sites_save

Choose to save the kept sites after filtering which is useful for downstream analysis. Outputs a file of 1's and 0's for keeping a site or not, respectively.

-threads [int]

Specify the number of thread(s) to use (Default: 1).

-out [output prefix]

Fileprefix for all output files created by PCAngsd (Default: "pcangsd").

Selection

Perform PC-based genome-wide selection scans using posterior expectations of the genotypes (genotype dosages):

-selection

Using an extended model of FastPCA. Performs a genome-wide selection scan along all significant PCs. Outputs the selection statistics and must be converted to p-values by user. Each column reflect the selection statistics along a tested PC and they are χ²-distributed with 1 degree of freedom.

-pcadapt

Using an extended model of pcadapt. Performs a genome-wide selection scan across all significant PCs. Outputs the z-scores and must be converted to test statistics with the provided script 'pcangsd/scripts/pcadapt.R', and the test statistics are χ²-distributed with K degree of freedom.

-snp_weights

Output the SNP weights of the significant K eigenvectors.

Inbreeding

-inbreedSites

Estimate per-site inbreeding coefficients accounting for population structure and perform likehood ratio test for detecting sites deviating from HWE [5].

-inbreedSamples

Estimate per-individual inbreeding coefficients accounting for population structure which is based on an extension of ngsF for structured populations.

-inbreed_iter [int]

Maximum number of iterations for inbreeding EM algorithm. (Default: 200)

-inbreed_tole [float]

Tolerance value for inbreeding EM algorithm in estimating inbreeding coefficients. (Default: 1e-4)

Call genotypes

Genotypes can be called from posterior genotype probabilities by incorporating the individual allele frequencies as prior information.

-geno [float]

Call genotypes with defined threshold.

-genoInbreed [float]

Call genotypes with defined threshold also taking inbreeding into account. '-inbreedSamples' must also be called for using this option.

Admixture

Individual admixture proportions and ancestral allele frequencies can be estimated assuming K ancestral populations using an accelerated mini-batch NMF method.

-admix

Toggles admixture estimations. Estimates admixture proportions and ancestral allele frequencies.

-admix_K [int]

Not recommended. Override the number of ancestry components (K) to use, instead of using K=e-1.

-admix_iter [int]

Maximum number of iterations for admixture estimations using NMF. (Default: 200)

-admix_tole [float]

Tolerance value for update in admixture estimations using NMF. (Default: 1e-5)

-admix_alpha [float

Specify alpha (sparseness regularization parameter). (Default: 0)

-admix_auto [float]

Enable automatic search for optimal alpha using likelihood measure, by giving soft upper search bound of alpha.

-admix_seed [int]

Specify seed for random initializations of factor matrices in admixture estimations.

Tree

-tree

Construct neighbour-joining tree of samples from estimated covariance matrix estimated based on indivdual allele frequencies.

-tree_samples

Provide a list of sample names of all individuals to construct a beautiful tree.

Citation

Our methods for inferring population structure have been published in GENETICS:

Inferring Population Structure and Admixture Proportions in Low Depth NGS Data


Our method for testing for HWE in structured populations has been published in Molecular Ecology Resources:

Testing for Hardy‐Weinberg Equilibrium in Structured Populations using Genotype or Low‐Depth NGS Data