PCAngsd: Difference between revisions
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
This page contains information about the program PCAngsd, which estimates the covariance matrix for low depth NGS data in an iterative procedure based on genotype likelihoods. Based on | This page contains information about the program PCAngsd, which estimates the covariance matrix for low depth NGS data in an iterative procedure based on genotype likelihoods. Based on population structure inference, PCAngsd is able to estimate individual allele frequencies. These individual allele frequencies can be used in various population genetic methods for heterogeneous populations, such that PCAngsd can perform PCA (estimate covariance matrix), call genotypes, estimate individual admixture proportions, estimate inbreeding coefficients (per-individual and per-site) and perform a genome selection scan using principal components. The entire program is written in Python 2.7 and is multithreaded to take advantage of several CPUs. | ||
[[File:Pcangsd_admix.gif|frame]] | [[File:Pcangsd_admix.gif|frame]] | ||
Line 10: | Line 10: | ||
https://github.com/Rosemeis/pcangsd | https://github.com/Rosemeis/pcangsd | ||
Latest release of PCAngsd: 0. | Latest release of PCAngsd: 0.8 | ||
<pre> | <pre> | ||
Line 17: | Line 17: | ||
</pre> | </pre> | ||
The following Python packages are needed to run PCAngsd | The following Python packages are needed to run PCAngsd: | ||
'''numpy''', '''scipy''' and ''' | '''numpy''', '''scipy''', '''pandas''', '''sklearn''' and '''numba'''. | ||
PCAngsd should work on all platforms meeting the requirements but server-side usage is recommended. | The packages and their dependencies can easily be installed using the following command inside the pcangsd folder: | ||
<pre> | |||
pip install --user -r python_packages.txt | |||
</pre> | |||
PCAngsd should work on all platforms meeting the requirements but server-side usage is highly recommended. | |||
Line 28: | Line 34: | ||
python pcangsd.py -h | python pcangsd.py -h | ||
# Only estimate covariance matrix | # Only estimate covariance matrix using 10 threads | ||
python pcangsd.py -beagle test.beagle.gz -o test | python pcangsd.py -beagle test.beagle.gz -n 100 -o test -threads 10 | ||
# Estimate covariance matrix and individual admixture proportions | |||
python pcangsd.py -beagle test.beagle.gz -n 100 -admix -o test -threads 10 | |||
# Estimate covariance matrix and inbreeding coefficients | # Estimate covariance matrix and inbreeding coefficients | ||
python pcangsd.py -beagle test.beagle.gz -inbreed 1 -o test | python pcangsd.py -beagle test.beagle.gz -n 100 -inbreed 1 -o test -threads 10 | ||
# Estimate covariance matrix and perform selection scan | # Estimate covariance matrix and perform selection scan | ||
python pcangsd.py -beagle test.beagle.gz -selection 1 -o test | python pcangsd.py -beagle test.beagle.gz -n 100 -selection 1 -o test -threads 10 | ||
</pre> | </pre> | ||
Line 45: | Line 54: | ||
</pre> | </pre> | ||
See [http://popgen.dk/angsd ANGSD] for more | See [http://popgen.dk/angsd ANGSD] for more information on how to compute the genotype likelihoods and call SNPs. | ||
=Using PCAngsd= | =Using PCAngsd= | ||
All the different options in PCAngsd are listed here. | All the different options in PCAngsd are listed here. PCAngsd will always compute the covariance matrix, where it uses principal components to estimate individual allele frequencies in an iterative procedure. The estimated individual allele frequencies will then be used in any of the other specified options of PCAngsd. | ||
; -beagle [Beagle filename] | ==Estimation of individual allele frequencies== | ||
; -beagle [Beagle filename] '''Required''' | |||
Path to file of the genotype likelihoods in Beagle format. | Path to file of the genotype likelihoods in Beagle format. | ||
; - | ; -n [int] '''Required''' | ||
Specify the number of individuals in dataset. | |||
; - | ; -threads [int] | ||
Maximum number of iterations for | Specify the number of thread(s) to use. (Default: 1) | ||
; - | ; -iter [int] | ||
Tolerance value for | Maximum number of iterations for estimation of individual allele frequencies. (Default: 100) | ||
; - | ; -tole [float] | ||
Tolerance value for update in estimation of individual allele frequencies. (Default: 5e-5) | |||
; -maf [int] | |||
Maximum number of EM iterations for computing the population allele frequencies. (Default: 200) | Maximum number of EM iterations for computing the population allele frequencies. (Default: 200) | ||
; - | ; -maf_tole [float] | ||
Tolerance value in EM algorithm for population allele frequencies estimation. (Default: | Tolerance value in EM algorithm for population allele frequencies estimation. (Default: 5e-5) | ||
; -e [int] | ; -e [int] | ||
Manually select the number of eigenvalues to use in the modelling of individual allele frequencies. (Default: Automatically tested) | Manually select the number of eigenvalues to use in the modelling of individual allele frequencies. (Default: Automatically tested using MAP test) | ||
; -o [prefix] | ; -o [prefix] | ||
Set the prefix for all output files created by PCAngsd (Default: "pcangsd"). | Set the prefix for all output files created by PCAngsd (Default: "pcangsd"). | ||
; -freq_save | |||
Choose to save estimated allele frequencies (both individual and population). | |||
; -sites_save | |||
Choose to save the marker IDs after performing filtering using population allele frequencies. Useful for especially selection scans and per-site inbreeding coefficients. | |||
==Call genotypes== | ==Call genotypes== | ||
Genotypes can be called from posterior genotype probabilities incorporating the individual allele frequencies | Genotypes can be called from posterior genotype probabilities incorporating the individual allele frequencies as prior information. | ||
; -geno [float] | ; -geno [float] | ||
Call genotypes with defined threshold. | Call genotypes with defined threshold. | ||
; -genoInbreed [float] | ; -genoInbreed [float] | ||
Call genotypes with defined threshold also taking inbreeding into account. '''-inbreed [int]''' is required. | Call genotypes with defined threshold also taking inbreeding into account. '''-inbreed [int]''' is required, since individual inbreeding coefficients must have been estimated prior to calling genotypes using that information. | ||
==Admixture== | |||
Individual admixture proportions and population-specific allele frequencies can be estimated based on assuming K ancestral populations using an accelerated mini-batch NMF method. | |||
; -admix | |||
Toggles admixture estimations. | |||
; -admix_alpha [int-list] | |||
Specify alpha (sparseness regularization parameter). Can be specified as a sequence to try several alphas in a single run. Fully compatible with -admix_seed and -admix_K. (Default: 0) | |||
; -admix_seed [int-list] | |||
Specify seed for random initializations of factor matrices in admixture estimations. Can be specified as a sequence to try several different seeds in a single run. Fully compatible with -admix_alpha and -admix_K. | |||
; -admix_K [int-list] | |||
Not recommended. Specify number of ancestral populations to use in admixture estimations. Can be specified as a sequence to try several K's in a single run. Fully compatible with -admix_alpha and -admix_seed. | |||
; -admix_iter [int] | |||
Maximum number of iterations for admixture estimations using NMF. (Default: 100) | |||
; -admix_tole [float] | |||
Tolerance value for update in admixture estimations using NMF. (Default: 1e-5) | |||
; -admix_batch [int] | |||
Specify the number of mini-batches to use in NMF method. (Default: 20) | |||
; -admix_save | |||
Choose to save the population-specific allele frequencies. | |||
==Inbreeding== | ==Inbreeding== | ||
Per-individual inbreeding coefficients incorporating population structure can be computed using three different methods | Per-individual inbreeding coefficients incorporating population structure can be computed using three different methods. However, -inbreed 2 is recommended for low depth cases. | ||
; -inbreed 1 | ; -inbreed 1 | ||
Line 98: | Line 128: | ||
Maximum number of iterations for the EM algorithm methods. (Default: 200) | Maximum number of iterations for the EM algorithm methods. (Default: 200) | ||
; -inbreed_tole [float] | ; -inbreed_tole [float] | ||
Tolerance value for the EM algorithms for inbreeding coefficients estimation. (Default: | Tolerance value for the EM algorithms for inbreeding coefficients estimation. (Default: 5e-5) | ||
Line 106: | Line 136: | ||
==Selection== | ==Selection== | ||
A genome selection scan can be computed using two different methods based on posterior expectations of the genotypes: | A genome selection scan can be computed using two different methods based on posterior expectations of the genotypes (genotype dosages): | ||
; -selection 1 | ; -selection 1 | ||
Line 120: | Line 150: | ||
; -kinship | ; -kinship | ||
Automatically estimated if '''-inbreed 3''' has been selected. | Automatically estimated if '''-inbreed 3''' has been selected. | ||
=Citation= | =Citation= |
Revision as of 14:20, 12 January 2018
This page contains information about the program PCAngsd, which estimates the covariance matrix for low depth NGS data in an iterative procedure based on genotype likelihoods. Based on population structure inference, PCAngsd is able to estimate individual allele frequencies. These individual allele frequencies can be used in various population genetic methods for heterogeneous populations, such that PCAngsd can perform PCA (estimate covariance matrix), call genotypes, estimate individual admixture proportions, estimate inbreeding coefficients (per-individual and per-site) and perform a genome selection scan using principal components. The entire program is written in Python 2.7 and is multithreaded to take advantage of several CPUs.
Download
The program can be downloaded from Github: https://github.com/Rosemeis/pcangsd
Latest release of PCAngsd: 0.8
git clone https://github.com/Rosemeis/pcangsd.git; cd pcangsd/
The following Python packages are needed to run PCAngsd: numpy, scipy, pandas, sklearn and numba.
The packages and their dependencies can easily be installed using the following command inside the pcangsd folder:
pip install --user -r python_packages.txt
PCAngsd should work on all platforms meeting the requirements but server-side usage is highly recommended.
Quick start
# See all options in PCAngsd python pcangsd.py -h # Only estimate covariance matrix using 10 threads python pcangsd.py -beagle test.beagle.gz -n 100 -o test -threads 10 # Estimate covariance matrix and individual admixture proportions python pcangsd.py -beagle test.beagle.gz -n 100 -admix -o test -threads 10 # Estimate covariance matrix and inbreeding coefficients python pcangsd.py -beagle test.beagle.gz -n 100 -inbreed 1 -o test -threads 10 # Estimate covariance matrix and perform selection scan python pcangsd.py -beagle test.beagle.gz -n 100 -selection 1 -o test -threads 10
Input
The only input PCAngsd needs and accepts are genotype likelihoods in Beagle format. ANGSD can be easily be used to compute genotype likelihoods and output them in the required Beagle format.
./angsd -GL 1 -out genoLikes -nThreads 10 -doGlf 2 -doMajorMinor 1 -doMaf 2 -SNP_pval 1e-6 -bam bam.filelist
See ANGSD for more information on how to compute the genotype likelihoods and call SNPs.
Using PCAngsd
All the different options in PCAngsd are listed here. PCAngsd will always compute the covariance matrix, where it uses principal components to estimate individual allele frequencies in an iterative procedure. The estimated individual allele frequencies will then be used in any of the other specified options of PCAngsd.
Estimation of individual allele frequencies
- -beagle [Beagle filename] Required
Path to file of the genotype likelihoods in Beagle format.
- -n [int] Required
Specify the number of individuals in dataset.
- -threads [int]
Specify the number of thread(s) to use. (Default: 1)
- -iter [int]
Maximum number of iterations for estimation of individual allele frequencies. (Default: 100)
- -tole [float]
Tolerance value for update in estimation of individual allele frequencies. (Default: 5e-5)
- -maf [int]
Maximum number of EM iterations for computing the population allele frequencies. (Default: 200)
- -maf_tole [float]
Tolerance value in EM algorithm for population allele frequencies estimation. (Default: 5e-5)
- -e [int]
Manually select the number of eigenvalues to use in the modelling of individual allele frequencies. (Default: Automatically tested using MAP test)
- -o [prefix]
Set the prefix for all output files created by PCAngsd (Default: "pcangsd").
- -freq_save
Choose to save estimated allele frequencies (both individual and population).
- -sites_save
Choose to save the marker IDs after performing filtering using population allele frequencies. Useful for especially selection scans and per-site inbreeding coefficients.
Call genotypes
Genotypes can be called from posterior genotype probabilities incorporating the individual allele frequencies as prior information.
- -geno [float]
Call genotypes with defined threshold.
- -genoInbreed [float]
Call genotypes with defined threshold also taking inbreeding into account. -inbreed [int] is required, since individual inbreeding coefficients must have been estimated prior to calling genotypes using that information.
Admixture
Individual admixture proportions and population-specific allele frequencies can be estimated based on assuming K ancestral populations using an accelerated mini-batch NMF method.
- -admix
Toggles admixture estimations.
- -admix_alpha [int-list]
Specify alpha (sparseness regularization parameter). Can be specified as a sequence to try several alphas in a single run. Fully compatible with -admix_seed and -admix_K. (Default: 0)
- -admix_seed [int-list]
Specify seed for random initializations of factor matrices in admixture estimations. Can be specified as a sequence to try several different seeds in a single run. Fully compatible with -admix_alpha and -admix_K.
- -admix_K [int-list]
Not recommended. Specify number of ancestral populations to use in admixture estimations. Can be specified as a sequence to try several K's in a single run. Fully compatible with -admix_alpha and -admix_seed.
- -admix_iter [int]
Maximum number of iterations for admixture estimations using NMF. (Default: 100)
- -admix_tole [float]
Tolerance value for update in admixture estimations using NMF. (Default: 1e-5)
- -admix_batch [int]
Specify the number of mini-batches to use in NMF method. (Default: 20)
- -admix_save
Choose to save the population-specific allele frequencies.
Inbreeding
Per-individual inbreeding coefficients incorporating population structure can be computed using three different methods. However, -inbreed 2 is recommended for low depth cases.
- -inbreed 1
A maximum likelihood estimator computed by an EM algorithm. Only allows for F-values between 0 and 1. Based on [1].
- -inbreed 2
Simple estimator also computed by an EM algorithm. Based on ngsF.
- -inbreed 3
(Not recommended for low depth NGS data!) Estimator using the kinship matrix. Based on PC-Relate.
- -inbreed_iter [int]
Maximum number of iterations for the EM algorithm methods. (Default: 200)
- -inbreed_tole [float]
Tolerance value for the EM algorithms for inbreeding coefficients estimation. (Default: 5e-5)
Per-site inbreeding coefficients incorporating population structure alongside likehood ratio tests for HWE can be computed as follows:
- -inbreedSites
Selection
A genome selection scan can be computed using two different methods based on posterior expectations of the genotypes (genotype dosages):
- -selection 1
Using an extended model of FastPCA. Performs a genome selection scan along all significant PCs.
- -selection 2
Using an extended model of PCAdapt.
Relatedness
Work in progress...
Estimate kinship matrix based on method Based on PC-Relate:
- -kinship
Automatically estimated if -inbreed 3 has been selected.