PCAngsd: Difference between revisions
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
This page contains information about the program PCAngsd, which estimates the covariance matrix for NGS data in an iterative | This page contains information about the program PCAngsd, which estimates the covariance matrix for low depth NGS data in an iterative procedure based on genotype likelihoods. Based on the population structure inference PCAngsd is able to estimate individual allele frequencies. By incorporating these allele frequencies in Empirical Bayes approaches, PCAngsd can perform PCA (estimate covariance matrix), call genotypes, estimate inbreeding coefficients (per-individual and per-site) and perform a genome selection scan using principal components in structured populations. The entire program is written in Python 2.7. | ||
[[File:Pcangsd_plot.png|thumb]] | [[File:Pcangsd_plot.png|thumb]] | ||
Line 13: | Line 13: | ||
</pre> | </pre> | ||
The following Python packages are | The following Python packages are needed to run PCAngsd (found in all popular distributions): | ||
numpy | numpy and pandas. | ||
PCAngsd should work on all platforms meeting the requirements but server-use is recommended | PCAngsd should work on all platforms meeting the requirements but server-use is recommended. | ||
Line 24: | Line 24: | ||
python pcangsd.py -h | python pcangsd.py -h | ||
# Estimate covariance matrix | # Estimate covariance matrix | ||
python pcangsd.py test.beagle.gz -o test | python pcangsd.py -beagle test.beagle.gz -o test | ||
# Estimate inbreeding coefficients | # Estimate inbreeding coefficients | ||
python pcangsd.py test.beagle.gz -inbreed | python pcangsd.py -beagle test.beagle.gz -inbreed 2 -o test | ||
# Perform selection scan | # Perform selection scan | ||
python pcangsd.py test.beagle.gz -selection 1 -o test | python pcangsd.py -beagle test.beagle.gz -selection 1 -o test | ||
</pre> | </pre> | ||
=Input= | =Input= | ||
The only | The only input PCAngsd needs and accepts are genotype likelihoods in [http://faculty.washington.edu/browning/beagle/beagle.html BEAGLE] format. [http://popgen.dk/angsd ANGSD] can be easily be used to compute the genotype likelihoods and output them in the required BEAGLE format. | ||
<pre> | <pre> | ||
./angsd -GL 1 -out genoLikes -nThreads 10 -doGlf 2 -doMajorMinor 1 -doMaf 2 -SNP_pval | ./angsd -GL 1 -out genoLikes -nThreads 10 -doGlf 2 -doMajorMinor 1 -doMaf 2 -SNP_pval 1e-6 -bam bam.filelist | ||
</pre> | </pre> | ||
See [http://popgen.dk/angsd ANGSD] for more info on how to compute the genotype likelihoods. | See [http://popgen.dk/angsd ANGSD] for more info on how to compute the genotype likelihoods and call SNPs. | ||
=Using PCAngsd= | =Using PCAngsd= | ||
Line 48: | Line 48: | ||
==Covariance matrix== | ==Covariance matrix== | ||
PCAngsd will | PCAngsd will compute the covariance matrix in all available analyses. It uses the principal components to model the individual allele frequencies such that they can be used to estimate another more accurate covariance matrix. This procedure is iterated until convergence for the individual allele frequencies. | ||
; beagle [BEAGLE file path] | ; beagle [BEAGLE file path] |
Revision as of 10:24, 10 August 2017
This page contains information about the program PCAngsd, which estimates the covariance matrix for low depth NGS data in an iterative procedure based on genotype likelihoods. Based on the population structure inference PCAngsd is able to estimate individual allele frequencies. By incorporating these allele frequencies in Empirical Bayes approaches, PCAngsd can perform PCA (estimate covariance matrix), call genotypes, estimate inbreeding coefficients (per-individual and per-site) and perform a genome selection scan using principal components in structured populations. The entire program is written in Python 2.7.
Download
The program can be downloaded from Github: https://github.com/Rosemeis/pcangsd
git clone https://github.com/Rosemeis/pcangsd.git; cd pcangsd/
The following Python packages are needed to run PCAngsd (found in all popular distributions): numpy and pandas.
PCAngsd should work on all platforms meeting the requirements but server-use is recommended.
Quick start
# See all options in PCAngsd python pcangsd.py -h # Estimate covariance matrix python pcangsd.py -beagle test.beagle.gz -o test # Estimate inbreeding coefficients python pcangsd.py -beagle test.beagle.gz -inbreed 2 -o test # Perform selection scan python pcangsd.py -beagle test.beagle.gz -selection 1 -o test
Input
The only input PCAngsd needs and accepts are genotype likelihoods in BEAGLE format. ANGSD can be easily be used to compute the genotype likelihoods and output them in the required BEAGLE format.
./angsd -GL 1 -out genoLikes -nThreads 10 -doGlf 2 -doMajorMinor 1 -doMaf 2 -SNP_pval 1e-6 -bam bam.filelist
See ANGSD for more info on how to compute the genotype likelihoods and call SNPs.
Using PCAngsd
All the different options in PCAngsd is listed here.
Covariance matrix
PCAngsd will compute the covariance matrix in all available analyses. It uses the principal components to model the individual allele frequencies such that they can be used to estimate another more accurate covariance matrix. This procedure is iterated until convergence for the individual allele frequencies.
- beagle [BEAGLE file path]
Positional argument for the path of the genotype likelihoods in BEAGLE format.
- -M [int]
Maximum number of iterations for covariance estimation. Only needed in rare cases. (Default: 100)
- -M_tole [float]
Tolerance value for the iterative covariance matrix estimation. (Default: 1e-4)
- -EM [int]
Maximum number of EM iterations for computing the population allele frequencies. (Default: 200)
- -EM_tole [float]
Tolerance value in EM algorithm for population allele frequencies estimation. (Default: 1e-4)
- -e [int]
Manually select the number of eigenvalues to use in modelling of individual allele frequencies. (Default: Automatically selected)
- -reg
Toogle to use Tikhonov regularization in modelling of individual allele frequencies to penalize lesser important PCs. May also help on convergence.
- -o [filename]
Set the prefix for all output files created by PCAngsd.
Call genotypes
Genotypes can be called very easily using the individual allele frequencies as prior.
- -callGeno
Toggle to call genotypes.
Inbreeding
Per-individual inbreeding coefficients can be computed using three different methods:
- -inbreed 1
A maximum likelihood estimator computed by an EM algorithm. Only allows F-values between 0 and 1.
- -inbreed 2
Simple estimator also computed by an EM algorithm described in [1].
- -inbreed 3
Moment estimator for the allele frequencies based on the model in PC-Relate. Sensitive to low-depth data.
- -inbreed_iter [int]
Maximum number of iterations for the EM algorithm methods. (Default: 200)
- -inbreed_tole [float]
Tolerance value for the EM algorithms for inbreeding coefficients estimation. (Default: 1e-4)
Selection
A genome-wide selection scan can be computed using two different methods:
- -selection 1
Using the model described in FastPCA. Produces a genome-wide selection scan for all significant PCs.
- -selection 2
Using the model described in PCAdapt.
LD can also be taken into account when performing selection scans. LD regression has been implemented in PCAngsd but the functionality is not fully tested.
- -LD [int]
Select the window (in bases) of preceding sites to use in regression.
Relatedness
Relatedness will also be touched upon in future updates.