FastNgsAdmixOld: Difference between revisions
Line 72: | Line 72: | ||
for instance from 1000 G or HGDP, or another custom made reference panel. The program also needs a file telling the size of each reference panel population. | for instance from 1000 G or HGDP, or another custom made reference panel. The program also needs a file telling the size of each reference panel population. | ||
There is an R script called plinkToRefV2.R, that can convert a plink file to a reference panel and size of reference panel populations. | There is an R script called plinkToRefV2.R, that can convert a plink file to a reference panel and size of reference panel populations. | ||
Example: | |||
<pre> | |||
Rscript plinkToRefV2.R plinkFile | |||
</pre> | |||
This generates 3 files, a reference panel named refPanel_plinkFile.txt, a number of individuals file called nInd_plinkFile.txt and a plinkFile.sites file with chr pos and minor major alleles. | |||
fastNGSadmix already comes with a premade reference panel, made from Lazaridis et al. (2014) where the curated dataset was selected. | fastNGSadmix already comes with a premade reference panel, made from Lazaridis et al. (2014) where the curated dataset was selected. | ||
Line 86: | Line 93: | ||
./fastNGSadmix -likes indi_genotypelikelihood.beagle -fname refPanel.txt -Nname nInd.txt -outfiles indi_genotypelikelihood.beagle -K 3 | ./fastNGSadmix -likes indi_genotypelikelihood.beagle -fname refPanel.txt -Nname nInd.txt -outfiles indi_genotypelikelihood.beagle -K 3 | ||
</pre> | </pre> | ||
If the reference panel has more than K populations, the program will just use the K first populations. | |||
You can pick which populations should be analyzed via the "-whichPops" option, where you write the names of the population comma seperated, K has to be same number as selected populations. | |||
It then produces two files indi_genotypelikelihood.beagle.qopt with the admixture proportions and indi_genotypelikelihood.beagle.log. | It then produces two files indi_genotypelikelihood.beagle.qopt with the admixture proportions and indi_genotypelikelihood.beagle.log. | ||
Line 98: | Line 108: | ||
<pre> | <pre> | ||
./fastNGSadmix | ./fastNGSadmix | ||
</pre> | </pre> | ||
The option "-conv" specifies the number of convergence runs, with a new random starting point for each run. This is useful to test for convergence. The program will execute a maximum of 10 times. | The option "-conv" specifies the number of convergence runs, with a new random starting point for each run. This is useful to test for convergence. The program will execute a maximum of 10 times. | ||
Another option is "-boot" which specifies the number of bootstrap runs, where random sites are sampled for each run. This is useful for generating a confidence interval of your estimate. The program will execute a maximum of 100 times. | Another option is "-boot" which specifies the number of bootstrap runs, where random sites are sampled for each run. This is useful for generating a confidence interval of your estimate. The program will execute a maximum of 100 times. | ||
For faster inference of admixture proportions the "-Qconv" option can be set to 1, this bases the converge criteria on change in the admixture proportions values. The threshold of this can be set with "-Qtol". | |||
By default the method adjusting the frequencies is used, to use the unadjusted approach set "-doAdjust 0". The accelerated EM algorithm is used by default, to use regular EM algorithm set "-method 0". | |||
OPTIONS OPTIONS OPTIONS!?? | OPTIONS OPTIONS OPTIONS!?? |
Revision as of 15:39, 5 January 2017
This page contains information about the program called FastNGSadmixPCA, which is a very fast tool for finding admixture proportions from NGS data of a single individual to incorporate into PCA of NGS data. It is based on genotype likelihoods. The program is written in R.
Installation
wget http://popgen.dk/albrecht/kristian/tool_download.zip unzip tool_download.zip OR simply use SHINY: http://popgen.dk:443/kristian/admixpca_human/
Run example
tool.zip contains all files needed to execute FASTNGSAdmixPCA. The sample is from the HAPMAP project. In need of more samples, one can find a couple more samples in http://popgen.dk/albrecht/kristian/ The Rscript below executes the tool. all output is directed to a output_folder that is created in the process. To see the preset: Rscript FastNGSAdmixPCA.r
Rscript FastNGSAdmixPCA.r infile=NA12763.mapped.ILLUMINA.bwa.CEU.low_coverage.20130502.bam.beagle.gz
All arguments can be altered. To alter the reference populations, one need to write comma separated populations to the refpops argument as shown below
Rscript FastNGSAdmixPCA.r infile=NA12763.mapped.ILLUMINA.bwa.CEU.low_coverage.20130502.bam.beagle.gz refpops=YRI,JPT,CHB,CEU
To get an overview of available reference populations, one can make a dry run
Rscript FastNGSAdmixPCA.r infile=TRUE dryrun=TRUE
Input Files
Input files are contains genotype likelihoods in genotype likelihood beagle input file format [1]. We recommend [ANGSD] for easy transformation of Next-generation sequencing data to beagle format.
The example below show how to make a beagle file of genotype likelihood using ANGSD.
HOME$ ./angsd0.594/angsd -i 'pathtoindi.bam' -GL 2 -sites 'SNP.sites' -doGlf 2 -doMajorMinor 3 -minMapQ 30 -minQ 20 -doDepth 1 -doCounts 1 -out indi_genotypelikelihood
Example of a beagle genotype likelihood input file for 3 individuals.
marker allele1 allele2 Ind0 Ind0 Ind0 1_14000023 1 0 0.941 0.058 0.000 1_14000072 2 3 0.709 0.177 0.112 1_14000113 0 2 0.855 0.106 0.037 1_14000202 2 0 0.835 0.104 0.060 ...
version 2
Input files are genotype likelihoods in the genotype likelihood beagle input file format [2]. Or called genotypes in the binary plink files (*.bed) format [3] We recommend [ANGSD] for easy transformation of Next-generation sequencing data to beagle format and plink2 for handling plink files.
The example below show how to make a beagle file of genotype likelihood using ANGSD.
HOME$ ./angsd0.594/angsd -i 'pathtoindi.bam' -GL 2 -sites 'SNP.sites' -doGlf 2 -doMajorMinor 3 -minMapQ 30 -minQ 20 -doDepth 1 -doCounts 1 -out indi_genotypelikelihood
Example of a beagle genotype likelihood input file for 1 individual.
marker allele1 allele2 Ind0 Ind0 Ind0 1_14000023 1 0 0.941 0.058 0.000 1_14000072 2 3 0.709 0.177 0.112 1_14000113 0 2 0.855 0.106 0.037 1_14000202 2 0 0.835 0.104 0.060 ...
A provided SNP.sites file has been included.
The program also needs frequencies of a reference panel with the populations for which admixture proportions should be estimated,
for instance from 1000 G or HGDP, or another custom made reference panel. The program also needs a file telling the size of each reference panel population.
There is an R script called plinkToRefV2.R, that can convert a plink file to a reference panel and size of reference panel populations.
Example:
Rscript plinkToRefV2.R plinkFile
This generates 3 files, a reference panel named refPanel_plinkFile.txt, a number of individuals file called nInd_plinkFile.txt and a plinkFile.sites file with chr pos and minor major alleles.
fastNGSadmix already comes with a premade reference panel, made from Lazaridis et al. (2014) where the curated dataset was selected.
I lifted the dataset hg19 using the program liftOver, I then translated snpNames to rs names, using 1000G data, generating a unique name for each site via "chr-pos-A1-A2" (where A1 and A2 are alphabetically sorted).
Furthermore I removed sites with more than 5 % missing and a MAF below 5 %, and only autosomal sites. I selected 5 populations French, Han, Karitiana, Papuan and Yoruba to have representation for most of the world. Furthermore I made sure that I only used unadmixed individuals within each population.
An example of a command running fastNGSadmix with a beagle file and a chosen K of 3:
./fastNGSadmix -likes indi_genotypelikelihood.beagle -fname refPanel.txt -Nname nInd.txt -outfiles indi_genotypelikelihood.beagle -K 3
If the reference panel has more than K populations, the program will just use the K first populations. You can pick which populations should be analyzed via the "-whichPops" option, where you write the names of the population comma seperated, K has to be same number as selected populations.
It then produces two files indi_genotypelikelihood.beagle.qopt with the admixture proportions and indi_genotypelikelihood.beagle.log.
Or with a plink file:
./fastNGSadmix -plink plinkFile -fname refPanel.txt -Nname nInd.txt -outfiles plinkFile -K 3
A whole list of options can be explored by running fastNGSadmix without any input:
./fastNGSadmix
The option "-conv" specifies the number of convergence runs, with a new random starting point for each run. This is useful to test for convergence. The program will execute a maximum of 10 times. Another option is "-boot" which specifies the number of bootstrap runs, where random sites are sampled for each run. This is useful for generating a confidence interval of your estimate. The program will execute a maximum of 100 times. For faster inference of admixture proportions the "-Qconv" option can be set to 1, this bases the converge criteria on change in the admixture proportions values. The threshold of this can be set with "-Qtol". By default the method adjusting the frequencies is used, to use the unadjusted approach set "-doAdjust 0". The accelerated EM algorithm is used by default, to use regular EM algorithm set "-method 0".
OPTIONS OPTIONS OPTIONS!??
Custom refpanel can be supplied, has to look like this, where the 5 first columns have to be, then populations frequencies:
chr,pos,name,A0,A1
The frequencies have to be of the A0 allele.
Basically the files have to look like this: bgl rs1 A B GL(AA) GL(AB) GL(BB) Then ref 1 1 rs1 B A f(B)
Then solution is this: bgl rs1 A B GL(AA) GL(AB) GL(BB) Then ref 1 1 rs1 B A 1-f(B)
Then prepFreqs.R will take care of preparing the files properly.
Example of running prepFreqs.R:
Rscript prepFreqs.R indi_genotypelikelihood.bgl
Can also specify other populations than the default 5 ones in the reference panel. Also a custom made reference panel can be supplied (has to be .Rdata file).
Rscript prepFreqs.R indi_genotypelikelihood.bgl Pop1,Pop2,Pop3,Pop4 customRefPanel
indi_genotypelikelihood
Then run prepFreqs.R to get the proper beagle, refpanel and nInd files for the analysis.
Then run fastNGSadmix.
All the awesome options with the program.
So bgl
rs1 A B GL(AA) GL(AB) GL(BB)
Then ref
1 1 rs1 A B f(A)
(So if the 3 columns with genotype likelihoods in the beagle file is coded like this AA AB BB, then the frequencies should be of the A allele.)
Furthermore a file with the number of individuals in each reference population should be supplied.
Then a lot of different options and filters can be specified:
(TO BE CONTINUED...)