FastNgsAdmixOld: Difference between revisions
No edit summary |
|||
Line 48: | Line 48: | ||
=version 2= | =version 2= | ||
This page contains information about the program called fastNGSadmix which is a very fast tool for finding admixture proportions from NGS data of a single individual and a method for doing PCA of NGS data, using the estimated admixture proporitions. It is based on genotype likelihoods. The admixture estimation part is written in C++ and the PCA part is written in R. | |||
=Installation= | |||
<pre> | |||
SOMETHING SIMILAR I GUESS/PERHAPS IN PUBLIC | |||
wget http://popgen.dk/albrecht/kristian/tool_download.zip | |||
unzip tool_download.zip | |||
OR simply use SHINY: | |||
http://popgen.dk:443/kristian/admixpca_human/ | |||
</pre> | |||
=Input Files= | |||
Input files are genotype likelihoods in the genotype likelihood beagle input file format [http://faculty.washington.edu/browning/beagle/beagle.html]. Or called genotypes in the binary plink files (*.bed) format [https://www.cog-genomics.org/plink2] | Input files are genotype likelihoods in the genotype likelihood beagle input file format [http://faculty.washington.edu/browning/beagle/beagle.html]. Or called genotypes in the binary plink files (*.bed) format [https://www.cog-genomics.org/plink2] | ||
We recommend [ANGSD] for easy transformation of Next-generation sequencing data to beagle format and plink2 for handling plink files. | We recommend [ANGSD] for easy transformation of Next-generation sequencing data to beagle format and plink2 for handling plink files. | ||
Line 69: | Line 84: | ||
A provided SNP.sites file has been included. | A provided SNP.sites file has been included. | ||
</pre> | |||
Line 81: | Line 98: | ||
This generates 3 files, a reference panel named refPanel_plinkFile.txt, a number of individuals file called nInd_plinkFile.txt and a plinkFile.sites file with chr pos and minor major alleles. | This generates 3 files, a reference panel named refPanel_plinkFile.txt, a number of individuals file called nInd_plinkFile.txt and a plinkFile.sites file with chr pos and minor major alleles. | ||
fastNGSadmix already comes with a premade reference panel, made from Lazaridis et al. (2014) where the curated dataset was selected. | fastNGSadmix already comes with a premade reference panel, made from Lazaridis et al. (2014) where the curated dataset was selected. | ||
Line 90: | Line 108: | ||
Furthermore I made sure that I only used unadmixed individuals within each population. | Furthermore I made sure that I only used unadmixed individuals within each population. | ||
=Run example= | |||
An example of a command running fastNGSadmix with a beagle file and a chosen K of 3: | An example of a command running fastNGSadmix with a beagle file and a chosen K of 3: | ||
Line 98: | Line 117: | ||
If the reference panel has more than K populations, the program will just use the K first populations. | If the reference panel has more than K populations, the program will just use the K first populations. | ||
You can pick which populations should be analyzed via the "-whichPops" option, where you write the names of the population comma seperated, K has to be same number as selected populations. | You can pick which populations should be analyzed via the "-whichPops" option, where you write the names of the population comma seperated, K has to be same number as selected populations. | ||
It should also be noted that the program quits if there are duplicate sites (based on | It should also be noted that the program quits if there are duplicate sites (based on chr_pos ID) in the input or the reference panel. | ||
It then produces two files indi_genotypelikelihood.beagle.qopt with the admixture proportions and indi_genotypelikelihood.beagle.log. | It then produces two files indi_genotypelikelihood.beagle.qopt with the admixture proportions and indi_genotypelikelihood.beagle.log. | ||
Line 121: | Line 140: | ||
OPTIONS OPTIONS OPTIONS!?? | OPTIONS OPTIONS OPTIONS!?? | ||
######### | |||
### PCA part | |||
######## | |||
Revision as of 16:26, 7 February 2017
This page contains information about the program called FastNGSadmixPCA, which is a very fast tool for finding admixture proportions from NGS data of a single individual to incorporate into PCA of NGS data. It is based on genotype likelihoods. The program is written in R.
Installation
wget http://popgen.dk/albrecht/kristian/tool_download.zip unzip tool_download.zip OR simply use SHINY: http://popgen.dk:443/kristian/admixpca_human/
Run example
tool.zip contains all files needed to execute FASTNGSAdmixPCA. The sample is from the HAPMAP project. In need of more samples, one can find a couple more samples in http://popgen.dk/albrecht/kristian/ The Rscript below executes the tool. all output is directed to a output_folder that is created in the process. To see the preset: Rscript FastNGSAdmixPCA.r
Rscript FastNGSAdmixPCA.r infile=NA12763.mapped.ILLUMINA.bwa.CEU.low_coverage.20130502.bam.beagle.gz
All arguments can be altered. To alter the reference populations, one need to write comma separated populations to the refpops argument as shown below
Rscript FastNGSAdmixPCA.r infile=NA12763.mapped.ILLUMINA.bwa.CEU.low_coverage.20130502.bam.beagle.gz refpops=YRI,JPT,CHB,CEU
To get an overview of available reference populations, one can make a dry run
Rscript FastNGSAdmixPCA.r infile=TRUE dryrun=TRUE
Input Files
Input files are contains genotype likelihoods in genotype likelihood beagle input file format [1]. We recommend [ANGSD] for easy transformation of Next-generation sequencing data to beagle format.
The example below show how to make a beagle file of genotype likelihood using ANGSD.
HOME$ ./angsd0.594/angsd -i 'pathtoindi.bam' -GL 2 -sites 'SNP.sites' -doGlf 2 -doMajorMinor 3 -minMapQ 30 -minQ 20 -doDepth 1 -doCounts 1 -out indi_genotypelikelihood
Example of a beagle genotype likelihood input file for 3 individuals.
marker allele1 allele2 Ind0 Ind0 Ind0 1_14000023 1 0 0.941 0.058 0.000 1_14000072 2 3 0.709 0.177 0.112 1_14000113 0 2 0.855 0.106 0.037 1_14000202 2 0 0.835 0.104 0.060 ...
version 2
This page contains information about the program called fastNGSadmix which is a very fast tool for finding admixture proportions from NGS data of a single individual and a method for doing PCA of NGS data, using the estimated admixture proporitions. It is based on genotype likelihoods. The admixture estimation part is written in C++ and the PCA part is written in R.
Installation
SOMETHING SIMILAR I GUESS/PERHAPS IN PUBLIC wget http://popgen.dk/albrecht/kristian/tool_download.zip unzip tool_download.zip OR simply use SHINY: http://popgen.dk:443/kristian/admixpca_human/
Input Files
Input files are genotype likelihoods in the genotype likelihood beagle input file format [2]. Or called genotypes in the binary plink files (*.bed) format [3] We recommend [ANGSD] for easy transformation of Next-generation sequencing data to beagle format and plink2 for handling plink files.
The example below show how to make a beagle file of genotype likelihood using ANGSD.
HOME$ ./angsd0.594/angsd -i 'pathtoindi.bam' -GL 2 -sites 'SNP.sites' -doGlf 2 -doMajorMinor 3 -minMapQ 30 -minQ 20 -doDepth 1 -doCounts 1 -out indi_genotypelikelihood
Example of a beagle genotype likelihood input file for 1 individual.
marker allele1 allele2 Ind0 Ind0 Ind0 1_14000023 1 0 0.941 0.058 0.000 1_14000072 2 3 0.709 0.177 0.112 1_14000113 0 2 0.855 0.106 0.037 1_14000202 2 0 0.835 0.104 0.060 ...
Where the marker should be chr_pos, instead of rs ID.
A provided SNP.sites file has been included.
The program also needs frequencies of a reference panel with the populations for which admixture proportions should be estimated,
for instance from 1000 G or HGDP, or another custom made reference panel. The program also needs a file telling the size of each reference panel population.
There is an R script called plinkToRefV2.R, that can convert a plink file to a reference panel and size of reference panel populations.
Example:
Rscript plinkToRefV2.R plinkFile
This generates 3 files, a reference panel named refPanel_plinkFile.txt, a number of individuals file called nInd_plinkFile.txt and a plinkFile.sites file with chr pos and minor major alleles.
fastNGSadmix already comes with a premade reference panel, made from Lazaridis et al. (2014) where the curated dataset was selected.
I lifted the dataset hg19 using the program liftOver, I then translated snpNames to rs names, using 1000G data, generating a unique name for each site via "chr-pos-A1-A2" (where A1 and A2 are alphabetically sorted).
Furthermore I removed sites with more than 5 % missing and a MAF below 5 %, and only autosomal sites. I selected 5 populations French, Han, Karitiana, Papuan and Yoruba to have representation for most of the world. Furthermore I made sure that I only used unadmixed individuals within each population.
Run example
An example of a command running fastNGSadmix with a beagle file and a chosen K of 3:
./fastNGSadmix -likes indi_genotypelikelihood.beagle -fname refPanel.txt -Nname nInd.txt -outfiles indi_genotypelikelihood.beagle -K 3
If the reference panel has more than K populations, the program will just use the K first populations. You can pick which populations should be analyzed via the "-whichPops" option, where you write the names of the population comma seperated, K has to be same number as selected populations. It should also be noted that the program quits if there are duplicate sites (based on chr_pos ID) in the input or the reference panel.
It then produces two files indi_genotypelikelihood.beagle.qopt with the admixture proportions and indi_genotypelikelihood.beagle.log.
Or with a plink file:
./fastNGSadmix -plink plinkFile -fname refPanel.txt -Nname nInd.txt -outfiles plinkFile -K 3
A whole list of options can be explored by running fastNGSadmix without any input:
./fastNGSadmix
The option "-conv" specifies the number of convergence runs, with a new random starting point for each run. This is useful to test for convergence. The program will execute a maximum of 10 times. Another option is "-boot" which specifies the number of bootstrap runs, where random sites are sampled for each run. This is useful for generating a confidence interval of your estimate. The program will execute a maximum of 100 times. For faster inference of admixture proportions the "-Qconv" option can be set to 1, this bases the converge criteria on change in the admixture proportions values. The threshold of this can be set with "-Qtol". By default the "-Qtol" threshold is 0.0000001, it should not be put lower than this and generally this is only for when wanting a fast overview of the data, as it is less precise than the likelihood based convergence. By default the method adjusting the frequencies is used, to use the unadjusted approach set "-doAdjust 0". The accelerated EM algorithm is used by default, to use regular EM algorithm set "-method 0".
OPTIONS OPTIONS OPTIONS!??
- PCA part
Custom refpanel can be supplied, has to look like this, where the 5 first columns have to be, then populations frequencies:
chr,pos,name,A0,A1
The frequencies have to be of the A0 allele.
Basically the files have to look like this: bgl rs1 A B GL(AA) GL(AB) GL(BB) Then ref 1 1 rs1 B A f(B)
Then solution is this: bgl rs1 A B GL(AA) GL(AB) GL(BB) Then ref 1 1 rs1 B A 1-f(B)
Then prepFreqs.R will take care of preparing the files properly.
Example of running prepFreqs.R:
Rscript prepFreqs.R indi_genotypelikelihood.bgl
Can also specify other populations than the default 5 ones in the reference panel. Also a custom made reference panel can be supplied (has to be .Rdata file).
Rscript prepFreqs.R indi_genotypelikelihood.bgl Pop1,Pop2,Pop3,Pop4 customRefPanel
indi_genotypelikelihood
Then run prepFreqs.R to get the proper beagle, refpanel and nInd files for the analysis.
Then run fastNGSadmix.
All the awesome options with the program.
So bgl
rs1 A B GL(AA) GL(AB) GL(BB)
Then ref
1 1 rs1 A B f(A)
(So if the 3 columns with genotype likelihoods in the beagle file is coded like this AA AB BB, then the frequencies should be of the A allele.)
Furthermore a file with the number of individuals in each reference population should be supplied.
Then a lot of different options and filters can be specified:
(TO BE CONTINUED...)