FastNgsAdmixOld
This page contains information about the program called FastNGSadmixPCA, which is a very fast tool for finding admixture proportions from NGS data of a single individual to incorporate into PCA of NGS data. It is based on genotype likelihoods. The program is written in R.
Installation
wget http://popgen.dk/albrecht/kristian/tool_download.zip unzip tool_download.zip OR simply use SHINY: http://popgen.dk:443/kristian/admixpca_human/
Run example
tool.zip contains all files needed to execute FASTNGSAdmixPCA. The sample is from the HAPMAP project. In need of more samples, one can find a couple more samples in http://popgen.dk/albrecht/kristian/ The Rscript below executes the tool. all output is directed to a output_folder that is created in the process. To see the preset: Rscript FastNGSAdmixPCA.r
Rscript FastNGSAdmixPCA.r infile=NA12763.mapped.ILLUMINA.bwa.CEU.low_coverage.20130502.bam.beagle.gz
All arguments can be altered. To alter the reference populations, one need to write comma separated populations to the refpops argument as shown below
Rscript FastNGSAdmixPCA.r infile=NA12763.mapped.ILLUMINA.bwa.CEU.low_coverage.20130502.bam.beagle.gz refpops=YRI,JPT,CHB,CEU
To get an overview of available reference populations, one can make a dry run
Rscript FastNGSAdmixPCA.r infile=TRUE dryrun=TRUE
Input Files
Input files are contains genotype likelihoods in genotype likelihood beagle input file format [1]. We recommend [ANGSD] for easy transformation of Next-generation sequencing data to beagle format.
The example below show how to make a beagle file of genotype likelihood using ANGSD.
HOME$ ./angsd0.594/angsd -i 'pathtoindi.bam' -GL 2 -sites 'SNP.sites' -doGlf 2 -doMajorMinor 3 -minMapQ 30 -minQ 20 -doDepth 1 -doCounts 1 -out indi_genotypelikelihood
Example of a beagle genotype likelihood input file for 3 individuals.
marker allele1 allele2 Ind0 Ind0 Ind0 1_14000023 1 0 0.941 0.058 0.000 1_14000072 2 3 0.709 0.177 0.112 1_14000113 0 2 0.855 0.106 0.037 1_14000202 2 0 0.835 0.104 0.060 ...
version 2
version 2
Input files are contains genotype likelihoods in genotype likelihood beagle input file format [2]. We recommend [ANGSD] for easy transformation of Next-generation sequencing data to beagle format.
The example below show how to make a beagle file of genotype likelihood using ANGSD.
HOME$ ./angsd0.594/angsd -i 'pathtoindi.bam' -GL 2 -sites 'SNP.sites' -doGlf 2 -doMajorMinor 3 -minMapQ 30 -minQ 20 -doDepth 1 -doCounts 1 -out indi_genotypelikelihood
Example of a beagle genotype likelihood input file for 3 individuals.
marker allele1 allele2 Ind0 Ind0 Ind0 1_14000023 1 0 0.941 0.058 0.000 1_14000072 2 3 0.709 0.177 0.112 1_14000113 0 2 0.855 0.106 0.037 1_14000202 2 0 0.835 0.104 0.060 ...
A provided SNP.sites file has been included together with the program, this along with a reference panel and genotypes are taken from Lazaridis et al. (2014) where the curated dataset was selected.
I lifted the dataset hg19 using the program liftOver, I then translated snpNames to rs names, using 1000G data, generating a unique name for each site via "chr-pos-A1-A2" (where A1 and A2 are alphabetically sorted).
Custom refpanel can be supplied, has to look like this, where the 5 first columns have to be, then populatiosn frequencies:
chr,pos,name,A0,A1
The frequencies have to be of the A0 allele. Then prepFreqs.R will take care of preparing the files properly.
Then run prepFreqs.R to get the proper beagle, refpanel and nInd files for the analysis.
Then run fastNGSadmix.
All the awesome options with the program.
This program needs the genotype likelihoods in the beagle file format. It also needs frequencies of a reference panel with the populations for which admixture proportions should be estimated, for instance from 1000 G or HGDP, or a custom made reference panel, it should be noted that the frequencies in the reference panel should be of the major allele in the beagle file.
(So if the 3 columns with genotype likelihoods in the beagle file is coded like this AA AB BB, then the frequencies should be of the A allele.)
Furthermore a file with the number of individuals in each reference population should be supplied.
An example of a command:
./fastNGSadmix -likes Yoruba10Japanese65Han25_3000000_d10_N10_GL.txt -fname Yoruba10Japanese65Han25_3000000_d10_N10_Ref.txt -Nname sYoruba10Japanese65Han25_3000000_d10_N10_nInd.txt -outfiles Yoruba10Japanese65Han25_3000000_d10_N10
Then a lot of different options and filters can be specified:
(TO BE CONTINUED...)