FastNGSadmix: Difference between revisions
No edit summary |
|||
Line 109: | Line 109: | ||
<pre> | <pre> | ||
./fastNGSadmix -plink plinkFile -fname | |||
#frequencies from reference panel | |||
REF=data/refPanel_humanOrigins_7worldPops.txt | |||
#number of individuals in each population of reference panel | |||
NIND=data/nInd_humanOrigins_7worldPops.txt | |||
./fastNGSadmix -plink plinkFile -fname $REF -Nname $NIND -outfiles plinkFile -K 3 +whichPops French,Han,Yoruba | |||
</pre> | </pre> | ||
Revision as of 10:59, 24 February 2017
This page contains information about the program fastNGSadmix, a very fast tool for finding admixture proportions from NGS data of a single individual and a method for doing PCA of NGS data, using the estimated admixture proportions. It is based on genotype likelihoods. It also read plink files. The admixture estimation part is written in C++ and the PCA part is written in R.
Download
The program can be downloaded from github:
https://github.com/e-jorsboe/fastNGSadmix
git clone https://github.com/e-jorsboe/fastNGSadmix.git; cd fastNGSadmix make
So far it has only been tested on Linux systems.
Data for the analyses as well as example data can be downloaded from:
wget popgen.dk/software/download/fastNGSadmix/data.tar.gz wget popgen.dk/software/download/fastNGSadmix/example.tar.gz
Use curl if you are on a MAC
They can be unpacked thus:
tar -xzf data.tar.gz tar -xzf example.tar.gz
The data folder both has an already made reference panel (more on this in the Input files section) for the admixture estimation and genotypes in plink format for the PCA analysis.
Quick start
#file with genotype likelihoods GL=example/NA12763.mapped.ILLUMINA.bwa.CEU.low_coverage.20130502.bam.beagle.gz #frequencies from reference panel REF=data/refPanel_humanOrigins_7worldPops.txt #number of individuals in each population of reference panel NIND=data/nInd_humanOrigins_7worldPops.txt #Estimate admixture proportions ./fastNGSadmix -likes $GL -fname $REF -Nname $NIND -outfiles NA12763_CEU -K 3
Input Files
Input files are genotype likelihoods in the genotype likelihood beagle input file format [1]. Or called genotypes in the binary plink files (*.bed) format [2] We recommend ANGSD for easy transformation of Next-generation sequencing data to the beagle format and plink2 for handling plink files. See the ANGSD wiki for installing and other documentation
The example below shows how to make a beagle file of genotype likelihoods using ANGSD.
#BAM/CRAM file BAM=example/smallNA12874.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.bam #sites in the referene panel including major and minor allele SITES=data/humanOrigins_7worldPops.sites #run ANGSD ./angsd -i $BAM -GL 2 -sites $SITES -doGlf 2 -doMajorMinor 3 -minMapQ 30 -minQ 20 -doDepth 1 -doCounts 1 -out outName
Example of a beagle genotype likelihood input file for 1 individual.
marker allele1 allele2 Ind0 Ind0 Ind0 1_14000023 1 0 0.941 0.058 0.000 1_14000072 2 3 0.709 0.177 0.112 1_14000113 0 2 0.855 0.106 0.037 1_14000202 2 0 0.835 0.104 0.060 ...
Where the marker column SHOULD be chr_pos, instead of rs ID.
A provided .sites file has been included, in the data folder.
Running FastNGSadmix
An example of a command running fastNGSadmix with a beagle file and a chosen K of 3:
#genotype likelihoods from sequencing file GL=example/NA12763.mapped.ILLUMINA.bwa.CEU.low_coverage.20130502.bam.beagle.gz #frequencies from reference panel REF=data/refPanel_humanOrigins_7worldPops.txt #number of individuals in each population of reference panel NIND=data/nInd_humanOrigins_7worldPops.txt ./fastNGSadmix -likes $GL -fname $REF -Nname $NIND -outfiles NA12763_CEU -K 3
If the reference panel has more than K populations, the program will just use the K first populations. You can pick which populations should be analyzed via the "-whichPops" option, where you write the names of the population comma seperated, K has to be same number as selected populations. It should also be noted that the program quits if there are duplicate sites (based on chr_pos ID) in the input or the reference panel.
It then produces two files NA12763_CEU.qopt with the admixture proportions and a log file NA12763_CEU.log.
Or with a plink file (no provided example plink file), with also the +whichPops option:
#frequencies from reference panel REF=data/refPanel_humanOrigins_7worldPops.txt #number of individuals in each population of reference panel NIND=data/nInd_humanOrigins_7worldPops.txt ./fastNGSadmix -plink plinkFile -fname $REF -Nname $NIND -outfiles plinkFile -K 3 +whichPops French,Han,Yoruba
A whole list of options can be explored by running fastNGSadmix without any input:
./fastNGSadmix
- -conv [int]
The option "-conv" specifies the number of convergence runs, with a new random starting point for each run. This is useful to test for convergence. The program will execute a maximum of 10 times.
- -boot [int]
Another option is "-boot" which specifies the number of bootstrap runs, where random sites are sampled for each run. This is useful for generating a confidence interval of your estimate, for example doing the empirical 0.025 and 0.975 quantiles. The maximum number of bootstraps is 10000.
- -doAdjust [int]
By default the method adjusting the frequencies is used (see more in the paper), to use the unadjusted approach set "-doAdjust 0". The accelerated EM algorithm is used by default, to use regular EM algorithm set "-method 0".
- -Qconv [int]
For faster inference of admixture proportions the "-Qconv" option can be set to 1, this bases the converge criteria on change in the admixture proportions values. The threshold of this can be set with "-Qtol".
- -Qtol [float]
By default the "-Qtol" threshold is 0.0000001, it should not be put lower than this and generally this is only for when wanting a fast overview of the data, as it is less precise than the likelihood based convergence.
PCA
After having run the admixture estimation, a PCA analysis can be run using the estimated admixture proportions to account for population structure in the PCA. The PCA method is implemented in R and is run using the script fastNGSAdmixPCA.R.
The method needs the estimated admixture proportions, the analyzed beagle or plink file as well as the genotypes of the reference panel used (as binary plink files).
GL=example/NA12763.mapped.ILLUMINA.bwa.CEU.low_coverage.20130502.bam.beagle.gz Rscript R/fastNGSadmixPCA.R likes=$GL qopt=NA12763_CEU.qopt out=NA12763_CEU geno=data/humanOrigins_7worldPops
Or using analyzed plink files.
Rscript R/fastNGSadmixPCA.R plinkFile=plinkFile qopt=plinkFile.qopt out=plinkFile geno=data/humanOrigins_7worldPops
By default the populations of the analyzed *.qopt file are used. It can be specified which PCs should be plotted (default PC1 and PC2), by for example giving PCs=2,3 as an argument (if PC2 and PC3). The script also generates barplots of the admixture proportions supplied with confidence intervals.
The snpStats R package, is also needed for running this script.
Making a reference panel
The program needs a reference panel of population specific frequencies, populations for which admixture proportions should be estimated, for instance from 1000 G or HGDP, or another custom made reference panel. The program also needs a file telling the size of each reference panel population. There is an R script plinkToRef.R in the data folder, that can convert a plink file to a reference panel and size of reference panel populations. Example:
Rscript R/plinkToRef.R data/humanOrigins_7worldPops
This generates 3 files, a reference panel named data/refPanel_humanOrigins_7worldPops.txt, a number of individuals file called data/nInd_humanOrigins_7worldPops.txt and a data/humanOrigins_7worldPops.sites file with chr pos and minor major alleles. The provided .fam file must have the group or population specified as FID, meaning first column and then the individual ID as the second column:
French F1 0 0 1 1 French F2 0 0 1 1 Yoruba Y1 0 0 2 1 Yoruba Y2 0 0 1 1 Yoruba Y3 0 0 2 1 ...
A second argument of a MAF threshold can be supplied, meaning all sites where the MAF is below this in any population is removed.
Rscript R/plinkToRef.R data/humanOrigins_7worldPops 0.05
The snpStats R package, is needed for running the plinkToRef.R script.
fastNGSadmix already comes with a premade reference panel, made from Lazaridis et al. (2014) where the curated Human Origins dataset was selected.
The dataset was lifted to hg19 using the program liftOver, The SNPs then got rs IDs, using 1000G data, generating a unique name for each site via "chr-pos-A1-A2" (where A1 and A2 are alphabetically sorted).
Furthermore sites with more than % missing and a MAF below 5 % were removed, only autosomal sites were kept. 7 populations were selectedÆ French, Han, Chukchi, Karitiana, Papuan, Sindhi and Yoruba to have representation for most of the world. Furthermore it was made sure that there were only unadmixed individuals within each population.