FastNgsAdmixOld: Difference between revisions

From software
Jump to navigation Jump to search
m (Albrecht moved page FastNgsAdmix to FastNgsAdmixOld without leaving a redirect: obsolute)
 
(21 intermediate revisions by one other user not shown)
Line 1: Line 1:
This page contains information about the program called FastNGSadmixPCA, which is a very fast tool for finding admixture proportions from NGS data of a single individual to incorporate into PCA of NGS data. It is based on genotype likelihoods. The program is written in R.
This page contains information about the program called fastNGSadmix which is a very fast tool for finding admixture proportions from NGS data of a single individual and a method for doing PCA of NGS data, using the estimated admixture proporitions. It is based on genotype likelihoods. The admixture estimation part is written in C++ and the PCA part is written in R.


=Installation=
=Installation=
The program + more can be downloaded from:
https://github.com/e-jorsboe/fastNGSadmix
So far it has only been tested on linux systems.
=Input Files=
Input files are genotype likelihoods in the genotype likelihood beagle input file format [http://faculty.washington.edu/browning/beagle/beagle.html]. Or called genotypes in the binary plink files (*.bed) format [https://www.cog-genomics.org/plink2]
We recommend [ANGSD] for easy transformation of Next-generation sequencing data to beagle format and plink2 for handling plink files.
The example below show how to make a beagle file of genotype likelihood using ANGSD.
<pre>
<pre>
wget http://popgen.dk/albrecht/kristian/tool_download.zip
HOME$ ./angsd -i example/smallNA12874.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.bam -GL 2 -sites data/humanOrigins_7worldPops.sites -doGlf 2 -doMajorMinor 3 -minMapQ 30 -minQ 20 -doDepth 1 -doCounts 1 -out example/smallNA12874.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.bam
unzip tool_download.zip
OR simply use SHINY:
http://popgen.dk:443/kristian/admixpca_human/
</pre>
</pre>


=Run example=
Example of a beagle genotype likelihood input file for 1 individual.
tool.zip contains all files needed to execute FASTNGSAdmixPCA. The sample is from the HAPMAP project. In need of more samples, one can find a couple more samples in http://popgen.dk/albrecht/kristian/
<pre>
The Rscript below executes the tool. all output is directed to a output_folder that is created in the process. To see the preset: Rscript FastNGSAdmixPCA.r
marker      allele1  allele2  Ind0      Ind0    Ind0
1_14000023      1      0      0.941    0.058    0.000
1_14000072      2      3      0.709    0.177    0.112
1_14000113      0      2      0.855    0.106    0.037
1_14000202      2      0      0.835    0.104    0.060
...
</pre>
 
Where the marker should be chr_pos, instead of rs ID.
 
A provided SNP.sites file has been included.
 
The program needs a reference panel of population specific frequencies, populations for which admixture proportions should be estimated,
for instance from 1000 G or HGDP, or another custom made reference panel. The program also needs a file telling the size of each reference panel population.
There is an R script plinkToRef.R, that can convert a plink file to a reference panel and size of reference panel populations.
Example:
 
<pre>
Rscript R/plinkToRef.R data/humanOrigins_7worldPops
</pre>
 
This generates 3 files, a reference panel named refPanel_plinkFile.txt, a number of individuals file called nInd_plinkFile.txt and a plinkFile.sites file with chr pos and minor major alleles.
The .fam file must have the group or population specified as FID, meaning first column and then the individual ID as the second column:
 
<pre>
<pre>
Rscript FastNGSAdmixPCA.r infile=NA12763.mapped.ILLUMINA.bwa.CEU.low_coverage.20130502.bam.beagle.gz
French F1 0 0 1 1
French F2 0 0 1 1
Yoruba Y1 0 0 2 1
Yoruba Y2 0 0 1 1
Yoruba Y3 0 0 2 1
...
</pre>
</pre>
All arguments can be altered. To alter the reference populations, one need to write comma separated populations to the refpops argument as shown below
 
A second argument of MAF threshold can be supplied, meaning all sites where the MAF is below this in any population is removed.


<pre>
<pre>
Rscript FastNGSAdmixPCA.r infile=NA12763.mapped.ILLUMINA.bwa.CEU.low_coverage.20130502.bam.beagle.gz refpops=YRI,JPT,CHB,CEU
Rscript R/plinkToRef.R data/humanOrigins_7worldPops 0.05
</pre>
</pre>


To get an overview of available reference populations, one can make a dry run
The '''snpStats R package''', is needed for running the plinkToRef.R script.
 
fastNGSadmix already comes with a premade reference panel, made from Lazaridis et al. (2014) where the curated dataset was selected.
 
I lifted the dataset hg19 using the program liftOver, I then translated snpNames to rs names, using 1000G data, generating a unique name for each site via "chr-pos-A1-A2" (where A1 and A2 are alphabetically sorted).
 
Furthermore I removed sites with more than 5 % missing and a MAF below 5 %, and kept only autosomal sites.
I selected 7 populations French, Han, Chukchi, Karitiana, Papuan, Sindhi and Yoruba to have representation for most of the world.
Furthermore I made sure that I only used unadmixed individuals within each population.
 
=Run example=
'''Admixture estimation'''
 
An example of a command running fastNGSadmix with a beagle file and a chosen K of 3:


<pre>
<pre>
Rscript FastNGSAdmixPCA.r infile=TRUE dryrun=TRUE
./fastNGSadmix -likes NA12763.mapped.ILLUMINA.bwa.CEU.low_coverage.20130502.bam.beagle.gz -fname data/refPanel_humanOrigins_7worldPops.txt -Nname data/nInd_humanOrigins_7worldPops.txt -outfiles NA12763_CEU -K 3
</pre>
</pre>


If the reference panel has more than K populations, the program will just use the K first populations.
You can pick which populations should be analyzed via the "-whichPops" option, where you write the names of the population comma seperated, K has to be same number as selected populations.
It should also be noted that the program quits if there are duplicate sites (based on chr_pos ID) in the input or the reference panel.


=Input Files=
It then produces two files NA12763_CEU.qopt with the admixture proportions and NA12763_CEU.log.
Input files are contains genotype likelihoods in genotype likelihood beagle input file format [http://faculty.washington.edu/browning/beagle/beagle.html]. We recommend [ANGSD] for easy transformation of Next-generation sequencing data to beagle format.
 
Or with a plink file (no provided example plink file):


The example below show how to make a beagle file of genotype likelihood using ANGSD.
<pre>
<pre>
HOME$ ./angsd0.594/angsd -i 'pathtoindi.bam' -GL 2 -sites 'SNP.sites' -doGlf 2 -doMajorMinor 3 -minMapQ 30 -minQ 20 -doDepth 1 -doCounts 1 -out indi_genotypelikelihood
./fastNGSadmix -plink plinkFile -fname data/refPanel_humanOrigins_7worldPops.txt -Nname data/nInd_humanOrigins_7worldPops.txt -outfiles plinkFile -K 3
</pre>
</pre>


Example of a beagle genotype likelihood input file for 3 individuals.
A whole list of options can be explored by running fastNGSadmix without any input:
 
<pre>
<pre>
marker      allele1  allele2  Ind0      Ind0    Ind0
./fastNGSadmix
1_14000023      1      0      0.941    0.058    0.000
1_14000072      2      3      0.709    0.177    0.112
1_14000113      0      2      0.855    0.106    0.037
1_14000202      2      0      0.835    0.104    0.060
...
</pre>
</pre>


The option "-conv" specifies the number of convergence runs, with a new random starting point for each run. This is useful to test for convergence. The program will execute a maximum of 10 times.
Another option is "-boot" which specifies the number of bootstrap runs, where random sites are sampled for each run. This is useful for generating a confidence interval of your estimate. The program will execute a maximum of 100 times.
For faster inference of admixture proportions the "-Qconv" option can be set to 1, this bases the converge criteria on change in the admixture proportions values. The threshold of this can be set with "-Qtol".
By default the "-Qtol" threshold is 0.0000001, it should not be put lower than this and generally this is only for when wanting a fast overview of the data, as it is less precise than the likelihood based convergence.
By default the method adjusting the frequencies is used, to use the unadjusted approach set "-doAdjust 0". The accelerated EM algorithm is used by default, to use regular EM algorithm set "-method 0".


=version 2=


This program needs the genotype likelihoods in the beagle file format. It also needs frequencies of a reference panel with the populations for which admixture proportions should be estimated,
'''PCA'''
for instance from 1000 G or HGDP, or a custom made reference panel, it should be noted that the frequencies in the reference panel should be of the major allele in the beagle file.


(So if the 3 columns with genotype likelihoods in the beagle file is coded like this AA AB BB, then the frequencies should be of the A allele.)
After having run the admixture estimation, a PCA analysis can be run using the estimated admixture proportions to account for population structure in the PCA.
The PCA method is implemented in R and is run using the script fastNGSAdmixPCA.R.


Furthermore a file with the number of individuals in each reference population should be supplied.
The method needs the estimated admixture proportions, the analyzed beagle or plink file as well as the genotypes of the reference panel used (as binary plink files).
(refPanel is the binary plink files, refPanel.bim, refPanel.fam and refPanel.bed)
 
<pre>
Rscript R/fastNGSAdmixPCA.R likes=NA12763.mapped.ILLUMINA.bwa.CEU.low_coverage.20130502.bam.beagle.gz  qopt=NA12763.mapped.ILLUMINA.bwa.CEU.low_coverage.20130502.bam.beagle.gz.qopt out=NA12763_CEU  geno=data/humanOrigins_7worldPops
</pre>


An example of a command:
Or using analyzed plink files.


<pre>
<pre>
./fastNGSadmix -likes Yoruba10Japanese65Han25_3000000_d10_N10_GL.txt -fname Yoruba10Japanese65Han25_3000000_d10_N10_Ref.txt -Nname sYoruba10Japanese65Han25_3000000_d10_N10_nInd.txt -outfiles Yoruba10Japanese65Han25_3000000_d10_N10
Rscript R/fastNGSAdmixPCA.R plinkFile=plinkFile qopt=plinkFile.qopt out=plinkFile  geno=data/humanOrigins_7worldPops
</pre>
</pre>


Then a lot of different options and filters can be specified:
By default the populations of the analyzed *.qopt file are used. It can be specified which PCs should be plotted (default PC1 and PC2), by for example giving PCs=2,3 as an argument (if PC2 and PC3).
The script also generates barplots of the admixture proportions supplied with confidence intervals.


(TO BE CONTINUED...)
The '''snpStats''' R package, is needed for running this script.

Latest revision as of 16:27, 3 May 2017

This page contains information about the program called fastNGSadmix which is a very fast tool for finding admixture proportions from NGS data of a single individual and a method for doing PCA of NGS data, using the estimated admixture proporitions. It is based on genotype likelihoods. The admixture estimation part is written in C++ and the PCA part is written in R.

Installation

The program + more can be downloaded from:

https://github.com/e-jorsboe/fastNGSadmix

So far it has only been tested on linux systems.

Input Files

Input files are genotype likelihoods in the genotype likelihood beagle input file format [1]. Or called genotypes in the binary plink files (*.bed) format [2] We recommend [ANGSD] for easy transformation of Next-generation sequencing data to beagle format and plink2 for handling plink files.

The example below show how to make a beagle file of genotype likelihood using ANGSD.

HOME$ ./angsd -i example/smallNA12874.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.bam -GL 2 -sites data/humanOrigins_7worldPops.sites -doGlf 2 -doMajorMinor 3 -minMapQ 30 -minQ 20 -doDepth 1 -doCounts 1 -out example/smallNA12874.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.bam

Example of a beagle genotype likelihood input file for 1 individual.

marker       allele1  allele2   Ind0      Ind0    Ind0
1_14000023      1       0       0.941    0.058    0.000
1_14000072      2       3       0.709    0.177    0.112
1_14000113      0       2       0.855    0.106    0.037
1_14000202      2       0       0.835    0.104    0.060
...

Where the marker should be chr_pos, instead of rs ID.

A provided SNP.sites file has been included.

The program needs a reference panel of population specific frequencies, populations for which admixture proportions should be estimated, for instance from 1000 G or HGDP, or another custom made reference panel. The program also needs a file telling the size of each reference panel population. There is an R script plinkToRef.R, that can convert a plink file to a reference panel and size of reference panel populations. Example:

Rscript R/plinkToRef.R data/humanOrigins_7worldPops

This generates 3 files, a reference panel named refPanel_plinkFile.txt, a number of individuals file called nInd_plinkFile.txt and a plinkFile.sites file with chr pos and minor major alleles. The .fam file must have the group or population specified as FID, meaning first column and then the individual ID as the second column:

French F1 0 0 1 1
French F2 0 0 1 1
Yoruba Y1 0 0 2 1
Yoruba Y2 0 0 1 1
Yoruba Y3 0 0 2 1
...

A second argument of MAF threshold can be supplied, meaning all sites where the MAF is below this in any population is removed.

Rscript R/plinkToRef.R data/humanOrigins_7worldPops 0.05

The snpStats R package, is needed for running the plinkToRef.R script.

fastNGSadmix already comes with a premade reference panel, made from Lazaridis et al. (2014) where the curated dataset was selected.

I lifted the dataset hg19 using the program liftOver, I then translated snpNames to rs names, using 1000G data, generating a unique name for each site via "chr-pos-A1-A2" (where A1 and A2 are alphabetically sorted).

Furthermore I removed sites with more than 5 % missing and a MAF below 5 %, and kept only autosomal sites. I selected 7 populations French, Han, Chukchi, Karitiana, Papuan, Sindhi and Yoruba to have representation for most of the world. Furthermore I made sure that I only used unadmixed individuals within each population.

Run example

Admixture estimation

An example of a command running fastNGSadmix with a beagle file and a chosen K of 3:

./fastNGSadmix -likes NA12763.mapped.ILLUMINA.bwa.CEU.low_coverage.20130502.bam.beagle.gz -fname data/refPanel_humanOrigins_7worldPops.txt -Nname data/nInd_humanOrigins_7worldPops.txt -outfiles NA12763_CEU -K 3

If the reference panel has more than K populations, the program will just use the K first populations. You can pick which populations should be analyzed via the "-whichPops" option, where you write the names of the population comma seperated, K has to be same number as selected populations. It should also be noted that the program quits if there are duplicate sites (based on chr_pos ID) in the input or the reference panel.

It then produces two files NA12763_CEU.qopt with the admixture proportions and NA12763_CEU.log.

Or with a plink file (no provided example plink file):

./fastNGSadmix -plink plinkFile -fname data/refPanel_humanOrigins_7worldPops.txt -Nname data/nInd_humanOrigins_7worldPops.txt -outfiles plinkFile -K 3

A whole list of options can be explored by running fastNGSadmix without any input:

./fastNGSadmix

The option "-conv" specifies the number of convergence runs, with a new random starting point for each run. This is useful to test for convergence. The program will execute a maximum of 10 times. Another option is "-boot" which specifies the number of bootstrap runs, where random sites are sampled for each run. This is useful for generating a confidence interval of your estimate. The program will execute a maximum of 100 times. For faster inference of admixture proportions the "-Qconv" option can be set to 1, this bases the converge criteria on change in the admixture proportions values. The threshold of this can be set with "-Qtol". By default the "-Qtol" threshold is 0.0000001, it should not be put lower than this and generally this is only for when wanting a fast overview of the data, as it is less precise than the likelihood based convergence. By default the method adjusting the frequencies is used, to use the unadjusted approach set "-doAdjust 0". The accelerated EM algorithm is used by default, to use regular EM algorithm set "-method 0".


PCA

After having run the admixture estimation, a PCA analysis can be run using the estimated admixture proportions to account for population structure in the PCA. The PCA method is implemented in R and is run using the script fastNGSAdmixPCA.R.

The method needs the estimated admixture proportions, the analyzed beagle or plink file as well as the genotypes of the reference panel used (as binary plink files). (refPanel is the binary plink files, refPanel.bim, refPanel.fam and refPanel.bed)

Rscript R/fastNGSAdmixPCA.R likes=NA12763.mapped.ILLUMINA.bwa.CEU.low_coverage.20130502.bam.beagle.gz  qopt=NA12763.mapped.ILLUMINA.bwa.CEU.low_coverage.20130502.bam.beagle.gz.qopt out=NA12763_CEU  geno=data/humanOrigins_7worldPops

Or using analyzed plink files.

Rscript R/fastNGSAdmixPCA.R plinkFile=plinkFile qopt=plinkFile.qopt out=plinkFile  geno=data/humanOrigins_7worldPops

By default the populations of the analyzed *.qopt file are used. It can be specified which PCs should be plotted (default PC1 and PC2), by for example giving PCs=2,3 as an argument (if PC2 and PC3). The script also generates barplots of the admixture proportions supplied with confidence intervals.

The snpStats R package, is needed for running this script.