NgsAdmixTutorial: Difference between revisions
| (19 intermediate revisions by 3 users not shown) | |||
| Line 1: | Line 1: | ||
| We will go through a simple and more complex example on how to use NGSadmix with visualization of the data. | We will go through a simple and more complex example on how to use NGSadmix with visualization of the data. | ||
| User need to to add -printInfo 1 in order to get information about sites retained for analysis. | |||
| ==Example of NGSadmix - very small data set== | ==Example of NGSadmix - very small data set== | ||
| Line 26: | Line 28: | ||
| <code>mkdir Results</code> | <code>mkdir Results</code> | ||
| <code>cd ..</code> | |||
| ===Set the paths to your local directories=== | ===Set the paths to your local directories=== | ||
| Line 42: | Line 46: | ||
| We will use a very reduced data set | We will use a very reduced data set. | ||
| *10 individuals from each population: 10 from Nigeria (YRI), 10 from Japan (JPT) and 10 with European Ancestry (CEU).   | |||
| *a very reduced genome 30 x 100k random regions across the autosomes | |||
| *each individual is sequenced at 2-6X | |||
| <table class="muse-table" border="2" cellpadding="5"> | <table class="muse-table" border="2" cellpadding="5"> | ||
| Line 64: | Line 68: | ||
| You can either download the beagle input files or create them yourself from bam files | You can either download the beagle input files or create them yourself from bam files. Therefore you should choose either 'Download the beagle genotype likelihood input file' or 'Create the the beagle genotype likelihood input file using ANGSD'. | ||
| ====Download the beagle genotype likelihood input file==== | ====Download the beagle genotype likelihood input file==== | ||
| NGSadmix uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.   | NGSadmix uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you. | ||
|   Download the files and move them to your input folder (for example, $IN_DIR): | A file with the information about the population is also provided. | ||
| Download the files and move them to your input folder (for example, $IN_DIR): | |||
| <code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code> | <code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code> | ||
| Line 76: | Line 82: | ||
| <code>mv Demo1pop.info $IN_DIR</code> | <code>mv Demo1pop.info $IN_DIR</code> | ||
| ====Create the the beagle genotype likelihood input file using ANGSD==== | ====Create the the beagle genotype likelihood input file using ANGSD==== | ||
| Line 105: | Line 110: | ||
| <code> paste -d " " <( cut -f 5 -d"." all.files ) <(cut -f 1 -d"." all.files | xargs -n1 basename) > $IN_DIR/Demo1pop.info </code> | <code> paste -d " " <( cut -f 5 -d"." all.files ) <(cut -f 1 -d"." all.files | xargs -n1 basename) > $IN_DIR/Demo1pop.info </code> | ||
| ====  | ==== View population information file ==== | ||
| To view a summary of the population information file, cut the first column, sort and count: | To view a summary of the population information file, cut the first column, sort and count: | ||
| Line 115: | Line 120: | ||
| <code> cut -f1 -d" " $IN_DIR/Demo1pop.info  > $OUT_DIR/poplabel </code> | <code> cut -f1 -d" " $IN_DIR/Demo1pop.info  > $OUT_DIR/poplabel </code> | ||
| ===  | ==== View the genotype likelihood beagle file ==== | ||
| *In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.   | *In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.   | ||
| Line 121: | Line 126: | ||
| *In order to see the first 10 columns and 10 lines of the input file, type: | *In order to see the first 10 columns and 10 lines of the input file, type: | ||
| <code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code> | :<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code> | ||
| *Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line): | *Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line): | ||
| <code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code> | :<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code> | ||
| === Let’s analyze the input file === | === Let’s analyze the input file === | ||
| Line 147: | Line 152: | ||
| *'''Estimated allele frequency''' | *'''Estimated allele frequency''' | ||
| A zipped .fopt file, that contains an estimate of the allele frequency in each of the 3 assumed ancestral populations (there is a line for each locus). | ::A zipped .fopt file, that contains an estimate of the allele frequency in each of the 3 assumed ancestral populations (there is a line for each locus). | ||
| We can use this file to obtain the estimated allele frequency of the first 5 SNPs (one per line) of the three assumed ancestral populations, by typing the following command: | ::We can use this file to obtain the estimated allele frequency of the first 5 SNPs (one per line) of the three assumed ancestral populations, by typing the following command: | ||
| <code>zcat $OUT_DIR/Demo1NGSadmix.fopt.gz | head -n 5</code> | ::<code>zcat $OUT_DIR/Demo1NGSadmix.fopt.gz | head -n 5</code> | ||
| *'''Estimated admixture proportions''' | *'''Estimated admixture proportions''' | ||
| A .qopt file, that contains an estimate of the individual's ancestry proportion from each of the three assumed ancestral populations (there is a line for each individual). | ::A .qopt file, that contains an estimate of the individual's ancestry proportion from each of the three assumed ancestral populations (there is a line for each individual). | ||
| ::To obtain the estimated admixture proportions for the first 5 individuals, type the following command: | |||
| ::<code>head -n 5 Demo1NGSadmix.qopt</code> | |||
| === Plot the results in R === | === Plot the results in R === | ||
| Line 176: | Line 173: | ||
| <code>cd $OUT_DIR</code> | <code>cd $OUT_DIR</code> | ||
| Type “R” in the terminal and press enter and paste the following code into R: | |||
| <pre> | <pre> | ||
| # Get ID and pop info for each individual | # Get ID and pop info for each individual | ||
| pop<-scan("poplabel",what="theFuck") | pop<-scan("poplabel",what="theFuck") | ||
| # Read inferred admixture proportions file | # Read inferred admixture proportions file | ||
| Line 192: | Line 187: | ||
| barplot(t(q)[,ord],col=c(2,1,3),names=pop[ord],las=2,ylab="Demo1 Admixture proportions",cex.names=0.75) | barplot(t(q)[,ord],col=c(2,1,3),names=pop[ord],las=2,ylab="Demo1 Admixture proportions",cex.names=0.75) | ||
| </pre> | </pre> | ||
| The y-axis of the plot show the admixture proportions and the individuals in the samples are plotted in the x-axis.   | The y-axis of the plot show the admixture proportions and the individuals in the samples are plotted in the x-axis.   | ||
| Line 239: | Line 233: | ||
| ''Note: please make sure you have created and set up the working directories as indicated in the previous tutorial.'' | ''Note: please make sure you have created and set up the working directories as indicated in the previous tutorial.'' | ||
| ===Download=== | ===Download Files=== | ||
| Download and copy the files to your input folder (for example, $IN_DIR) | Download and copy the files to your input folder (for example, $IN_DIR) | ||
| < | <pre> | ||
| wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz | |||
| wget popgen.dk/software/download/NGSadmix/data/Demo2pop.info | |||
| mv Demo2input.gz $IN_DIR | |||
| mv Demo2pop.info $OUT_DIR</pre> | |||
| make sure you are back in your original folder (not the $OUT_DIR folder) | make sure you are back in your original folder (not the $OUT_DIR folder) | ||
| Line 257: | Line 249: | ||
| ===Run NGSadmix=== | ===Run NGSadmix=== | ||
| Run an analysis of the data with NGSadmix with K=3 (-K 3), using 1 cpu (-P 1), using only SNPs with minor allele frequency above 0.05 (-minMaf 0.05), set the seed set to 21 (-seed 21), and set the prefix of the output files to Demo2NGSadmixK3. For a reference on the parameters that can be used to run NGSadmix, please go to [http://www.popgen.dk/software/index.php/NgsAdmixv2#Parameters] | Run an analysis of the data with NGSadmix with K=3 (-K 3), using 1 cpu (-P 1), using only SNPs with minor allele frequency above 0.05 (-minMaf 0.05), set the seed set to 21 (-seed 21), and set the prefix of the output files to Demo2NGSadmixK3. For a reference on the parameters that can be used to run NGSadmix, please go to [http://www.popgen.dk/software/index.php/NgsAdmixv2#Parameters] | ||
| <code>$NGSADMIX -likes $IN_DIR/Demo2input.gz -K 3 -P 1 -minMaf 0.05 -seed 21 -o $OUT_DIR/Demo2NGSadmixK3</code> | <code>$NGSADMIX -likes $IN_DIR/Demo2input.gz -K 3 -P 1 -minMaf 0.05 -seed 21 -o $OUT_DIR/Demo2NGSadmixK3</code> | ||
Latest revision as of 23:59, 6 June 2022
We will go through a simple and more complex example on how to use NGSadmix with visualization of the data.
User need to to add -printInfo 1 in order to get information about sites retained for analysis.
Example of NGSadmix - very small data set
In our first example, we will infer admixture proportions for low depth NGS data using a small dataset from 30 human samples.
Set paths to software
Every time you open a new terminal window, set directories to all required programs and the data you will use depending on where you installed them
Set the path to NGSadmix, for example:
NGSADMIX=~/Software/NGSadmix
Test the link
ls $NGSADMIX
Create directories
Create the directories that will be used for working:
mkdir Demo
cd Demo
mkdir Data
mkdir Results
cd ..
Set the paths to your local directories
IN_DIR=Demo/Data
OUT_DIR=Demo/Results	
Test the links
ls $IN_DIR
ls $OUT_DIR
The test data
We will use a very reduced data set.
- 10 individuals from each population: 10 from Nigeria (YRI), 10 from Japan (JPT) and 10 with European Ancestry (CEU).
- a very reduced genome 30 x 100k random regions across the autosomes
- each individual is sequenced at 2-6X
| CEU | Europeans (mostly of British ancestry) | 
| JPT | East Asian - Japanese individuals | 
| YRI | West African - Nigerian Yoruba individuals | 
You can either download the beagle input files or create them yourself from bam files. Therefore you should choose either 'Download the beagle genotype likelihood input file' or 'Create the the beagle genotype likelihood input file using ANGSD'.
Download the beagle genotype likelihood input file
NGSadmix uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you. A file with the information about the population is also provided.
Download the files and move them to your input folder (for example, $IN_DIR):
wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz
wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info
mv Demo1input.gz $IN_DIR
mv Demo1pop.info $IN_DIR
Create the the beagle genotype likelihood input file using ANGSD
first set the path to ANGSD (change ~/github/angsd/ to the path on your system)
 ANGSD=~/github/angsd/angsd 
test that you have the rigth path and that ANGSD is installed
 $ANGSD
Download small BAM files and extract files
 wget http://popgen.dk/software/download/NGSadmix/data/smallerbams.tar 
 tar xf smallerbams.tar 
Make a list with the bam file names
find smallerbams/ |  grep bam$ > all.files
calculate genotype likelihoods for polymorphic sites using ANGSD (see ANGSD website for further information)
$ANGSD -bam all.files -GL 2 -doMajorMinor 1 -doMaf 1 -SNP_pval 2e-6 -minMapQ 30 -minQ 20 -minInd 25 -minMaf 0.05 -doGlf 2 -out $IN_DIR/Demo1input.gz -P 5 
create population information file
 paste -d " " <( cut -f 5 -d"." all.files ) <(cut -f 1 -d"." all.files | xargs -n1 basename) > $IN_DIR/Demo1pop.info 
View population information file
To view a summary of the population information file, cut the first column, sort and count:
cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c
Lets make a population label file and place it in the output directory
 cut -f1 -d" " $IN_DIR/Demo1pop.info  > $OUT_DIR/poplabel 
View the genotype likelihood beagle file
- In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.
- In order to see the first 10 columns and 10 lines of the input file, type:
- gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t
- Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):
- gunzip -c $IN_DIR/Demo1input.gz | wc -l
Let’s analyze the input file
To run an analysis of the GLs with NGSadmix, assuming the number of ancestral populations is K=3, type the following command:
$NGSADMIX -likes $IN_DIR/Demo1input.gz -K 3 -minMaf 0.05 -seed 1 -o $OUT_DIR/Demo1NGSadmix
For a reference on the parameters that can be used to run NGSadmix, please go to [1]
The output
The analysis performed by NGSadmix produces 4 files:
- Log likelihood of the estimates
A .log file that summarizes the run.
Let’s take a look at the log file to determine the log likelihood of the estimates achieved by NGSadmix which is called the “best like” in the file:
cat $OUT_DIR/Demo1NGSadmix.log
- Estimated allele frequency
- A zipped .fopt file, that contains an estimate of the allele frequency in each of the 3 assumed ancestral populations (there is a line for each locus).
 
- We can use this file to obtain the estimated allele frequency of the first 5 SNPs (one per line) of the three assumed ancestral populations, by typing the following command:
 
- zcat $OUT_DIR/Demo1NGSadmix.fopt.gz | head -n 5
 
- Estimated admixture proportions
- A .qopt file, that contains an estimate of the individual's ancestry proportion from each of the three assumed ancestral populations (there is a line for each individual).
 
- To obtain the estimated admixture proportions for the first 5 individuals, type the following command:
 
- head -n 5 Demo1NGSadmix.qopt
 
Plot the results in R
Follow these instructions to make a simple plot of the estimated admixture proportions for all individuals in R:
Make sure you stand on the output directory
cd $OUT_DIR
Type “R” in the terminal and press enter and paste the following code into R:
# Get ID and pop info for each individual
pop<-scan("poplabel",what="theFuck")
# Read inferred admixture proportions file
q<-read.table("Demo1NGSadmix.qopt")
# Plot them (ordered by population)
ord = order(pop)
par(mar=c(7,4,1,1))
barplot(t(q)[,ord],col=c(2,1,3),names=pop[ord],las=2,ylab="Demo1 Admixture proportions",cex.names=0.75)
The y-axis of the plot show the admixture proportions and the individuals in the samples are plotted in the x-axis.
Each color represents a different ancestral population.
The proportion of each color shows the different admixture of the individuals for each ancestral population.
The plot is sorted by the population of origin of each individual in the sample, and therefore, it shows blocks with prevalence of a certain color, which represents the population to which each individual belongs.
NB As you could tell from the number of loci included in the analysis, the above analysis is based on data from very few loci (actually we on purpose only analyzed data from a small part of the genome to make sure the analysis ran fast).
Example of NGSadmix with admixed populations
Now that you know how to make input data to NGSadmix, how to run NGSadmix and what the output looks like, let's try to look at a more realistic size dataset. More specifically let's try to run NGSadmix on 50000 sites from the 1000 genomes project from the following populations:
- ASW - HapMap African Americans from SW US - CEU - European individuals - CHB - Han Chinese in Beijing - JPT - Japanese individuals - YRI - Yoruba individuals from Nigeria - MXL - Mexican individuals from LA California 
The input file Demo2input.gz with genotype likelihoods from 100 individuals in .beagle format, and a file with population info are given. Note: please make sure you have created and set up the working directories as indicated in the previous tutorial.
Download Files
Download and copy the files to your input folder (for example, $IN_DIR)
wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz wget popgen.dk/software/download/NGSadmix/data/Demo2pop.info mv Demo2input.gz $IN_DIR mv Demo2pop.info $OUT_DIR
make sure you are back in your original folder (not the $OUT_DIR folder)
Take a quick look at the population data
make a summary by cutting the first column, sorting and counting
cut -f 1 -d " " $OUT_DIR/Demo2pop.info | sort | uniq -c
Run NGSadmix
Run an analysis of the data with NGSadmix with K=3 (-K 3), using 1 cpu (-P 1), using only SNPs with minor allele frequency above 0.05 (-minMaf 0.05), set the seed set to 21 (-seed 21), and set the prefix of the output files to Demo2NGSadmixK3. For a reference on the parameters that can be used to run NGSadmix, please go to [2]
$NGSADMIX -likes $IN_DIR/Demo2input.gz -K 3 -P 1 -minMaf 0.05 -seed 21 -o $OUT_DIR/Demo2NGSadmixK3
Plot
Plot the estimated admixture proportions by running the following code in R:
Make sure you stand on the output directory
cd $OUT_DIR
Type “R” in the terminal and press enter and paste the following code into R:
# read population labels and estimated admixture proportions
pop<-read.table("Demo2pop.info",as.is=T)
q<-read.table("Demo2NGSadmixK3.qopt")
# order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=3")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)
Note that like in the previous example, the order of the individuals in the plot is not the same as in the .qopt file. Instead, to provide a better overview, the individuals have been ordered according to their population labels.
Choose a differnt K
Try to run NGSadmix with K=4 and compare the plots.
$NGSADMIX -likes $IN_DIR/Demo2input.gz -K 4 -P 1 -minMaf 0.05 -seed 21 -o $OUT_DIR/Demo2NGSadmixK4
Type “R” in the terminal and press enter and paste the following code into R:
# read population labels and estimated admixture proportions
pop<-read.table("Demo2pop.info",as.is=T)
q<-read.table("Demo2NGSadmixK4.qopt")
# order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=4")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)