NgsAdmixTutorial: Difference between revisions

Revision as of 10:58, 5 July 2019

NGSadmix Tutorial

Example NGSadmix - Simple In our first example, we will be using a small dataset of .bam files with low depth NGS data from 30 human samples: 10 from Nigeria (YRI), 10 from Japan (JPT) and 10 with European Ancestry (CEU).

Set directories to all required programs and the data you will use depending on where you installed them every time you open a new terminal window

Set path to NGSadmix

NGSADMIX=~/Software/NGSadmix

test the link

ls $NGSADMIX

Set path to the directory with the input files

AA=/../home/albrecht/PhDCourse

test the link

ls $AA To create the data files, please go the the following link: https://github.com/mfumagalli/ngsTools/blob/master/Scripts/data.sh

Create the directories that will be used for working: mkdir Demo cd Demo mkdir Data mkdir Results

Set paths to local directories

IN_DIR=~/Demo/Data OUT_DIR=~/Demo/Results

#Test the links ls $IN_DIR ls $OUT_DIR

NGSadmix uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you. A file with labels that indicate which population the individuals are sampled from is also provided as the population information. Copy the files to your input folder ($IN_DIR): cp xxxx $IN_DIR cp $AA/admixture/data/pop.info $IN_DIR

We need to prepare the population information file by cuttin the first column, sorting and counting: cut -f 1 -d " " $IN_DIR/pop.info | sort | uniq -c

Now let’s analyze the input file. In order to see the first 10 columns and 10 lines of the input file, type: gunzip -c $IN_DIR/Demo1GLs | head -n 10 | cut -fl-10 | column -t

Use this command to count the number of lines. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line): gunzip -c $IN_DIR/Demo1GLs | wc -l

To run an analysis of the GLs with NGSadmix, assuming the number of ancestral populations is K=3, type the following command: $NGSadmix -likes $IN_DIR/Demo1GLs.beagle.gz -K 3 -minMaf 0.05 -seed 1 -o $OUT_DIR/Demo1NGSadmix

The output:

The analysis performed by NGSadmix produces 4 files:

Log likelihood of the estimates A .log file that summarizes the run. Let’s take a look at the log file to determine the log likelihood of the estimates achieved by NGSadmix which is called the “best like” in the file: cat Demo1NGSadmix.log

Estimated allele frequency A zipped .fopt file, that contains an estimate of the allele frequency in each of the 3 assumed ancestral populations (there is a line for each locus). We can use this file to obtain the estimated allele frequency of the first 5 SNPs (one per line) of the three assumed ancestral populations, by typing the following command: zcat Demo1NGSadmix.fopt.gz | head -n 5

Estimated admixture proportions A .qopt file, that contains an estimate of the individual's ancestry proportion from each of the three assumed ancestral populations (there is a line for each individual). To obtain the estimated admixture proportions for the first 5 individuals, type the following command: head -n 5 Demo1NGSadmix.qopt

DemoNGSadmix.filter: is an empty file???? If the filter is used, it will show the sites that were left out or in… CHECK

Follow these instructions to make a simple plot of the estimated admixture proportions for all individuals in R:

Type “R” in the terminal and press enter. Paste the following code into R:

set up the working directories

Fill up a table with the IDs of the population information for each individual

pop<-read.table("pop.info", as.is=T)

Read inferred admixture proportions

q<-read.table("Demo1NGSadmix.qopt")

Plot them (ordered by population)

ord = order(pop) par(mar=c(7,4,1,1)) barplot(t(q)[,ord],col=c(2,1,3),names=pop[ord],las=2,ylab="Admixture proportions",cex.names=0.75)

The y-axis of the plot show the admixture proportions and the individuals in the samples are plotted in the x-axis. Each color represents a different ancestral population. The proportion of each color shows the different admixture of the individuals for each ancestral population. The plot is sorted by the population of origin of each individual in the sample, and therefore, it shows blocks with prevalence of a certain color, which represents the population to which each individual belongs.

NB As you could tell from the number of loci included in the analysis, the above analysis is based on data from very few loci (actually we on purpose only analyzed data from a small part of the genome to make sure the analysis ran fast). Results from an analysis of data from the entire genome can be seen here.

Example NGSadmix - 1000 Genome Project Data

Now you know how to run NGSadmix and what the output looks like. Let's try to look at a more realistic size dataset. More specifically let's try to run NGSadmix on data from the 1000 genomes project from the following populations. The input file for you for this dataset and a file with population information are given as in the previous example. The input is a file with genotype likelihoods from 100 individuals in .beagle format.

ASW HapMap African ancestry individuals from SW US CEU European individuals CHB Han Chinese in Beijing JPT Japanese individuals YRI Yoruba individuals from Nigeria MXL Mexican individuals from LA California

Set directories to all required programs and the data you will use depending on where you installed them every time you open a new terminal window

Set path to NGSadmix

NGSADMIX=~/Software/NGSadmix

Set path to the directory with the input files

AA=/../home/albrecht/PhDCourse

Set paths to local directories

IN_DIR=~/Demo/Data OUT_DIR=~/Demo/Results

Take a quick look at the data First try to get an overview of the dataset by copying the information file and making a summary using the following:

copy to folder

cp $AA/admixture/data/pop.info .

1. cut first column | sort | count

cut -f 1 -d " " pop.info | sort | uniq -c

Which countries are the samples from and how many samples from each? Now look at the genotype file input.gz. It is in the same format at the file we looked at in the previous example: Use the same approach as in the previous example to find out how many loci there are genotype likelihoods from Run an analysis of the data with NGSadmix Try to start an analysis of the data with NGSadmix with K=3 (-K 3), using 1 cpu (-P 1), using only SNPs with minor allele frequency above 0.05 (-minMaf 0.05), set the seed set to 21 (-seed 21) and set the prefix of the output files to myownoutfilesK3 (-o myownoutfilesK3). $NGSADMIX -likes $IN_DIR/Demo2input.gz -K 3 -P 1 -minMaf 0.05 -seed 21 -o $OUT_DIR/Demo2NGSadmixK3

$NGSADMIX -likes $IN_DIR/Demo2input.gz -K 4 -P 1 -minMaf 0.05 -seed 21 -o $OUT_DIR/Demo2NGSadmixK4

Next, plot the estimated admixture proportions by running the following code in R :

1. open R

1. read population labels and estimated admixture proportions

pop<-read.table("pop.info",as.is=T) q<-read.table("myownoutfilesK3.qopt")

1. order according to population

ord<-order(pop[,1]) barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions") text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T) abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

Note that - like in the previous example - the order of the individuals in the plot are not the same as in the qopt file. Instead, to provide a better overview, the individuals have been ordered according to their population labels. Try to explain what the plot shows (what is on the axes, what do the colors mean and so on) What does the plot suggest in terms of population structure and admixture? Other K values (if you have time) Try to run NGSadmix with K=4 instead. Plot the output (if you have trouble plotting it here is the plot I got: K4 plot. Based on all the results what can you say about the Mexican samples (MXL)?

Example using ANGSD to create input file set directories to all required programs depending on where you installed them, for instance these are my paths: NGSTOOLS=~/Software/ngsTools ANGSD=~/Software/angsd NGSADMIX=~/Software/NGSadmix/NGSadmix

SAMTOOLS=~/Software/samtools-1.4.1/samtools HTSLIB=~/Software/htslib-1.4.1 FASTME=~/Software/fastme-2.1.5/src/fastme MYDIR=~/Demo/Data NGSadmix uses Genotype Likelihoods (GLs) as input, and therefore we are going to calculate the GLs from the .bam files to provide an input file in .beagle format:

Create a file that contains a list with the path to each one of the .bam files: find $BAMFOLDER | grep bam$ > ~/Demo/Data/all.bamfiles

Type in this command to very the content of the created file: cat all.bamfiles

$ANGSD -bam $IN_DIR/all.bamfiles -GL 2 -doMajorMinor 1 -doMaf 1 -SNP_pval 2e-6 -minMapQ 30 -minQ 20 -minInd 25 -minMaf 0.05 -doGlf 2 -out $IN_DIR/DemoGLs -P 5

wssd

asdf You can see the ID of the first individual by getting the first line of the file you created with all your original bam files in the beginning: head -n1 all.files

Based on that ID, which population does the individual come from? What does this suggest about what column to look for the frequencies for that population in the qopt file? Based on this and the frequency estimates for the first locus that you looked at earlier, what does NGSadmix estimate the allele frequency to be at the first locus in that population?

Plot results[edit] Plot in the order of the input file. admix<-t(as.matrix(read.table("myoutfiles.qopt"))) barplot(admix,col=1:3,space=0,border=NA,xlab="Individuals",ylab="admixture")

Plot using a population label file. pop<-read.table("pop.info",as.is=T) admix<-t(as.matrix(read.table("myoutfiles.qopt"))) admix<-admix[,order(pop[,1])] pop<-pop[order(pop[,1]),] h<-barplot(admix,col=1:3,space=0,border=NA,xlab="Individuals",ylab="admixture") text(tapply(1:nrow(pop),pop[,1],mean),-0.05,unique(pop[,1]),xpd=T)

@@ Line 1: / Line 1: @@
 NGSadmix Tutorial
+Example NGSadmix - Simple
+In our first example, we will be using a small dataset of .bam files with low depth NGS data from 30 human samples: 10 from Nigeria (YRI), 10 from Japan (JPT) and 10 with European Ancestry (CEU).
+Set directories to all required programs and the data you will use depending on where you installed them every time you open a new terminal window
+# Set path to NGSadmix
+NGSADMIX=~/Software/NGSadmix
+#test the link
+ls $NGSADMIX
+# Set path to the directory with the input files
+AA=/../home/albrecht/PhDCourse
+#test the link
+ls $AA
+To create the data files, please go the the following link:
+https://github.com/mfumagalli/ngsTools/blob/master/Scripts/data.sh
+Create the directories that will be used for working:
+mkdir Demo
+cd Demo
+mkdir Data
+mkdir Results
+#Set paths to local directories
+IN_DIR=~/Demo/Data
+OUT_DIR=~/Demo/Results
+		#Test the links
+ls $IN_DIR
+ls $OUT_DIR
+NGSadmix uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you. A file with labels that indicate which population the individuals are sampled from is also provided as the population information. Copy the files to your input folder ($IN_DIR):
+cp xxxx $IN_DIR
+cp $AA/admixture/data/pop.info $IN_DIR
+We need to prepare the population information file by cuttin the first column, sorting and counting:
+cut -f 1 -d " " $IN_DIR/pop.info | sort | uniq -c
+Now let’s analyze the input file. In order to see the first 10 columns and 10 lines of the input file, type:
+gunzip -c $IN_DIR/Demo1GLs | head -n 10 | cut -fl-10 | column -t
+Use this command to count the number of lines. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):
+gunzip -c $IN_DIR/Demo1GLs | wc -l
+To run an analysis of the GLs with NGSadmix, assuming the number of ancestral populations is K=3, type the following command:
+$NGSadmix -likes $IN_DIR/Demo1GLs.beagle.gz -K 3 -minMaf 0.05 -seed 1 -o $OUT_DIR/Demo1NGSadmix
+The output:
+The analysis performed by NGSadmix produces 4 files:
+Log likelihood of the estimates
+A .log file that summarizes the run.
+Let’s take a look at the log file to determine the log likelihood of the estimates achieved by NGSadmix which is called the “best like” in the file:
+cat Demo1NGSadmix.log
+Estimated allele frequency
+A zipped .fopt file, that contains an estimate of the allele frequency in each of the 3 assumed ancestral populations (there is a line for each locus).
+We can use this file to obtain the estimated allele frequency of the first 5 SNPs (one per line) of the three assumed ancestral populations, by typing the following command:
+zcat Demo1NGSadmix.fopt.gz | head -n 5
+Estimated admixture proportions
+A .qopt file, that contains an estimate of the individual's ancestry proportion from each of the three assumed ancestral populations (there is a line for each individual).
+To obtain the estimated admixture proportions for the first 5 individuals, type the following command:
+head -n 5 Demo1NGSadmix.qopt
+DemoNGSadmix.filter: is an empty file????
+If the filter is used, it will show the sites that were left out or in…
+CHECK
+Follow these instructions to make a simple plot of the estimated admixture proportions for all individuals in R:
+Type “R” in the terminal and press enter.
+Paste the following code into R:
+#set up the working directories
+# Fill up a table with the IDs of the population information for each individual
+pop<-read.table("pop.info", as.is=T)
+# Read inferred admixture proportions
+q<-read.table("Demo1NGSadmix.qopt")
+# Plot them (ordered by population)
+ord = order(pop)
+par(mar=c(7,4,1,1))
+barplot(t(q)[,ord],col=c(2,1,3),names=pop[ord],las=2,ylab="Admixture proportions",cex.names=0.75)
+The y-axis of the plot show the admixture proportions and the individuals in the samples are plotted in the x-axis.
+Each color represents a different ancestral population.
+The proportion of each color shows the different admixture of the individuals for each ancestral population.
+The plot is sorted by the population of origin of each individual in the sample, and therefore, it shows blocks with prevalence of a certain color, which represents the population to which each individual belongs.
+NB As you could tell from the number of loci included in the analysis, the above analysis is based on data from very few loci (actually we on purpose only analyzed data from a small part of the genome to make sure the analysis ran fast). Results from an analysis of data from the entire genome can be seen here.
+Example NGSadmix - 1000 Genome Project Data
+Now you know how to run NGSadmix and what the output looks like. Let's try to look at a more realistic size dataset. More specifically let's try to run NGSadmix on data from the 1000 genomes project from the following populations. The input file for you for this dataset and a file with population information are given as in the previous example. The input is a file with genotype likelihoods from 100 individuals in .beagle format.
+ASW
+HapMap African ancestry individuals from SW US
+CEU
+European individuals
+CHB
+Han Chinese in Beijing
+JPT
+Japanese individuals
+YRI
+Yoruba individuals from Nigeria
+MXL
+Mexican individuals from LA California
+Set directories to all required programs and the data you will use depending on where you installed them every time you open a new terminal window
+# Set path to NGSadmix
+NGSADMIX=~/Software/NGSadmix
+# Set path to the directory with the input files
+AA=/../home/albrecht/PhDCourse
+#Set paths to local directories
+IN_DIR=~/Demo/Data
+OUT_DIR=~/Demo/Results
+Take a quick look at the data
+First try to get an overview of the dataset by copying the information file and making a summary using the following:
+#copy to folder
+cp $AA/admixture/data/pop.info .
+## cut first column | sort | count
+cut -f 1 -d " " pop.info | sort | uniq -c
+Which countries are the samples from and how many samples from each?
+Now look at the genotype file input.gz. It is in the same format at the file we looked at in the previous example:
+Use the same approach as in the previous example to find out how many loci there are genotype likelihoods from
+Run an analysis of the data with NGSadmix
+Try to start an analysis of the data with NGSadmix with K=3 (-K 3), using 1 cpu (-P 1), using only SNPs with minor allele frequency above 0.05 (-minMaf 0.05), set the seed set to 21 (-seed 21) and set the prefix of the output files to myownoutfilesK3 (-o myownoutfilesK3).
+$NGSADMIX -likes $IN_DIR/Demo2input.gz -K 3 -P 1 -minMaf 0.05 -seed 21 -o $OUT_DIR/Demo2NGSadmixK3
+$NGSADMIX -likes $IN_DIR/Demo2input.gz -K 4 -P 1 -minMaf 0.05 -seed 21 -o $OUT_DIR/Demo2NGSadmixK4
+Next, plot the estimated admixture proportions by running the following code in R :
+## open R
+## read population labels and estimated admixture proportions
+pop<-read.table("pop.info",as.is=T)
+q<-read.table("myownoutfilesK3.qopt")
+## order according to population
+ord<-order(pop[,1])
+barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
+text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
+abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)
+Note that - like in the previous example - the order of the individuals in the plot are not the same as in the qopt file. Instead, to provide a better overview, the individuals have been ordered according to their population labels.
+Try to explain what the plot shows (what is on the axes, what do the colors mean and so on)
+What does the plot suggest in terms of population structure and admixture?
+Other K values (if you have time)
+Try to run NGSadmix with K=4 instead.
+Plot the output (if you have trouble plotting it here is the plot I got: K4 plot.
+Based on all the results what can you say about the Mexican samples (MXL)?
+Example using ANGSD to create input file
+set directories to all required programs depending on where you installed them, for instance these are my paths:
+NGSTOOLS=~/Software/ngsTools
+ANGSD=~/Software/angsd
+NGSADMIX=~/Software/NGSadmix/NGSadmix
+SAMTOOLS=~/Software/samtools-1.4.1/samtools
+HTSLIB=~/Software/htslib-1.4.1
+FASTME=~/Software/fastme-2.1.5/src/fastme
+MYDIR=~/Demo/Data
+NGSadmix uses Genotype Likelihoods (GLs) as input, and therefore we are going to calculate the GLs from the .bam files to provide an input file in .beagle format:
+Create a file that contains a list with the path to each one of the .bam files:
+find $BAMFOLDER | grep bam$ > ~/Demo/Data/all.bamfiles
+Type in this command to very the content of the created file:
+cat all.bamfiles
+$ANGSD -bam $IN_DIR/all.bamfiles -GL 2 -doMajorMinor 1 -doMaf 1 -SNP_pval 2e-6 -minMapQ 30 -minQ 20 -minInd 25 -minMaf 0.05 -doGlf 2 -out $IN_DIR/DemoGLs -P 5
+wssd
+asdf
+You can see the ID of the first individual by getting the first line of the file you created with all your original bam files in the beginning:
+head -n1 all.files
+Based on that ID, which population does the individual come from?
+What does this suggest about what column to look for the frequencies for that population in the qopt file?
+Based on this and the frequency estimates for the first locus that you looked at earlier, what does NGSadmix estimate the allele frequency to be at the first locus in that population?
+Plot results[edit]
+Plot in the order of the input file.
+admix<-t(as.matrix(read.table("myoutfiles.qopt")))
+barplot(admix,col=1:3,space=0,border=NA,xlab="Individuals",ylab="admixture")
+Plot using a population label file.
+pop<-read.table("pop.info",as.is=T)
+admix<-t(as.matrix(read.table("myoutfiles.qopt")))
+admix<-admix[,order(pop[,1])]
+pop<-pop[order(pop[,1]),]
+h<-barplot(admix,col=1:3,space=0,border=NA,xlab="Individuals",ylab="admixture")
+text(tapply(1:nrow(pop),pop[,1],mean),-0.05,unique(pop[,1]),xpd=T)

NgsAdmixTutorial: Difference between revisions

Revision as of 10:58, 5 July 2019

Navigation menu