NgsAdmixTutorial: Difference between revisions

From software
Jump to navigation Jump to search
Line 155: Line 155:




The input file Demo2input.gz and a file with population info are given.
The input file Demo2input.gz with genotype likelihoods from 100 individuals in .beagle format, and a file with population info are given.
''Note: please make sure you have created and set up the working directories as indicated in the previous tutorial.''


1. Download and copy the files to your input folder (for example, $IN_DIR):
:1. Download and copy the files to your input folder (for example, $IN_DIR):
wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz$IN_DIR
::<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz $IN_DIR
wget popgen.dk/software/download/NGSadmix/data/pop.info $IN_DIR
::<code>wget popgen.dk/software/download/NGSadmix/data/pop.info $IN_DIR




#A file with genotype likelihoods from 100 individuals in beagle format: path:
:2. Take a quick look at the data by copying the information file and making a summary using the following:
$ThePath/admixture/data/input.gz
::copy to folder
::<code>cp $IN_DIR/pop.info .</code>


##A file with labels that indicate which population they are sampled from:
::cut first column | sort | count
$ThePath/admixture/data/pop.info
::<code>cut -f 1 -d " " pop.info | sort | uniq -c</code>
Take a quick look at the data
First try to get an overview of the dataset by copying the information file and making a summary using the following:


#copy to folder
cp $ThePath/admixture/data/pop.info .


## cut first column | sort | count
:3. Run an analysis of the data with NGSadmix with K=3 (-K 3), using 1 cpu (-P 1), using only SNPs with minor allele frequency above 0.05 (-minMaf 0.05), set the seed set to 21 (-seed 21), and set the prefix of the output files to myownoutfilesK3 (-o myownoutfilesK3).
cut -f 1 -d " " pop.info | sort | uniq -c


::<code>NGSADMIX -likes $IN_DIR/Demo2input.gz -K 3 -P 1 -minMaf 0.05 -seed 21 -o $OUT_DIR/Demo2NGSadmixK3</code>
:4. Plot the estimated admixture proportions by running the following code in R:


Run an analysis of the data with NGSadmix
::Type “R” in the terminal and press enter and paste the following code into R:
Try to start an analysis of the data with NGSadmix with K=3 (-K 3), using 1 cpu (-P 1), using only SNPs with minor allele frequency above 0.05 (-minMaf 0.05), set the seed set to 21 (-seed 21) and set the prefix of the output files to myownoutfilesK3 (-o myownoutfilesK3).
 
$NGSadmix -likes $AA/admixture/data/input.gz -K 3 -P 1 -minMaf 0.05 -seed 21 -o myownoutfilesK3
Next, plot the estimated admixture proportions by running the following code in R :


## open R


## read population labels and estimated admixture proportions
## read population labels and estimated admixture proportions

Revision as of 00:13, 12 July 2019

We will go through a simple and more complex example on how to use NGSadmix with visualization of the data.

Example of NGSadmix - Simple

In our first example, we will infer admixture proportions for low depth NGS data using a small dataset from 30 human samples.

1. Every time you open a new terminal window, set directories to all required programs and the data you will use depending on where you installed them
Set the path to NGSadmix, for example:
NGSADMIX=~/Software/NGSadmix
Test the link
ls $NGSADMIX
2. Create the directories that will be used for working:
mkdir Demo
cd Demo
mkdir Data
mkdir Results
3. Set the paths to your local directories, for example:
IN_DIR=~/Demo/Data
OUT_DIR=~/Demo/Results
Test the links
ls $IN_DIR
ls $OUT_DIR
4. The Input file
NGSadmix uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.
We will use a very reduced data set:
-10 individuals from each population: 10 from Nigeria (YRI), 10 from Japan (JPT) and 10 with European Ancestry (CEU).
-a very reduced genome 30 x 100k random regions across the autosomes
-each individual is sequenced at 2-6X
CEU Europeans (mostly of British ancestry)
JPT East Asian - Japanese individuals
YRI West African - Nigerian Yoruba individuals
5. Population information file.
A file with labels that indicate which population the individuals are sampled from is also provided as the population information.
To create the data files, please go the the following link [1]
6. Download the files to your input folder (for example, $IN_DIR):
wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz $IN_DIR
wget popgen.dk/software/download/NGSadmix/data/pop.info $IN_DIR
7. We need to prepare the population information file by cutting the first column, sorting and counting:
cut -f 1 -d " " $IN_DIR/pop.info | sort | uniq -c
8. Now let’s analyze the input file:
  • In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.
  • In order to see the first 10 columns and 10 lines of the input file, type:
gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -fl-10 | column -t
  • Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):
gunzip -c $IN_DIR/Demo1input.gz | wc -l
9. To run an analysis of the GLs with NGSadmix, assuming the number of ancestral populations is K=3, type the following command:
$NGSADMIX-likes $IN_DIR/Demo1input.gz -K 3 -minMaf 0.05 -seed 1 -o $OUT_DIR/Demo1NGSadmix
10. The output:
The analysis performed by NGSadmix produces 4 files:
  • Log likelihood of the estimates
A .log file that summarizes the run.
Let’s take a look at the log file to determine the log likelihood of the estimates achieved by NGSadmix which is called the “best like” in the file:
cat Demo1NGSadmix.log
  • Estimated allele frequency
A zipped .fopt file, that contains an estimate of the allele frequency in each of the 3 assumed ancestral populations (there is a line for each locus).
We can use this file to obtain the estimated allele frequency of the first 5 SNPs (one per line) of the three assumed ancestral populations, by typing the following command:
zcat Demo1NGSadmix.fopt.gz | head -n 5
  • Estimated admixture proportions
A .qopt file, that contains an estimate of the individual's ancestry proportion from each of the three assumed ancestral populations (there is a line for each individual).
To obtain the estimated admixture proportions for the first 5 individuals, type the following command:
head -n 5 Demo1NGSadmix.qopt
  • DemoNGSadmix.filter if the filter was used, it will show the sites that were left out.
To see the header file, type:
head -n 5 Demo1NGSadmix.filter
11. Follow these instructions to make a simple plot of the estimated admixture proportions for all individuals in R:
Type “R” in the terminal and press enter and paste the following code into R:
# Fill up a table with the IDs of the population information for each individual
pop<-read.table("pop.info", as.is=T)
# Read inferred admixture proportions file
q<-read.table("Demo1NGSadmix.qopt")
# Plot them (ordered by population)
ord = order(pop)
par(mar=c(7,4,1,1))
barplot(t(q)[,ord],col=c(2,1,3),names=pop[ord],las=2,ylab="Admixture proportions",cex.names=0.75)
The y-axis of the plot show the admixture proportions and the individuals in the samples are plotted in the x-axis.
Each color represents a different ancestral population.
The proportion of each color shows the different admixture of the individuals for each ancestral population.
The plot is sorted by the population of origin of each individual in the sample, and therefore, it shows blocks with prevalence of a certain color, which represents the population to which each individual belongs.
NB As you could tell from the number of loci included in the analysis, the above analysis is based on data from very few loci (actually we on purpose only analyzed data from a small part of the genome to make sure the analysis ran fast). Results from an analysis of data from the entire genome can be seen here [LINK ANDERS]

Example of NGSadmix - Complex

Now that you know how to make input data to NGSadmix, how to run NGSadmix and what the output looks like, let's try to look at a more realistic size dataset. More specifically let's try to run NGSadmix on data from the 1000 genomes project from the following populations:

ASW HapMap African ancestry individuals from SW US
CEU European individuals
CHB Han Chinese in Beijing
JPT Japanese individuals
YRI Yoruba individuals from Nigeria
MXL Mexican individuals from LA California

The input file Demo2input.gz with genotype likelihoods from 100 individuals in .beagle format, and a file with population info are given. Note: please make sure you have created and set up the working directories as indicated in the previous tutorial.

1. Download and copy the files to your input folder (for example, $IN_DIR):
wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz $IN_DIR
wget popgen.dk/software/download/NGSadmix/data/pop.info $IN_DIR
2. Take a quick look at the data by copying the information file and making a summary using the following:
copy to folder
cp $IN_DIR/pop.info .
cut first column | sort | count
cut -f 1 -d " " pop.info | sort | uniq -c
3. Run an analysis of the data with NGSadmix with K=3 (-K 3), using 1 cpu (-P 1), using only SNPs with minor allele frequency above 0.05 (-minMaf 0.05), set the seed set to 21 (-seed 21), and set the prefix of the output files to myownoutfilesK3 (-o myownoutfilesK3).
NGSADMIX -likes $IN_DIR/Demo2input.gz -K 3 -P 1 -minMaf 0.05 -seed 21 -o $OUT_DIR/Demo2NGSadmixK3
4. Plot the estimated admixture proportions by running the following code in R:
Type “R” in the terminal and press enter and paste the following code into R:
    1. read population labels and estimated admixture proportions

pop<-read.table("pop.info",as.is=T) q<-read.table("myownoutfilesK3.qopt")

    1. order according to population

ord<-order(pop[,1]) barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions") text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T) abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2) Note that - like in the previous example - the order of the individuals in the plot are not the same as in the qopt file. Instead, to provide a better overview, the individuals have been ordered according to their population labels. Try to run NGSadmix with K=4 instead. $NGSADMIX -likes $IN_DIR/Demo2input.gz -K 3 -P 1 -minMaf 0.05 -seed 21 -o $OUT_DIR/Demo2NGSadmixK3 $NGSADMIX -likes $IN_DIR/Demo2input.gz -K 4 -P 1 -minMaf 0.05 -seed 21 -o $OUT_DIR/Demo2NGSadmixK4