Inference of admixture and population structure

A. Use of NGSadmix to infer admixture proportions for numerous individuals

Login to the server and set paths
First small example
More realistic example

Bonus Exercise: Inbreeding in the admixed individuals

A. Use of NGSadmix to infer admixture proportions for numerous individuals

In this exercise we will try to use NGSadmix to analyze two different NGS datasets.

Login to the server and set paths

First open a terminal and login to the server. Next - before running any analyses - you need to set paths to the programs and the data you will use. Do this by pasting the following into your terminal window:

# NB this must be done every time you open a new terminal

ThePath=/gdc_home2/groups/BAG18/3_wednesday

# Set path to ANGSD program
ANGSD=$ThePath/prog/angsd/angsd

# Set path to NGSadmix
NGSadmix=$ThePath/prog/angsd/misc/NGSadmix

# Set path to a bam file list with several bam files
BAMFOLDER=$ThePath/sfs/data/smallerbams/

First small example

We will first try to run an NGSadmix analysis of a small dataset consisting of bam files with low depth NGS data from 30 samples: 10 from Nigeria, West Africa (YRI), 10 from Japan (JPT) and 10 with European ancestry (CEU).

CEU	Europeans (mostly of British ancestry)
JPT	East Asian - Japanese individuals
YRI	West African - Nigerian Yoruba individuals

Due to computation We will use a very reduced data set:

Input data: bam files
10 individuals from each population
a very reduced genome 30 x 100k random regions across the autosomes
Each individual is sequenced at 2-6X

Aims:

to Generate Genotype likelihood files in the beagle format
To infer admixture proportions for low depth sequencing data

Make input data using ANGSD

The input to NGSadmix is genotype likelihoods (GLs). Therefore the first step of running an NGSadmix analysis if all you have are bams files is to calculate GLs. So let's start bying doing that. First make a file that contains the paths of all the 30 bam files:

find $BAMFOLDER |  grep bam$ > all.files

To see the content of the file you made type:

cat all.files

Now calculate GLs from all the BAM files using ANGSD by running the following command in the terminal:

$ANGSD -bam all.files -GL 2 -doMajorMinor 1 -doMaf 1 -SNP_pval 2e-6 -minMapQ 30 -minQ 20 -minInd 25 -minMaf 0.05 -doGlf 2 -out all -P 5

NOTE that this will take a bit of time to run (a few minutes). While waiting, let's try to understand the above command and get some info about the data. Here is an explanation of the options used in the command:

 -bam all.files : tells ANGSD that the bam files to calculate GL from are listed in the file all.files
 -GL 2 : tells ANGSD to use the GATK genotype likelihood model
 -doMajorMinor 1 : tells ANGSD to infer the major and minor alleles
 -doMaf 1 : tells ANGSD to calculate minor allele frequencies (needed by two of the options below: -SNP_pval and -minMaf)
 -SNP_pval 2e-6 : tells ANGSD to use a p-value threshold of 2e-6 for calling SNPs
 -minMapQ 30 : tells ANGSD what to require as minimum mapping quality (quality filter)
 -minQ 20 : tells ANGSD what to require as minimum base quality (quality filter)
 -minInd 25 : tells ANGSD to only output GLs for loci with data for at least 25 individuals for each site (quality filter)
 -minMaf 0.05 : tells ANGSD to only output GLS for loci with a minimum minor allele frequency of 0.05 (quality filter)
 -doGlf 2 : tells ANGSD to write the final genotype likelihoods into a file in beagle format
 -out all : tells ANGSD to call all output files it generate "all" followed by the appropriate suffix e.g. "all.beagle.gz"
 -P 5 : tells ANDSG use 5 threads (up to 500% CPU)

Explore the input data

Now let's have a look at the GL file that you have created with ANGSD. It is a "beagle format" file called all.beagle.gz - and will be the input file to NGSadmix. The first line in this file is a header line and after that it contains a line for each locus with GLs. By using the unix command wc we can count the number of lines in the file:

gunzip -c all.beagle.gz | wc -l

Use this to find out how many loci there are GLs for in the data set?

Next, to get an idea of what the GL file contains try from the command line to print the first 9 columns of the first 7 lines of the file:

gunzip -c all.beagle.gz | head -n 7 | cut -f1-9 | column -t

In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individuals. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software it does not mean that they are genotype probabilities.

Based on this, what is the most likely genotype of Ind0 in the first locus and the second locus?

Run an analysis of the data with NGSadmix

Now you know how the input looks. Next, let's try to perform an NGSadmix analyses of the GLs typing assuming the number of ancestral populations, K, is 3:

$NGSadmix -likes all.beagle.gz -K 3 -minMaf 0.05 -seed 1 -o all

While waiting for the analysis to run then make sure you understand the command. If you are in doubt seek help here. Here you can also see what other options you have when you run an NGSadmix analyses.

Explore the output

The output from the analysis you just ran is three files:

all.beagle.gz.log (a "log file" that summarizes the analysis run)
all.beagle.gz.fopt.gz (an "fopt file", which has a line for each locus that contains an estimate of the allele frequency in each of the 3 assumed ancestral populations)
all.beagle.gz.qopt.gz (a "qopt file", which has a line for each individual that contains anestimate of the individual's ancestry proportion from each of the three assumed ancestral populations).

Let's have a look at them one at a time. First, check the log file by typing

cat all.log

What is the log likelihood of the estimates achieved by NGSadmix (called "best like" in the log file)?

Next, check the first line of the fopt file by typing:

zcat all.fopt.gz | head -n1

Based on this: what is the estimated allele frequency of the first SNP in three assumed ancestral populations?

Finally, check the first line of the qopt file and thus the estimated admixture proportions for the first individuals by typing:

head -n1 all.qopt

Based on this: does the individual look admixed?

You can see the ID of the first individual by getting the first line of the file you created with all your original bam files in the beginning:

head -n1 all.files

Based on that ID, which population does the individual come from?
What does this suggest about what column to look for the frequencies for that population in the qopt file?
Based on this and the frequency estimates for the first locus that you looked at earlier, what does NGSadmix estimate the allele frequency to be at the first locus in that population?

Plot the admixture proportion estimates

Finally, try to make a simple plot the estimated admixture proportions for all the individuals by opening the statistical program called R (which you do by typing "R" in the terminal and pressing enter) and then copy pasting the following code:

## open R
# Get ID and pop info for each individual
s<-strsplit(basename(scan("all.files",what="theFuck")),"\\.")
pop<-sapply(s,function(x){paste(x[5],x[1],sep="_")})

# Read inferred admixture proportions
q<-read.table("all.qopt")

# Plot them (ordered by population)
ord = order(pop)
par(mar=c(7,4,1,1))
barplot(t(q)[,ord],col=c(2,1,3),names=pop[ord],las=2,ylab="Admixture proportions",cex.names=0.75)

Note that the order of the individuals in the plot are not the same as in the qopt file. Instead, to provide a better overview, the individuals have been ordered according to the population they are from.

Try to explain what the plot shows (what is on the axes, what do the colors mean and so on)
What does the plot suggest about whether the individuals are admixed?

NB As you could tell from the number of loci included in the analysis, the above analysis is based on data from very few loci (actually we on purpose only analyzed data from a small part of the genome to make sure the analysis ran fast). Results from an analyses of data from the entire genome can be seen here.

What does that suggest about whether the individuals are admixed?

More realistic example

Now you know how to make input data to NGSadmix, how to run NGSadmix and what the output looks like. Let's try to look at a more realistic size dataset. More specifically let's try to run NGSadmix on data from the 1000 genomes project from the following populations:

ASW	HapMap African ancestry individuals from SW US
CEU	European individuals
CHB	Han Chinese in Beijing
JPT	Japanese individuals
YRI	Yoruba individuals from Nigeria
MXL	Mexican individuals from LA California

. data: Genotype likelhoods in beagle format for the first 50k SNPs

To save time we have already made the input file for you for this dataset and a file with population info.

#A file with genotype likelihoods from 100 individuals in beagle format: path:
$ThePath/admixture/data/input.gz

##A file with labels that indicate which population they are sampled from:
$ThePath/admixture/data/pop.info

Take a quick look at the data

First try to get an overview of the dataset by copying the information file and making a summary using the following:

#copy to folder
cp $ThePath/admixture/data/pop.info .

## cut first column | sort | count
cut -f 1 -d " " pop.info | sort | uniq -c

Which countries are the samples from and how many samples from each?

Now look at the genotype file input.gz. It is in the same format at the file we looked at in the previous example:

Use the same approach as in the previous example to find out how many loci there are genotype likelihoods from

Run an analysis of the data with NGSadmix

Try to start an analysis of the data with NGSadmix with K=3 (-K 3), using 1 cpu (-P 1), using only SNPs with minor allele frequency above 0.05 (-minMaf 0.05), set the seed set to 21 (-seed 21) and set the prefix of the output files to myownoutfilesK3 (-o myownoutfilesK3).

 $NGSadmix -likes $ThePath/admixture/data/input.gz -K 3 -P 4 -minMaf 0.05 -seed 21 -o myownoutfilesK3

Next, plot the estimated admixture proportions by running the following code in R :

## open R

## read population labels and estimated admixture proportions
pop<-read.table("pop.info",as.is=T)
q<-read.table("myownoutfilesK3.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

Note that - like in the previous example - the order of the individuals in the plot are not the same as in the qopt file. Instead, to provide a better overview, the individuals have been ordered according to their population labels.

Try to explain what the plot shows (what is on the axes, what do the colors mean and so on)
What does the plot suggest in terms of population structure and admixture?

Other K values (if you have time)

Try to run NGSadmix with K=4 instead.
Plot the output (if you have trouble plotting it here is the plot I got: K4 plot.
Based on all the results what can you say about the Mexican samples (MXL)?

PCA with admixture aware priors

Let's try to perform PCA analysis on the same 1000 genotype genotype likelihoods that you performed admixture analysis on yesterday. First let set the path to program and the input file

# NB this must be done every time you open a new terminal
ThePath=/gdc_home2/groups/BAG18/3_wednesday



## beagle genotype likelihood file
GL1000Genomes=$ThePath/admixture/data/input.gz

## copy population information file to current folder
cp $ThePath/admixture/data/pop.info .


## load the python module
## PCAngsd
source /usr/local/anaconda/bin/activate PCangsd

What were the populations included? And how many sites?

pcangsd.py -beagle $GL1000Genomes -o input -n 100

The program estimates the covariance matrix that can then be used for PCA. look at the output from the program

How many significant PCA (see MAP test in output)?

Plot the results in R

#open R
pop<-read.table("pop.info")

C <- as.matrix(read.table("input.cov"))
 e <- eigen(C)
pdf("PCAngsd.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R

view the results with


evince PCAngsd.pdf

Compare with the estimate admixture proportions

plot

In the PCA plot can you identify the Mexicans with only European ancestry?
What about the African American with East Asian ancestry?
Based on the PCA would you have the same conclusion as the admixture proportions?

Try the same analysis but without estimating individual allele frequencies. This is the same as using the first iteration of the algorithm

pcangsd.py -beagle $GL1000Genomes -o input2 -iter 0 -n 100

Plot the results in R

#open R
pop<-read.table("pop.info")

C <- as.matrix(read.table("input2.cov"))
 e <- eigen(C)
pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R

View plot:

evince PCAngsd2.pdf

Do you see any difference?
Would any of your conclusions change? (compared to the previous PCA plot)

Let try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

 pcangsd.py -beagle $GL1000Genomes -o input -n 100 -admix -admix_alpha 50

Plot the results in R

#open R
pop<-read.table("pop.info",as.is=T)
q<-read.table("input.K3.a50.0.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R

how does this compare to the NGSadmix analysis?

Bonus Exercise: Inbreeding in the admixed individuals

Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency

pcangsd.py -beagle $GL1000Genomes -o IB0 -inbreed 2 -n 100 -iter 0

join names and results, sort the values and look at the results

paste pop.info IB0.inbreed | LC_ALL=C sort -k3g

The third column is an estimate of the inbreeding coefficient (allowing for negative)

Does any of the individuals look more inbreed than an offspring of a pair of first cousins ?
how do you interpret negative values?
The results will be affected by population structure - Why?
see any pattern of which individuals have low (negative) and high inbreeding coefficients? - can you explain the pattern?

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

pcangsd.py -beagle $GL1000Genomes -o IB -inbreed 2 -n 100

join names and results, sort the values and look at the results

paste pop.info IB.inbreed | LC_ALL=C sort -k3g

Does any of the individual look inbreed?

PCangsd and selection

For very resent selection we can look within closely related individuals for example with in Europeans

data:

Genotype likelihoods in Beagle format
~150k random SNPs with maf > 5%
Four EU populations with ~100 individuals in each

CEU	European (mostly British)
GBR	Great Britain
IBS	Iberian/Spain
TSI	Italien

First lets set the paths


## copy positions and sample information
cp $ThePath/PCangsd/data/eu1000g.sample.Info .

#set pa
EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz
wc eu1000g.sample.Info
N=424 #one line for header

Run PCangsd with to estimate the covariance matrix while jointly estimating the individuals allele frequencies

pcangsd.py -beagle $EU1000 -o EUsmall -n $N -threads 4

Plot the results in R

## R
 cov <- as.matrix(read.table("EUsmall.cov"))

 e<-eigen(cov)
 ID<-read.table("eu1000g.sample.Info",head=T)
 plot(e$vectors[,1:2],col=ID$POP)

 legend("topleft",fill=1:4,levels(ID$POP))

Does the plot look like you expected?

Since the European individuals in 1000G are not simple homogeneous disjoint populations it is hard to use PBS/FST or similar statistics to infer selection based on populating differences. However, PCA offers a good description of the differences between individuals which out having the define disjoint groups.

Now let try to use the PC to infer selection along the genome based on the PCA

pcangsd.py -beagle $EU1000 -o EUsmall -n $N -selection 1 -sites_save

plot the results

## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
  qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
 w<-range(which(pval<1e-7)) + c(-100,100)
 keep<-w[1]:w[2]
 plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
 p$pos[which.max(s)]

see if you can make sense of the top hit based on the genome.

Look in UCSC browser
Choose human GRCh38/hg38
search for the position of the top hit and identify the genes at that loci

[ Front | MyPhdCourse | Index ]

Updated February 14, 2018

Inference of admixture and population structure

Popgen.dk Anders Software ANGSD About

A. Use of NGSadmix to infer admixture proportions for numerous individuals

Login to the server and set paths

First small example

Make input data using ANGSD

Explore the input data

Run an analysis of the data with NGSadmix

Explore the output

Plot the admixture proportion estimates

More realistic example

Take a quick look at the data

Run an analysis of the data with NGSadmix

Other K values (if you have time)

PCA with admixture aware priors

Bonus Exercise: Inbreeding in the admixed individuals

PCangsd and selection