A. Use of NGSadmix to infer admixture proportions for numerous individuals

In this exercise we will try to use NGSadmix to analyze two different NGS datasets.

Login to the server and set paths

First open a terminal and login to the server. Next - before running any analyses - you need to set paths to the programs and the data you will use. Do this by pasting the following into your terminal window:

# NB this must be done every time you open a new terminal

ThePath=/gdc_home2/groups/BAG18/3_wednesday

# Set path to ANGSD program
ANGSD=$ThePath/prog/angsd/angsd

# Set path to NGSadmix
NGSadmix=$ThePath/prog/angsd/misc/NGSadmix

# Set path to a bam file list with several bam files
BAMFOLDER=$ThePath/sfs/data/smallerbams/

First small example

We will first try to run an NGSadmix analysis of a small dataset consisting of bam files with low depth NGS data from 30 samples: 10 from Nigeria, West Africa (YRI), 10 from Japan (JPT) and 10 with European ancestry (CEU).

CEU Europeans (mostly of British ancestry)
JPT East Asian - Japanese individuals
YRI West African - Nigerian Yoruba individuals

Due to computation We will use a very reduced data set:

Aims:

Make input data using ANGSD

The input to NGSadmix is genotype likelihoods (GLs). Therefore the first step of running an NGSadmix analysis if all you have are bams files is to calculate GLs. So let's start bying doing that. First make a file that contains the paths of all the 30 bam files:

find $BAMFOLDER |  grep bam$ > all.files

To see the content of the file you made type:

cat all.files

Now calculate GLs from all the BAM files using ANGSD by running the following command in the terminal:

$ANGSD -bam all.files -GL 2 -doMajorMinor 1 -doMaf 1 -SNP_pval 2e-6 -minMapQ 30 -minQ 20 -minInd 25 -minMaf 0.05 -doGlf 2 -out all -P 5

NOTE that this will take a bit of time to run (a few minutes). While waiting, let's try to understand the above command and get some info about the data. Here is an explanation of the options used in the command:

 -bam all.files : tells ANGSD that the bam files to calculate GL from are listed in the file all.files
 -GL 2 : tells ANGSD to use the GATK genotype likelihood model
 -doMajorMinor 1 : tells ANGSD to infer the major and minor alleles
 -doMaf 1 : tells ANGSD to calculate minor allele frequencies (needed by two of the options below: -SNP_pval and -minMaf)
 -SNP_pval 2e-6 : tells ANGSD to use a p-value threshold of 2e-6 for calling SNPs
 -minMapQ 30 : tells ANGSD what to require as minimum mapping quality (quality filter)
 -minQ 20 : tells ANGSD what to require as minimum base quality (quality filter)
 -minInd 25 : tells ANGSD to only output GLs for loci with data for at least 25 individuals for each site (quality filter)
 -minMaf 0.05 : tells ANGSD to only output GLS for loci with a minimum minor allele frequency of 0.05 (quality filter)
 -doGlf 2 : tells ANGSD to write the final genotype likelihoods into a file in beagle format
 -out all : tells ANGSD to call all output files it generate "all" followed by the appropriate suffix e.g. "all.beagle.gz"
 -P 5 : tells ANDSG use 5 threads (up to 500% CPU)
 

Explore the input data

Now let's have a look at the GL file that you have created with ANGSD. It is a "beagle format" file called all.beagle.gz - and will be the input file to NGSadmix. The first line in this file is a header line and after that it contains a line for each locus with GLs. By using the unix command wc we can count the number of lines in the file:

gunzip -c all.beagle.gz | wc -l

Next, to get an idea of what the GL file contains try from the command line to print the first 9 columns of the first 7 lines of the file:

gunzip -c all.beagle.gz | head -n 7 | cut -f1-9 | column -t

In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individuals. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software it does not mean that they are genotype probabilities.

Run an analysis of the data with NGSadmix

Now you know how the input looks. Next, let's try to perform an NGSadmix analyses of the GLs typing assuming the number of ancestral populations, K, is 3:

$NGSadmix -likes all.beagle.gz -K 3 -minMaf 0.05 -seed 1 -o all

Explore the output

The output from the analysis you just ran is three files:

Let's have a look at them one at a time. First, check the log file by typing

cat all.log

Next, check the first line of the fopt file by typing:

zcat all.fopt.gz | head -n1

Finally, check the first line of the qopt file and thus the estimated admixture proportions for the first individuals by typing:

head -n1 all.qopt

You can see the ID of the first individual by getting the first line of the file you created with all your original bam files in the beginning:

head -n1 all.files

Plot the admixture proportion estimates

Finally, try to make a simple plot the estimated admixture proportions for all the individuals by opening the statistical program called R (which you do by typing "R" in the terminal and pressing enter) and then copy pasting the following code:

## open R
# Get ID and pop info for each individual
s<-strsplit(basename(scan("all.files",what="theFuck")),"\\.")
pop<-sapply(s,function(x){paste(x[5],x[1],sep="_")})

# Read inferred admixture proportions
q<-read.table("all.qopt")

# Plot them (ordered by population)
ord = order(pop)
par(mar=c(7,4,1,1))
barplot(t(q)[,ord],col=c(2,1,3),names=pop[ord],las=2,ylab="Admixture proportions",cex.names=0.75)

Note that the order of the individuals in the plot are not the same as in the qopt file. Instead, to provide a better overview, the individuals have been ordered according to the population they are from.

NB As you could tell from the number of loci included in the analysis, the above analysis is based on data from very few loci (actually we on purpose only analyzed data from a small part of the genome to make sure the analysis ran fast). Results from an analyses of data from the entire genome can be seen here.

More realistic example

Now you know how to make input data to NGSadmix, how to run NGSadmix and what the output looks like. Let's try to look at a more realistic size dataset. More specifically let's try to run NGSadmix on data from the 1000 genomes project from the following populations:

ASW HapMap African ancestry individuals from SW US
CEU European individuals
CHB Han Chinese in Beijing
JPT Japanese individuals
YRI Yoruba individuals from Nigeria
MXL Mexican individuals from LA California

. data: Genotype likelhoods in beagle format for the first 50k SNPs

To save time we have already made the input file for you for this dataset and a file with population info.

#A file with genotype likelihoods from 100 individuals in beagle format: path:
$ThePath/admixture/data/input.gz

##A file with labels that indicate which population they are sampled from:
$ThePath/admixture/data/pop.info

Take a quick look at the data

First try to get an overview of the dataset by copying the information file and making a summary using the following:

#copy to folder
cp $ThePath/admixture/data/pop.info .

## cut first column | sort | count
cut -f 1 -d " " pop.info | sort | uniq -c

Now look at the genotype file input.gz. It is in the same format at the file we looked at in the previous example:

Run an analysis of the data with NGSadmix

Try to start an analysis of the data with NGSadmix with K=3 (-K 3), using 1 cpu (-P 1), using only SNPs with minor allele frequency above 0.05 (-minMaf 0.05), set the seed set to 21 (-seed 21) and set the prefix of the output files to myownoutfilesK3 (-o myownoutfilesK3).

 $NGSadmix -likes $ThePath/admixture/data/input.gz -K 3 -P 4 -minMaf 0.05 -seed 21 -o myownoutfilesK3
 

Next, plot the estimated admixture proportions by running the following code in R :

## open R

## read population labels and estimated admixture proportions
pop<-read.table("pop.info",as.is=T)
q<-read.table("myownoutfilesK3.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

Note that - like in the previous example - the order of the individuals in the plot are not the same as in the qopt file. Instead, to provide a better overview, the individuals have been ordered according to their population labels.

Other K values (if you have time)

PCA with admixture aware priors

Let's try to perform PCA analysis on the same 1000 genotype genotype likelihoods that you performed admixture analysis on yesterday. First let set the path to program and the input file

# NB this must be done every time you open a new terminal
ThePath=/gdc_home2/groups/BAG18/3_wednesday



## beagle genotype likelihood file
GL1000Genomes=$ThePath/admixture/data/input.gz

## copy population information file to current folder
cp $ThePath/admixture/data/pop.info .


## load the python module
## PCAngsd
source /usr/local/anaconda/bin/activate PCangsd



pcangsd.py -beagle $GL1000Genomes -o input -n 100

The program estimates the covariance matrix that can then be used for PCA. look at the output from the program

Plot the results in R

#open R
pop<-read.table("pop.info")

C <- as.matrix(read.table("input.cov"))
 e <- eigen(C)
pdf("PCAngsd.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R

view the results with


evince PCAngsd.pdf

Compare with the estimate admixture proportions

plot
plot

Try the same analysis but without estimating individual allele frequencies. This is the same as using the first iteration of the algorithm

pcangsd.py -beagle $GL1000Genomes -o input2 -iter 0 -n 100

Plot the results in R

#open R
pop<-read.table("pop.info")

C <- as.matrix(read.table("input2.cov"))
 e <- eigen(C)
pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R

View plot:

evince PCAngsd2.pdf

Let try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

 pcangsd.py -beagle $GL1000Genomes -o input -n 100 -admix -admix_alpha 50

Plot the results in R

#open R
pop<-read.table("pop.info",as.is=T)
q<-read.table("input.K3.a50.0.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R

Bonus Exercise: Inbreeding in the admixed individuals

Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency

pcangsd.py -beagle $GL1000Genomes -o IB0 -inbreed 2 -n 100 -iter 0

join names and results, sort the values and look at the results

paste pop.info IB0.inbreed | LC_ALL=C sort -k3g
The third column is an estimate of the inbreeding coefficient (allowing for negative)

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

pcangsd.py -beagle $GL1000Genomes -o IB -inbreed 2 -n 100

join names and results, sort the values and look at the results

paste pop.info IB.inbreed | LC_ALL=C sort -k3g

PCangsd and selection

For very resent selection we can look within closely related individuals for example with in Europeans

data:

CEU European (mostly British)
GBR Great Britain
IBS Iberian/Spain
TSI Italien

First lets set the paths


## copy positions and sample information
cp $ThePath/PCangsd/data/eu1000g.sample.Info .

#set pa
EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz
wc eu1000g.sample.Info
N=424 #one line for header

Run PCangsd with to estimate the covariance matrix while jointly estimating the individuals allele frequencies

pcangsd.py -beagle $EU1000 -o EUsmall -n $N -threads 4

Plot the results in R

## R
 cov <- as.matrix(read.table("EUsmall.cov"))

 e<-eigen(cov)
 ID<-read.table("eu1000g.sample.Info",head=T)
 plot(e$vectors[,1:2],col=ID$POP)

 legend("topleft",fill=1:4,levels(ID$POP))

Since the European individuals in 1000G are not simple homogeneous disjoint populations it is hard to use PBS/FST or similar statistics to infer selection based on populating differences. However, PCA offers a good description of the differences between individuals which out having the define disjoint groups.

Now let try to use the PC to infer selection along the genome based on the PCA

pcangsd.py -beagle $EU1000 -o EUsmall -n $N -selection 1 -sites_save

plot the results

## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
  qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
 w<-range(which(pval<1e-7)) + c(-100,100)
 keep<-w[1]:w[2]
 plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
 p$pos[which.max(s)]

see if you can make sense of the top hit based on the genome.