PCAngsdTutorial: Difference between revisions

From software
Jump to navigation Jump to search
 
(50 intermediate revisions by 2 users not shown)
Line 6: Line 6:


<code>
<code>
PCANGSD=~/Software/pcangsd
PCANGSD=~/Software/pcangsd/pcangsd.py
</code>
</code>


Line 20: Line 20:
<code>mkdir Demo</code>
<code>mkdir Demo</code>


<code>cd Demo</code>


<code>mkdir Data</code>
<code>mkdir Demo/Data</code>


<code>mkdir Results</code>
<code>mkdir Demo/Results</code>


==== Set the paths to your local directories ====
==== Set the paths to your local directories ====
Line 48: Line 47:
The population information file is also provided.
The population information file is also provided.


Download the files and move them to your input folder (for example, $IN_DIR):
To download the files and move them to your input folder (for example, $IN_DIR):
 


'''*ANDERS*'''


<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>
<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>
Line 59: Line 58:


<code>mv Demo1pop.info $IN_DIR</code>
<code>mv Demo1pop.info $IN_DIR</code>
'''*ANDERS*'''


==== View the genotype likelihood beagle file ====
==== View the genotype likelihood beagle file ====
Line 90: Line 87:
==== Estimating Individual Allele Frequencies ====
==== Estimating Individual Allele Frequencies ====


<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD</code>
<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>


Plot the results in R
Plot the results in R
Line 96: Line 93:
<pre>
<pre>
#open R
#open R
pop<-read.table("pop.info")
pop<-read.table("Demo/Data/Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD.cov"))
C <- as.matrix(read.table("Demo/Results/Demo1PCANGSD_1.cov"))
e <- eigen(C)
e <- eigen(C)


Line 116: Line 113:
This is the same as using the first iteration of the algorithm.
This is the same as using the first iteration of the algorithm.


<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0</code>
<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0</code>


Plot the results in R
Plot the results in R
Line 122: Line 119:
<pre>
<pre>
#open R
#open R
pop<-read.table("pop.info")
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)
e <- eigen(C)


Line 137: Line 134:
<code>evince PCAngsd2.pdf</code>
<code>evince PCAngsd2.pdf</code>


==== Admixture based on 1st two PC ====
==== Admixture based on two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).


<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>
<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -e 2 -o $OUT_DIR/Demo2PCANGSD_3 -admix -admix_alpha 50</code>


Plot the results in R
Plot the results in R
Line 146: Line 143:
<pre>
<pre>
#open R
#open R
pop<-read.table("pop.info",as.is=T)
##Requires previous installation of the library RcppCNPy
 
library(RcppCNPy) # Numpy library for R


###########CHECK NAME OF FILE###################
pop<-read.table("Demo/Data/Demo1pop.info",as.is=T)
q<-read.table("Demo3PCANGSD.K3.a50.0.qopt")
 
q <- npyLoad("Demo/Results/Demo2PCANGSD_3.admix.Q.npy")  


## order according to population
## order according to population
Line 163: Line 163:
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.


<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>
<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_4 -inbreed 2 -iter 0</code>


Join names and results, sort the values and look at the results
Join names and results, sort the values and look at the results


<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>
<code>paste Demo1pop.info Demo1PCANGSD_4.inbreed.npy | LC_ALL=C sort -k3g</code>


The third column is an estimate of the inbreeding coefficient (allowing for negative)
The third column is an estimate of the inbreeding coefficient (allowing for negative)
Line 175: Line 175:
Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA
Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA


<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>
<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_5 -inbreed 2 </code>


Join names and results, sort the values and look at the results
Join names and results, sort the values and look at the results


<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>
<code>paste Demo1pop.info Demo1PCANGSD_5.inbreed.npy | LC_ALL=C sort -k3g</code>


=== Demo 2: Selection ===
=== Demo 2: Selection ===
Line 185: Line 185:
For very resent selection we can look within closely related individuals, for example within Europeans.
For very resent selection we can look within closely related individuals, for example within Europeans.


The objective is to use PCangsd to estimate the covariance matrix while jointly estimating the individuals allele frequencies.
The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.


Data file:
Data file:
Line 212: Line 212:
:</table>
:</table>


==== Download the beagle genotype likelihood input file ====
==== Download the input and sample information files ====


There is a file with the positions and sample information.
A file with the positions and sample information, and a beagle file are provided:
(cp $ThePath/PCangsd/data/eu1000g.sample.Info)
 
And a beagle file (EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)


Download the files and move them to your input folder (for example, $IN_DIR):
Download the files and move them to your input folder (for example, $IN_DIR):
Line 224: Line 221:


<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>
<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>
(/home/albrecht/oldPhDCourse/PCangsd/data/eu1000g.small.beagle.gz)


<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>
<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>
(/home/albrecht/oldPhDCourse/PCangsd/data/eu1000g.sample.Info)


<code>mv Demo2input.gz $IN_DIR</code>
<code>mv Demo2input.gz $IN_DIR</code>


<code>mv Demo2sample.info $IN_DIR</code>
<code>mv Demo2sample.info $OUT_DIR</code>


'''*ANDERS*'''
'''*ANDERS*'''


<pre>
==== Run PCAngsd ====
wc Demo2sample.info
N=424 #one line for header


#run PCANGSD
The objective is to show the differences among individuals.
$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4
#-n $N -threads 20 # from previous year...
</pre>


<code>python $PCANGSD -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_1 -threads 4</code>


Plot the results in R
==== Plot the results in R ====


<pre>
<pre>
## R
## Open R
  cov <- as.matrix(read.table("EUsmall.cov"))
 
  cov <- as.matrix(read.table("Demo2PCANGSD_1.cov"))


  e<-eigen(cov)
  e<-eigen(cov)
  ID<-read.table("Demo2sample.info",head=T)
  ID<-read.table("Demo2sample.info",head=T)
pdf("PCAngsdDemo2_1.pdf")
  plot(e$vectors[,1:2],col=ID$POP)
  plot(e$vectors[,1:2],col=ID$POP)


Line 256: Line 255:
</pre>
</pre>


Since the European individuals in 1000G are not simple homogeneous disjoint populations it is hard to use PBS/FST or similar statistics to infer selection based on populating differences. However, PCA offers a good description of the differences between individuals which out having the define disjoint groups.
Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.
 
=== Infer selection along the genome ===


Now let's try to use the PC to infer selection along the genome based on the PCA
Now let's try to use the PC to infer selection along the genome based on the PCA


<code>$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>
<code>python $PCANGSD -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_2 -selection -sites_save #-n $N </code>




plot the results
==== Plot the results ====


<pre>
<pre>
library(RcppCNPy) # Numpy library for R
## function for QQplot
## function for QQplot
qqchi<-function(x,...){
qqchi<-function(x,...){
Line 274: Line 278:


### read in seleciton statistics (chi2 distributed)
### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
s<-npyLoad("Demo2PCANGSD_2.selection.npy")
 
## make QQ plot to QC the test statistics
## make QQ plot to QC the test statistics
qqchi(s)
qqchi(s)
Line 282: Line 287:


## read positions (hg38)
## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")
p<-read.table("Demo2PCANGSD_2.sites",colC=c("factor","integer"),sep="_")


names(p)<-c("chr","pos")
names(p)<-c("chr","pos")


## make manhatten plot
## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhattan plot")
 


## zoom into region
## zoom into region

Latest revision as of 17:10, 29 May 2020

We will go through some examples on how to use PCAngsd with visualization of the data.

Set the path to PCAngsd

Every time you open a new terminal window, set the paths to the program and the input file.

PCANGSD=~/Software/pcangsd/pcangsd.py

Test the link

ls $PCAngsd

Create directories

Create the directories that will be used for working:

mkdir Demo


mkdir Demo/Data

mkdir Demo/Results

Set the paths to your local directories

IN_DIR=Demo/Data

OUT_DIR=Demo/Results

Test the links

ls $IN_DIR

ls $OUT_DIR

Demo 1: Allele Frequencies

This example will perform a PCA analysis on 1000 genotype likelihoods.

Download the input and population information files

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):


wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz

wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info

mv Demo1input.gz $IN_DIR

mv Demo1pop.info $IN_DIR

View the genotype likelihood beagle file

  • In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.
  • In order to see the first 10 columns and 10 lines of the input file, type:
gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t
  • Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):
gunzip -c $IN_DIR/Demo1input.gz | wc -l

View population information file

To view a summary of the population information file, cut the first column, sort and count:

cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c

Lets make a population label file and place it in the output directory

cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel

Run PCAngsd

The program estimates the covariance matrix that can be used for PCA.

Estimating Individual Allele Frequencies

python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1

Plot the results in R

#open R
pop<-read.table("Demo/Data/Demo1pop.info")
C <- as.matrix(read.table("Demo/Results/Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R

To view the plot, type:

evince PCAngsd1.pdf

Without Estimating Individual Allele Frequencies

Try the same analysis but without estimating individual allele frequencies. This is the same as using the first iteration of the algorithm.

python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0

Plot the results in R

#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R

To view the plot, type:

evince PCAngsd2.pdf

Admixture based on two PC

Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

python $PCANGSD -beagle $IN_DIR/Demo1input.gz -e 2 -o $OUT_DIR/Demo2PCANGSD_3 -admix -admix_alpha 50

Plot the results in R

#open R
##Requires previous installation of the library RcppCNPy

library(RcppCNPy) # Numpy library for R

pop<-read.table("Demo/Data/Demo1pop.info",as.is=T)

q <- npyLoad("Demo/Results/Demo2PCANGSD_3.admix.Q.npy") 

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R

Inbreeding in the admixed individuals

Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_4 -inbreed 2 -iter 0

Join names and results, sort the values and look at the results

paste Demo1pop.info Demo1PCANGSD_4.inbreed.npy | LC_ALL=C sort -k3g

The third column is an estimate of the inbreeding coefficient (allowing for negative)

Inbreeding with individual allele frequencies

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_5 -inbreed 2

Join names and results, sort the values and look at the results

paste Demo1pop.info Demo1PCANGSD_5.inbreed.npy | LC_ALL=C sort -k3g

Demo 2: Selection

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

  • Genotype likelihoods in Beagle format
  • ~150k random SNPs with maf > 5%
  • Four EU populations with ~100 individuals in each
CEU European (mostly British)
GBR Great Britain
IBS Iberian/Spain
TSI Italien

Download the input and sample information files

A file with the positions and sample information, and a beagle file are provided:

Download the files and move them to your input folder (for example, $IN_DIR):

*ANDERS*

wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz (/home/albrecht/oldPhDCourse/PCangsd/data/eu1000g.small.beagle.gz)

wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info (/home/albrecht/oldPhDCourse/PCangsd/data/eu1000g.sample.Info)

mv Demo2input.gz $IN_DIR

mv Demo2sample.info $OUT_DIR

*ANDERS*

Run PCAngsd

The objective is to show the differences among individuals.

python $PCANGSD -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_1 -threads 4

Plot the results in R

 ## Open R

 cov <- as.matrix(read.table("Demo2PCANGSD_1.cov"))

 e<-eigen(cov)
 ID<-read.table("Demo2sample.info",head=T)

 pdf("PCAngsdDemo2_1.pdf")

 plot(e$vectors[,1:2],col=ID$POP)

 legend("topleft",fill=1:4,levels(ID$POP))

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

Infer selection along the genome

Now let's try to use the PC to infer selection along the genome based on the PCA

python $PCANGSD -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_2 -selection -sites_save #-n $N


Plot the results


library(RcppCNPy) # Numpy library for R

## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
  qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-npyLoad("Demo2PCANGSD_2.selection.npy")

## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("Demo2PCANGSD_2.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhattan plot")


## zoom into region
 w<-range(which(pval<1e-7)) + c(-100,100)
 keep<-w[1]:w[2]
 plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
 p$pos[which.max(s)]