PCAngsdTutorial

From software
Revision as of 00:27, 30 July 2019 by Sonia4 (talk | contribs)
Jump to navigation Jump to search

We will go through a simple and more complex example on how to use PCAngsd with visualization of the data.

PCA with admixture aware priors

This example will perform a PCA analysis on 1000 genotype likelihoods.

Every time you open a new terminal window, set the paths to the program and the input file.

Set the path to PCAngsd

PCANGSD=~/Software/pcangsd

Test the link

ls $PCAngsd

Create directories

Create the directories that will be used for working:

mkdir Demo

cd Demo

mkdir Data

mkdir Results

Set the paths to your local directories

IN_DIR=Demo/Data

OUT_DIR=Demo/Results

Test the links

ls $IN_DIR

ls $OUT_DIR

Download the beagle genotype likelihood input file

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

Download the files and move them to your input folder (for example, $IN_DIR):

*ANDERS*

wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz

wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info

mv Demo1input.gz $IN_DIR

mv Demo1pop.info $IN_DIR

*ANDERS*

View the genotype likelihood beagle file

  • In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.
  • In order to see the first 10 columns and 10 lines of the input file, type:

gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t

  • Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

gunzip -c $IN_DIR/Demo1input.gz | wc -l

View population information file

To view a summary of the population information file, cut the first column, sort and count:

cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c

Lets make a population label file and place it in the output directory

cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel

Run PCAngsd

Demo 1: Allele Frequencies

The program estimates the covariance matrix that can then be used for PCA.

Estimating Individual Allele Frequencies

$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD

Plot the results in R

#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo1PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R

To view the plot:

evince PCAngsd1.pdf


Without Estimating Individual Allele Frequencies

Try the same analysis but without estimating individual allele frequencies. This is the same as using the first iteration of the algorithm.

$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0

Plot the results in R

#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R

To view the plot:

evince PCAngsd2.pdf

Admixture based on 1st two PC

Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50

Plot the results in R

#open R
pop<-read.table("pop.info",as.is=T)
*'''CHANGE NAME OF FILE'''
q<-read.table("Demo3PCANGSD.K3.a50.0.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R

Inbreeding in the admixed individuals

Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0

Join names and results, sort the values and look at the results

paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g

The third column is an estimate of the inbreeding coefficient (allowing for negative)

Inbreeding with individual allele frequencies

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2

Join names and results, sort the values and look at the results paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g

Demo 2: Selection

For very resent selection we can look within closely related individuals for example with in Europeans

Data file:

  • Genotype likelihoods in Beagle format
  • ~150k random SNPs with maf > 5%
  • Four EU populations with ~100 individuals in each
<tbody> </tbody>
CEU European (mostly British)
GBR Great Britain
IBS Iberian/Spain
TSI Italien