Scan statistics

Variability statistics
Haplotype based ( EHH ) scan statistics
Bonus exercise: Population differentiation using genotype based PBS

Variability statistics

We will use individuals sequencing in the 1000 genomes project. Even thought the individuals we sequenced at low/medium depth they were able to obtain good genotype calls for most of the SNPs in the genome (using imputation + external information). Because of computational demands we will only be looking at a 20Mb region for a small subset of the individuals. We will use 56 African individuals (YRI) and 60 European individuals (CEU). These are the individuals what overlap with the HapMap project where many selection scans in Humans have been performed. We will try to explore the LCT gene located at position 136.6 Mb.

Aim: Locate the LCT selection signal using phased genotype data

We will use the program selscan with contains many of the haplotype based methods used for scan statistics. These methods are based on phased genotpe data.

In the terminal set the paths to the data and program

#selscan program folder
SS=/data/anders/prog/selscan/bin/linux/

ThePath=/data/anders/

#VCF files
ceuVCF=$ThePath/EHH/data/ceuLCT.recode.vcf
yriVCF=$ThePath/EHH/data/yriLCT.recode.vcf

#genetic map (positions in centimorgans)
MAP=$ThePath/EHH/data/geneticV2.map

Look inside one of the VCF files e.g.

less -S $yriVCF

How can you tell that it is phased data?
The VCF file has been filters for certain types of sites. Which ones?

variability/Tajimas pi

First lets look at the variability of the data in the CEU.

If there has been positive selection at the LCT loci what do you expect?

Estimate tajimas theta (pi) in 10k windows

$SS/selscan --pi --vcf $ceuVCF --map $MAP --out ceuLCT --threads 8 --pi-win 10000

You can plot the results in R e.g.

#in R
r<-read.table("ceuLCT.pi.10000bp.out",head=F)
r<-subset(r,V3!=0)
plot(r[,1],r[,3],ylab="pi",xlab="position")
causalSNP <- 136608646
abline(v=136.5e6,col="red")
legend("topleft",fill="red","LCT")
## don't close R

Can you see the reduced variability?

To determine whether it is extreme we can compare with the rest of the region

#continue in R
hist(r[,3],br=100)
(causalWin<-subset(r,V1<causalSNP & V2>causalSNP))
abline(v=causalWin[,3],col="red")

is the variability in the LCT region extreme?
Try different windows sizes to see if it will change your results (NB! you have to change the name when reading into R)

NB!. This is based on genotypes. I will get back to how to do it for low depth data using genotype likelihoood

BONUS: If you want to see tajimas D instead

Use the browser to see how Tajima's D perform on the same data

Haplotype based ( EHH ) scan statistics

IHS

Lets see if the haplotype homozygosity does a better job. Run IHS using the command

$SS/selscan --ihs --vcf $ceuVCF --map $MAP --out ceuLCT --threads 8

The analysis will take a couple of minutes. If you cannot wait then you can copy the results with the command

cp $ThePath/run/ceuLCT.ihs* .

The output colums are : <locusID> <physicalPos> <'1' freq> <ihh1> <ihh0> <unstandardized iHS>

This statistics will be affected by the frequency of the SNPs therefore we have to normalize in frequency bins. The default in 100 bins

$SS/norm --ihs --files ceuLCT.ihs.out

The number of bins in too high for this data set since we do not have enough SNPs for each bin of allele frequencies. Therefore, redo the analysis where you reduce the number of bins to 20 with the - -bins 20 option.

Lets plot the results in R

r <- read.table("ceuLCT.ihs.out.20bins.norm",as.is=T,head=F)
names(r) <- c("locusID", "physicalPos","freq","ihh1","ihh0","unstandardizediHS")
r[which.max(r$ihh1/r$ihh0),]
causalSNP <- 136608646
#plot without frequency standardization
plot(r$physicalPos,r$unstandardizediHS);

## with standardiztion IHS=ihh0/ihh1
r$iHS<-log(r$ihh1/r$ihh0)
plot(r$physicalPos,r$iHS,ylab="iHs");
abline(v=causalSNP,col="red")

## causal SNP test statistics vs. rest of region
(causalSite<-r[which(r$physicalPos==causalSNP),])
hist(r$iHS)
abline(v=causalSite$iHS,col="red")
# close R when you are done

Lets try to use the West Africans (YRI) to normalize the IHS using XpEHH

$SS/selscan --xpehh --vcf $ceuVCF --map $MAP --vcf-ref $yriVCF   --out ceuLCT --threads 8

It will take about 10mins so you should probably copy the results instead

cp $ThePath/run/ceuLCT.xpehh* .

We also have to normalize this

$SS/norm --xpehh --files ceuLCT.xpehh.out

again you can plot the results in R.

r<-read.table("ceuLCT.xpehh.out.norm",head=T,as.is=T,row.names=NULL)
causalSNP <- 136608646
(causalSite<-r[which(r$pos==causalSNP),])
plot(r$pos,r$normxpehh)
abline(v=causalSNP,col="red")

#print the site with maximum statistic
r[which.max(r$normxpehh),]

#plot the distribution
hist(r$normxpehh)
abline(v=causalSite$normxpehh,col="red")

#get the quantile
mean(causalSite$normxpehh>r$normxpehh)

Are you more convinced that the site is under selection?

Bonus exercise: Population differentiation using genotype based PBS

Here we will use the called genotypes for the whole genome from 4 1000G populations each with more then 100 individuals. We will use a browser to explore the genome using PBS statistics.

The browser is running remotely so it is very slow so you need to have patience

make a directory called tmp in your home directory

mkdir ~/tmp

then open R and run the shiny application

.libPaths( c( .libPaths(), "/data/anders/R"))
library(shiny)

runApp("/data/anders/selectionScan/",port=3838)

This will open a browser but it will take a minute or two to start so be patient .

- When the application shows "press run analysist" then it is ready

Europeans. Let see if we can do better than the Tajima’s D by using the PBS statistics. First select 3 populations from

NAT - Native Americans (PERU+Mexico)
CHB – East Asian - Han Chinese
CEU – Central Europeans
YRI – African - Nigerians The first population is the one which branch you are investigating. The two others are the one you are comparing to. Chose CEU as the first and choose CHB and YRI as the two others.

First lets get an overview of the whole genome by making a manhattan plot

Press “Run analysis”. Note which chromosomes have extreme values

To view a single chromosome – go to PBS region

Chose the chromosome and set the starting position to -1 to get the whole chromosome
Zoom in to the peak by changing start and end position.
Locate the most extreme regions of the genome
Try the LCT gene (the mutations are locate in the adjacent MCM6 gene, chr2 136.5Mb). Remember to choose the relevant populations H - ow does this compare with Tajima’s D

If you have time you can try other genes. Here are the top ones for Humans. You can find the find the location of the genes using for example the ucsc browser https://genome-euro.ucsc.edu/cgi-bin/hgGateway

(choose human GRCh37/hg19 genome). Note that there are some population that you cannot test because the populations are not represented in the data e.g. Tibetan, Ethiopian , Inuit, Siberians.

[ Front | MyPhdCourse | Index ]

Updated April 3, 2019

Scan statistics

Popgen.dk Anders Software ANGSD About

Variability statistics

variability/Tajimas pi

Haplotype based ( EHH ) scan statistics

IHS

Bonus exercise: Population differentiation using genotype based PBS