Variability statistics
We will use individuals sequencing in the 1000 genomes project. Even thought the individuals we sequenced at low/medium depth they were able to obtain good genotype calls for most of the SNPs in the genome (using imputation + external information). Because of computational demands we will only be looking at a 20Mb region for a small subset of the individuals. We will use 56 African individuals (YRI) and 60 European individuals (CEU). These are the individuals what overlap with the HapMap project where many selection scans in Humans have been performed. We will try to explore the LCT gene located at position 136.6 Mb.
Aim: Locate the LCT selection signal using phased genotype data
We will use the program selscan with contains many of the haplotype based methods used for scan statistics. These methods are based on phased genotpe data.
In the terminal set the paths to the data and program
#selscan program folder SS=/data/anders/prog/selscan/bin/linux/ ThePath=/data/anders/ #VCF files ceuVCF=$ThePath/EHH/data/ceuLCT.recode.vcf yriVCF=$ThePath/EHH/data/yriLCT.recode.vcf #genetic map (positions in centimorgans) MAP=$ThePath/EHH/data/geneticV2.map
Look inside one of the VCF files e.g.
less -S $yriVCF
- How can you tell that it is phased data?
- The VCF file has been filters for certain types of sites. Which ones?
variability/Tajimas pi
First lets look at the variability of the data in the CEU.
- If there has been positive selection at the LCT loci what do you expect?
Estimate tajimas theta (pi) in 10k windows
$SS/selscan --pi --vcf $ceuVCF --map $MAP --out ceuLCT --threads 8 --pi-win 10000
You can plot the results in R e.g.
#in R
r<-read.table("ceuLCT.pi.10000bp.out",head=F)
r<-subset(r,V3!=0)
plot(r[,1],r[,3],ylab="pi",xlab="position")
causalSNP <- 136608646
abline(v=136.5e6,col="red")
legend("topleft",fill="red","LCT")
## don't close R
- Can you see the reduced variability?
To determine whether it is extreme we can compare with the rest of the region
#continue in R hist(r[,3],br=100) (causalWin<-subset(r,V1<causalSNP & V2>causalSNP)) abline(v=causalWin[,3],col="red")
- is the variability in the LCT region extreme?
- Try different windows sizes to see if it will change your results (NB! you have to change the name when reading into R)
NB!. This is based on genotypes. I will get back to how to do it for low depth data using genotype likelihoood
BONUS: If you want to see tajimas D instead
Use the browser to see how Tajima's D perform on the same data
Haplotype based ( EHH ) scan statistics
IHS
Lets see if the haplotype homozygosity does a better job. Run IHS using the command
$SS/selscan --ihs --vcf $ceuVCF --map $MAP --out ceuLCT --threads 8
The analysis will take a couple of minutes. If you cannot wait then you can copy the results with the command
cp $ThePath/run/ceuLCT.ihs* .
The output colums are : <locusID> <physicalPos> <'1' freq> <ihh1> <ihh0> <unstandardized iHS>
This statistics will be affected by the frequency of the SNPs therefore we have to normalize in frequency bins. The default in 100 bins
$SS/norm --ihs --files ceuLCT.ihs.out
The number of bins in too high for this data set since we do not have enough SNPs for each bin of allele frequencies. Therefore, redo the analysis where you reduce the number of bins to 20 with the - -bins 20 option.
Lets plot the results in R
r <- read.table("ceuLCT.ihs.out.20bins.norm",as.is=T,head=F)
names(r) <- c("locusID", "physicalPos","freq","ihh1","ihh0","unstandardizediHS")
r[which.max(r$ihh1/r$ihh0),]
causalSNP <- 136608646
#plot without frequency standardization
plot(r$physicalPos,r$unstandardizediHS);
## with standardiztion IHS=ihh0/ihh1
r$iHS<-log(r$ihh1/r$ihh0)
plot(r$physicalPos,r$iHS,ylab="iHs");
abline(v=causalSNP,col="red")
## causal SNP test statistics vs. rest of region
(causalSite<-r[which(r$physicalPos==causalSNP),])
hist(r$iHS)
abline(v=causalSite$iHS,col="red")
# close R when you are done
Lets try to use the West Africans (YRI) to normalize the IHS using XpEHH
$SS/selscan --xpehh --vcf $ceuVCF --map $MAP --vcf-ref $yriVCF --out ceuLCT --threads 8It will take about 10mins so you should probably copy the results instead
cp $ThePath/run/ceuLCT.xpehh* .
We also have to normalize this
$SS/norm --xpehh --files ceuLCT.xpehh.out
again you can plot the results in R.
r<-read.table("ceuLCT.xpehh.out.norm",head=T,as.is=T,row.names=NULL)
causalSNP <- 136608646
(causalSite<-r[which(r$pos==causalSNP),])
plot(r$pos,r$normxpehh)
abline(v=causalSNP,col="red")
#print the site with maximum statistic
r[which.max(r$normxpehh),]
#plot the distribution
hist(r$normxpehh)
abline(v=causalSite$normxpehh,col="red")
#get the quantile
mean(causalSite$normxpehh>r$normxpehh)
- Are you more convinced that the site is under selection?
Bonus exercise: Population differentiation using genotype based PBS
Here we will use the called genotypes for the whole genome from 4 1000G populations each with more then 100 individuals. We will use a browser to explore the genome using PBS statistics.
The browser is running remotely so it is very slow so you need to have patience
make a directory called tmp in your home directory
mkdir ~/tmp
then open R and run the shiny application
.libPaths( c( .libPaths(), "/data/anders/R"))
library(shiny)
runApp("/data/anders/selectionScan/",port=3838)
This will open a browser but it will take a minute or two to start so be patient .
- When the application shows "press run analysist" then it is ready
Europeans. Let see if we can do better than the Tajima’s D by using the PBS statistics. First select 3 populations from
- NAT - Native Americans (PERU+Mexico)
- CHB – East Asian - Han Chinese
- CEU – Central Europeans
- YRI – African - Nigerians The first population is the one which branch you are investigating. The two others are the one you are comparing to. Chose CEU as the first and choose CHB and YRI as the two others.
First lets get an overview of the whole genome by making a manhattan plot
Press “Run analysis”. Note which chromosomes have extreme values
To view a single chromosome – go to PBS region
- Chose the chromosome and set the starting position to -1 to get the whole chromosome
- Zoom in to the peak by changing start and end position.
- Locate the most extreme regions of the genome
- Try the LCT gene (the mutations are locate in the adjacent MCM6 gene, chr2 136.5Mb). Remember to choose the relevant populations H - ow does this compare with Tajima’s D
If you have time you can try other genes. Here are the top ones for Humans. You can find the find the location of the genes using for example the ucsc browser https://genome-euro.ucsc.edu/cgi-bin/hgGateway
(choose human GRCh37/hg19 genome). Note that there are some population that you cannot test because the populations are not represented in the data e.g. Tibetan, Ethiopian , Inuit, Siberians.
