ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Fst PCA: Difference between revisions

From angsd
Jump to navigation Jump to search
No edit summary
Line 5: Line 5:


We are also working on a new implementation that generalizes the above approach to multiple populations.
We are also working on a new implementation that generalizes the above approach to multiple populations.
Our program can estimate fst between two populations. And has been generalized to give all pairwise fst estimates if you supply the command with multiple populations.
;if you supply 3 populations, the program will also output the pbs statistic.


=New Fancy Version=
=New Fancy Version=

Revision as of 14:56, 30 July 2015

Matteo Fumagalli has been working on methods to estimate Fst and doing PCA/Covariance based on ANGSD output files.

The main documentation for this is found here: https://github.com/mfumagalli/ngsTools

We are also working on a new implementation that generalizes the above approach to multiple populations.

Our program can estimate fst between two populations. And has been generalized to give all pairwise fst estimates if you supply the command with multiple populations.

if you supply 3 populations, the program will also output the pbs statistic.

New Fancy Version

Two Populations real data

#this is with 2pops
#first calculate per pop saf for each populatoin
../angsd -b list1  -anc hg19ancNoChr.fa -out pop1 -dosaf 1 -gl 1
../angsd -b list2  -anc hg19ancNoChr.fa -out pop2 -dosaf 1 -gl 1
#calculate the 2dsfs prior
../misc/realSFS pop1.saf.idx pop2.saf.idx >pop1.pop2.ml
#prepare the fst for easy window analysis etc
../misc/realSFS fst index pop1.saf.idx pop2.saf.idx -sfs pop1.pop2.ml -fstout here
#get the global estimate
../misc/realSFS fst stats here.fst.idx 
-> FST.Unweight:0.069395 Fst.Weight:0.042349
#below is not tested that much, but seems to work
../misc/realSFS fst stats2 here.fst.idx -win 50000 -step 10000 >slidingwindow

3 Populations real data

In commands below im using 24 threads, because this is what I have. Adjust accordingly

#this is with 2pops
#first calculate per pop saf for each populatoin
./angsd -b list10  -anc hg19ancNoChr.fa -out pop1 -dosaf 1 -gl 1
./angsd -b list11  -anc hg19ancNoChr.fa -out pop2 -dosaf 1 -gl 1
./angsd -b list12  -anc hg19ancNoChr.fa -out pop3 -dosaf 1 -gl 1
#calculate all pairwise 2dsfs's
./misc/realSFS pop1.saf.idx pop2.saf.idx -P 24 >pop1.pop2.ml
./misc/realSFS pop1.saf.idx pop3.saf.idx -P 24 >pop1.pop3.ml
./misc/realSFS pop2.saf.idx pop3.saf.idx -P 24 >pop2.pop3.ml
#prepare the fst for easy analysis etc
./misc/realSFS fst index pop1.saf.idx pop2.saf.idx pop3.saf.idx -sfs pop1.pop2.ml -sfs pop1.pop3.ml -sfs pop2.pop3.ml -fstout here
#get the global estimate
	-> Assuming idxname:here.fst.idx
	-> Assuming .fst.gz file: here.fst.gz
	-> FST.Unweight[nObs:1666245]:0.022063 Fst.Weight:0.034513
0.022063 0.034513
	-> FST.Unweight[nObs:1666245]:0.026867 Fst.Weight:0.031989
0.026867 0.031989
	-> FST.Unweight[nObs:1666245]:0.025324 Fst.Weight:0.021118
0.025324 0.021118
	-> pbs.pop1	0.023145
	-> pbs.pop2	0.005088
	-> pbs.pop3	0.009367
#below is not tested that much, but seems to work
../misc/realSFS fst stats2 here.fst.idx -win 50000 -step 10000 >slidingwindow

In the presence of 3 populations, the program will also calculate the population branch statistics

Sliding Window output

The sliding window seems to work so we have documented it here:

Second column is chromosome, third is center of window followed by
fst.unweight(pop1,pop2) fst.weight(pop1,pop2) fst.unweight(pop1,pop3) fst.weight(pop1,pop3) fst.unweight(pop2,pop3) fst.weight(pop2,pop3)
 
(9133,58895)(14010000,14059999)(14010000,14060000)	1	14035000	0.022099	0.016387	0.026686	0.027731	0.025311	0.047920	-0.002231	0.035045	0.030353
(19114,68881)(14020000,14069999)(14020000,14070000)	1	14045000	0.022096	0.019076	0.026777	0.024238	0.025290	0.052793	-0.005220	0.041969	0.029757
(28951,78655)(14030000,14079999)(14030000,14080000)	1	14055000	0.022043	0.021025	0.026915	0.023368	0.025342	0.056975	-0.006884	0.046840	0.030530
(38928,88632)(14040000,14089999)(14040000,14090000)	1	14065000	0.022083	0.016525	0.026846	0.029560	0.025345	0.053421	-0.004116	0.039898	0.034122
(48917,98170)(14050000,14099999)(14050000,14100000)	1	14075000	0.022132	0.022082	0.026742	0.025564	0.025262	0.037071	0.005226	0.024827	0.020671
(74,49191)(14000000,14049999)(14000000,14050000)	10	14025000	0.022704	0.101955	0.026479	0.095713	0.025102	0.001924	0.103108	-0.048378	-0.002500
(9734,58555)(14010000,14059999)(14010000,14060000)	10	14035000	0.022779	0.102670	0.026425	0.088015	0.025118	0.002721	0.098870	-0.043342	-0.006738

The last 3 columns are the populations branch statistic for population1, popultion2 and population3

3 Populations simulated data

msms -ms 44 10 -t 930 -r 400 -I 3 12 14 18 -n 1 1.682020 -n 2 3.736830 -n 3 7.292050 -eg 0 2 116.010723 -eg 0 3 160.246047 -ma x 0.881098 0.561966 0.881098 x 2.797460 0.561966 2.797460 x -ej 0.028985 3 2 -en 0.028985 2 0.287184 -ema 0.028985 3 x 7.293140 x 7.293140 x x x x x -ej 0.197963 2 1 -en 0.303501 1 1 >msoutput.txt
../misc/msToGlf -in msoutput.txt -out raw -singleOut 1 -regLen 0 -depth 8 -err 0.005 
../misc/splitgl raw.glf.gz 22 1 6 >pop1.glf.gz
../misc/splitgl raw.glf.gz 22 7 13 >pop2.glf.gz
../misc/splitgl raw.glf.gz 22 14 22 >pop3.glf.gz
echo \"1 250000000\" >fai.fai
../angsd -glf pop1.glf.gz -nind 6 -doSaf 1 -out pop1 -fai fai.fai -issim 1
../angsd -glf pop2.glf.gz -nind 7 -doSaf 1 -out pop2 -fai fai.fai -issim 1
../angsd -glf pop3.glf.gz -nind 9 -doSaf 1 -out pop3 -fai fai.fai -issim 1
../misc/realSFS pop1.saf.idx >pop1.saf.idx.ml
../misc/realSFS pop2.saf.idx >pop2.saf.idx.ml
../misc/realSFS pop3.saf.idx >pop3.saf.idx.ml
../misc/realSFS pop1.saf.idx pop2.saf.idx -p 20  >pop1.pop2.saf.idx.ml
../misc/realSFS pop1.saf.idx pop3.saf.idx -p 20  >pop1.pop3.saf.idx.ml
../misc/realSFS pop2.saf.idx pop3.saf.idx -p 20  >pop2.pop3.saf.idx.ml
../misc/realSFS fst index pop1.saf.idx pop2.saf.idx -fstout pop1.pop2 -sfs pop1.pop2.saf.idx.ml
../misc/realSFS fst index pop1.saf.idx pop3.saf.idx -fstout pop1.pop3 -sfs pop1.pop3.saf.idx.ml
../misc/realSFS fst index pop2.saf.idx pop3.saf.idx -fstout pop2.pop3 -sfs pop2.pop3.saf.idx.ml
../misc/realSFS fst index pop1.saf.idx pop2.saf.idx pop3.saf.idx -fstout pop1.pop2.pop3 -sfs pop1.pop2.saf.idx.ml -sfs pop1.pop3.saf.idx.ml -sfs pop2.pop3.saf.idx.ml
../misc/realSFS fst stats pop1.pop2.fst.idx
../misc/realSFS fst stats pop1.pop3.fst.idx
../misc/realSFS fst stats pop2.pop3.fst.idx
../misc/realSFS fst stats pop1.pop2.pop3.fst.idx

Which gives the following output

$ ../misc/realSFS fst stats pop1.pop2.fst.idx
	-> You are printing the optimized SFS to the terminal consider dumping into a file
	-> E.g.: './realSFS fst stats pop1.pop2.fst.idx >sfs.ml.txt'
	-> Assuming idxname:pop1.pop2.fst.idx
	-> Assuming .fst.gz file: pop1.pop2.fst.gz
	-> FST.Unweight[nObs:51085]:0.114638 Fst.Weight:0.186980
$ ../misc/realSFS fst stats pop1.pop3.fst.idx
	-> You are printing the optimized SFS to the terminal consider dumping into a file
	-> E.g.: './realSFS fst stats pop1.pop3.fst.idx >sfs.ml.txt'
	-> Assuming idxname:pop1.pop3.fst.idx
	-> Assuming .fst.gz file: pop1.pop3.fst.gz
	-> FST.Unweight[nObs:51085]:0.121007 Fst.Weight:0.192111
$ ../misc/realSFS fst stats pop2.pop3.fst.idx
	-> You are printing the optimized SFS to the terminal consider dumping into a file
	-> E.g.: './realSFS fst stats pop2.pop3.fst.idx >sfs.ml.txt'
	-> Assuming idxname:pop2.pop3.fst.idx
	-> Assuming .fst.gz file: pop2.pop3.fst.gz
	-> FST.Unweight[nObs:51085]:0.069462 Fst.Weight:0.125002
$ ../misc/realSFS fst stats pop1.pop2.pop3.fst.idx
	-> You are printing the optimized SFS to the terminal consider dumping into a file
	-> E.g.: './realSFS fst stats pop1.pop2.pop3.fst.idx >sfs.ml.txt'
	-> Assuming idxname:pop1.pop2.pop3.fst.idx
	-> Assuming .fst.gz file: pop1.pop2.pop3.fst.gz
	-> FST.Unweight[nObs:51085]:0.114638 Fst.Weight:0.186980
	-> FST.Unweight[nObs:51085]:0.121007 Fst.Weight:0.192111
	-> FST.Unweight[nObs:51085]:0.069462 Fst.Weight:0.125002

Two populations (sim data with R implementation of functionality)

  nRep <- 100
       nPop1 <- 24
       nPop2 <- 16
       cmd <- paste("msms -ms",nPop1+nPop2,nRep,"-t 930 -r 400 -I 2",nPop1,nPop2,"0 -g 1 9.70406 -n 1 2 -n 2 1 -ma x 0.0 0.0 x -ej 0.07142857 2 1  >msoutput.txt ",sep=" ")
       system(cmd)
       ##system("msms -ms 40 1 -t 930 -r 400 -I 2 20 20 0 -g 1 9.70406 -n 1 2 -n 2 1 -ma x 0.0 0.0 x -ej 0.07142857 2 1  >msoutput.txt  ")
 source("../R/readms.output.R")
  to2dSFS <- function(p1.d,p2.d,nPop1,nPop2)
        sapply(0:nPop1,function(x) table(factor(p2.d[p1.d==x],levels=0:nPop2)))

  source("../R/readms.output.R")
        a<- read.ms.output(file="msoutput.txt")
        
        p1.d <- unlist((sapply(a$gam,function(x) colSums(x[1:nPop1,]))))
        p2.d <- unlist((sapply(a$gam,function(x) colSums(x[-c(1:nPop1),]))))
        par(mfrow=c(1,2))
        barplot(table(p1.d))
        barplot(table(p2.d))
        sfs.2d <- t(sapply(0:nPop1,function(x) table(factor(p2.d[p1.d==x],levels=0:nPop2))))
   system("../misc/msToGlf -in msoutput.txt -out raw -singleOut 1 -regLen 0 -depth 8 -err 0.005")
       system("../misc/splitgl raw.glf.gz 20 1 12 >pop1.glf.gz")
       system("../misc/splitgl raw.glf.gz 20 13 20 >pop2.glf.gz")
       system("echo \"1 250000000\" >fai.fai")
       system("../angsd -glf pop1.glf.gz -nind 12 -doSaf 1 -out pop1 -fai fai.fai -issim 1")
       system("../angsd -glf pop2.glf.gz -nind 8 -doSaf 1 -out pop2 -fai fai.fai -issim 1")
       system("../misc/realSFS pop1.saf.idx >pop1.saf.idx.ml")
       system("../misc/realSFS pop2.saf.idx >pop2.saf.idx.ml")
       system("../misc/realSFS pop1.saf.idx pop2.saf.idx -maxIter 500 -p 20  >pop1.pop2.saf.idx.ml")
getFst<-function(est){
    N1<-nrow(est)-1
    N2<-ncol(est)-1
    cat("N1: ",N1 ," N2: ",N2,"\n")
    est0<-est
    est0[1,1]<-0
    est0[N1+1,N2+1]<-0
    est0<-est0/sum(est0)
    
    aMat<<-matrix(NA,nrow=N1+1,ncol=N2+1)
    baMat<<-matrix(NA,nrow=N1+1,ncol=N2+1)
    for(a1 in 0:(N1))
        for(a2 in 0:(N2)){
            p1 <- a1/N1
            p2 <- a2/N2
            q1 <- 1 - p1
            q2 <- 1 - p2
            alpha1 <- 1 - (p1^2 + q1^2)
            alpha2 <- 1 - (p2^2 + q2^2)
            
            al <-  1/2 * ( (p1-p2)^2 + (q1-q2)^2) - (N1+N2) *  (N1*alpha1 + N2*alpha2) / (4*N1*N2*(N1+N2-1))
            bal <- 1/2 * ( (p1-p2)^2 + (q1-q2)^2) + (4*N1*N2-N1-N2)*(N1*alpha1 + N2*alpha2) / (4*N1*N2*(N1+N2-1))
            aMat[a1+1,a2+1]<<-al
            baMat[a1+1,a2+1]<<-bal
            ##  print(signif(c(a1=a1,a2=a2,p1=p1,p2=p2,al1=alpha1,al2=alpha2,al),2))
        }
    ## unweighted average of single-locus ratio estimators
    fstU <-   sum(est0*(aMat/baMat),na.rm=T)
    ## weighted average of single-locus ratio estimators
    fstW <-   sum(est0*aMat,na.rm=T)/sum(est0*baMat,na.rm=T)
    c(fstW=fstW,fstU=fstU)
}

> getFst(sfs.2d)
N1:  24  N2:  16 
      fstW       fstU 
0.11945801 0.08249571 

est <- matrix(as.integer(scan("pop1.pop2.saf.idx.ml")),byrow=T,ncol=nPop2+1)
> getFst(est)
N1:  24  N2:  16 
      fstW       fstU 
0.11925903 0.08241461 

cmd<"fst index pop1.saf.idx pop2.saf.idx -sfs pop1.pop2.saf.idx.ml  -fstout testing"
system(cmd)

##view the per site 'alpha' 'beta' if you want
cmd<-"../misc/realSFS fst print testing.fst.idx |head"

##use fancy new emperical bayes
cmd<- "../misc/realSFS fst stats testing.fst.idx "
system(cmd)
-> FST.Unweight:0.083316 Fst.Weight:0.119372

Old nice version

Fst

  1. Generate .saf files from each population using ANGSD SFS Estimation
    1. using a 2D-SFS as a prior, estimated using ngs2dSFS
    2. using marginal spectra as priors, estimated using realSFS

PCA

More information here: https://github.com/mfumagalli/ngsTools#ngscovar

cite

If you use these methods, you should cite the m. fumagalli paper http://www.ncbi.nlm.nih.gov/pubmed/23979584