ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Fst: Difference between revisions

From angsd
Jump to navigation Jump to search
 
(18 intermediate revisions by the same user not shown)
Line 2: Line 2:


;if you supply 3 populations, the program will also output the pbs statistic.
;if you supply 3 populations, the program will also output the pbs statistic.
;NB we have removed the very unusefull unweighted fst estimator in the output, and have included a header. The output example below will be updated at some point.
The procedure is
- Use angsd for calculating '''saf''' files for each population
- Use realSFS to calculate 2d sfs for each pair
- Use the above calculated 2dsfs as priors jointly with all '''safs''' from step1 to calculate '''fst''' binary files
- Use realSFS to extract the the fst values from the '''fst'''
NB;
In the latest github version there is a different fst estimator which should be preferable for small sample sizes. Feel free to try that out with
<pre>
./realSFS fst index [saf.idx's] -whichFst 1
</pre>
=Note about fst for folded spectra=
* Earlier versions of angsd/realSFS could output folded 1d sample allele frequencies which would be usefull for 1population neutrality test like Tajima. This is however not appropriate to use for calculating fst since the folding was done within population.
* We have therefore added a proper folding procedure for the optimization based on the UNFOLDED .saf.idx files generated by -doSaf. These are the ones that should be used for calculating fst.
Therefore please remember to add -fold 1 if you want angsd (the realSFS subfunction) to perform fst and pbs estimation using the folded spectra.


=Two Populations real data=
=Two Populations real data=
Line 50: Line 72:
</pre>
</pre>
In the presence of 3 populations, the program will also calculate the population branch statistics
In the presence of 3 populations, the program will also calculate the population branch statistics
===Sliding Window  output===
==Sliding Window  output==
The sliding window seems to work so we have documented it here:
The sliding window seems to work so we have documented it here:
;Second column is chromosome, third is center of window followed by:
;Second column is chromosome, third is center of window followed by:
Line 64: Line 86:
</pre>
</pre>
The last 3 columns are the populations branch statistic for population1, popultion2 and population3
The last 3 columns are the populations branch statistic for population1, popultion2 and population3
==Relative window positions?==
We allow for 3 different ways of defining window positions, these are chosen with the '''-type''' argument in realSFS


== 3 Populations simulated data==
<pre>
msms -ms 44 10 -t 930 -r 400 -I 3 12 14 18 -n 1 1.682020 -n 2 3.736830 -n 3 7.292050 -eg 0 2 116.010723 -eg 0 3 160.246047 -ma x 0.881098 0.561966 0.881098 x 2.797460 0.561966 2.797460 x -ej 0.028985 3 2 -en 0.028985 2 0.287184 -ema 0.028985 3 x 7.293140 x 7.293140 x x x x x -ej 0.197963 2 1 -en 0.303501 1 1 >msoutput.txt
../misc/msToGlf -in msoutput.txt -out raw -singleOut 1 -regLen 0 -depth 8 -err 0.005
../misc/splitgl raw.glf.gz 22 1 6 >pop1.glf.gz
../misc/splitgl raw.glf.gz 22 7 13 >pop2.glf.gz
../misc/splitgl raw.glf.gz 22 14 22 >pop3.glf.gz
echo \"1 250000000\" >fai.fai
../angsd -glf pop1.glf.gz -nind 6 -doSaf 1 -out pop1 -fai fai.fai -issim 1
../angsd -glf pop2.glf.gz -nind 7 -doSaf 1 -out pop2 -fai fai.fai -issim 1
../angsd -glf pop3.glf.gz -nind 9 -doSaf 1 -out pop3 -fai fai.fai -issim 1
../misc/realSFS pop1.saf.idx >pop1.saf.idx.ml
../misc/realSFS pop2.saf.idx >pop2.saf.idx.ml
../misc/realSFS pop3.saf.idx >pop3.saf.idx.ml
../misc/realSFS pop1.saf.idx pop2.saf.idx -p 20  >pop1.pop2.saf.idx.ml
../misc/realSFS pop1.saf.idx pop3.saf.idx -p 20  >pop1.pop3.saf.idx.ml
../misc/realSFS pop2.saf.idx pop3.saf.idx -p 20  >pop2.pop3.saf.idx.ml
../misc/realSFS fst index pop1.saf.idx pop2.saf.idx -fstout pop1.pop2 -sfs pop1.pop2.saf.idx.ml
../misc/realSFS fst index pop1.saf.idx pop3.saf.idx -fstout pop1.pop3 -sfs pop1.pop3.saf.idx.ml
../misc/realSFS fst index pop2.saf.idx pop3.saf.idx -fstout pop2.pop3 -sfs pop2.pop3.saf.idx.ml
../misc/realSFS fst index pop1.saf.idx pop2.saf.idx pop3.saf.idx -fstout pop1.pop2.pop3 -sfs pop1.pop2.saf.idx.ml -sfs pop1.pop3.saf.idx.ml -sfs pop2.pop3.saf.idx.ml
../misc/realSFS fst stats pop1.pop2.fst.idx
../misc/realSFS fst stats pop1.pop3.fst.idx
../misc/realSFS fst stats pop2.pop3.fst.idx
../misc/realSFS fst stats pop1.pop2.pop3.fst.idx
</pre>
Which gives the following output
<pre>
$ ../misc/realSFS fst stats pop1.pop2.fst.idx
-> You are printing the optimized SFS to the terminal consider dumping into a file
-> E.g.: './realSFS fst stats pop1.pop2.fst.idx >sfs.ml.txt'
-> Assuming idxname:pop1.pop2.fst.idx
-> Assuming .fst.gz file: pop1.pop2.fst.gz
-> FST.Unweight[nObs:51085]:0.114638 Fst.Weight:0.186980
$ ../misc/realSFS fst stats pop1.pop3.fst.idx
-> You are printing the optimized SFS to the terminal consider dumping into a file
-> E.g.: './realSFS fst stats pop1.pop3.fst.idx >sfs.ml.txt'
-> Assuming idxname:pop1.pop3.fst.idx
-> Assuming .fst.gz file: pop1.pop3.fst.gz
-> FST.Unweight[nObs:51085]:0.121007 Fst.Weight:0.192111
$ ../misc/realSFS fst stats pop2.pop3.fst.idx
-> You are printing the optimized SFS to the terminal consider dumping into a file
-> E.g.: './realSFS fst stats pop2.pop3.fst.idx >sfs.ml.txt'
-> Assuming idxname:pop2.pop3.fst.idx
-> Assuming .fst.gz file: pop2.pop3.fst.gz
-> FST.Unweight[nObs:51085]:0.069462 Fst.Weight:0.125002
$ ../misc/realSFS fst stats pop1.pop2.pop3.fst.idx
-> You are printing the optimized SFS to the terminal consider dumping into a file
-> E.g.: './realSFS fst stats pop1.pop2.pop3.fst.idx >sfs.ml.txt'
-> Assuming idxname:pop1.pop2.pop3.fst.idx
-> Assuming .fst.gz file: pop1.pop2.pop3.fst.gz
-> FST.Unweight[nObs:51085]:0.114638 Fst.Weight:0.186980
-> FST.Unweight[nObs:51085]:0.121007 Fst.Weight:0.192111
-> FST.Unweight[nObs:51085]:0.069462 Fst.Weight:0.125002
</pre>


==Two populations (sim data with R implementation of functionality)==
;-type 2 Use pos=1 as the leftmost position of first window. Even though there isn't any data.
<pre>
  nRep <- 100
      nPop1 <- 24
      nPop2 <- 16
      cmd <- paste("msms -ms",nPop1+nPop2,nRep,"-t 930 -r 400 -I 2",nPop1,nPop2,"0 -g 1 9.70406 -n 1 2 -n 2 1 -ma x 0.0 0.0 x -ej 0.07142857 2 1  >msoutput.txt ",sep=" ")
      system(cmd)
      ##system("msms -ms 40 1 -t 930 -r 400 -I 2 20 20 0 -g 1 9.70406 -n 1 2 -n 2 1 -ma x 0.0 0.0 x -ej 0.07142857 2 1  >msoutput.txt  ")
source("../R/readms.output.R")
  to2dSFS <- function(p1.d,p2.d,nPop1,nPop2)
        sapply(0:nPop1,function(x) table(factor(p2.d[p1.d==x],levels=0:nPop2)))


  source("../R/readms.output.R")
;-type 1 Use first position with data, as leftmost position for the first window.
        a<- read.ms.output(file="msoutput.txt")
       
        p1.d <- unlist((sapply(a$gam,function(x) colSums(x[1:nPop1,]))))
        p2.d <- unlist((sapply(a$gam,function(x) colSums(x[-c(1:nPop1),]))))
        par(mfrow=c(1,2))
        barplot(table(p1.d))
        barplot(table(p2.d))
        sfs.2d <- t(sapply(0:nPop1,function(x) table(factor(p2.d[p1.d==x],levels=0:nPop2))))
  system("../misc/msToGlf -in msoutput.txt -out raw -singleOut 1 -regLen 0 -depth 8 -err 0.005")
      system("../misc/splitgl raw.glf.gz 20 1 12 >pop1.glf.gz")
      system("../misc/splitgl raw.glf.gz 20 13 20 >pop2.glf.gz")
      system("echo \"1 250000000\" >fai.fai")
      system("../angsd -glf pop1.glf.gz -nind 12 -doSaf 1 -out pop1 -fai fai.fai -issim 1")
      system("../angsd -glf pop2.glf.gz -nind 8 -doSaf 1 -out pop2 -fai fai.fai -issim 1")
      system("../misc/realSFS pop1.saf.idx >pop1.saf.idx.ml")
      system("../misc/realSFS pop2.saf.idx >pop2.saf.idx.ml")
      system("../misc/realSFS pop1.saf.idx pop2.saf.idx -maxIter 500 -p 20  >pop1.pop2.saf.idx.ml")
getFst<-function(est){
    N1<-nrow(est)-1
    N2<-ncol(est)-1
    cat("N1: ",N1 ," N2: ",N2,"\n")
    est0<-est
    est0[1,1]<-0
    est0[N1+1,N2+1]<-0
    est0<-est0/sum(est0)
   
    aMat<<-matrix(NA,nrow=N1+1,ncol=N2+1)
    baMat<<-matrix(NA,nrow=N1+1,ncol=N2+1)
    for(a1 in 0:(N1))
        for(a2 in 0:(N2)){
            p1 <- a1/N1
            p2 <- a2/N2
            q1 <- 1 - p1
            q2 <- 1 - p2
            alpha1 <- 1 - (p1^2 + q1^2)
            alpha2 <- 1 - (p2^2 + q2^2)
           
            al <-  1/2 * ( (p1-p2)^2 + (q1-q2)^2) - (N1+N2) *  (N1*alpha1 + N2*alpha2) / (4*N1*N2*(N1+N2-1))
            bal <- 1/2 * ( (p1-p2)^2 + (q1-q2)^2) + (4*N1*N2-N1-N2)*(N1*alpha1 + N2*alpha2) / (4*N1*N2*(N1+N2-1))
            aMat[a1+1,a2+1]<<-al
            baMat[a1+1,a2+1]<<-bal
            ##  print(signif(c(a1=a1,a2=a2,p1=p1,p2=p2,al1=alpha1,al2=alpha2,al),2))
        }
    ## unweighted average of single-locus ratio estimators
    fstU <-  sum(est0*(aMat/baMat),na.rm=T)
    ## weighted average of single-locus ratio estimators
    fstW <-  sum(est0*aMat,na.rm=T)/sum(est0*baMat,na.rm=T)
    c(fstW=fstW,fstU=fstU)
}


> getFst(sfs.2d)
;-type 0 Split out the genome into blocks. And use the first window that have data for the entire window. Then we will have the same windowcenters across datasets.
N1:  24  N2:  16
=realSFS fst print=
      fstW      fstU
You can print out the precalculated A and B with
0.11945801 0.08249571


est <- matrix(as.integer(scan("pop1.pop2.saf.idx.ml")),byrow=T,ncol=nPop2+1)
''./realSFS fst print pop1.pop2.fst.idx''
> getFst(est)
N1:  24  N2:  16
      fstW      fstU
0.11925903 0.08241461


cmd<"fst index pop1.saf.idx pop2.saf.idx -sfs pop1.pop2.saf.idx.ml -fstout testing"
Assuming we have pop1.saf.idx, pop2.saf.idx.
system(cmd)
<pre>
./realSFS pop1.saf.idx pop2.saf.idx >pop1.pop2.saf.idx.ml
./realSFS fst index pop1.saf.idx pop2.saf.idx -fstout pop1.pop2 -sfs pop1.pop2.saf.idx.ml
./realSFS fst print pop1.pop2.fst.idx
</pre>


##view the per site 'alpha' 'beta' if you want
The weighted fst for a region is the ratio between the sum of As and the sum of B. The unweighted is the mean of the persite ratios.
cmd<-"../misc/realSFS fst print testing.fst.idx |head"


##use fancy new emperical bayes
A is the alpha from the reynolds 1983 (or Bhatia) and B is the alpha + beta.
cmd<- "../misc/realSFS fst stats testing.fst.idx "
system(cmd)
-> FST.Unweight:0.083316 Fst.Weight:0.119372
</pre>

Latest revision as of 02:38, 17 August 2019

Our program can estimate fst between populations. And has been generalized to give all pairwise fst estimates if you supply the command with multiple populations.

if you supply 3 populations, the program will also output the pbs statistic.
NB we have removed the very unusefull unweighted fst estimator in the output, and have included a header. The output example below will be updated at some point.

The procedure is

- Use angsd for calculating saf files for each population

- Use realSFS to calculate 2d sfs for each pair

- Use the above calculated 2dsfs as priors jointly with all safs from step1 to calculate fst binary files

- Use realSFS to extract the the fst values from the fst

NB; In the latest github version there is a different fst estimator which should be preferable for small sample sizes. Feel free to try that out with

./realSFS fst index [saf.idx's] -whichFst 1

Note about fst for folded spectra

  • Earlier versions of angsd/realSFS could output folded 1d sample allele frequencies which would be usefull for 1population neutrality test like Tajima. This is however not appropriate to use for calculating fst since the folding was done within population.
  • We have therefore added a proper folding procedure for the optimization based on the UNFOLDED .saf.idx files generated by -doSaf. These are the ones that should be used for calculating fst.

Therefore please remember to add -fold 1 if you want angsd (the realSFS subfunction) to perform fst and pbs estimation using the folded spectra.

Two Populations real data

#this is with 2pops
#first calculate per pop saf for each populatoin
../angsd -b list1  -anc hg19ancNoChr.fa -out pop1 -dosaf 1 -gl 1
../angsd -b list2  -anc hg19ancNoChr.fa -out pop2 -dosaf 1 -gl 1
#calculate the 2dsfs prior
../misc/realSFS pop1.saf.idx pop2.saf.idx >pop1.pop2.ml
#prepare the fst for easy window analysis etc
../misc/realSFS fst index pop1.saf.idx pop2.saf.idx -sfs pop1.pop2.ml -fstout here
#get the global estimate
../misc/realSFS fst stats here.fst.idx 
-> FST.Unweight:0.069395 Fst.Weight:0.042349
#below is not tested that much, but seems to work
../misc/realSFS fst stats2 here.fst.idx -win 50000 -step 10000 >slidingwindow

3 Populations real data

In commands below im using 24 threads, because this is what I have. Adjust accordingly

#this is with 2pops
#first calculate per pop saf for each populatoin
./angsd -b list10  -anc hg19ancNoChr.fa -out pop1 -dosaf 1 -gl 1
./angsd -b list11  -anc hg19ancNoChr.fa -out pop2 -dosaf 1 -gl 1
./angsd -b list12  -anc hg19ancNoChr.fa -out pop3 -dosaf 1 -gl 1
#calculate all pairwise 2dsfs's
./misc/realSFS pop1.saf.idx pop2.saf.idx -P 24 >pop1.pop2.ml
./misc/realSFS pop1.saf.idx pop3.saf.idx -P 24 >pop1.pop3.ml
./misc/realSFS pop2.saf.idx pop3.saf.idx -P 24 >pop2.pop3.ml
#prepare the fst for easy analysis etc
./misc/realSFS fst index pop1.saf.idx pop2.saf.idx pop3.saf.idx -sfs pop1.pop2.ml -sfs pop1.pop3.ml -sfs pop2.pop3.ml -fstout here
#get the global estimate
	-> Assuming idxname:here.fst.idx
	-> Assuming .fst.gz file: here.fst.gz
	-> FST.Unweight[nObs:1666245]:0.022063 Fst.Weight:0.034513
0.022063 0.034513
	-> FST.Unweight[nObs:1666245]:0.026867 Fst.Weight:0.031989
0.026867 0.031989
	-> FST.Unweight[nObs:1666245]:0.025324 Fst.Weight:0.021118
0.025324 0.021118
	-> pbs.pop1	0.023145
	-> pbs.pop2	0.005088
	-> pbs.pop3	0.009367
#below is not tested that much, but seems to work
../misc/realSFS fst stats2 here.fst.idx -win 50000 -step 10000 >slidingwindow

In the presence of 3 populations, the program will also calculate the population branch statistics

Sliding Window output

The sliding window seems to work so we have documented it here:

Second column is chromosome, third is center of window followed by
fst.unweight(pop1,pop2) fst.weight(pop1,pop2) fst.unweight(pop1,pop3) fst.weight(pop1,pop3) fst.unweight(pop2,pop3) fst.weight(pop2,pop3)
 
(9133,58895)(14010000,14059999)(14010000,14060000)	1	14035000	0.022099	0.016387	0.026686	0.027731	0.025311	0.047920	-0.002231	0.035045	0.030353
(19114,68881)(14020000,14069999)(14020000,14070000)	1	14045000	0.022096	0.019076	0.026777	0.024238	0.025290	0.052793	-0.005220	0.041969	0.029757
(28951,78655)(14030000,14079999)(14030000,14080000)	1	14055000	0.022043	0.021025	0.026915	0.023368	0.025342	0.056975	-0.006884	0.046840	0.030530
(38928,88632)(14040000,14089999)(14040000,14090000)	1	14065000	0.022083	0.016525	0.026846	0.029560	0.025345	0.053421	-0.004116	0.039898	0.034122
(48917,98170)(14050000,14099999)(14050000,14100000)	1	14075000	0.022132	0.022082	0.026742	0.025564	0.025262	0.037071	0.005226	0.024827	0.020671
(74,49191)(14000000,14049999)(14000000,14050000)	10	14025000	0.022704	0.101955	0.026479	0.095713	0.025102	0.001924	0.103108	-0.048378	-0.002500
(9734,58555)(14010000,14059999)(14010000,14060000)	10	14035000	0.022779	0.102670	0.026425	0.088015	0.025118	0.002721	0.098870	-0.043342	-0.006738

The last 3 columns are the populations branch statistic for population1, popultion2 and population3

Relative window positions?

We allow for 3 different ways of defining window positions, these are chosen with the -type argument in realSFS


-type 2 Use pos=1 as the leftmost position of first window. Even though there isn't any data.
-type 1 Use first position with data, as leftmost position for the first window.
-type 0 Split out the genome into blocks. And use the first window that have data for the entire window. Then we will have the same windowcenters across datasets.

realSFS fst print

You can print out the precalculated A and B with

./realSFS fst print pop1.pop2.fst.idx

Assuming we have pop1.saf.idx, pop2.saf.idx.

./realSFS pop1.saf.idx pop2.saf.idx >pop1.pop2.saf.idx.ml
./realSFS fst index pop1.saf.idx pop2.saf.idx -fstout pop1.pop2 -sfs pop1.pop2.saf.idx.ml
./realSFS fst print pop1.pop2.fst.idx

The weighted fst for a region is the ratio between the sum of As and the sum of B. The unweighted is the mean of the persite ratios.

A is the alpha from the reynolds 1983 (or Bhatia) and B is the alpha + beta.