Intern: Difference between revisions

From software
Jump to navigation Jump to search
No edit summary
Line 1: Line 1:
This sofware is for internal use. If you found this page then congrats. Feel free you use the methods but they might not work, give wrong results or other nasty stuff. If and when they fuck up don't contact us or complain.  
This sofware is for internal use. If you found this page den congrats. Feel free you use the methods but they might not work, give wrong results or other nasty stuff. If and then they fuck up don't contact us or complain.  


=Secret software=
=Secret software=
Line 18: Line 18:
==chisquare==
==chisquare==
A small library build using numerical recipies third edition for calculating stuff relating to the chisqdistribution
A small library build using numerical recipies third edition for calculating stuff relating to the chisqdistribution
 


==bambi/bammer==
==bambi/bammer==
Line 27: Line 27:
[[bambi.tar.gz]]
[[bambi.tar.gz]]
===Usage===
===Usage===
Program is invoked with either
Program is invoked with either:
 
<pre>
<example>
./bammer view    #samtools view/ single sample only
./bammer view    #samtools view/ single sample only
./bammer uppile  #This does emulates the samtools mpileup
./bammer uppile  #This does emulates the samtools mpileup
</example>
</pre>




Inputfiles are supplied with either
Inputfiles are supplied with either
<example>
<pre>
tmp.bam                #single bamfile
tmp.bam                #single bamfile
-b bam.list            #filelist containing bamfiles
-b bam.list            #filelist containing bamfiles
-b bam.list -nInd INT  #only select the first INT samples from bam.list
-b bam.list -nInd INT  #only select the first INT samples from bam.list
</example>
</pre>




Regions of interest can be supplied by defining a region in the following way
Regions of interest can be supplied by defining a region in the following way
<example>
<pre>
-r chr:a-b  #select chromsome chr from a, to b
-r chr:a-b  #select chromsome chr from a, to b
-r chr:-b  #select chromosome chr from beginning to be
-r chr:-b  #select chromosome chr from beginning to be
-r chr:b-  #select chromosome chr from b to end of chromosome
-r chr:b-  #select chromosome chr from b to end of chromosome
</example>
</pre>


If we want to dump different regions in the same run, this can be done with
If we want to dump different regions in the same run, this can be done with
<example>
<pre>
-rf regions.file  #every region is on a newline, has same format as above
-rf regions.file  #every region is on a newline, has same format as above
</example>
</pre>




Program can also estimate genotypelikelihoods if uppile is selected
Program can also estimate genotypelikelihoods if uppile is selected


<example>
<pre>
-GL 0    #SOAPsnp
-GL 0    #SOAPsnp
-GL 1    #samtools
-GL 1    #samtools
-GL 2    #gatk
-GL 2    #gatk
</example>
</pre>


Output is hardcode dumped in glfs.glfs, these is a binary double file. Information for first site is the first 10*sizeof(double)*nInd bytes. This is the samtools ordering AA,AC,AG,...
Output is hardcode dumped in glfs.glfs, these is a binary double file. Information for first site is the first 10*sizeof(double)*nInd bytes. This is the samtools ordering AA,AC,AG,...
Line 71: Line 70:


Below are some examples for running the program
Below are some examples for running the program
<example>
<pre>
#view commands below
#view commands below
./bammer view new2.bam
./bammer view new2.bam
Line 82: Line 81:
#samtools estimation of GL
#samtools estimation of GL
./bammer uppile -b lucampHighBam.filelist -nInd 2 -r chr1:-376920 -GL 1
./bammer uppile -b lucampHighBam.filelist -nInd 2 -r chr1:-376920 -GL 1
</example>
</pre>


** Technics
** Technics
Line 93: Line 92:


to compile
to compile
<example>
<pre>
g++ -O3 -o ieatgor ieatgor.cpp -lz
g++ -O3 -o ieatgor ieatgor.cpp -lz
</example>
</pre>


The target file must be sorted. example of target file
The target file must be sorted. example of target file
<example>
<pre>
chr1:22323:232322
chr1:22323:232322
chr10:10000:10000000
chr10:10000:10000000
chr3:10000:10000000
chr3:10000:10000000
  </example>
  </pre>


A "gor" file is a file with the first coloum being the chromosome and the second coloum being the positions. The rest of the coloums are not important. The gor file must be sorted based on the positions and chr. gz gor files are also accepted.  
A "gor" file is a file with the first coloum being the chromosome and the second coloum being the positions. The rest of the coloums are not important. The gor file must be sorted based on the positions and chr. gz gor files are also accepted.  
example of a gor file
example of a gor file
  <example>
  <pre>
chr1    1033267 2      3      know 0.000000 EM 0.000000 unk 0.000000 EMunk 0.000000 Nind 9
chr1    1033267 2      3      know 0.000000 EM 0.000000 unk 0.000000 EMunk 0.000000 Nind 9
chr1    1033268 0      2      know 0.000000 EM 0.000000 unk 0.000000 EMunk 0.000000 Nind 9
chr1    1033268 0      2      know 0.000000 EM 0.000000 unk 0.000000 EMunk 0.000000 Nind 9
Line 114: Line 113:
chr1    1033304 0      1      know 0.000000 EM 0.000000 unk 0.000000 EMunk 0.000000 Nind 9
chr1    1033304 0      1      know 0.000000 EM 0.000000 unk 0.000000 EMunk 0.000000 Nind 9
...
...
  </example>
  </pre>


run example
run example
<example>
<pre>
./ieatgor file.target file.gor
./ieatgor file.target file.gor
</example>
</pre>
the extracted line are returned to standard out.  
the extracted line are returned to standard out.  


Line 129: Line 128:


Download here [[getliners6.cpp]] .  
Download here [[getliners6.cpp]] .  
<example>
<pre>
g++ getliners5.cpp -lz -O3 -o getliners
g++ getliners5.cpp -lz -O3 -o getliners
</example>
</pre>




Line 147: Line 146:


It can be downloaded here [[abbababatest.cpp]] and can be compiled with:
It can be downloaded here [[abbababatest.cpp]] and can be compiled with:
<example>
<pre>
g++ -O3 abbababatest.cpp -o calcABBABABA -lz
g++ -O3 abbababatest.cpp -o calcABBABABA -lz
</example>
</pre>
To get all options just type  
To get all options just type  
<example>
<pre>
  ./calcABBABABA  
  ./calcABBABABA  
</example>
</pre>
An example run cmd
An example run cmd
<example>
<pre>
../calcABBABABA -file /space/genomes/refgenomes/ancestal/hg19/shortForm/chr20.ans -bam  testbamfilelistbotoV2chr20.txt -outfileprefix testbotoV2Chr20plus1x2 -blocksz 5  
../calcABBABABA -file /space/genomes/refgenomes/ancestal/hg19/shortForm/chr20.ans -bam  testbamfilelistbotoV2chr20.txt -outfileprefix testbotoV2Chr20plus1x2 -blocksz 5  
</example>
</pre>
The program outputs three files: an abbababa file (with only abbababa stats), a count file (with overall counts for the chromosome) and a summary file where the compile time, run time and the cmds and the input files used when the program was run are all listed.  
The program outputs three files: an abbababa file (with only abbababa stats), a count file (with overall counts for the chromosome) and a summary file where the compile time, run time and the cmds and the input files used when the program was run are all listed.  


Line 164: Line 163:
[[Rfun/][Folder]]
[[Rfun/][Folder]]


<example>
<pre>
  Rscript /home/software/public/software/intern/Rfun/plink.plot.R <plink output file>
  Rscript /home/software/public/software/intern/Rfun/plink.plot.R <plink output file>
</example>
</pre>




Line 176: Line 175:


Download path
Download path
<example>
<pre>
/home/software/public/software/intern/ngsAdmix/
/home/software/public/software/intern/ngsAdmix/
</example>
</pre>


*** compile
*** compile
compile with icpc or g++ (icpc much better)
compile with icpc or g++ (icpc much better)
<example>
<pre>
icpc -O3 -o ngsAdmix ngsAdmix12.cpp -lz -pthread
icpc -O3 -o ngsAdmix ngsAdmix12.cpp -lz -pthread
g++ -O3 -o ngsAdmix ngsAdmix12.cpp -lz -pthread
g++ -O3 -o ngsAdmix ngsAdmix12.cpp -lz -pthread
</example>
</pre>


*** options
*** options
options are shown with  
options are shown with  
<example>
<pre>
./ngsAdmix
./ngsAdmix
</example>
</pre>
All options also have a short form (first letter)
All options also have a short form (first letter)


Line 199: Line 198:


*** run example
*** run example
<example>
<pre>
./ngsAdmix -l dat.beagle -i 5000 -K 3 -P 5 -m 1 -o datK3
./ngsAdmix -l dat.beagle -i 5000 -K 3 -P 5 -m 1 -o datK3
</example>
</pre>




Line 209: Line 208:
You might have the change the chainFile path and the liftover path if you are not on pontus
You might have the change the chainFile path and the liftover path if you are not on pontus
Note that if two SNP map to the same position after liftover one of them is removed.  
Note that if two SNP map to the same position after liftover one of them is removed.  
<example>
<pre>
#!/bin/bash
#!/bin/bash
#make liftover file
#make liftover file
Line 258: Line 257:
$plink  --noweb --ped ${Files}Red.ped --map hg19.map --flip flipSNPs.txt --recode --make-bed --out ${Files}Hg19
$plink  --noweb --ped ${Files}Red.ped --map hg19.map --flip flipSNPs.txt --recode --make-bed --out ${Files}Hg19
$plink  --noweb --bfile ${Files}Hg19 --recode --out ${Files}Hg19
$plink  --noweb --bfile ${Files}Hg19 --recode --out ${Files}Hg19
</example>
</pre>


*** run example
*** run pre
<example>
<pre>
./liftover.sh myPlinkFile
./liftover.sh myPlinkFile
</example>
</pre>


* angsd
* angsd
Line 283: Line 282:




<example>
<pre>
Rscript plotAdmix_anders.R
Rscript plotAdmix_anders.R
Rscript plotAdmix2.R
Rscript plotAdmix2.R
</example>
</pre>




Line 293: Line 292:
plot the LD in a region. You need SNPmatrix working. If you cannot get SNPmatrix to work use Relate instead (ldsnp3 function) or if on pontus use
plot the LD in a region. You need SNPmatrix working. If you cannot get SNPmatrix to work use Relate instead (ldsnp3 function) or if on pontus use


<example>
<pre>
library(snpMatrix,lib="/home/albrecht/R/x86_64-pc-linux-gnu-library/2.15/")
library(snpMatrix,lib="/home/albrecht/R/x86_64-pc-linux-gnu-library/2.15/")
</example>
</pre>


R function:
R function:
Line 301: Line 300:


*** example
*** example
<example>
<pre>
library(snpMatrix)
library(snpMatrix)
#snp info from plink files
#snp info from plink files
Line 333: Line 332:
colR=colorRampPalette(c("white","grey","black"))(40)
colR=colorRampPalette(c("white","grey","black"))(40)
ldplot(ldinfo$rsq2,pos=plB[keep,4],main="LD (r2)",colR=colR)
ldplot(ldinfo$rsq2,pos=plB[keep,4],main="LD (r2)",colR=colR)
</example>
</pre>


This example produces the plots: [[andersLink/ldpic1.pdf][pdf]]
This example produces the plots: [[andersLink/ldpic1.pdf][pdf]]

Revision as of 16:07, 19 March 2014

This sofware is for internal use. If you found this page den congrats. Feel free you use the methods but they might not work, give wrong results or other nasty stuff. If and then they fuck up don't contact us or complain.

Secret software

Grepper

Formerly known as getliner. Used for grep'ing with in columns

Sortdep

Fast counting of integers

RefFinder

Extract bases from a fasta file

Other stuff

chisquare

A small library build using numerical recipies third edition for calculating stuff relating to the chisqdistribution


bambi/bammer

(Thorfinn, april20 2012)

Download

bambi.tar.gz

Usage

Program is invoked with either:

./bammer view     #samtools view/ single sample only
./bammer uppile   #This does emulates the samtools mpileup


Inputfiles are supplied with either

tmp.bam                #single bamfile
-b bam.list            #filelist containing bamfiles
-b bam.list -nInd INT  #only select the first INT samples from bam.list


Regions of interest can be supplied by defining a region in the following way

-r chr:a-b  #select chromsome chr from a, to b
-r chr:-b   #select chromosome chr from beginning to be
-r chr:b-   #select chromosome chr from b to end of chromosome

If we want to dump different regions in the same run, this can be done with

-rf regions.file  #every region is on a newline, has same format as above


Program can also estimate genotypelikelihoods if uppile is selected

-GL 0    #SOAPsnp
-GL 1    #samtools
-GL 2    #gatk

Output is hardcode dumped in glfs.glfs, these is a binary double file. Information for first site is the first 10*sizeof(double)*nInd bytes. This is the samtools ordering AA,AC,AG,...

Information about the position is dumped in mafs.mafs. This is NOT the MAF, but simply a 2 column file with chr tab.


Below are some examples for running the program

#view commands below
./bammer view new2.bam
./bammer view bwape_sort_rmdup.bam chr21:9483252-9672320
./bammer view bwape_sort_rmdup.bam chrY:-1000000

#mpileup below
./bammer uppile -b lucampHighBam.filelist -nInd 2 -r chr1:-376920

#samtools estimation of GL
./bammer uppile -b lucampHighBam.filelist -nInd 2 -r chr1:-376920 -GL 1
    • Technics

Program spawns a single thread, when a chunk has been read, and the thread is not finished it will wait.


  • ieatgor

Extract lines from file in "gor" format. The files can be found in this [[ieatgor/][folder]]

to compile

g++ -O3 -o ieatgor ieatgor.cpp -lz

The target file must be sorted. example of target file

chr1:22323:232322
chr10:10000:10000000
chr3:10000:10000000
 

A "gor" file is a file with the first coloum being the chromosome and the second coloum being the positions. The rest of the coloums are not important. The gor file must be sorted based on the positions and chr. gz gor files are also accepted. example of a gor file

chr1    1033267 2       3       know 0.000000 EM 0.000000 unk 0.000000 EMunk 0.000000 Nind 9
chr1    1033268 0       2       know 0.000000 EM 0.000000 unk 0.000000 EMunk 0.000000 Nind 9
chr1    1033279 3       1       know 0.000000 EM 0.000000 unk 0.000000 EMunk 0.000000 Nind 9
chr1    1033286 0       2       know 0.000000 EM 0.000000 unk 0.000000 EMunk 0.000000 Nind 9
chr1    1033297 0       2       know 0.000000 EM 0.000000 unk 0.000000 EMunk 0.000000 Nind 9
chr1    1033304 0       1       know 0.000000 EM 0.000000 unk 0.000000 EMunk 0.000000 Nind 9
...
 

run example

./ieatgor file.target file.gor

the extracted line are returned to standard out.

  • getliner

NB HUGE BUG, must +1 when using -l and zipped files. Sorry for inconveincace. (tsk)


This is a general tool for extract lines from newline seperated textfiles (can be gz compressed). Getliners4 allows for complement grep, use '-v 1'. Default is '-v 0'. Program can also print out basic statistics for hit/nonhit if -i filename is supplied. getliners6.cpp tokinzes on \r, so should work on windows files.

Download here getliners6.cpp .

g++ getliners5.cpp -lz -O3 -o getliners


Program can either extract specific lines, or "keys". The linenumbers to extract must be 1-indexed (first line i 1). When using keys, you must supply which column to 'grep' for. The program builds an associative array of the keys. And therefore only a single pass of the datafile is required.


Fields in the datafile can be seperated by a delimiter that can be specified at runtine with the "-d " argument. Default is space/tab. If delimter is an escape sequence it will not work and the code should be modified.


  • calcABBABABA

This is program that calculates ABBA-BABA related stats for a given chromosome.

It can be downloaded here abbababatest.cpp and can be compiled with:

g++ -O3 abbababatest.cpp -o calcABBABABA -lz

To get all options just type

 ./calcABBABABA 

An example run cmd

../calcABBABABA -file /space/genomes/refgenomes/ancestal/hg19/shortForm/chr20.ans -bam  testbamfilelistbotoV2chr20.txt -outfileprefix testbotoV2Chr20plus1x2 -blocksz 5 

The program outputs three files: an abbababa file (with only abbababa stats), a count file (with overall counts for the chromosome) and a summary file where the compile time, run time and the cmds and the input files used when the program was run are all listed.

  • plot Plink

[[Rfun/][Folder]]

 Rscript /home/software/public/software/intern/Rfun/plink.plot.R <plink output file>


  • ngsAdmix
      • download

[[ngsAdmix/][Download folder]]

Download path

/home/software/public/software/intern/ngsAdmix/
      • compile

compile with icpc or g++ (icpc much better)

icpc -O3 -o ngsAdmix ngsAdmix12.cpp -lz -pthread
g++ -O3 -o ngsAdmix ngsAdmix12.cpp -lz -pthread
      • options

options are shown with

./ngsAdmix

All options also have a short form (first letter)

- default stopping criteria is either tolerence 0.0001 for emSq
- default stopping criteria is difference in loglikelihoods of 0.1 over 50 iterations for both em and emSq
- default optimaziation is emSq
      • run example
./ngsAdmix -l dat.beagle -i 5000 -K 3 -P 5 -m 1 -o datK3


  • liftover a plink file


You might have the change the chainFile path and the liftover path if you are not on pontus Note that if two SNP map to the same position after liftover one of them is removed.

#!/bin/bash
#make liftover file
Files=$1
liftoverProg=/home/albrecht/bin/prog/liftover/liftOver
chainFile=/home/albrecht/bin/prog/liftover/chainFiles/hg18ToHg19.over.chain
plink=/home/albrecht/bin/prog/plink-1.07-x86_64/plink

if [ "$#" -eq 0 ]; then
    echo "supply a plink prefix"
    exit
fi

if [ -f $Files.map ]
then
    echo using file $Files.map
else
    echo the file ${Files}.map does not exists \(use plink --recode\)
fi

if [ -f $Files.bed ]
then
    echo using file $Files.bed
else
    echo the file $Files.bed does not exists  \(use plink --make-bed\)
fi
#make bed file for liftover
Rscript -e "options(scipen=10);map<-read.table('$Files.map',hea=F,as.is=T);res<-cbind(paste('chr',map[,1],sep=''),map[,4]-1,map[,4],map[,2],0,'+');write.table(res,file='$Files.bedfile',col=F,row=F,qu=F,sep='\t')"

#liftover
$liftoverProg $Files.bedfile $chainFile ${Files}Hg19.bedfile ${Files}Hg19.unmapped

# rm unmapped
grep -v \# ${Files}Hg19.unmapped  | cut -f4 > rmSNPs.txt
#rm duplicates
cat ${Files}Hg19.bedfile | sort -k1 | rev | uniq -d -f3 | rev | cut -f4 > dubSNP.txt
echo removing `wc -l dubSNP.txt`
cat dubSNP.txt >> rmSNPs.txt

$plink --noweb --bfile $Files --exclude rmSNPs.txt --recode --out ${Files}Red 
#change pos

Rscript -e "liftmap<-read.table('${Files}Hg19.bedfile',hea=F,colC=c('character','integer')[c(1,2,2,1,2,1)]);liftmap<-read.table('${Files}Hg19.bedfile',hea=F,as.is=T);map<-read.table('${Files}Red.map',hea=F,as.is=T); liftmap<-liftmap[liftmap[,4]%in%map[,2],];rownames(map)<-map[,2];map[liftmap[,4],4]<-liftmap[,3];map[liftmap[,4],1]<-sub('chr','',liftmap[,1]);write.table(map,file='hg19.map',col=F,row=F,qu=F)"

echo The new plink file is called  ${Files}Hg19 and are on the same strand as the original file
#flip strands 
cat ${Files}Hg19.bedfile | grep \- | cut -f4 > flipSNPs.txt
$plink  --noweb --ped ${Files}Red.ped --map hg19.map --flip flipSNPs.txt --recode --make-bed --out ${Files}Hg19
$plink  --noweb --bfile ${Files}Hg19 --recode --out ${Files}Hg19
      • run pre
./liftover.sh myPlinkFile
  • angsd

[[angsd/][Download folder]]


  • plot admix results
      • Download

[[Rfun/plotAdmix_anders.R][Get the R script]]

or the more fancy one

[[Rfun/plotAdmix2.R][fancier plotter (ida)]]

to get the options type


Rscript plotAdmix_anders.R
Rscript plotAdmix2.R


  • plot LD region from plink file

plot the LD in a region. You need SNPmatrix working. If you cannot get SNPmatrix to work use Relate instead (ldsnp3 function) or if on pontus use

library(snpMatrix,lib="/home/albrecht/R/x86_64-pc-linux-gnu-library/2.15/")

R function: <include file="andersLink/ldplot.R" markup="example">

      • example
library(snpMatrix)
#snp info from plink files
pl<-read.plink("/space/anders/greenland/data/plink/datGRDKplus")
plB<-read.table("/space/anders/greenland/data/plink/datGRDKplus.bim")

#choose region
min.pos<-70e6
max.pos<-81e6
chr=1
minMaf<-0.05
minCallrate<-0.95
depth<-400 #depth of ld calculations (window size)

# calculatate maf and callrate
info<-col.summary(pl)

#chose SNPs
table(keep <- plB[,1]==chr&plB[,4]>min.pos&plB[,4]<max.pos & info$MAF>minMaf & !is.na(info$MAF) & info$Call.rate > minCallrate)

#estimate LD
ldinfo <- ld.snp(pl[,keep],depth=depth)

#plot LD (D')
ldplot(ldinfo$dprime,pos=plB[keep,4],main="LD (D')")

#plot R squared
ldplot(ldinfo$rsq2,pos=plB[keep,4],main="LD (r2)")

#change color palette
colR=colorRampPalette(c("white","grey","black"))(40)
ldplot(ldinfo$rsq2,pos=plB[keep,4],main="LD (r2)",colR=colR)

This example produces the plots: [[andersLink/ldpic1.pdf][pdf]]