NgsRelate: Difference between revisions
| Line 6: | Line 6: | ||
| == Download Installation of C program == | == Download Installation of C program == | ||
| <pre> | <pre> | ||
| curl https://raw.githubusercontent.com/ANGSD/ | curl https://raw.githubusercontent.com/ANGSD/NgsRelate/master/NgsRelate.cpp >NgsRelate.cpp | ||
| g++  | g++ NgsRelate.cpp -O3 -lz -o NgsRelate | ||
| </pre> | </pre> | ||
Revision as of 22:25, 19 June 2015
Brief description
This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals for low coverage nags data by using genotype likelihoods. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be done e.g. using the program ANGSD as shown in the example.
Installation
Primary repository is github.
Download Installation of C program
curl https://raw.githubusercontent.com/ANGSD/NgsRelate/master/NgsRelate.cpp >NgsRelate.cpp g++ NgsRelate.cpp -O3 -lz -o NgsRelate
Run example using C
Assume we have file containing paths to 100 BAM/CRAM files, then we can use ANGSD to estimate frequencies calculate genotype likelihoods while doing SNP calling and dumping the input files needed for the NgsRelate program
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 - domaf 1 -minmaf 0.05 -doGlf 3 #this generates an angsdput.mafs.gz and a angsdput.glf.gz. #we will need to extract the frequency column from the mafs file and remove the header cut -f5 angsdput.mafs.gz |sed 1d >freq ./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
Here we specify that our binary genotype likelihood file contains 100 samples, and that we want to run the analysis for the first two samples -a 0 -b 1. If no -a and -b are specified it will loop through all pairs
Output file format
Example of output
Pair k0 k1 k2 loglh nIter coverage (0,1) 0.673213 0.326774 0.000013 -1710940.769941 19 0.814658
The first two columns are the individuals number. The next three columns are the estimated relatedness coefficients and the last column is the number of iterations used.
Input file format
The input files are binary gz compressed, log like ratios encoded as double. 3 values per sample. The freq file is allowed to be gz compressed.
Citing and references
relateAdmix
Moltke, I, Albrechtsen, A (2013). RelateAdmix: a software tool for estimating relatedness between admixed individuals. Bioinformatics. pubmed bibtex
ADMIXTURE
D.H. Alexander, J. Novembre, and K. Lange. Fast model-based estimation of ancestry in unrelated individuals. Genome Research, 19:1655–1664, 2009.
change log
- 0.14 made more MAC usable (I think). Thanks to Paul Lott for reporting it and for suggestions and Thorfinn Sand for changing it
- 0.13 added extra check for file exists to give instant errors + changes all printf to fprintf(stderr,
- 0.11 changed threading to a fixed pool of threads
- 0.10 optimized code
- 0.09 added error for when the number of sites and individuals does not match between files
- 0.08 fixed a bug that would sometimes print an extra line when multiple threaded
- 0.07 fixed a small leak