|  |   | 
| (101 intermediate revisions by 2 users not shown) | 
| Line 1: | Line 1: | 
|  | =Brief description= |  | = NEW VERSION =   | 
|  | This page contains information about theprogram called NgsRelate, which can be used to infer relatedness coefficients for pairs ofindividuals for low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods. To be able to infer the relatednessyou will need toknow the population frequencies and have genotype likelihoods. This can be obtained e.g.using the program ANGSDas shown in the example below.
 |  | For the NEW version of ngsRelate that coestimates relatedness and inbreeding go to this link https://github.com/ANGSD/NgsRelate | 
|  | 
 |  | 
 | 
|  | =Download and Installation=
 |  | 
|  | Primary repository is github. https://github.com/ANGSD/NgsRelate
 |  | 
|  | <pre>
 |  | 
|  | curl https://raw.githubusercontent.com/ANGSD/NgsRelate/master/NgsRelate.cpp >NgsRelate.cpp
 |  | 
|  | g++ NgsRelate.cpp -O3 -lz -o ngsrelate
 |  | 
|  | </pre>
 |  | 
|  | 
 |  | 
 | 
|  | = Run example using only NGS data= |  | = OLD VERSION =   | 
|  | Assume we have file containing paths to 100 BAM/CRAM files, then we can useANGSD to estimate frequencies calculate genotype likelihoods while doing SNP calling and dumping the input files needed for the NgsRelate program
 |  | For the old version please use this link: http://www.popgen.dk/software/index.php?title=NgsRelate&oldid=694 | 
|  | <pre>
 |  | 
|  | ./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 - domaf 1 -minmaf 0.05 -doGlf 3
 |  | 
|  | #thisgenerates an angsdput.mafs.gz and a angsdput.glf.gz.
 |  | 
|  | #we will need to extract the frequency column from the mafs file and remove the header
 |  | 
|  | cut -f5 angsdput.mafs.gz |sed 1d >freq
 |  | 
|  | ./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
 |  | 
|  | </pre>
 |  | 
|  | Here we specify that our binary genotype likelihood file contains 100 samples, and that we want to run the analysis for the first two samples -a 0 -b 1.
 |  | 
|  | If no -a and -b are specified it will loop through all pairs
 |  | 
|  |   |  | 
|  | = Run example using NGS data with plink population frequencies=
 |  | 
|  | In this example we show how one can estimate relatedness between a number of individuals which we have NGS data from in bam files whose paths are listed in the file "list" using genetic data from PLINK files for frequency estimation. 
 |  | 
|  |   |  | 
|  | Assume the individuals we want to estimate relatedness from are from the population called LWK and assume we have files with HapMap3 data in binary PLINK format (hapmap3_r2_b36_fwd.consensus.qc.polyHg19.*) and a file, LWK.fam, with the IDs of the LWK individuals in the HapMap dataset. Then using first PLINK and then ANGSD we can produce allele frequency estimates in a format that NgsRelate can use as follows:
 |  | 
|  | <pre>
 |  | 
|  | ### extract population from huge binary plink file
 |  | 
|  | plink --bfile hapmap3_r2_b36_fwd.consensus.qc.polyHg19 --keep LWK.fam  --make-bed --out hapmap3Hg19LWK --noweb
 |  | 
|  |   |  | 
|  | ### calculate frequencies for the subpopulation
 |  | 
|  | plink --bfile  hapmap3Hg19LWK --freq --noweb --out LWKsub
 |  | 
|  |   |  | 
|  | ### we are only interested in the output from the snpsites with the seq data. so we extract the chr,pos,major,minor
 |  | 
|  | cut -f1,4-6  hapmap3Hg19LWK.bim >forAngsd.txt
 |  | 
|  |   |  | 
|  | ### index the file for angsd
 |  | 
|  | angsd sites index forAngsd.txt
 |  | 
|  |   |  | 
|  | </pre>
 |  | 
|  | Afterwards we can use ANGSD to calculate genotype likelihoods      
 |  | 
|  | <pre>
 |  | 
|  | ### calculate genotype likelihoods for the bam files whose paths are listed in the file 'list'
 |  | 
|  | angsd -gl 1 -doglf 3 -sites forAngsd.txt -b list -domajorminor 3 -P 2 -minMapQ 30 -minQ 20
 |  | 
|  | ### this generates the output files angsdput.glf.gz and a angsdput.glf.pos.gz.
 |  | 
|  | </pre>
 |  | 
|  |   |  | 
|  | Finally we can use NgsRelate to estimate relatedness for the six individuals from which we have NGS data in bam files:
 |  | 
|  | <pre>
 |  | 
|  | ### extract the frequencies and sync it to the angsd output
 |  | 
|  | ngsrelate extract_freq angsdput.glf.pos.gz files/hapmap3Hg19LWK.bim files/LWKsub.frq >freq
 |  | 
|  |   |  | 
|  | ### run ngsrelate 
 |  | 
|  | ngsrelate  -g angsdput.glf.gz -n 6 -f freq >res
 |  | 
|  | </pre>
 |  | 
|  | The final relatedness estimates will then be available in the file called "res".
 |  | 
|  |   |  | 
|  | =Output=
 |  | 
|  | Example of output of with two samples
 |  | 
|  | <pre>
 |  | 
|  | Pair	k0	k1	k2	loglh	nIter	coverage
 |  | 
|  | (0,1)	0.673213	0.326774	0.000013	-1710940.769941	19	0.814658
 |  | 
|  | </pre>
 |  | 
|  |   |  | 
|  | Example of output  with 6 samples:
 |  | 
|  | <pre>
 |  | 
|  | cat resi
 |  | 
|  | Pair	k0	k1	k2	loglh	nIter	coverage
 |  | 
|  | (0,1)	0.675337	0.322079	0.002584	-1710946.832375	10	0.813930
 |  | 
|  | (0,2)	0.458841	0.526377	0.014782	-1666215.528333	10	0.808822
 |  | 
|  | (0,3)	1.000000	0.000000	0.000000	-1743992.363193	-1	0.816266
 |  | 
|  | (0,4)	1.000000	0.000000	0.000000	-1759202.971213	-1	0.818856
 |  | 
|  | (0,5)	1.000000	0.000000	0.000000	-1550475.615322	-1	0.752663
 |  | 
|  | (1,2)	0.007111	0.991020	0.001868	-1580995.130867	10	0.806912
 |  | 
|  | (1,3)	1.000000	0.000000	0.000000	-1728859.988212	-1	0.814272
 |  | 
|  | (1,4)	1.000001	-0.000001	0.000000	-1744055.203870	9	0.816887
 |  | 
|  | (1,5)	1.000000	0.000000	0.000000	-1536858.187440	-1	0.750917
 |  | 
|  | (2,3)	1.000000	0.000000	0.000000	-1705157.832621	-1	0.809297
 |  | 
|  | (2,4)	1.000000	0.000000	0.000000	-1719681.338365	-1	0.811804
 |  | 
|  | (2,5)	1.000000	0.000000	0.000000	-1517388.260612	-1	0.746903
 |  | 
|  | (3,4)	0.547602	0.439423	0.012975	-1743899.789842	10	0.819276
 |  | 
|  | (3,5)	0.265819	0.482953	0.251228	-1467343.087647	10	0.754637
 |  | 
|  | (4,5)	0.004655	0.995345	-0.000000	-1473415.049411	8	0.755734
 |  | 
|  | </pre>
 |  | 
|  |   |  | 
|  | The first column contain the individuals that was used for the analysis . The next three columns are the estimated relatedness coefficient. The fifth column is the log of the likelihood of the MLE. The sixth column is the number of iterations required to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and the minor allele frequency (MAF) above the threshold (default is 0.05 but may also user specified).
 |  | 
|  |   |  | 
|  | = Input file format = |  | 
|  | The input files are binary gz compressed, log like ratios encoded as double. 3 values per sample.
 |  | 
|  | The freq file is allowed to be gz compressed.
 |  | 
|  |   |  | 
|  | = Citing and references =
 |  | 
|  |   |  | 
|  | = Changelog =
 |  | 
|  | See github for log
 |  |