|   |   | 
| (79 intermediate revisions by 2 users not shown) | 
| Line 1: | Line 1: | 
|  | =Brief description= |  | = NEW VERSION =   | 
|  | This page contains information about theprogram called NgsRelate, which can be used to infer relatedness coefficients for pairs ofindividuals from low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods instead of called genotypes. To be able to infer the relatednessyou will need toknow the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the examples below. For more information about ANGSD see here: http://popgen.dk/angsd/index.php/Quick_Start.
 |  | For the NEW version of ngsRelate that coestimates relatedness and inbreeding go to this link https://github.com/ANGSD/NgsRelate | 
|  | 
 |  | 
 | 
|  | =How to download and install=
 |  | 
|  | The source code for NgsRelate is deposited on github: https://github.com/ANGSD/NgsRelate. On a linux system with curl and g++ installed NgsRelate can be downloaded and installed as follows:  
 |  | 
|  | <pre>
 |  | 
|  | curl https://raw.githubusercontent.com/ANGSD/NgsRelate/master/NgsRelate.cpp >NgsRelate.cpp
 |  | 
|  | g++ NgsRelate.cpp -O3 -lz -o ngsrelate
 |  | 
|  | </pre>
 |  | 
|  | 
 |  | 
 | 
|  | =Run examples= |  | = OLD VERSION =   | 
|  | Below is two examples of how NgsRelate can be used to estimate relatedness from NGS data. To be able to run all steps of the examples you need to have the programs ANGSD and PLINK installed. 
 |  | For the old version please use this link: http://www.popgen.dk/software/index.php?title=NgsRelate&oldid=694 | 
|  | NB. If you just want to try out NgsRelate without going through all the intermediate steps of generating input data (which takes some time!) you can download the final input data for NgsRelate used in the last example here: . And you can find the expected output from the same example here: .
 |  | 
|  |   |  | 
|  | == Run example using only NGS data==
 |  | 
|  | Assume we have file containing paths to 100 BAM/CRAM files. Then we can use ANGSD to estimate frequencies and calculate genotype likelihoods while doing SNP calling and in the end produce the the input files needed for the NgsRelate program as follows:
 |  | 
|  | <pre>
 |  | 
|  | ### First we generate a file with allele frequencies (angsdput.mafs.gz) and a file with genotpe likelihoods (angsdput.glf.gz).
 |  | 
|  | angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 - domaf 1 -minmaf 0.05 -doGlf 3
 |  | 
|  |   |  | 
|  | ### Then we extract the frequency column from theallele frequency file and remove the header (to make it in the format NgsRelate needs)
 |  | 
|  | cut -f5 angsdput.mafs.gz |sed 1d >freq
 |  | 
|  | </pre>
 |  | 
|  | Once we have these files we can useNgsRelate to estimate relatedness between any pairs of individuals. E.g. if we want to estimate relatedness between the first two individuals (0 and 1) we can do it using the following command: 
 |  | 
|  | <pre>
 |  | 
|  | ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
 |  | 
|  | </pre>
 |  | 
|  | Here we specify the name of our file with genotype likelihoods after the option "-g", the number of individuals in the file after the option "-n", the name of the file with allele frequencies after the option "-f" and the number of the two individuals after the options "-a" and "-b" . If -a and -b are not specified NgsRelate will loop through all pairs of individuals in the input file.
 |  | 
|  |   |  | 
|  | = Run example using NGS data with population frequencies estimated from genetic data from PLINK files =
 |  | 
|  | In this example we show how you can estimate relatedness between a number of individuals which you have NGS data from (in bam files) using genetic data from PLINK files for frequency estimation. 
 |  | 
|  |   |  | 
|  | Assume the individuals we want to estimate relatedness from are from the population called LWK and assume we have files with genetic data from individuals from LWK as well as other populations in binary PLINK format (e.g. hapmap3_r2_b36_fwd.consensus.qc.polyHg19.*) and a file, LWK.fam, with the IDs of the LWK individuals in thisdataset. Then using PLINK we can produce allele frequency information in a format that NgsRelate can use as follows:
 |  | 
|  | <pre>
 |  | 
|  | ### extract individuals from LWK from huge binary plink file
 |  | 
|  | plink --bfile hapmap3_r2_b36_fwd.consensus.qc.polyHg19 --keep LWK.fam  --make-bed --out hapmap3Hg19LWK --noweb
 |  | 
|  |   |  | 
|  | ### calculate frequencies for this population
 |  | 
|  | plink --bfile  hapmap3Hg19LWK --freq --noweb --out LWKsub
 |  | 
|  | </pre>
 |  | 
|  | Afterwards we can use ANGSD to calculate genotype likelihoods for the sites for which we have frequency in     
 |  | 
|  | <pre>
 |  | 
|  | ### extract the chr,pos,major,minor information about the sites we have frequency info from into a file 
 |  | 
|  | ### (so we can extract data from these sites from the NGS data files) 
 |  | 
|  | cut -f1,4-6  hapmap3Hg19LWK.bim >forAngsd.txt
 |  | 
|  |   |  | 
|  | ### index this file for angsd
 |  | 
|  | angsd sites index forAngsd.txt
 |  | 
|  |   |  | 
|  | ### calculate genotype likelihoods for the six individuals for the sites we have frequency info on based on the bam files 
 |  | 
|  | ### (assuming the paths to the bam files are listed in the file 'list'):
 |  | 
|  | angsd -gl 1 -doglf 3 -sites forAngsd.txt -b list -domajorminor 3 -P 2 -minMapQ 30 -minQ 20
 |  | 
|  | ### this generates the output files angsdput.glf.gz and a angsdput.glf.pos.gz.
 |  | 
|  | </pre>
 |  | 
|  |   |  | 
|  | Finally we can use NgsRelate to estimate relatedness for the six individuals from which we have NGS data in bam files:
 |  | 
|  | <pre>
 |  | 
|  | ### extract the frequencies and sync it to the angsd output
 |  | 
|  | ngsrelate extract_freq angsdput.glf.pos.gz files/hapmap3Hg19LWK.bim files/LWKsub.frq >freq
 |  | 
|  |   |  | 
|  | ### run ngsrelate 
 |  | 
|  | ngsrelate  -g angsdput.glf.gz -n 6 -f freq >res
 |  | 
|  | </pre>
 |  | 
|  | The final relatedness estimates will then be available in the file called "res".
 |  | 
|  |   |  | 
|  | =Output=
 |  | 
|  | Example of output of with two samples
 |  | 
|  | <pre>
 |  | 
|  | Pair	k0	k1	k2	loglh	nIter	coverage
 |  | 
|  | (0,1)	0.673213	0.326774	0.000013	-1710940.769941	19	0.814658
 |  | 
|  | </pre>
 |  | 
|  |   |  | 
|  | Example of output  with 6 samples:
 |  | 
|  | <pre>
 |  | 
|  | cat res
 |  | 
|  | Pair	k0	k1	k2	loglh	nIter	coverage
 |  | 
|  | (0,1)	0.675337	0.322079	0.002584	-1710946.832375	10	0.813930
 |  | 
|  | (0,2)	0.458841	0.526377	0.014782	-1666215.528333	10	0.808822
 |  | 
|  | (0,3)	1.000000	0.000000	0.000000	-1743992.363193	-1	0.816266
 |  | 
|  | (0,4)	1.000000	0.000000	0.000000	-1759202.971213	-1	0.818856
 |  | 
|  | (0,5)	1.000000	0.000000	0.000000	-1550475.615322	-1	0.752663
 |  | 
|  | (1,2)	0.007111	0.991020	0.001868	-1580995.130867	10	0.806912
 |  | 
|  | (1,3)	1.000000	0.000000	0.000000	-1728859.988212	-1	0.814272
 |  | 
|  | (1,4)	1.000001	-0.000001	0.000000	-1744055.203870	9	0.816887
 |  | 
|  | (1,5)	1.000000	0.000000	0.000000	-1536858.187440	-1	0.750917
 |  | 
|  | (2,3)	1.000000	0.000000	0.000000	-1705157.832621	-1	0.809297
 |  | 
|  | (2,4)	1.000000	0.000000	0.000000	-1719681.338365	-1	0.811804
 |  | 
|  | (2,5)	1.000000	0.000000	0.000000	-1517388.260612	-1	0.746903
 |  | 
|  | (3,4)	0.547602	0.439423	0.012975	-1743899.789842	10	0.819276
 |  | 
|  | (3,5)	0.265819	0.482953	0.251228	-1467343.087647	10	0.754637
 |  | 
|  | (4,5)	0.004655	0.995345	-0.000000	-1473415.049411	8	0.755734
 |  | 
|  | </pre>
 |  | 
|  |   |  | 
|  | The first column contain the individuals that was used for the analysis . The next three columns are the estimated relatedness coefficient. The fifth column is the log of the likelihood of the MLE. The sixth column is the number of iterations required to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and the minor allele frequency (MAF) above the threshold (default is 0.05 but may also user specified).
 |  | 
|  |   |  | 
|  | = Input file format = |  | 
|  | The input files are binary gz compressed, log like ratios encoded as double. 3 values per sample.
 |  | 
|  | The freq file is allowed to be gz compressed.
 |  | 
|  |   |  | 
|  | = Citing and references =
 |  | 
|  |   |  | 
|  | = Changelog =
 |  | 
|  | See github for log
 |  |