NgsRelate: Difference between revisions
Line 18: | Line 18: | ||
=Run examples= | =Run examples= | ||
Below are | Below are examples of how NgsRelate can be used to coestimate relatedness and inbreeding from NGS data. | ||
<pre> | <pre> | ||
./ngsrelate -g angsdput.glf.gz -n 6 -f freq > newres | ./ngsrelate -g angsdput.glf.gz -n 6 -f freq > newres |
Revision as of 13:45, 3 October 2018
This pages refers to the new v2 of ngsRelate which coestimates relatedness and inbreeding. For the old version please use this link: http://www.popgen.dk/software/index.php?title=NgsRelate&oldid=694
Brief description
This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals from low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods instead of called genotypes. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the examples below. For more information about ANGSD see here: http://popgen.dk/angsd/index.php/Quick_Start.
Method is published here: http://bioinformatics.oxfordjournals.org/content/early/2015/08/29/bioinformatics.btv509.abstract
How to download and install
The source code for NgsRelate is deposited on github: https://github.com/ANGSD/NgsRelate. On a linux or mac system with curl and g++ installed NgsRelate can be downloaded and installed as follows:
git clone https://github.com/SAMtools/htslib git clone https://github.com/ANGSD/ngsRelate cd htslib/;make;cd ../ngsRelate;make HTSSRC=../htslib/
Run examples
Below are examples of how NgsRelate can be used to coestimate relatedness and inbreeding from NGS data.
./ngsrelate -g angsdput.glf.gz -n 6 -f freq > newres
The output should be a file called res that contains relatedness estimates for all pairs between 6 individuals. A copy of this file can be found here http://www.popgen.dk/ida/NgsRelateExampleData/web/output/newres.
Run example 0: using only VCF/BCF files
Run example 1: using only NGS data
Assume we have file containing paths to 100 BAM/CRAM files; one line per BAN/CRAM file. Then we can use ANGSD to estimate frequencies and calculate genotype likelihoods while doing SNP calling and in the end produce the the input files needed for the NgsRelate program as follows:
### First we generate a file with allele frequencies (angsdput.mafs.gz) and a file with genotype likelihoods (angsdput.glf.gz). ./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 -domaf 1 -minmaf 0.05 -doGlf 3 ### Then we extract the frequency column from the allele frequency file and remove the header (to make it in the format NgsRelate needs) zcat angsdput.mafs.gz | cut -f5 |sed 1d >freq
Once we have these files we can use NgsRelate to estimate relatedness between any pairs of individuals. E.g. if we want to estimate relatedness between the first two individuals (numbered from 0, so 0 and 1) we can do it using the following command:
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
Here we specify the name of our file with genotype likelihoods after the option "-g", the number of individuals in the file after the option "-n", the name of the file with allele frequencies after the option "-f" and the number of the two individuals after the options "-a" and "-b" . If -a and -b are not specified NgsRelate will loop through all pairs of individuals in the input file.
NEW: Note that if you want you also input a file with the IDs of the individuals (on ID per line) in the same order as in the file 'filelist' used to make the genotype likelihoods. If you do the output will also contain these IDs and not just the numbers of the samples (one can actually just use that exact file, however the IDs then tend to be a bit long). This can be done with the optional flag -z followed by the filename.
Run example 3: using frequencies from 1000genomes vcf files
We want to run ngsRelate using population frequencies from europe. We will extract the frequencies from the 1000genomes project vcf.
#Assuming that we have perchr called: ALL.chr*.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz #We dump output in EUR_AF/*.frq #We only use diallelic sites, we extract CHROM,POS,REF,ALT,EUR_AF tags from the vcf #We then pulled out the unique sites. for f in `seq 1 22` do IF=/storage/data_shared/callsets/1000genomes/phase3/vcf/ALL.chr${f}.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz echo "bcftools view -m2 -M2 -v snps ${IF} | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%EUR_AF\n' |awk '{if(\$5>0) print \$0 }'|sort -S 50% -u -k1,2 >EUR_AF/${f}.frq" done|parallel ##We merge into one file cat EUR_AF/1.frq >EUR_AF/ALL.frq for i in `seq 2 22` do cat EUR_AF/${i}.frq >>EUR_AF/ALL.frq done gzip EUR_AF/ALL.frq #we extract the first 4 columns, which is the sites input for angsd gunzip -c EUR_AF/ALL.frq.gz |cut -f1-4 |gzip -c >EUR_AF/sites.txt.gz ./angsd/angsd sites index EUR_AF/sites.txt.gz ./angsd/angsd -b list -gl 1 -domajorminor 3 -C 50 -ref /storage/data_shared/reference_genomes/hs37d5/hs37d5.fa -doglf 3 -minmapq 30 -minq 20 -sites EUR_AF/sites.txt.gz #Then we extract and match the freqs from the reference population with the sites where we had data. The parser expects a header, so make a dummy file echo "header" |gzip -c >new cat EUR_AF/ALL.frq.gz >>new ngsRelate extract_freq new angsdput.glf.pos.gz >myfreq
Output format
NEW: Example of output of analysis of two samples run without the optional -z:
a b nSites k0 k1 k2 loglh nIter coverage 0 1 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
And the same analysis run with the optional flag -z followed by name of file with IDs (where the first two IDs are S1 and S42):
a b ida idb nSites k0 k1 k2 loglh nIter coverage 0 1 S1 S42 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
Example of output with 6 samples:
cat newres a b nSites k0 k1 k2 loglh nIter coverage 0 1 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930 0 2 1121594 0.448790 0.548298 0.002912 -1666189.356801 25 0.808822 0 3 1131917 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266 0 4 1135509 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856 0 5 1043719 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663 1 2 1118945 0.006249 0.993750 0.000001 -1580989.961356 13 0.806912 1 3 1129152 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272 1 4 1132778 1.000000 0.000000 0.000000 -1744055.210286 -1 0.816887 1 5 1041298 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917 2 3 1122253 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297 2 4 1125729 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804 2 5 1035731 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903 3 4 1136091 0.566552 0.433054 0.000393 -1743752.158759 36 0.819276 3 5 1046456 0.265831 0.482954 0.251214 -1467343.087558 11 0.754637 4 5 1047977 0.004653 0.995347 0.000000 -1473415.049864 94 0.755734
The first two columns contain the information of about what two individuals was used for the analysis. The third column contains information about how many sites were used in the analysis. The following three columns are the maximum likelihood (ML) estimates of the relatedness coefficients. The seventh column is the log of the likelihood of the ML estimate. The eigth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the ninth column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).
For OLD versions of the program (from before June 28 2017):
Example of output of with two samples
Pair k0 k1 k2 loglh nIter coverage (0,1) 0.673213 0.326774 0.000013 -1710940.769941 19 0.814658
Example of output with 6 samples:
cat res Pair k0 k1 k2 loglh nIter coverage (0,1) 0.675337 0.322079 0.002584 -1710946.832375 10 0.813930 (0,2) 0.458841 0.526377 0.014782 -1666215.528333 10 0.808822 (0,3) 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266 (0,4) 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856 (0,5) 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663 (1,2) 0.007111 0.991020 0.001868 -1580995.130867 10 0.806912 (1,3) 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272 (1,4) 1.000001 -0.000001 0.000000 -1744055.203870 9 0.816887 (1,5) 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917 (2,3) 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297 (2,4) 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804 (2,5) 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903 (3,4) 0.547602 0.439423 0.012975 -1743899.789842 10 0.819276 (3,5) 0.265819 0.482953 0.251228 -1467343.087647 10 0.754637 (4,5) 0.004655 0.995345 -0.000000 -1473415.049411 8 0.755734
The first column contains the information of about which individuals was used for the analysis. The next three columns are the maximum likelihood (ML) estimate of the relatedness coefficients. The fifth column is the log of the likelihood of the ML estimate. The sixth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).
Input file format
NgsAdmix takes two files as input: a file with genotype likelihoods and a file with frequencies for the sites there are genotype likelihoods for. The genotype likelihood file needs to contain a line for each site with 3 values for each individual (one log transformed genotype likelihood for each of the 3 possible genotypes encoded as 'double's) and it needs to be in binary format and gz compressed. The frequency file needs to contain a line per site with the allele frequency of the site in it.
Help and additional options
To get help and a list of all options simply type
./ngsrelate
Citing and references
Changelog
Important recent changes:
- We have made -s 1 default (flips all allele frequencies from freq to 1-freq), since this is needed in almost all analyses. If you do not want the frequencies flipped then simply run the program with -s 0
- The output format has been changed to a more R friendly format (no ":" and parenthesis)
- The option -z has been added so one can get the sample IDs printed in the output (if one run the program with -z idfilename)
- We have fixed -m 1 so the estimates can no longer be negative
See github for the full change log.
Bugs/Improvements
-Make better output message if files doesn't exists when using the extract_freq option