IBSrelate: Difference between revisions

From software
Jump to navigation Jump to search
 
(24 intermediate revisions by the same user not shown)
Line 56: Line 56:


=== Specify an allele at each site ===
=== Specify an allele at each site ===
For the realSFS method, one of the alleles at each site must be specified. Here we will use an ancestral state file.
For the realSFS method, one of the alleles at each site must be specified. Here we will use an ancestral state file (fasta format).
<pre>
<pre>
# download and index the ancestral state fasta file
# download and index the ancestral state fasta file
Line 64: Line 64:


=== Generate a saf (site allele frequency likelihood) file for each individual ===
=== Generate a saf (site allele frequency likelihood) file for each individual ===
In the commands below, I apply minMapQ (-minMapQ 30) and minQ (-minQ 20) filters, as well as specify a specific genotype likelihood model (-GL 2).  This specific values worked well for this data set, and seem to be reasonable defaults, but the best values may vary by data set.  
In the commands below, I apply minMapQ (-minMapQ 30) and minQ (-minQ 20) filters, as well as specify a specific genotype likelihood model (-GL 2).  These values worked well for this data set and seem to be reasonable defaults, but the best values may vary by data set. I also generate summaries of sequencing depth (-doDepth) and allele counts (-doCounts).  The output of these are not evaluated here, but they should be examined be part of a general QC process for NGS data.
In the commands below I also generate summaries of sequencing depth (-doDepth) and allele counts (-doCounts).  The output of these are not evaluated here, but they should be examined be part of a QC process for NGS data.


<pre>
<pre>
Line 139: Line 138:


=== run IBS, this will analyse each pair of individuals ===
=== run IBS, this will analyse each pair of individuals ===
Here we have specified "model 0", so the command will generate expected values for each of the 100 possible pairwise 2-allele genotypes for each pair of individuals (10 unique genotypes per individual).
The man page for misc/ibs [http://www.popgen.dk/angsd/index.php/Genotype_Distribution here].
<pre>
<pre>
$IBS -glf example.glf.gz \
$IBS -glf example.glf.gz \
Line 146: Line 148:
</pre>
</pre>


=== Examine the results ===
=== Examine the IBS results ===
<pre>
<pre>
Rscript \
Rscript \
Line 157: Line 159:
cat all.filelist
cat all.filelist
</pre>
</pre>
=Frequently asked Questions=
== I run out of RAM running IBS or realSFS on my entire data set, what should I do?  ==
You can split the data set up by chromosome (or contig), run the IBS or SFS method on each chromosome, and then combine afterwards by summing the values ['A' - 'I'] across chromosomes. 
There a few possible concerns with this approach.  1) The optimization routines work best with more data, so be careful of breaking up your data into too many small groups. This is especially true as R0, R1, and KING-robust kinship are all ratios and can be sensitive to small absolute errors.    2) Also, without chromosomes, it becomes difficult to generate meaningful confidence intervals around R0, R1, and KING-robust kinship, as it important to account for correlations between nearby sites when generating the confidence intervals with something like a jackknife.  You can jackknife by leaving one contig out each time, but if the correlations (i.e. the identity-by-descent (IBD) blocks present between relatives) extend beyond the contigs, this procedure will produce confidence intervals that are smaller than they should be.
== How can I estimate confidence intervals around my R0, R1, or KING-robust kinship estimates? ==
Confidence intervals around R0, R1, and KING-robust kinship can be generated with a weighted block jackknife procedure.  This involves dividing your data set into '''X''' blocks that can be of unequal sizes.  Confidence intervals are then generated by leaving out each block and computing R0, R1, and KING-robust kinship on these '''X''' distinct sets [https://link.springer.com/article/10.1023/A:1008800423698 method citation]. This is a similar approach to generating Z-scores for D-statistics[https://www.genetics.org/content/192/3/1065.short] that are informative about admixture.
One important difference here is that this procedure assumes that the observations within each of the '''X''' blocks are independent from the observations in all the other pieces. For relatives, this implies that identity-by-descent (IBD) tracts should not extend into multiple blocks, as blocks within the same IBD tract will not be independent.  One easy solution if your reference genome contains chromosomes is to assign each chromosome to a distinct block. However, if your reference genome only has smaller contigs, it will be difficult to interpret the confidence intervals produced by a block jackknife procedure like this as observations may not be fully independent. This issue is also discussed in the paper.
== Can I estimate R0, R1, or KING-robust kinship even if my reference genome is not assembled into chromosomes?  ==
Yes, point estimates of R0, R1, or KING-robust kinship do not require a reference genome that is assembled into chromosomes.  If you have enough RAM, you can use all of your data at once, otherwise you will need to split your data into pieces, analyze each piece, and combine them afterwards.
== How can the quality of the reference genome assembly affect R0, R1, or KING-robust kinship estimates? ==
Probably in multiple, complicated ways!
But one to watch out for are regions of your reference genome that are attracting aligned reads originating from multiple, distinct parts of the genome. These regions can exhibit elevated heterozygosity across all individuals.  This is particularly problematic for our analyses, as shared heterozygosity between a pair of individuals (sites where both individuals are heterozygotes) are the strongest signals we have that the individuals are related. This is because shared heterozygosity shows two important things about a site: 1) the site is variable (segregating) in the population, and 2) that the pair of individuals has the maximum possible allelic identity at the site. A upward bias in shared heterozygosity will result in KING-robust kinship estimates biased toward 0.5, elevated values of R1, and lower values of R0.
If possible these regions of the reference genome should be excluded from the analysis. With access to multiple individuals, it may be possible to identify them by estimating heterozygosity across all individuals, or with a HWE test (see [https://onlinelibrary.wiley.com/doi/full/10.1111/1755-0998.13019 here] for a method of testing HWE on low depth data that is robust to population structure).  Without access to multiple individuals, it will be difficult to resolve these issues except through a better reference genome.
One thing we found with the human data in the paper was that applying a "mappability" filter improved our estimates of R0, R1, or KING-robust kinship.  This is essentially a scan of the reference genome that reports where in the genome it is possible to uniquely align reads, and needs to be calibrated to your read lengths.  We used [https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0030377 GEM]
But there are also [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6237805 new programs] that claim to do the same.
== How will inbreeding affects the estimates produces here? ==
Inbreeding within one or both individuals of a pair can affect estimates of R0, R1, or KING-robust kinship, but there isn't a simple answer that covers all possibilities.
If only one of the individuals is inbred, the pair of individuals may appear less related than otherwise, as the pair of individuals may have an elevated number of alternate homozygous genotypes and a reduced number of shared heterozygous genotypes. 
When considering inbreeding, it can be useful to compare the heterozygosities of a pair of individuals.  This is possible from the output of realSFS or IBS, and is facilitated by the parsing scripts included above.
== What data type is required for these analysis? ==
The best starting point are aligned sequencing reads (bam files).  The above example shows how to process these into all of the intermediate and final outputs.
The program [http://academic.oup.com/gigascience/article-abstract/8/5/giz034/5481763 ngsRelateV2] can also be used to estimate R0, R1, or KING-robust kinship.  It can take multiple formats including BAM, VCF and GLF. Please notice that ngsRelateV2 has its own implementation of the something very close to the SFS-based method described in the paper, and so it assumes that you know one of the alleles that is present at each site.


=Citation=
=Citation=

Latest revision as of 13:36, 24 September 2019

This page contains information about the method IBSrelate, a method to identify pairs of related individuals without requiring population allele frequencies.

On this page we will show you how to estimate the R0, R1 and KING-robust kinship statistics for a pair (or more!) of individuals from aligned sequencing data. These statistics are informative about relatedness, but can also be useful for quality-control (QC).

For further details, including the interpretation of R0, R1 and KING-robust kinship, please see our paper in Molecular Ecology at: https://doi.org/10.1111/mec.14954


Calculating statistics from the output of IBS and realSFS

IBS and realSFS are two methods implemented in ANGSD [1] that can be used to estimate the allele sharing genotype distribution for a pair of individuals. The paper describes and examines the differences between the two methods, but we expect they both will perform comparably well for most applications. Below are links to two R scripts that can be used to load the output of IBS and realSFS and produce estimates of R0, R1 and KING-robust kinship.

https://github.com/rwaples/freqfree_suppl/blob/master/read_IBS.R

https://github.com/rwaples/freqfree_suppl/blob/master/read_realSFS.R


Application to the ANGSD example data

The shell commands given below are available in a text file here: https://github.com/rwaples/freqfree_suppl/blob/master/example_data.sh .

They are available in a Jupyter notebook here (slightly older version): https://nbviewer.jupyter.org/github/rwaples/freqfree_suppl/blob/master/example_data.ipynb .

Setup

You will need installations and both ANGSD and samtools, as well as Rscript (part of R).

The commands below will download files to your current directory, as well as create a few sub-directories. In total the analysis will download and generate files with a total size of about 1.5 GB.

Set up shell variables

# set paths to the analysis programs, may need to be replaced your local installation(s)
ANGSD="$HOME/programs/angsd/angsd"
realSFS="$HOME/programs/angsd/misc/realSFS"
IBS="$HOME/programs/angsd/misc/ibs"
SAMTOOLS="samtools"

Get the example data

The example data set has small bam files from ten individuals.

# download the example data
wget http://popgen.dk/software/download/angsd/bams.tar.gz

# unzip/untar and index the bam files
tar xf bams.tar.gz
for i in bams/*.bam;do samtools index $i;done

realSFS method

Setup

# make a directory for the results
mkdir results_realsfs

# get the R script to parse the realSFS output
wget https://raw.githubusercontent.com/rwaples/freqfree_suppl/master/read_realSFS.R

Specify an allele at each site

For the realSFS method, one of the alleles at each site must be specified. Here we will use an ancestral state file (fasta format).

# download and index the ancestral state fasta file
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz
$SAMTOOLS faidx hg19ancNoChr.fa.gz

Generate a saf (site allele frequency likelihood) file for each individual

In the commands below, I apply minMapQ (-minMapQ 30) and minQ (-minQ 20) filters, as well as specify a specific genotype likelihood model (-GL 2). These values worked well for this data set and seem to be reasonable defaults, but the best values may vary by data set. I also generate summaries of sequencing depth (-doDepth) and allele counts (-doCounts). The output of these are not evaluated here, but they should be examined be part of a general QC process for NGS data.

# make a separate bam filelist for each individual
# also create a SAMPLES array for use below
BAMS=./bams/*.bam
SAMPLES=()
for b in $BAMS; do
  # parse out the sample name
  base="$(basename -- $b)"
  sample="${base%%.mapped.*}"
  SAMPLES+=("$sample")
  echo $sample
  echo $b > ${sample}.filelist.ind
done

# run doSAF on each individual
for s in "${SAMPLES[@]}"; do
  $ANGSD -b ${s}.filelist.ind \
  -anc hg19ancNoChr.fa.gz \
  -minMapQ 30 -minQ 20 -GL 2 \
  -doSaf 1 -doDepth 1 -doCounts 1 \
  -out ${s}
done

run realSFS on each pair of indiviudals

Here we have 10 individuals, and want to consider each pair just once. We index the SAMPLES array created above.

for i in {0..9}; do
  for j in {0..9}; do
    if (( i < j)); then
      sample1=${SAMPLES[i]}
      sample2=${SAMPLES[j]}
      $realSFS ${sample1}.saf.idx ${sample2}.saf.idx > ./results_realsfs/${sample1}_${sample2}.2dsfs
    fi
  done
done

Parse the results for a single pair of individuals

Below shows how to use the read_realSFS() function in R to parse the output 2dsfs file generated by realSFS. It will be easier to open an interactive R session, source read_realSFS.R, and use the function read_realSFS() to parse the file into data frame to then explore, plot, or export.

Rscript \
  -e "source('./read_realSFS.R')" \
  -e "res = read_realSFS('results_realsfs/smallNA06985_smallNA11830.2dsfs')" \
  -e "res['sample1'] = 'smallNA06985'; res['sample2'] = 'smallNA11830'" \
  -e "print(res[,c('sample1', 'sample2', 'nSites', 'Kin', 'R0', 'R1') ])"

IBS Method

Setup

# make a directory for the results
mkdir results_IBS

# get the R script to parse the IBS output
wget https://raw.githubusercontent.com/rwaples/freqfree_suppl/master/read_IBS.R

## make a file with paths to the bam(s) for all individuals
ls bams/*.bam > all.filelist

make a genotype likelihood file (glf) containing all individuals

see here for a description of the file format produced by -doGLF 1.

$ANGSD -b all.filelist \
  -minMapQ 30 -minQ 20 -GL 2 \
  -doGlf 1 \
  -out example

run IBS, this will analyse each pair of individuals

Here we have specified "model 0", so the command will generate expected values for each of the 100 possible pairwise 2-allele genotypes for each pair of individuals (10 unique genotypes per individual). The man page for misc/ibs here.

$IBS -glf example.glf.gz \
  -model 0 \
  -nInd 10 -allpairs 1 \
  -outFileName results_IBS/ibs.model0.results

Examine the IBS results

Rscript \
  -e "source('./read_IBS.R')" \
  -e "res = do_derived_stats(read_ibspair_model0('results_IBS/ibs.model0.results.ibspair'))" \
  -e "print(res[6,c('ind1', 'ind2', 'nSites', 'Kin', 'R0', 'R1') ])"

# the IBS method in ANGSD indexes individuals as they appear in the filelist
# (zero-indexed)
cat all.filelist

Frequently asked Questions

I run out of RAM running IBS or realSFS on my entire data set, what should I do?

You can split the data set up by chromosome (or contig), run the IBS or SFS method on each chromosome, and then combine afterwards by summing the values ['A' - 'I'] across chromosomes.

There a few possible concerns with this approach. 1) The optimization routines work best with more data, so be careful of breaking up your data into too many small groups. This is especially true as R0, R1, and KING-robust kinship are all ratios and can be sensitive to small absolute errors. 2) Also, without chromosomes, it becomes difficult to generate meaningful confidence intervals around R0, R1, and KING-robust kinship, as it important to account for correlations between nearby sites when generating the confidence intervals with something like a jackknife. You can jackknife by leaving one contig out each time, but if the correlations (i.e. the identity-by-descent (IBD) blocks present between relatives) extend beyond the contigs, this procedure will produce confidence intervals that are smaller than they should be.

How can I estimate confidence intervals around my R0, R1, or KING-robust kinship estimates?

Confidence intervals around R0, R1, and KING-robust kinship can be generated with a weighted block jackknife procedure. This involves dividing your data set into X blocks that can be of unequal sizes. Confidence intervals are then generated by leaving out each block and computing R0, R1, and KING-robust kinship on these X distinct sets method citation. This is a similar approach to generating Z-scores for D-statistics[2] that are informative about admixture.

One important difference here is that this procedure assumes that the observations within each of the X blocks are independent from the observations in all the other pieces. For relatives, this implies that identity-by-descent (IBD) tracts should not extend into multiple blocks, as blocks within the same IBD tract will not be independent. One easy solution if your reference genome contains chromosomes is to assign each chromosome to a distinct block. However, if your reference genome only has smaller contigs, it will be difficult to interpret the confidence intervals produced by a block jackknife procedure like this as observations may not be fully independent. This issue is also discussed in the paper.

Can I estimate R0, R1, or KING-robust kinship even if my reference genome is not assembled into chromosomes?

Yes, point estimates of R0, R1, or KING-robust kinship do not require a reference genome that is assembled into chromosomes. If you have enough RAM, you can use all of your data at once, otherwise you will need to split your data into pieces, analyze each piece, and combine them afterwards.

How can the quality of the reference genome assembly affect R0, R1, or KING-robust kinship estimates?

Probably in multiple, complicated ways!

But one to watch out for are regions of your reference genome that are attracting aligned reads originating from multiple, distinct parts of the genome. These regions can exhibit elevated heterozygosity across all individuals. This is particularly problematic for our analyses, as shared heterozygosity between a pair of individuals (sites where both individuals are heterozygotes) are the strongest signals we have that the individuals are related. This is because shared heterozygosity shows two important things about a site: 1) the site is variable (segregating) in the population, and 2) that the pair of individuals has the maximum possible allelic identity at the site. A upward bias in shared heterozygosity will result in KING-robust kinship estimates biased toward 0.5, elevated values of R1, and lower values of R0.

If possible these regions of the reference genome should be excluded from the analysis. With access to multiple individuals, it may be possible to identify them by estimating heterozygosity across all individuals, or with a HWE test (see here for a method of testing HWE on low depth data that is robust to population structure). Without access to multiple individuals, it will be difficult to resolve these issues except through a better reference genome.

One thing we found with the human data in the paper was that applying a "mappability" filter improved our estimates of R0, R1, or KING-robust kinship. This is essentially a scan of the reference genome that reports where in the genome it is possible to uniquely align reads, and needs to be calibrated to your read lengths. We used GEM But there are also new programs that claim to do the same.

How will inbreeding affects the estimates produces here?

Inbreeding within one or both individuals of a pair can affect estimates of R0, R1, or KING-robust kinship, but there isn't a simple answer that covers all possibilities.

If only one of the individuals is inbred, the pair of individuals may appear less related than otherwise, as the pair of individuals may have an elevated number of alternate homozygous genotypes and a reduced number of shared heterozygous genotypes.

When considering inbreeding, it can be useful to compare the heterozygosities of a pair of individuals. This is possible from the output of realSFS or IBS, and is facilitated by the parsing scripts included above.

What data type is required for these analysis?

The best starting point are aligned sequencing reads (bam files). The above example shows how to process these into all of the intermediate and final outputs.

The program ngsRelateV2 can also be used to estimate R0, R1, or KING-robust kinship. It can take multiple formats including BAM, VCF and GLF. Please notice that ngsRelateV2 has its own implementation of the something very close to the SFS-based method described in the paper, and so it assumes that you know one of the alleles that is present at each site.

Citation

Waples, R. K., Albrechtsen, A. and Moltke, I. (2018), Allele frequency‐free inference of close familial relationships from genotypes or low depth sequencing data. Mol Ecol. doi:10.1111/mec.14954


Bibtex

@article{doi:10.1111/mec.14954,
author = {Waples, Ryan K and Albrechtsen, Anders and Moltke, Ida},
title = {Allele frequency-free inference of close familial relationships from genotypes or low depth sequencing data},
journal = {Molecular Ecology},
volume = {0},
number = {ja},
pages = {},
doi = {10.1111/mec.14954},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/mec.14954},
eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/mec.14954},
}