IBSrelate: Difference between revisions
Line 144: | Line 144: | ||
=== Examine the results === | === Examine the results === | ||
<pre> | |||
Rscript \ | Rscript \ | ||
-e "source('./read_IBS.R')" \ | -e "source('./read_IBS.R')" \ | ||
Line 152: | Line 153: | ||
# (zero-indexed) | # (zero-indexed) | ||
cat all.filelist | cat all.filelist | ||
</pre> | |||
Line 181: | Line 182: | ||
<pre>{realSFS} ./data/1000G_aln/saf/chromosomes/NA19042_chr{CHR}.saf.idx ./data/1000G_aln/saf/chromosomes/NA19027_chr{CHR}.saf.idx -r {CHR} -P 2 -tole 1e-10 > ./data/1000G_aln/saf/chromosomes/NA19042_NA19027_chr{CHR}.2dsfs</pre> | <pre>{realSFS} ./data/1000G_aln/saf/chromosomes/NA19042_chr{CHR}.saf.idx ./data/1000G_aln/saf/chromosomes/NA19027_chr{CHR}.saf.idx -r {CHR} -P 2 -tole 1e-10 > ./data/1000G_aln/saf/chromosomes/NA19042_NA19027_chr{CHR}.2dsfs</pre> | ||
==== Use the above R script ('''read_realSFS.R''') to interpret the output for each pair of individuals ==== | ==== Use the above R script ('''read_realSFS.R''') to interpret the output for each pair of individuals ==== | ||
=== IBS method=== | === IBS method=== |
Revision as of 14:16, 23 September 2019
This page contains information about the method IBSrelate, a method to identify pairs of related individuals without requiring population allele frequencies.
On this page we will show you how to estimate the R0, R1 and KING-robust kinship statistics for a pair (or more!) of individuals from aligned sequencing data. These statistics are informative about relatedness, but can also be useful for quality-control (QC). For further details, please see our paper in Molecular Ecology at: https://doi.org/10.1111/mec.14954
Calculating statistics from the output of IBS and realSFS
IBS and realSFS are two methods implemented in ANGSD [1] that can be used to estimate the allele sharing genotype distribution for a pair of individuals. The paper describes and examines the differences between the two methods, but we expect they both will perform comparably well for most applications. Below are links to two R scripts that can be used to load the output of IBS and realSFS and produce estimates of R0, R1 and KING-robust kinship.
https://github.com/rwaples/freqfree_suppl/blob/master/read_IBS.R
https://github.com/rwaples/freqfree_suppl/blob/master/read_realSFS.R
Demonstration of the IBS and realSFS methods on the ANGSD example data
The shell commands given below are available in a text file here: https://github.com/rwaples/freqfree_suppl/blob/master/example_data.sh .
They are available in a Jupyter notebook here: https://nbviewer.jupyter.org/github/rwaples/freqfree_suppl/blob/master/example_data.ipynb .
Setup
You will need installations and both ANGSD and samtools, as well as Rscript (part of R).
Set up shell variables
# set paths to the analysis programs, # may need to be replaced your local installation(s) ANGSD="$HOME/programs/angsd/angsd" realSFS="$HOME/programs/angsd/misc/realSFS" IBS="$HOME/programs/angsd/misc/ibs" SAMTOOLS="samtools"
Get the example data
The example data set has small bam files from ten individuals. # download the example data wget http://popgen.dk/software/download/angsd/bams.tar.gz # unzip/untar and index the bam files tar xf bams.tar.gz for i in bams/*.bam;do samtools index $i;done
realSFS method
Setup
# make a directory for the results mkdir results_realsfs # get the R script to parse the realSFS output wget https://raw.githubusercontent.com/rwaples/freqfree_suppl/master/read_realSFS.R
Specify an allele at each site
For the realSFS method, one of the alleles at each site must be specified. Here we will use an ancestral state file.
# download and index the ancestral state fasta file wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz $SAMTOOLS faidx hg19ancNoChr.fa.gz
Generate a saf (site allele frequency likelihood) file for each individual
# make a separate bam filelist for each individual # also create a SAMPLES array for use below BAMS=./bams/*.bam SAMPLES=() for b in $BAMS; do # parse out the sample name base="$(basename -- $b)" sample="${base%%.mapped.*}" SAMPLES+=("$sample") echo $sample echo $b > ${sample}.filelist.ind done # run doSAF on each individual for s in "${SAMPLES[@]}"; do $ANGSD -b ${s}.filelist.ind \ -anc hg19ancNoChr.fa.gz \ -minMapQ 30 -minQ 20 -GL 2 \ -doSaf 1 -doDepth 1 -doCounts 1 \ -out ${s} done
run realSFS on each pair of indiviudals
Here we have 10 individuals, and want to consider each pair just once. We index the SAMPLES array created above.
for i in {0..9}; do for j in {0..9}; do if (( i < j)); then sample1=${SAMPLES[i]} sample2=${SAMPLES[j]} $realSFS ${sample1}.saf.idx ${sample2}.saf.idx > ./results_realsfs/${sample1}_${sample2}.2dsfs fi done done
Parse the results for a single pair of individuals
Below shows how to use the read_realSFS() function in R to parse the output 2dsfs file generated by realSFS.
Rscript \ -e "source('./read_realSFS.R')" \ -e "res = read_realSFS('results_realsfs/smallNA06985_smallNA11830.2dsfs')" \ -e "res['sample1'] = 'smallNA06985'; res['sample2'] = 'smallNA11830'" \ -e "print(res[,c('sample1', 'sample2', 'nSites', 'Kin', 'R0', 'R1') ])"
IBS Method
Setup
# make a directory for the results mkdir results_IBS # get the R script to parse the IBS output wget https://raw.githubusercontent.com/rwaples/freqfree_suppl/master/read_IBS.R ## make a bam filelists containing all individuals ls bams/*.bam > all.filelist
make a genotype likelihood file (glf) containing all individuals
$ANGSD -b all.filelist \ -minMapQ 30 -minQ 20 -GL 2 \ -doGlf 1 \ -out example
run IBS, this will analyse each pair of individuals
$IBS -glf example.glf.gz \ -model 0 \ -nInd 10 -allpairs 1 \ -outFileName results_IBS/ibs.model0.results
Examine the results
Rscript \ -e "source('./read_IBS.R')" \ -e "res = do_derived_stats(read_ibspair_model0('results_IBS/ibs.model0.results.ibspair'))" \ -e "print(res[6,c('ind1', 'ind2', 'nSites', 'Kin', 'R0', 'R1') ])" # the IBS method in ANGSD indexes individuals as they appear in the filelist # (zero-indexed) cat all.filelist
make a consensus sequence (fasta) from one of the individuals
Here the *.list file contains paths to the bam files for NA19042. A separate consensus should be created for each chromosome. This step is optional, the reference sequence used for alignment can also be used.
{ANGSD} -b ./data/1000G_aln/NA19042.mapped.ILLUMINA.bwa.LWK.low_coverage.20130415.list \ -r {CHR} -minMapQ 30 -minQ 20 -setMinDepth 3 -doFasta 2 -doCounts 1 -out ./data/consensus.NA19042.chr{CHR}
make *.saf files
- .saf files are needed for each chromosome within each individual.
The *.list file contains paths to the bam files for NA19027. The file GEM_mappability1_75mer.angsd gives the sites passing the GEM mappability filter in a bed-like format, as required by ANGSD (see here: [2])
{ANGSD} -b ./data/1000G_aln/NA19027.mapped.ILLUMINA.bwa.LWK.low_coverage.20130415.list \ -r {CHR} \ -ref ./data/1000G_aln/hs37d5.fa \ -anc ./data/consensus.NA19042.chr{CHR}.fa.gz \ -sites ./data/1000G_aln/GEM_mappability1_75mer.angsd \ -minMapQ 30 -minQ 20 -GL 2 \ -doSaf 1 -doDepth 1 -doCounts 1 \ -out ./data/1000G_aln/saf/chromosomes/NA19027_chr{CHR}
run realSFS for each pair of individuals
This will generate a 2-dimensional site-frequency spectrum. The command below runs realSFS for NA19042 and NA19027. Run for each chromosome for each pair of individuals.
{realSFS} ./data/1000G_aln/saf/chromosomes/NA19042_chr{CHR}.saf.idx ./data/1000G_aln/saf/chromosomes/NA19027_chr{CHR}.saf.idx -r {CHR} -P 2 -tole 1e-10 > ./data/1000G_aln/saf/chromosomes/NA19042_NA19027_chr{CHR}.2dsfs
Use the above R script (read_realSFS.R) to interpret the output for each pair of individuals
IBS method
make a genotype likelihood file
The file bamlist.all.txt contains paths to the bam files for each individual, one per individual. The file GEM_mappability1_75mer.angsd gives the sites passing the GEM mappability filter in a bed-like format, as required by ANGSD (see here: [3]) The output will contain genotype likelihoods for each individual at each site (*.glf.gz). Run for each chromosome.
{ANGSD} -b ./data/1000G_aln/bamlist.all.txt \ -r {CHR} \ -sites ./data/1000G_aln/GEM_mappability1_75mer.angsd \ -minMapQ 30 -minQ 20 -GL 2 \ -doGlf 1 \ -out ./data/1000G_aln/GLF/chromosomes/chr{CHR}
run IBS
Here there are 5 individuals in the glf file (-nInd 5), and we want to evaluate at each pair (-allpairs 1), using IBS model 0 (-model 0).
{IBS} -glf ./data/1000G_aln/GLF/chromosomes/chr{CHR}.glf.gz \ -seed {CHR} -maxSites 300000000 -model 0 \ -nInd 5 -allpairs 1 \ -outFileName ./data/1000G_aln/GLF/chromosomes/chr{CHR}.model0
Use the above R script (read_IBS.R) to interpret the output of IBS for each pair of individuals
Citation
Waples, R. K., Albrechtsen, A. and Moltke, I. (2018), Allele frequency‐free inference of close familial relationships from genotypes or low depth sequencing data. Mol Ecol. doi:10.1111/mec.14954
Bibtex
@article{doi:10.1111/mec.14954, author = {Waples, Ryan K and Albrechtsen, Anders and Moltke, Ida}, title = {Allele frequency-free inference of close familial relationships from genotypes or low depth sequencing data}, journal = {Molecular Ecology}, volume = {0}, number = {ja}, pages = {}, doi = {10.1111/mec.14954}, url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/mec.14954}, eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/mec.14954}, }