software - User contributions [en]

AsaMap

2026-03-24T07:32:33Z

Albrecht: Replaced content with "The program is available and described on github: https://github.com/e-jorsboe/asaMap"

The program is available and described on github:

https://github.com/e-jorsboe/asaMap

NgsAdmix

2026-03-24T07:28:40Z

Albrecht: Replaced content with "NGSadmix is a tool for estimating individual admixture proportions low depth sequencing data based on genotype likelihoods The software including tutorials can be found here https://github.com/aalbrechtsen/NGSadmix"

NGSadmix is a tool for estimating individual admixture proportions low depth sequencing data based on genotype likelihoods

The software including tutorials can be found here
https://github.com/aalbrechtsen/NGSadmix

EvalAdmix

2024-03-04T17:57:07Z

Albrecht:

'''IMPORTANT''': version 0.95 (updated on 30/06/2021) fixes a bug in the implementation for genotype data, which caused displacement of genotypes between samples when a site had missing data. When all sites have some missingness this would result in the last samples from the analyses having a correlation of nan with all other samples; but might have some more subtle effects whenever there is some level of missingness. If you have analyses from previous versions based on genotype data with any missingness might be a good idea to re-run them after updating. The bug did not affect the genotype likelihoods implementation so if you based the analyses on genotype likelihoods you do not need to worry. If you applied it to gentoype data without any missingness you also do not need to worry.

evalAdmix allows to evaluate the results of an admixture analysis (i.e. the result of applying [https://genome.cshlp.org/content/19/9/1655.long ADMIXTURE], [https://web.stanford.edu/group/pritchardlab/structure.html STRUCTURE], [http://www.popgen.dk/software/index.php/NgsAdmix NGSadmix] and similar). It only needs the input genotype data used for the previous admixture analysis and the output of that analysis (admixture proportions and ancestral population frequencies). The genotype input data can either be called genotypes in [https://www.cog-genomics.org/plink/1.9/formats#bed binary plink format] or genotype likelihoods in [http://www.popgen.dk/angsd/index.php/Genotype_Likelihoods#Beagle_format beagle format].

The output is a pairwise correlation of residuals matrix between individuals. The correlation will be close to 0 in case of a good fit of the data to the admixture model. When individuals do not fit the model, individuals with similar demographic histories (i.e. usually individuals from the same population) will be positively correlated; and individuals with different histories but that are modelled as sharing one or more ancestral populations as admixture sources will have a negative correlation. Positive correlation between a pair of individuals might also be due to relatedness.

[[File:evalAdmix.jpg|thumb|550px]]

==Download and Installation==

evalAdmix can be installed from [https://github.com/GenisGE/evalAdmix github]

<pre>git clone https://github.com/GenisGE/evalAdmix.git
cd evalAdmix
make
</pre>

==Quick start==
:<code> ./evalAdmix -beagle inputBeagleFile.gz -fname myoutfiles.fopt.gz -qname myoutfiles.qopt -o evaladmixOut.corres -P 10 </code>
:<code> ./evalAdmix -plink inputPlinkPrefix -fname inputPlinkPrefix.K.P -qname inputPlinkPrefix.K.Q -o evaladmixOut.corres -P 10 </code>

* '''-beagle''' beagle file of genotype likelihoods
* '''-plink''' binary plink file prefix with genotype data
* '''-fname''' file with ancestral frequencies (space delimited, rows are sites and columns ancestral populations)
* '''-qname''' file with admixture proportions (space delimited, rows are individuals and columns ancestral populations)
* '''-o''' prefix of output file names
* '''-P''' Number of threads used

==Parameters==

<pre>./evalAdmix </pre>

<pre>
Arguments:
Required:
-plink path to binary plink file (excluding the .bed)
or
-beagle path to beagle file containing genotype likelihoods (alternative to -plink)

-fname path to ancestral population frequencies file
-qname path to admixture proportions file

Optional:

-o name of the output file

Setup (optional):

-P 1 number of threads
-autosomeMax 23 autosome ends with this chromsome
-nIts 5 number of iterations to do for frequency correction; if set to 0 calculates correlation without correction (fast but biased)
-useSites 1.0 proportion of sites to use to calculate correlation of residuals
-useInds filename path to tab delimited file with first column containing all individuals ID and second column with 1 to include individual in analysis and 0 otherwise (default all individuals are included)
-misTol 0.05 tolerance for considering site as missing when using genotype likelihoods. Use same value as used in NGSadmix to keep compatibility when using genotype likelihoods (-beagle)
-minMaf 0.05 minimum minor allele frequency to keep site. Use same value as used in NGSadmix to keep compatibility when using genotype likelihoods (-beagle)
</pre>

==Input Files==

===Plink===
Genotype data files in binary PLINK format (.bed .fam .bim).
===Beagle genotype likelhoods===
The input file contains genotype likelihoods in a .beagle file format [http://faculty.washington.edu/browning/beagle/beagle.html].
and can be compressed with gzip.
==== BAM files ====
If you have BAM files you can use [[ANGSD]] to produce genotype likelihoods in .beagle format. Please
see [http://www.popgen.dk/angsd/index.php/Beagle_input Creation of Beagle files with ANGSD]

==== VCF files ====
If you already have made a VCF file that contains genotype likehood information then it should be possible to convert .vcf files with genotype likelihoods to .beagle file via vcftools [https://vcftools.github.io/man_latest.html]

<pre>
vcftools --vcf input.vcf --out test --BEAGLE-GL --chr 1,2
</pre>
Chromosome has to be specified.

You can also use bcftools' [https://samtools.github.io/bcftools/bcftools.html] 'query' option for generating a .beagle file from a .vcf file.

==Output File==
The analysis performed by evalAdmix produces one file, containing a tab delimited N times N symmetric correlation matrix, where column i in line j contains the correlation of residuals between individual i and j, and the diagonal values (self-correlation) are set to NA:

NA 0.008609 -0.006919 0.002731 0.020224<br />
0.008609 NA 0.000033 0.004968 -0.008470<br />
-0.006919 0.000033 NA 0.006982 0.005664<br />
0.002731 0.004968 0.006982 NA 0.000521<br />
0.020224 -0.008470 0.005664 0.000521 NA

==Run command example==

=== Genotype data ===
Download the input file containing genotypes in binary plink format
<pre>
wget https://popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.bed
wget https://popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.bim
wget https://popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.fam
</pre>

Run ADMIXTURE [http://software.genetics.ucla.edu/admixture/] to obtain admixture proprotions

<pre>admixture admixTjeck2.bed 3 -j20</pre>

::Genotypes file prefix admixTjeck2.bed
::Number of Ancestral Populations K = 3
::Number of CPU threads = 20

Run evalAdmix
<pre>./evalAdmix -plink admixTjeck2 -fname admixTjeck2.3.P -qname admixTjeck2.3.Q -P 20 </pre>

::Genotypes file prefix admixTjeck2 (-plink admixTjeck2).
::Ancestral Populations frequency file admixTjeck2.3.P (-fname admixTjeck2.3.P).
::Admixture proportions file admixTjeck2.3.Q (-qname admixTjeck2.3.Q).
::Computer cores = 20 (-P 20).

Plot results in R
[[File:evalAdmixK3.Q.png|thumb|frame|admixture proportions]]
[[File:evalAdmixK3.cor.png|thumb|frame|evalAdmix correlations]]
<pre>
source("visFuns.R")

# read population labels and estimated admixture proportions
pop<-read.table("admixTjeck2.fam")
q<-read.table("admixTjeck2.3.Q",stringsAsFactors=T)

palette(c("#A6CEE3","#1F78B4","#B2DF8A","#33A02C","#FB9A99","#E31A1C","#FDBF6F","#FF7F00","#CAB2D6","#6A3D9A","#FFFF99","#B15928","#1B9E77","#999999"))

# order according to population and plot the ADMIXTURE reults
ord<-orderInds(pop = as.vector(pop[,2]), q = q)

#make barplot
plotAdmix(q,ord=ord,pop=pop[,2])

r<-as.matrix(read.table("output.corres.txt"))

# Plot correlation of residuals
plotCorRes(cor_mat = r, pop = as.vector(pop[,2]), ord=ord, title="Evaluation of 1000G admixture proportions with K=3", max_z=0.1, min_z=-0.1)

</pre>

=== Low depth sequencing data ===
Download the input file containing genotype likelihoods in beagle format
<pre>
wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz
wget popgen.dk/software/download/NGSadmix/data/Demo2pop.info
</pre>

Execute [[NgsAdmix |NGSadmix]] to obtain admixture proportions
<pre>./NGSadmix -likes Demo2input.gz -K 3 -P 20 -o myoutfiles -minMaf 0.05</pre>

::Input file = Demo2input.gz
::Ancestral Populations K=3
::Computer cores = 20 (-P 20).
::Output prefix = myoutfiles (-o myoutfiles)
::SNPs with MAF > 5% (-minMaf 0.05)

Run evalAdmix
<pre>./evalAdmix -beagle Demo2input.gz -fname myoutfiles.fopt.gz -qname myoutfiles.qopt -P 20 </pre>

::Genotype likelihoods file Demo2input.gz (-beagle Demo2input.gz).
::Ancestral Populations frequency file myoutfiles.fopt.gz (-fname myoutfiles.fopt.gz).
::Admixture proportions file myoutfiles.qopt(-qname myoutfiles.qopt).
::Computer cores = 20 (-P 20).

Plot results in R
<pre>
source("visFuns.R")
# read population labels and estimated admixture proportions
pop<-read.table("Demo2pop.info",as.is=T)
q<-read.table("myoutfiles.qopt")

# order according to population and plot the NGSadmix reults
ord<-orderInds(pop = as.vector(pop[,1]), q = q)
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=3")
text(sort(tapply(1:nrow(pop),pop[ord,1],mean)),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

r<-read.table("output.corres.txt")

# Plot correlation of residuals

plotCorRes(cor_mat = r, pop = as.vector(pop[,1]), ord = ord, title="Evaluation of 1000G admixture proportions with K=3", max_z=0.1, min_z=-0.1)

</pre>

==Citation==

[https://doi.org/10.1111/1755-0998.13171 Evaluation of Model Fit of Inferred Admixture Proportions]

File:EvalAdmix.jpg

2024-01-11T10:10:26Z

Albrecht:

EvalAdmix

2024-01-11T10:09:38Z

Albrecht:

'''IMPORTANT''': version 0.95 (updated on 30/06/2021) fixes a bug in the implementation for genotype data, which caused displacement of genotypes between samples when a site had missing data. When all sites have some missingness this would result in the last samples from the analyses having a correlation of nan with all other samples; but might have some more subtle effects whenever there is some level of missingness. If you have analyses from previous versions based on genotype data with any missingness might be a good idea to re-run them after updating. The bug did not affect the genotype likelihoods implementation so if you based the analyses on genotype likelihoods you do not need to worry. If you applied it to gentoype data without any missingness you also do not need to worry.

evalAdmix allows to evaluate the results of an admixture analysis (i.e. the result of applying [https://genome.cshlp.org/content/19/9/1655.long ADMIXTURE], [https://web.stanford.edu/group/pritchardlab/structure.html STRUCTURE], [http://www.popgen.dk/software/index.php/NgsAdmix NGSadmix] and similar). It only needs the input genotype data used for the previous admixture analysis and the output of that analysis (admixture proportions and ancestral population frequencies). The genotype input data can either be called genotypes in [https://www.cog-genomics.org/plink/1.9/formats#bed binary plink format] or genotype likelihoods in [http://www.popgen.dk/angsd/index.php/Genotype_Likelihoods#Beagle_format beagle format].

The output is a pairwise correlation of residuals matrix between individuals. The correlation will be close to 0 in case of a good fit of the data to the admixture model. When individuals do not fit the model, individuals with similar demographic histories (i.e. usually individuals from the same population) will be positively correlated; and individuals with different histories but that are modelled as sharing one or more ancestral populations as admixture sources will have a negative correlation. Positive correlation between a pair of individuals might also be due to relatedness.

[[File:evalAdmix.jpg|thumb|550px]]

==Download and Installation==

evalAdmix can be installed from [https://github.com/GenisGE/evalAdmix github]

<pre>git clone https://github.com/GenisGE/evalAdmix.git
cd evalAdmix
make
</pre>

==Quick start==
:<code> ./evalAdmix -beagle inputBeagleFile.gz -fname myoutfiles.fopt.gz -qname myoutfiles.qopt -o evaladmixOut.corres -P 10 </code>
:<code> ./evalAdmix -plink inputPlinkPrefix -fname inputPlinkPrefix.K.P -qname inputPlinkPrefix.K.Q -o evaladmixOut.corres -P 10 </code>

* '''-beagle''' beagle file of genotype likelihoods
* '''-plink''' binary plink file prefix with genotype data
* '''-fname''' file with ancestral frequencies (space delimited, rows are sites and columns ancestral populations)
* '''-qname''' file with admixture proportions (space delimited, rows are individuals and columns ancestral populations)
* '''-o''' prefix of output file names
* '''-P''' Number of threads used

==Parameters==

<pre>./evalAdmix </pre>

<pre>
Arguments:
Required:
-plink path to binary plink file (excluding the .bed)
or
-beagle path to beagle file containing genotype likelihoods (alternative to -plink)

-fname path to ancestral population frequencies file
-qname path to admixture proportions file

Optional:

-o name of the output file

Setup (optional):

-P 1 number of threads
-autosomeMax 23 autosome ends with this chromsome
-nIts 5 number of iterations to do for frequency correction; if set to 0 calculates correlation without correction (fast but biased)
-useSites 1.0 proportion of sites to use to calculate correlation of residuals
-useInds filename path to tab delimited file with first column containing all individuals ID and second column with 1 to include individual in analysis and 0 otherwise (default all individuals are included)
-misTol 0.05 tolerance for considering site as missing when using genotype likelihoods. Use same value as used in NGSadmix to keep compatibility when using genotype likelihoods (-beagle)
-minMaf 0.05 minimum minor allele frequency to keep site. Use same value as used in NGSadmix to keep compatibility when using genotype likelihoods (-beagle)
</pre>

==Input Files==

===Plink===
Genotype data files in binary PLINK format (.bed .fam .bim).
===Beagle genotype likelhoods===
The input file contains genotype likelihoods in a .beagle file format [http://faculty.washington.edu/browning/beagle/beagle.html].
and can be compressed with gzip.
==== BAM files ====
If you have BAM files you can use [[ANGSD]] to produce genotype likelihoods in .beagle format. Please
see [http://www.popgen.dk/angsd/index.php/Beagle_input Creation of Beagle files with ANGSD]

==== VCF files ====
If you already have made a VCF file that contains genotype likehood information then it should be possible to convert .vcf files with genotype likelihoods to .beagle file via vcftools [https://vcftools.github.io/man_latest.html]

<pre>
vcftools --vcf input.vcf --out test --BEAGLE-GL --chr 1,2
</pre>
Chromosome has to be specified.

You can also use bcftools' [https://samtools.github.io/bcftools/bcftools.html] 'query' option for generating a .beagle file from a .vcf file.

==Output File==
The analysis performed by evalAdmix produces one file, containing a tab delimited N times N symmetric correlation matrix, where column i in line j contains the correlation of residuals between individual i and j, and the diagonal values (self-correlation) are set to NA:

NA 0.008609 -0.006919 0.002731 0.020224<br />
0.008609 NA 0.000033 0.004968 -0.008470<br />
-0.006919 0.000033 NA 0.006982 0.005664<br />
0.002731 0.004968 0.006982 NA 0.000521<br />
0.020224 -0.008470 0.005664 0.000521 NA

==Run command example==

=== Genotype data ===
Download the input file containing genotypes in binary plink format
<pre>
wget http://pontus.popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.bed
wget http://pontus.popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.bim
wget http://pontus.popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.fam
</pre>

Run ADMIXTURE [http://software.genetics.ucla.edu/admixture/] to obtain admixture proprotions

<pre>admixture admixTjeck2.bed 3 -j20</pre>

::Genotypes file prefix admixTjeck2.bed
::Number of Ancestral Populations K = 3
::Number of CPU threads = 20

Run evalAdmix
<pre>./evalAdmix -plink admixTjeck2 -fname admixTjeck2.3.P -qname admixTjeck2.3.Q -P 20 </pre>

::Genotypes file prefix admixTjeck2 (-plink admixTjeck2).
::Ancestral Populations frequency file admixTjeck2.3.P (-fname admixTjeck2.3.P).
::Admixture proportions file admixTjeck2.3.Q (-qname admixTjeck2.3.Q).
::Computer cores = 20 (-P 20).

Plot results in R
[[File:evalAdmixK3.Q.png|thumb|frame|admixture proportions]]
[[File:evalAdmixK3.cor.png|thumb|frame|evalAdmix correlations]]
<pre>
source("visFuns.R")

# read population labels and estimated admixture proportions
pop<-read.table("admixTjeck2.fam")
q<-read.table("admixTjeck2.3.Q",stringsAsFactors=T)

palette(c("#A6CEE3","#1F78B4","#B2DF8A","#33A02C","#FB9A99","#E31A1C","#FDBF6F","#FF7F00","#CAB2D6","#6A3D9A","#FFFF99","#B15928","#1B9E77","#999999"))

# order according to population and plot the ADMIXTURE reults
ord<-orderInds(pop = as.vector(pop[,2]), q = q)

#make barplot
plotAdmix(q,ord=ord,pop=pop[,2])

r<-as.matrix(read.table("output.corres.txt"))

# Plot correlation of residuals
plotCorRes(cor_mat = r, pop = as.vector(pop[,2]), ord=ord, title="Evaluation of 1000G admixture proportions with K=3", max_z=0.1, min_z=-0.1)

</pre>

=== Low depth sequencing data ===
Download the input file containing genotype likelihoods in beagle format
<pre>
wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz
wget popgen.dk/software/download/NGSadmix/data/Demo2pop.info
</pre>

Execute [[NgsAdmix |NGSadmix]] to obtain admixture proportions
<pre>./NGSadmix -likes Demo2input.gz -K 3 -P 20 -o myoutfiles -minMaf 0.05</pre>

::Input file = Demo2input.gz
::Ancestral Populations K=3
::Computer cores = 20 (-P 20).
::Output prefix = myoutfiles (-o myoutfiles)
::SNPs with MAF > 5% (-minMaf 0.05)

Run evalAdmix
<pre>./evalAdmix -beagle Demo2input.gz -fname myoutfiles.fopt.gz -qname myoutfiles.qopt -P 20 </pre>

::Genotype likelihoods file Demo2input.gz (-beagle Demo2input.gz).
::Ancestral Populations frequency file myoutfiles.fopt.gz (-fname myoutfiles.fopt.gz).
::Admixture proportions file myoutfiles.qopt(-qname myoutfiles.qopt).
::Computer cores = 20 (-P 20).

Plot results in R
<pre>
source("visFuns.R")
# read population labels and estimated admixture proportions
pop<-read.table("Demo2pop.info",as.is=T)
q<-read.table("myoutfiles.qopt")

# order according to population and plot the NGSadmix reults
ord<-orderInds(pop = as.vector(pop[,1]), q = q)
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=3")
text(sort(tapply(1:nrow(pop),pop[ord,1],mean)),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

r<-read.table("output.corres.txt")

# Plot correlation of residuals

plotCorRes(cor_mat = r, pop = as.vector(pop[,1]), ord = ord, title="Evaluation of 1000G admixture proportions with K=3", max_z=0.1, min_z=-0.1)

</pre>

==Citation==

[https://doi.org/10.1111/1755-0998.13171 Evaluation of Model Fit of Inferred Admixture Proportions]

File:Pcangsd pca.png

2023-10-24T11:32:12Z

Albrecht:

PCAngsd

2023-10-24T11:26:25Z

Albrecht:

PCAngsd is a program that estimates the covariance matrix and individual allele frequencies for low-depth next-generation sequencing (NGS) data in structured/heterogeneous populations using principal component analysis (PCA) to perform multiple population genetic analyses using genotype likelihoods. Since version 0.98, PCAngsd was re-written to be based on Cython for computational bottlenecks and parallelization.

The main method was published in 2018 and can be found here: [https://www.genetics.org/content/210/2/719]

The HWE test was published in 2019 and can be found here: [https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13019]
[[File:Pcangsd_admix3.gif|frame]]

[[File:Pcangsd_pca.png|thumb|400px|Simulated low depth NGS data of 3 populations]]

=Overview=
Framework for analyzing low-depth next-generation sequencing (NGS) data in heterogeneous/structured populations using principal component analysis (PCA). Population structure is inferred by estimating individual allele frequencies in an iterative approach using a truncated SVD model. The covariance matrix is estimated using the estimated individual allele frequencies as prior information for the unobserved genotypes in low-depth NGS data.

The estimated individual allele frequencies can further be used to account for population structure in other probabilistic methods. PCAngsd can perform the following analyses:
*Covariance matrix
*Admixture estimations
*Inbreeding coefficients (both per-individual and per-site)
*HWE test
*Genome-wide selection scan
*Genotype calling
*Estimate NJ tree of samples

Older versions of PCAngsd can be found here [https://github.com/Rosemeis/pcangsd/releases/].

=Download and Installation=

PCAngsd should work on all platforms meeting the requirements but server-side usage is highly recommended. Installation has only been tested on Linux systems.

Get PCAngsd and build
<pre>
git clone https://github.com/Rosemeis/pcangsd.git
cd pcangsd/
python setup.py build_ext --inplace
</pre>
Install dependencies:

The required set of Python packages are easily installed using the pip command and the 'requirements.txt file' included in the 'pcangsd' folder.

<pre>
pip install --user -r requirements.txt
</pre>

=Quick start=

PCAngsd is used by running the main caller file pcangsd.py. To see all available options use the following command:
<pre>
python pcangsd.py -h

# Genotype likelihoods using 64 threads
python pcangsd.py -beagle input.beagle.gz -out output -threads 64

# PLINK files (using file-prefix, *.bed, *.bim, *.fam)
python pcangsd.py -beagle input.plink -out output -threads 64
</pre>

PCAngsd accepts either genotype likelihoods in Beagle format or PLINK genotype files. Beagle files can be generated from BAM files using [http://popgen.dk/angsd ANGSD]. For inference of population structure in genotype data with non-random missigness, we recommend our [http://www.popgen.dk/software/index.php/EMU EMU] software that performs accelerated EM-PCA, however with fewer functionalities than PCAngsd (#soon).

PCAngsd will mostly output files in binary Numpy format (.npy) with a few exceptions. In order to read files in python:
<pre>
import numpy as np
C = np.genfromtxt("output.cov") # Reads in estimated covariance matrix (text)
D = np.load("output.selection.npy") # Reads PC based selection statistics
</pre>

R can also read Numpy matrices using the "RcppCNPy" R library:
<pre>
library(RcppCNPy)
C <- as.matrix(read.table("output.cov")) # Reads in estimated covariance matrix
D <- npyLoad("output.selection.npy") # Reads PC based selection statistics
</pre>

An example of generating genotype likelihoods in [http://popgen.dk/angsd ANGSD] and output them in the required Beagle text format.

<pre>
./angsd -GL 2 -out input -nThreads 4 -doGlf 2 -doMajorMinor 1 -doMaf 2 -SNP_pval 1e-6 -bam bam.filelist
</pre>

=Tutorial=

Please refer to the tutorial's page [http://www.popgen.dk/software/index.php/PCAngsdTutorial]

=Options=
<pre>
# See all options in PCAngsd
python pcangsd.py -h
</pre>

==General usage==
; -beagle [Beagle file]
Input file of genotype likelihoods in Beagle format (.beagle.gz).
; -filter [Text file]
Input file of 1's or 0's whether to keep individuals or not.
; -plink [Prefix for binary PLINK files]
Path to PLINK files using their ONLY prefix (.bed, .bim, .fam).
; -plink_error [float]
Incorporate errors into genotypes by specifying rate as argument.
; -minMaf [float]
Minimum minor allele frequency threshold. (Default: 0.05)
; -maf_iter [int]
Maximum number of EM iterations for computing the population allele frequencies (Default: 200).
; -maf_tole [float]
Tolerance value in EM algorithm for population allele frequencies estimation (Default: 1e-4).
; -iter [int]
Maximum number of iterations for estimation of individual allele frequencies (Default: 100).
; -tole [float]
Tolerance value for update in estimation of individual allele frequencies (Default: 1e-5).
; -hwe [.lrt.npy file]
Input file of LRT binary file from previous PCAngsd run to filter based on HWE.
; -hwe_tole [float]
Threshold for HWE filtering of sites.
; -e [int]
Manually select the number of eigenvalues to use in the modelling of individual allele frequencies (Default: Automatically tested using MAP test).
; -pi [.pi.npy file]
Load previous estimation of individual allele frequencies to skip covariance estimation.
; -maf_save
Choose to save estimated population allele frequencies (Binary). Numpy format (.npy).
; -pi_save
Choose to save estimated individual allele frequencies (Binary). Numpy format (.npy). Can be used with the '-pi' command.
; -dosage_save
Choose to save estimated genotype dosages (Binary). Numpy format (.npy).
; -post_save
Choose to save the posterior genotype probabilities. Beagle format (.beagle).
; -sites_save
Choose to save the kept sites after filtering which is useful for downstream analysis. Outputs a file of 1's and 0's for keeping a site or not, respectively.
; -threads [int]
Specify the number of thread(s) to use (Default: 1).
; -out [output prefix]
Fileprefix for all output files created by PCAngsd (Default: "pcangsd").

==Selection==
Perform PC-based genome-wide selection scans using posterior expectations of the genotypes (genotype dosages):

; -selection
Using an extended model of [http://www.cell.com/ajhg/abstract/S0002-9297(16)00003-3 FastPCA]. Performs a genome-wide selection scan along all significant PCs. Outputs the selection statistics and must be converted to p-values by user. Each column reflect the selection statistics along a tested PC and they are χ²-distributed with 1 degree of freedom.

; -pcadapt
Using an extended model of [https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.12592 pcadapt]. Performs a genome-wide selection scan across all significant PCs. Outputs the z-scores and must be converted to test statistics with the provided script 'pcangsd/scripts/pcadapt.R', and the test statistics are χ²-distributed with K degree of freedom.

; -snp_weights
Output the SNP weights of the significant K eigenvectors.

==Inbreeding==
; -inbreedSites
Estimate per-site inbreeding coefficients accounting for population structure and perform likehood ratio test for detecting sites deviating from HWE [https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13019].

; -inbreedSamples
Estimate per-individual inbreeding coefficients accounting for population structure which is based on an extension of [http://genome.cshlp.org/content/23/11/1852.full ngsF] for structured populations.

; -inbreed_iter [int]
Maximum number of iterations for inbreeding EM algorithm. (Default: 200)

; -inbreed_tole [float]
Tolerance value for inbreeding EM algorithm in estimating inbreeding coefficients. (Default: 1e-4)

==Call genotypes==
Genotypes can be called from posterior genotype probabilities by incorporating the individual allele frequencies as prior information.

; -geno [float]
Call genotypes with defined threshold.
; -genoInbreed [float]
Call genotypes with defined threshold also taking inbreeding into account. '-inbreedSamples' must also be called for using this option.

==Admixture==
Individual admixture proportions and ancestral allele frequencies can be estimated assuming K ancestral populations using an accelerated mini-batch NMF method.

; -admix
Toggles admixture estimations. Estimates admixture proportions and ancestral allele frequencies.
; -admix_K [int]
Not recommended. Override the number of ancestry components (K) to use, instead of using K=e-1.
; -admix_iter [int]
Maximum number of iterations for admixture estimations using NMF. (Default: 200)
; -admix_tole [float]
Tolerance value for update in admixture estimations using NMF. (Default: 1e-5)
; -admix_alpha [float
Specify alpha (sparseness regularization parameter). (Default: 0)
; -admix_auto [float]
Enable automatic search for optimal alpha using likelihood measure, by giving soft upper search bound of alpha.
; -admix_seed [int]
Specify seed for random initializations of factor matrices in admixture estimations.

==Tree==
; -tree
Construct neighbour-joining tree of samples from estimated covariance matrix estimated based on indivdual allele frequencies.
; -tree_samples
Provide a list of sample names of all individuals to construct a beautiful tree.

=Citation=
Our methods for inferring population structure have been published in GENETICS:

[http://www.genetics.org/content/early/2018/08/21/genetics.118.301336 Inferring Population Structure and Admixture Proportions in Low Depth NGS Data]

Our method for testing for HWE in structured populations has been published in Molecular Ecology Resources:

[https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13019 Testing for Hardy‐Weinberg Equilibrium in Structured Populations using Genotype or Low‐Depth NGS Data]

File:Pcangsd admix3.gif

2023-10-24T11:25:27Z

Albrecht:

PCAngsd

2023-10-24T11:12:40Z

Albrecht:

PCAngsd is a program that estimates the covariance matrix and individual allele frequencies for low-depth next-generation sequencing (NGS) data in structured/heterogeneous populations using principal component analysis (PCA) to perform multiple population genetic analyses using genotype likelihoods. Since version 0.98, PCAngsd was re-written to be based on Cython for computational bottlenecks and parallelization.

The main method was published in 2018 and can be found here: [https://www.genetics.org/content/210/2/719]

The HWE test was published in 2019 and can be found here: [https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13019]
[[File:Pcangsd_admix.gif|frame]]
[[File:Pcangsd_admix3.gif|frame]]

[[File:Pcangsd_pca.png|thumb|400px|Simulated low depth NGS data of 3 populations]]

=Overview=
Framework for analyzing low-depth next-generation sequencing (NGS) data in heterogeneous/structured populations using principal component analysis (PCA). Population structure is inferred by estimating individual allele frequencies in an iterative approach using a truncated SVD model. The covariance matrix is estimated using the estimated individual allele frequencies as prior information for the unobserved genotypes in low-depth NGS data.

The estimated individual allele frequencies can further be used to account for population structure in other probabilistic methods. PCAngsd can perform the following analyses:
*Covariance matrix
*Admixture estimations
*Inbreeding coefficients (both per-individual and per-site)
*HWE test
*Genome-wide selection scan
*Genotype calling
*Estimate NJ tree of samples

Older versions of PCAngsd can be found here [https://github.com/Rosemeis/pcangsd/releases/].

=Download and Installation=

PCAngsd should work on all platforms meeting the requirements but server-side usage is highly recommended. Installation has only been tested on Linux systems.

Get PCAngsd and build
<pre>
git clone https://github.com/Rosemeis/pcangsd.git
cd pcangsd/
python setup.py build_ext --inplace
</pre>
Install dependencies:

The required set of Python packages are easily installed using the pip command and the 'requirements.txt file' included in the 'pcangsd' folder.

<pre>
pip install --user -r requirements.txt
</pre>

=Quick start=

PCAngsd is used by running the main caller file pcangsd.py. To see all available options use the following command:
<pre>
python pcangsd.py -h

# Genotype likelihoods using 64 threads
python pcangsd.py -beagle input.beagle.gz -out output -threads 64

# PLINK files (using file-prefix, *.bed, *.bim, *.fam)
python pcangsd.py -beagle input.plink -out output -threads 64
</pre>

PCAngsd accepts either genotype likelihoods in Beagle format or PLINK genotype files. Beagle files can be generated from BAM files using [http://popgen.dk/angsd ANGSD]. For inference of population structure in genotype data with non-random missigness, we recommend our [http://www.popgen.dk/software/index.php/EMU EMU] software that performs accelerated EM-PCA, however with fewer functionalities than PCAngsd (#soon).

PCAngsd will mostly output files in binary Numpy format (.npy) with a few exceptions. In order to read files in python:
<pre>
import numpy as np
C = np.genfromtxt("output.cov") # Reads in estimated covariance matrix (text)
D = np.load("output.selection.npy") # Reads PC based selection statistics
</pre>

R can also read Numpy matrices using the "RcppCNPy" R library:
<pre>
library(RcppCNPy)
C <- as.matrix(read.table("output.cov")) # Reads in estimated covariance matrix
D <- npyLoad("output.selection.npy") # Reads PC based selection statistics
</pre>

An example of generating genotype likelihoods in [http://popgen.dk/angsd ANGSD] and output them in the required Beagle text format.

<pre>
./angsd -GL 2 -out input -nThreads 4 -doGlf 2 -doMajorMinor 1 -doMaf 2 -SNP_pval 1e-6 -bam bam.filelist
</pre>

=Tutorial=

Please refer to the tutorial's page [http://www.popgen.dk/software/index.php/PCAngsdTutorial]

=Options=
<pre>
# See all options in PCAngsd
python pcangsd.py -h
</pre>

==General usage==
; -beagle [Beagle file]
Input file of genotype likelihoods in Beagle format (.beagle.gz).
; -filter [Text file]
Input file of 1's or 0's whether to keep individuals or not.
; -plink [Prefix for binary PLINK files]
Path to PLINK files using their ONLY prefix (.bed, .bim, .fam).
; -plink_error [float]
Incorporate errors into genotypes by specifying rate as argument.
; -minMaf [float]
Minimum minor allele frequency threshold. (Default: 0.05)
; -maf_iter [int]
Maximum number of EM iterations for computing the population allele frequencies (Default: 200).
; -maf_tole [float]
Tolerance value in EM algorithm for population allele frequencies estimation (Default: 1e-4).
; -iter [int]
Maximum number of iterations for estimation of individual allele frequencies (Default: 100).
; -tole [float]
Tolerance value for update in estimation of individual allele frequencies (Default: 1e-5).
; -hwe [.lrt.npy file]
Input file of LRT binary file from previous PCAngsd run to filter based on HWE.
; -hwe_tole [float]
Threshold for HWE filtering of sites.
; -e [int]
Manually select the number of eigenvalues to use in the modelling of individual allele frequencies (Default: Automatically tested using MAP test).
; -pi [.pi.npy file]
Load previous estimation of individual allele frequencies to skip covariance estimation.
; -maf_save
Choose to save estimated population allele frequencies (Binary). Numpy format (.npy).
; -pi_save
Choose to save estimated individual allele frequencies (Binary). Numpy format (.npy). Can be used with the '-pi' command.
; -dosage_save
Choose to save estimated genotype dosages (Binary). Numpy format (.npy).
; -post_save
Choose to save the posterior genotype probabilities. Beagle format (.beagle).
; -sites_save
Choose to save the kept sites after filtering which is useful for downstream analysis. Outputs a file of 1's and 0's for keeping a site or not, respectively.
; -threads [int]
Specify the number of thread(s) to use (Default: 1).
; -out [output prefix]
Fileprefix for all output files created by PCAngsd (Default: "pcangsd").

==Selection==
Perform PC-based genome-wide selection scans using posterior expectations of the genotypes (genotype dosages):

; -selection
Using an extended model of [http://www.cell.com/ajhg/abstract/S0002-9297(16)00003-3 FastPCA]. Performs a genome-wide selection scan along all significant PCs. Outputs the selection statistics and must be converted to p-values by user. Each column reflect the selection statistics along a tested PC and they are χ²-distributed with 1 degree of freedom.

; -pcadapt
Using an extended model of [https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.12592 pcadapt]. Performs a genome-wide selection scan across all significant PCs. Outputs the z-scores and must be converted to test statistics with the provided script 'pcangsd/scripts/pcadapt.R', and the test statistics are χ²-distributed with K degree of freedom.

; -snp_weights
Output the SNP weights of the significant K eigenvectors.

==Inbreeding==
; -inbreedSites
Estimate per-site inbreeding coefficients accounting for population structure and perform likehood ratio test for detecting sites deviating from HWE [https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13019].

; -inbreedSamples
Estimate per-individual inbreeding coefficients accounting for population structure which is based on an extension of [http://genome.cshlp.org/content/23/11/1852.full ngsF] for structured populations.

; -inbreed_iter [int]
Maximum number of iterations for inbreeding EM algorithm. (Default: 200)

; -inbreed_tole [float]
Tolerance value for inbreeding EM algorithm in estimating inbreeding coefficients. (Default: 1e-4)

==Call genotypes==
Genotypes can be called from posterior genotype probabilities by incorporating the individual allele frequencies as prior information.

; -geno [float]
Call genotypes with defined threshold.
; -genoInbreed [float]
Call genotypes with defined threshold also taking inbreeding into account. '-inbreedSamples' must also be called for using this option.

==Admixture==
Individual admixture proportions and ancestral allele frequencies can be estimated assuming K ancestral populations using an accelerated mini-batch NMF method.

; -admix
Toggles admixture estimations. Estimates admixture proportions and ancestral allele frequencies.
; -admix_K [int]
Not recommended. Override the number of ancestry components (K) to use, instead of using K=e-1.
; -admix_iter [int]
Maximum number of iterations for admixture estimations using NMF. (Default: 200)
; -admix_tole [float]
Tolerance value for update in admixture estimations using NMF. (Default: 1e-5)
; -admix_alpha [float
Specify alpha (sparseness regularization parameter). (Default: 0)
; -admix_auto [float]
Enable automatic search for optimal alpha using likelihood measure, by giving soft upper search bound of alpha.
; -admix_seed [int]
Specify seed for random initializations of factor matrices in admixture estimations.

==Tree==
; -tree
Construct neighbour-joining tree of samples from estimated covariance matrix estimated based on indivdual allele frequencies.
; -tree_samples
Provide a list of sample names of all individuals to construct a beautiful tree.

=Citation=
Our methods for inferring population structure have been published in GENETICS:

[http://www.genetics.org/content/early/2018/08/21/genetics.118.301336 Inferring Population Structure and Admixture Proportions in Low Depth NGS Data]

Our method for testing for HWE in structured populations has been published in Molecular Ecology Resources:

[https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13019 Testing for Hardy‐Weinberg Equilibrium in Structured Populations using Genotype or Low‐Depth NGS Data]

EvalAdmix

2022-12-27T09:27:51Z

Albrecht: /* Genotype data */

'''IMPORTANT''': version 0.95 (updated on 30/06/2021) fixes a bug in the implementation for genotype data, which caused displacement of genotypes between samples when a site had missing data. When all sites have some missingness this would result in the last samples from the analyses having a correlation of nan with all other samples; but might have some more subtle effects whenever there is some level of missingness. If you have analyses from previous versions based on genotype data with any missingness might be a good idea to re-run them after updating. The bug did not affect the genotype likelihoods implementation so if you based the analyses on genotype likelihoods you do not need to worry. If you applied it to gentoype data without any missingness you also do not need to worry.

evalAdmix allows to evaluate the results of an admixture analysis (i.e. the result of applying [https://genome.cshlp.org/content/19/9/1655.long ADMIXTURE], [https://web.stanford.edu/group/pritchardlab/structure.html STRUCTURE], [http://www.popgen.dk/software/index.php/NgsAdmix NGSadmix] and similar). It only needs the input genotype data used for the previous admixture analysis and the output of that analysis (admixture proportions and ancestral population frequencies). The genotype input data can either be called genotypes in [https://www.cog-genomics.org/plink/1.9/formats#bed binary plink format] or genotype likelihoods in [http://www.popgen.dk/angsd/index.php/Genotype_Likelihoods#Beagle_format beagle format].

The output is a pairwise correlation of residuals matrix between individuals. The correlation will be close to 0 in case of a good fit of the data to the admixture model. When individuals do not fit the model, individuals with similar demographic histories (i.e. usually individuals from the same population) will be positively correlated; and individuals with different histories but that are modelled as sharing one or more ancestral populations as admixture sources will have a negative correlation. Positive correlation between a pair of individuals might also be due to relatedness.

[[File:evalAdmix.png|thumb|550px]]

==Download and Installation==

evalAdmix can be installed from [https://github.com/GenisGE/evalAdmix github]

<pre>git clone https://github.com/GenisGE/evalAdmix.git
cd evalAdmix
make
</pre>

==Quick start==
:<code> ./evalAdmix -beagle inputBeagleFile.gz -fname myoutfiles.fopt.gz -qname myoutfiles.qopt -o evaladmixOut.corres -P 10 </code>
:<code> ./evalAdmix -plink inputPlinkPrefix -fname inputPlinkPrefix.K.P -qname inputPlinkPrefix.K.Q -o evaladmixOut.corres -P 10 </code>

* '''-beagle''' beagle file of genotype likelihoods
* '''-plink''' binary plink file prefix with genotype data
* '''-fname''' file with ancestral frequencies (space delimited, rows are sites and columns ancestral populations)
* '''-qname''' file with admixture proportions (space delimited, rows are individuals and columns ancestral populations)
* '''-o''' prefix of output file names
* '''-P''' Number of threads used

==Parameters==

<pre>./evalAdmix </pre>

<pre>
Arguments:
Required:
-plink path to binary plink file (excluding the .bed)
or
-beagle path to beagle file containing genotype likelihoods (alternative to -plink)

-fname path to ancestral population frequencies file
-qname path to admixture proportions file

Optional:

-o name of the output file

Setup (optional):

-P 1 number of threads
-autosomeMax 23 autosome ends with this chromsome
-nIts 5 number of iterations to do for frequency correction; if set to 0 calculates correlation without correction (fast but biased)
-useSites 1.0 proportion of sites to use to calculate correlation of residuals
-useInds filename path to tab delimited file with first column containing all individuals ID and second column with 1 to include individual in analysis and 0 otherwise (default all individuals are included)
-misTol 0.05 tolerance for considering site as missing when using genotype likelihoods. Use same value as used in NGSadmix to keep compatibility when using genotype likelihoods (-beagle)
-minMaf 0.05 minimum minor allele frequency to keep site. Use same value as used in NGSadmix to keep compatibility when using genotype likelihoods (-beagle)
</pre>

==Input Files==

===Plink===
Genotype data files in binary PLINK format (.bed .fam .bim).
===Beagle genotype likelhoods===
The input file contains genotype likelihoods in a .beagle file format [http://faculty.washington.edu/browning/beagle/beagle.html].
and can be compressed with gzip.
==== BAM files ====
If you have BAM files you can use [[ANGSD]] to produce genotype likelihoods in .beagle format. Please
see [http://www.popgen.dk/angsd/index.php/Beagle_input Creation of Beagle files with ANGSD]

==== VCF files ====
If you already have made a VCF file that contains genotype likehood information then it should be possible to convert .vcf files with genotype likelihoods to .beagle file via vcftools [https://vcftools.github.io/man_latest.html]

<pre>
vcftools --vcf input.vcf --out test --BEAGLE-GL --chr 1,2
</pre>
Chromosome has to be specified.

You can also use bcftools' [https://samtools.github.io/bcftools/bcftools.html] 'query' option for generating a .beagle file from a .vcf file.

==Output File==
The analysis performed by evalAdmix produces one file, containing a tab delimited N times N symmetric correlation matrix, where column i in line j contains the correlation of residuals between individual i and j, and the diagonal values (self-correlation) are set to NA:

NA 0.008609 -0.006919 0.002731 0.020224<br />
0.008609 NA 0.000033 0.004968 -0.008470<br />
-0.006919 0.000033 NA 0.006982 0.005664<br />
0.002731 0.004968 0.006982 NA 0.000521<br />
0.020224 -0.008470 0.005664 0.000521 NA

==Run command example==

=== Genotype data ===
Download the input file containing genotypes in binary plink format
<pre>
wget http://pontus.popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.bed
wget http://pontus.popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.bim
wget http://pontus.popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.fam
</pre>

Run ADMIXTURE [http://software.genetics.ucla.edu/admixture/] to obtain admixture proprotions

<pre>admixture admixTjeck2.bed 3 -j20</pre>

::Genotypes file prefix admixTjeck2.bed
::Number of Ancestral Populations K = 3
::Number of CPU threads = 20

Run evalAdmix
<pre>./evalAdmix -plink admixTjeck2 -fname admixTjeck2.3.P -qname admixTjeck2.3.Q -P 20 </pre>

::Genotypes file prefix admixTjeck2 (-plink admixTjeck2).
::Ancestral Populations frequency file admixTjeck2.3.P (-fname admixTjeck2.3.P).
::Admixture proportions file admixTjeck2.3.Q (-qname admixTjeck2.3.Q).
::Computer cores = 20 (-P 20).

Plot results in R
[[File:evalAdmixK3.Q.png|thumb|frame|admixture proportions]]
[[File:evalAdmixK3.cor.png|thumb|frame|evalAdmix correlations]]
<pre>
source("visFuns.R")

# read population labels and estimated admixture proportions
pop<-read.table("admixTjeck2.fam")
q<-read.table("admixTjeck2.3.Q",stringsAsFactors=T)

palette(c("#A6CEE3","#1F78B4","#B2DF8A","#33A02C","#FB9A99","#E31A1C","#FDBF6F","#FF7F00","#CAB2D6","#6A3D9A","#FFFF99","#B15928","#1B9E77","#999999"))

# order according to population and plot the ADMIXTURE reults
ord<-orderInds(pop = as.vector(pop[,2]), q = q)

#make barplot
plotAdmix(q,ord=ord,pop=pop[,2])

r<-as.matrix(read.table("output.corres.txt"))

# Plot correlation of residuals
plotCorRes(cor_mat = r, pop = as.vector(pop[,2]), ord=ord, title="Evaluation of 1000G admixture proportions with K=3", max_z=0.1, min_z=-0.1)

</pre>

=== Low depth sequencing data ===
Download the input file containing genotype likelihoods in beagle format
<pre>
wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz
wget popgen.dk/software/download/NGSadmix/data/Demo2pop.info
</pre>

Execute [[NgsAdmix |NGSadmix]] to obtain admixture proportions
<pre>./NGSadmix -likes Demo2input.gz -K 3 -P 20 -o myoutfiles -minMaf 0.05</pre>

::Input file = Demo2input.gz
::Ancestral Populations K=3
::Computer cores = 20 (-P 20).
::Output prefix = myoutfiles (-o myoutfiles)
::SNPs with MAF > 5% (-minMaf 0.05)

Run evalAdmix
<pre>./evalAdmix -beagle Demo2input.gz -fname myoutfiles.fopt.gz -qname myoutfiles.qopt -P 20 </pre>

::Genotype likelihoods file Demo2input.gz (-beagle Demo2input.gz).
::Ancestral Populations frequency file myoutfiles.fopt.gz (-fname myoutfiles.fopt.gz).
::Admixture proportions file myoutfiles.qopt(-qname myoutfiles.qopt).
::Computer cores = 20 (-P 20).

Plot results in R
<pre>
source("visFuns.R")
# read population labels and estimated admixture proportions
pop<-read.table("Demo2pop.info",as.is=T)
q<-read.table("myoutfiles.qopt")

# order according to population and plot the NGSadmix reults
ord<-orderInds(pop = as.vector(pop[,1]), q = q)
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=3")
text(sort(tapply(1:nrow(pop),pop[ord,1],mean)),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

r<-read.table("output.corres.txt")

# Plot correlation of residuals

plotCorRes(cor_mat = r, pop = as.vector(pop[,1]), ord = ord, title="Evaluation of 1000G admixture proportions with K=3", max_z=0.1, min_z=-0.1)

</pre>

==Citation==

[https://doi.org/10.1111/1755-0998.13171 Evaluation of Model Fit of Inferred Admixture Proportions]

File:EvalAdmixK3.Q.png

2022-12-27T09:26:56Z

Albrecht: Albrecht uploaded a new version of File:EvalAdmixK3.Q.png

File:EvalAdmixK3.cor.png

2022-12-27T09:25:56Z

Albrecht:

File:EvalAdmixK3.Q.png

2022-12-27T09:25:35Z

Albrecht:

EvalAdmix

2022-12-27T09:25:20Z

Albrecht: /* Genotype data */

'''IMPORTANT''': version 0.95 (updated on 30/06/2021) fixes a bug in the implementation for genotype data, which caused displacement of genotypes between samples when a site had missing data. When all sites have some missingness this would result in the last samples from the analyses having a correlation of nan with all other samples; but might have some more subtle effects whenever there is some level of missingness. If you have analyses from previous versions based on genotype data with any missingness might be a good idea to re-run them after updating. The bug did not affect the genotype likelihoods implementation so if you based the analyses on genotype likelihoods you do not need to worry. If you applied it to gentoype data without any missingness you also do not need to worry.

evalAdmix allows to evaluate the results of an admixture analysis (i.e. the result of applying [https://genome.cshlp.org/content/19/9/1655.long ADMIXTURE], [https://web.stanford.edu/group/pritchardlab/structure.html STRUCTURE], [http://www.popgen.dk/software/index.php/NgsAdmix NGSadmix] and similar). It only needs the input genotype data used for the previous admixture analysis and the output of that analysis (admixture proportions and ancestral population frequencies). The genotype input data can either be called genotypes in [https://www.cog-genomics.org/plink/1.9/formats#bed binary plink format] or genotype likelihoods in [http://www.popgen.dk/angsd/index.php/Genotype_Likelihoods#Beagle_format beagle format].

The output is a pairwise correlation of residuals matrix between individuals. The correlation will be close to 0 in case of a good fit of the data to the admixture model. When individuals do not fit the model, individuals with similar demographic histories (i.e. usually individuals from the same population) will be positively correlated; and individuals with different histories but that are modelled as sharing one or more ancestral populations as admixture sources will have a negative correlation. Positive correlation between a pair of individuals might also be due to relatedness.

[[File:evalAdmix.png|thumb|550px]]

==Download and Installation==

evalAdmix can be installed from [https://github.com/GenisGE/evalAdmix github]

<pre>git clone https://github.com/GenisGE/evalAdmix.git
cd evalAdmix
make
</pre>

==Quick start==
:<code> ./evalAdmix -beagle inputBeagleFile.gz -fname myoutfiles.fopt.gz -qname myoutfiles.qopt -o evaladmixOut.corres -P 10 </code>
:<code> ./evalAdmix -plink inputPlinkPrefix -fname inputPlinkPrefix.K.P -qname inputPlinkPrefix.K.Q -o evaladmixOut.corres -P 10 </code>

* '''-beagle''' beagle file of genotype likelihoods
* '''-plink''' binary plink file prefix with genotype data
* '''-fname''' file with ancestral frequencies (space delimited, rows are sites and columns ancestral populations)
* '''-qname''' file with admixture proportions (space delimited, rows are individuals and columns ancestral populations)
* '''-o''' prefix of output file names
* '''-P''' Number of threads used

==Parameters==

<pre>./evalAdmix </pre>

<pre>
Arguments:
Required:
-plink path to binary plink file (excluding the .bed)
or
-beagle path to beagle file containing genotype likelihoods (alternative to -plink)

-fname path to ancestral population frequencies file
-qname path to admixture proportions file

Optional:

-o name of the output file

Setup (optional):

-P 1 number of threads
-autosomeMax 23 autosome ends with this chromsome
-nIts 5 number of iterations to do for frequency correction; if set to 0 calculates correlation without correction (fast but biased)
-useSites 1.0 proportion of sites to use to calculate correlation of residuals
-useInds filename path to tab delimited file with first column containing all individuals ID and second column with 1 to include individual in analysis and 0 otherwise (default all individuals are included)
-misTol 0.05 tolerance for considering site as missing when using genotype likelihoods. Use same value as used in NGSadmix to keep compatibility when using genotype likelihoods (-beagle)
-minMaf 0.05 minimum minor allele frequency to keep site. Use same value as used in NGSadmix to keep compatibility when using genotype likelihoods (-beagle)
</pre>

==Input Files==

===Plink===
Genotype data files in binary PLINK format (.bed .fam .bim).
===Beagle genotype likelhoods===
The input file contains genotype likelihoods in a .beagle file format [http://faculty.washington.edu/browning/beagle/beagle.html].
and can be compressed with gzip.
==== BAM files ====
If you have BAM files you can use [[ANGSD]] to produce genotype likelihoods in .beagle format. Please
see [http://www.popgen.dk/angsd/index.php/Beagle_input Creation of Beagle files with ANGSD]

==== VCF files ====
If you already have made a VCF file that contains genotype likehood information then it should be possible to convert .vcf files with genotype likelihoods to .beagle file via vcftools [https://vcftools.github.io/man_latest.html]

<pre>
vcftools --vcf input.vcf --out test --BEAGLE-GL --chr 1,2
</pre>
Chromosome has to be specified.

You can also use bcftools' [https://samtools.github.io/bcftools/bcftools.html] 'query' option for generating a .beagle file from a .vcf file.

==Output File==
The analysis performed by evalAdmix produces one file, containing a tab delimited N times N symmetric correlation matrix, where column i in line j contains the correlation of residuals between individual i and j, and the diagonal values (self-correlation) are set to NA:

NA 0.008609 -0.006919 0.002731 0.020224<br />
0.008609 NA 0.000033 0.004968 -0.008470<br />
-0.006919 0.000033 NA 0.006982 0.005664<br />
0.002731 0.004968 0.006982 NA 0.000521<br />
0.020224 -0.008470 0.005664 0.000521 NA

==Run command example==

=== Genotype data ===
Download the input file containing genotypes in binary plink format
<pre>
wget http://pontus.popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.bed
wget http://pontus.popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.bim
wget http://pontus.popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.fam
</pre>

Run ADMIXTURE [http://software.genetics.ucla.edu/admixture/] to obtain admixture proprotions

<pre>admixture admixTjeck2.bed 3 -j20</pre>

::Genotypes file prefix admixTjeck2.bed
::Number of Ancestral Populations K = 3
::Number of CPU threads = 20

Run evalAdmix
<pre>./evalAdmix -plink admixTjeck2 -fname admixTjeck2.3.P -qname admixTjeck2.3.Q -P 20 </pre>

::Genotypes file prefix admixTjeck2 (-plink admixTjeck2).
::Ancestral Populations frequency file admixTjeck2.3.P (-fname admixTjeck2.3.P).
::Admixture proportions file admixTjeck2.3.Q (-qname admixTjeck2.3.Q).
::Computer cores = 20 (-P 20).

Plot results in R
<pre>
source("visFuns.R")

# read population labels and estimated admixture proportions
pop<-read.table("admixTjeck2.fam")
q<-read.table("admixTjeck2.3.Q",stringsAsFactors=T)

palette(c("#A6CEE3","#1F78B4","#B2DF8A","#33A02C","#FB9A99","#E31A1C","#FDBF6F","#FF7F00","#CAB2D6","#6A3D9A","#FFFF99","#B15928","#1B9E77","#999999"))

# order according to population and plot the ADMIXTURE reults
ord<-orderInds(pop = as.vector(pop[,2]), q = q)

#make barplot
plotAdmix(q,ord=ord,pop=pop[,2])

r<-as.matrix(read.table("output.corres.txt"))

# Plot correlation of residuals
plotCorRes(cor_mat = r, pop = as.vector(pop[,2]), ord=ord, title="Evaluation of 1000G admixture proportions with K=3", max_z=0.1, min_z=-0.1)

</pre>
[[File:evalAdmixK3.Q.png|thumb|frame|admixture proportions]]
[[File:evalAdmixK3.cor.png|thumb|frame|evalAdmix correlations]]

=== Low depth sequencing data ===
Download the input file containing genotype likelihoods in beagle format
<pre>
wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz
wget popgen.dk/software/download/NGSadmix/data/Demo2pop.info
</pre>

Execute [[NgsAdmix |NGSadmix]] to obtain admixture proportions
<pre>./NGSadmix -likes Demo2input.gz -K 3 -P 20 -o myoutfiles -minMaf 0.05</pre>

::Input file = Demo2input.gz
::Ancestral Populations K=3
::Computer cores = 20 (-P 20).
::Output prefix = myoutfiles (-o myoutfiles)
::SNPs with MAF > 5% (-minMaf 0.05)

Run evalAdmix
<pre>./evalAdmix -beagle Demo2input.gz -fname myoutfiles.fopt.gz -qname myoutfiles.qopt -P 20 </pre>

::Genotype likelihoods file Demo2input.gz (-beagle Demo2input.gz).
::Ancestral Populations frequency file myoutfiles.fopt.gz (-fname myoutfiles.fopt.gz).
::Admixture proportions file myoutfiles.qopt(-qname myoutfiles.qopt).
::Computer cores = 20 (-P 20).

Plot results in R
<pre>
source("visFuns.R")
# read population labels and estimated admixture proportions
pop<-read.table("Demo2pop.info",as.is=T)
q<-read.table("myoutfiles.qopt")

# order according to population and plot the NGSadmix reults
ord<-orderInds(pop = as.vector(pop[,1]), q = q)
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=3")
text(sort(tapply(1:nrow(pop),pop[ord,1],mean)),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

r<-read.table("output.corres.txt")

# Plot correlation of residuals

plotCorRes(cor_mat = r, pop = as.vector(pop[,1]), ord = ord, title="Evaluation of 1000G admixture proportions with K=3", max_z=0.1, min_z=-0.1)

</pre>

==Citation==

[https://doi.org/10.1111/1755-0998.13171 Evaluation of Model Fit of Inferred Admixture Proportions]

EvalAdmix

2022-12-27T09:16:14Z

Albrecht: /* Run command example */

'''IMPORTANT''': version 0.95 (updated on 30/06/2021) fixes a bug in the implementation for genotype data, which caused displacement of genotypes between samples when a site had missing data. When all sites have some missingness this would result in the last samples from the analyses having a correlation of nan with all other samples; but might have some more subtle effects whenever there is some level of missingness. If you have analyses from previous versions based on genotype data with any missingness might be a good idea to re-run them after updating. The bug did not affect the genotype likelihoods implementation so if you based the analyses on genotype likelihoods you do not need to worry. If you applied it to gentoype data without any missingness you also do not need to worry.

evalAdmix allows to evaluate the results of an admixture analysis (i.e. the result of applying [https://genome.cshlp.org/content/19/9/1655.long ADMIXTURE], [https://web.stanford.edu/group/pritchardlab/structure.html STRUCTURE], [http://www.popgen.dk/software/index.php/NgsAdmix NGSadmix] and similar). It only needs the input genotype data used for the previous admixture analysis and the output of that analysis (admixture proportions and ancestral population frequencies). The genotype input data can either be called genotypes in [https://www.cog-genomics.org/plink/1.9/formats#bed binary plink format] or genotype likelihoods in [http://www.popgen.dk/angsd/index.php/Genotype_Likelihoods#Beagle_format beagle format].

The output is a pairwise correlation of residuals matrix between individuals. The correlation will be close to 0 in case of a good fit of the data to the admixture model. When individuals do not fit the model, individuals with similar demographic histories (i.e. usually individuals from the same population) will be positively correlated; and individuals with different histories but that are modelled as sharing one or more ancestral populations as admixture sources will have a negative correlation. Positive correlation between a pair of individuals might also be due to relatedness.

[[File:evalAdmix.png|thumb|550px]]

==Download and Installation==

evalAdmix can be installed from [https://github.com/GenisGE/evalAdmix github]

<pre>git clone https://github.com/GenisGE/evalAdmix.git
cd evalAdmix
make
</pre>

==Quick start==
:<code> ./evalAdmix -beagle inputBeagleFile.gz -fname myoutfiles.fopt.gz -qname myoutfiles.qopt -o evaladmixOut.corres -P 10 </code>
:<code> ./evalAdmix -plink inputPlinkPrefix -fname inputPlinkPrefix.K.P -qname inputPlinkPrefix.K.Q -o evaladmixOut.corres -P 10 </code>

* '''-beagle''' beagle file of genotype likelihoods
* '''-plink''' binary plink file prefix with genotype data
* '''-fname''' file with ancestral frequencies (space delimited, rows are sites and columns ancestral populations)
* '''-qname''' file with admixture proportions (space delimited, rows are individuals and columns ancestral populations)
* '''-o''' prefix of output file names
* '''-P''' Number of threads used

==Parameters==

<pre>./evalAdmix </pre>

<pre>
Arguments:
Required:
-plink path to binary plink file (excluding the .bed)
or
-beagle path to beagle file containing genotype likelihoods (alternative to -plink)

-fname path to ancestral population frequencies file
-qname path to admixture proportions file

Optional:

-o name of the output file

Setup (optional):

-P 1 number of threads
-autosomeMax 23 autosome ends with this chromsome
-nIts 5 number of iterations to do for frequency correction; if set to 0 calculates correlation without correction (fast but biased)
-useSites 1.0 proportion of sites to use to calculate correlation of residuals
-useInds filename path to tab delimited file with first column containing all individuals ID and second column with 1 to include individual in analysis and 0 otherwise (default all individuals are included)
-misTol 0.05 tolerance for considering site as missing when using genotype likelihoods. Use same value as used in NGSadmix to keep compatibility when using genotype likelihoods (-beagle)
-minMaf 0.05 minimum minor allele frequency to keep site. Use same value as used in NGSadmix to keep compatibility when using genotype likelihoods (-beagle)
</pre>

==Input Files==

===Plink===
Genotype data files in binary PLINK format (.bed .fam .bim).
===Beagle genotype likelhoods===
The input file contains genotype likelihoods in a .beagle file format [http://faculty.washington.edu/browning/beagle/beagle.html].
and can be compressed with gzip.
==== BAM files ====
If you have BAM files you can use [[ANGSD]] to produce genotype likelihoods in .beagle format. Please
see [http://www.popgen.dk/angsd/index.php/Beagle_input Creation of Beagle files with ANGSD]

==== VCF files ====
If you already have made a VCF file that contains genotype likehood information then it should be possible to convert .vcf files with genotype likelihoods to .beagle file via vcftools [https://vcftools.github.io/man_latest.html]

<pre>
vcftools --vcf input.vcf --out test --BEAGLE-GL --chr 1,2
</pre>
Chromosome has to be specified.

You can also use bcftools' [https://samtools.github.io/bcftools/bcftools.html] 'query' option for generating a .beagle file from a .vcf file.

==Output File==
The analysis performed by evalAdmix produces one file, containing a tab delimited N times N symmetric correlation matrix, where column i in line j contains the correlation of residuals between individual i and j, and the diagonal values (self-correlation) are set to NA:

NA 0.008609 -0.006919 0.002731 0.020224<br />
0.008609 NA 0.000033 0.004968 -0.008470<br />
-0.006919 0.000033 NA 0.006982 0.005664<br />
0.002731 0.004968 0.006982 NA 0.000521<br />
0.020224 -0.008470 0.005664 0.000521 NA

==Run command example==

=== Genotype data ===
Download the input file containing genotypes in binary plink format
<pre>
wget http://pontus.popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.bed
wget http://pontus.popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.bim
wget http://pontus.popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.fam
</pre>

Run ADMIXTURE [http://software.genetics.ucla.edu/admixture/] to obtain admixture proprotions

<pre>admixture admixTjeck2.bed 3 -j20</pre>

::Genotypes file prefix admixTjeck2.bed
::Number of Ancestral Populations K = 3
::Number of CPU threads = 20

Run evalAdmix
<pre>./evalAdmix -plink admixTjeck2 -fname admixTjeck2.3.P -qname admixTjeck2.3.Q -P 20 </pre>

::Genotypes file prefix admixTjeck2 (-plink admixTjeck2).
::Ancestral Populations frequency file admixTjeck2.3.P (-fname admixTjeck2.3.P).
::Admixture proportions file admixTjeck2.3.Q (-qname admixTjeck2.3.Q).
::Computer cores = 20 (-P 20).

Plot results in R
<pre>
source("visFuns.R")

# read population labels and estimated admixture proportions
pop<-read.table("admixTjeck2.fam")
q<-read.table("admixTjeck2.3.Q",stringsAsFactors=T)

palette(c("#A6CEE3","#1F78B4","#B2DF8A","#33A02C","#FB9A99","#E31A1C","#FDBF6F","#FF7F00","#CAB2D6","#6A3D9A","#FFFF99","#B15928","#1B9E77","#999999"))

# order according to population and plot the ADMIXTURE reults
ord<-orderInds(pop = as.vector(pop[,2]), q = q)

#make barplot
plotAdmix(q,ord=ord,pop=pop[,2])

r<-as.matrix(read.table("output.corres.txt"))

# Plot correlation of residuals
plotCorRes(cor_mat = r, pop = as.vector(pop[,2]), ord=ord, title="Evaluation of 1000G admixture proportions with K=3", max_z=0.1, min_z=-0.1)

</pre>

=== Low depth sequencing data ===
Download the input file containing genotype likelihoods in beagle format
<pre>
wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz
wget popgen.dk/software/download/NGSadmix/data/Demo2pop.info
</pre>

Execute [[NgsAdmix |NGSadmix]] to obtain admixture proportions
<pre>./NGSadmix -likes Demo2input.gz -K 3 -P 20 -o myoutfiles -minMaf 0.05</pre>

::Input file = Demo2input.gz
::Ancestral Populations K=3
::Computer cores = 20 (-P 20).
::Output prefix = myoutfiles (-o myoutfiles)
::SNPs with MAF > 5% (-minMaf 0.05)

Run evalAdmix
<pre>./evalAdmix -beagle Demo2input.gz -fname myoutfiles.fopt.gz -qname myoutfiles.qopt -P 20 </pre>

::Genotype likelihoods file Demo2input.gz (-beagle Demo2input.gz).
::Ancestral Populations frequency file myoutfiles.fopt.gz (-fname myoutfiles.fopt.gz).
::Admixture proportions file myoutfiles.qopt(-qname myoutfiles.qopt).
::Computer cores = 20 (-P 20).

Plot results in R
<pre>
source("visFuns.R")
# read population labels and estimated admixture proportions
pop<-read.table("Demo2pop.info",as.is=T)
q<-read.table("myoutfiles.qopt")

# order according to population and plot the NGSadmix reults
ord<-orderInds(pop = as.vector(pop[,1]), q = q)
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=3")
text(sort(tapply(1:nrow(pop),pop[ord,1],mean)),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

r<-read.table("output.corres.txt")

# Plot correlation of residuals

plotCorRes(cor_mat = r, pop = as.vector(pop[,1]), ord = ord, title="Evaluation of 1000G admixture proportions with K=3", max_z=0.1, min_z=-0.1)

</pre>

==Citation==

[https://doi.org/10.1111/1755-0998.13171 Evaluation of Model Fit of Inferred Admixture Proportions]

EvalAdmix

2022-12-27T09:13:09Z

Albrecht: /* Genotype data */

'''IMPORTANT''': version 0.95 (updated on 30/06/2021) fixes a bug in the implementation for genotype data, which caused displacement of genotypes between samples when a site had missing data. When all sites have some missingness this would result in the last samples from the analyses having a correlation of nan with all other samples; but might have some more subtle effects whenever there is some level of missingness. If you have analyses from previous versions based on genotype data with any missingness might be a good idea to re-run them after updating. The bug did not affect the genotype likelihoods implementation so if you based the analyses on genotype likelihoods you do not need to worry. If you applied it to gentoype data without any missingness you also do not need to worry.

evalAdmix allows to evaluate the results of an admixture analysis (i.e. the result of applying [https://genome.cshlp.org/content/19/9/1655.long ADMIXTURE], [https://web.stanford.edu/group/pritchardlab/structure.html STRUCTURE], [http://www.popgen.dk/software/index.php/NgsAdmix NGSadmix] and similar). It only needs the input genotype data used for the previous admixture analysis and the output of that analysis (admixture proportions and ancestral population frequencies). The genotype input data can either be called genotypes in [https://www.cog-genomics.org/plink/1.9/formats#bed binary plink format] or genotype likelihoods in [http://www.popgen.dk/angsd/index.php/Genotype_Likelihoods#Beagle_format beagle format].

The output is a pairwise correlation of residuals matrix between individuals. The correlation will be close to 0 in case of a good fit of the data to the admixture model. When individuals do not fit the model, individuals with similar demographic histories (i.e. usually individuals from the same population) will be positively correlated; and individuals with different histories but that are modelled as sharing one or more ancestral populations as admixture sources will have a negative correlation. Positive correlation between a pair of individuals might also be due to relatedness.

[[File:evalAdmix.png|thumb|550px]]

==Download and Installation==

evalAdmix can be installed from [https://github.com/GenisGE/evalAdmix github]

<pre>git clone https://github.com/GenisGE/evalAdmix.git
cd evalAdmix
make
</pre>

==Quick start==
:<code> ./evalAdmix -beagle inputBeagleFile.gz -fname myoutfiles.fopt.gz -qname myoutfiles.qopt -o evaladmixOut.corres -P 10 </code>
:<code> ./evalAdmix -plink inputPlinkPrefix -fname inputPlinkPrefix.K.P -qname inputPlinkPrefix.K.Q -o evaladmixOut.corres -P 10 </code>

* '''-beagle''' beagle file of genotype likelihoods
* '''-plink''' binary plink file prefix with genotype data
* '''-fname''' file with ancestral frequencies (space delimited, rows are sites and columns ancestral populations)
* '''-qname''' file with admixture proportions (space delimited, rows are individuals and columns ancestral populations)
* '''-o''' prefix of output file names
* '''-P''' Number of threads used

==Parameters==

<pre>./evalAdmix </pre>

<pre>
Arguments:
Required:
-plink path to binary plink file (excluding the .bed)
or
-beagle path to beagle file containing genotype likelihoods (alternative to -plink)

-fname path to ancestral population frequencies file
-qname path to admixture proportions file

Optional:

-o name of the output file

Setup (optional):

-P 1 number of threads
-autosomeMax 23 autosome ends with this chromsome
-nIts 5 number of iterations to do for frequency correction; if set to 0 calculates correlation without correction (fast but biased)
-useSites 1.0 proportion of sites to use to calculate correlation of residuals
-useInds filename path to tab delimited file with first column containing all individuals ID and second column with 1 to include individual in analysis and 0 otherwise (default all individuals are included)
-misTol 0.05 tolerance for considering site as missing when using genotype likelihoods. Use same value as used in NGSadmix to keep compatibility when using genotype likelihoods (-beagle)
-minMaf 0.05 minimum minor allele frequency to keep site. Use same value as used in NGSadmix to keep compatibility when using genotype likelihoods (-beagle)
</pre>

==Input Files==

===Plink===
Genotype data files in binary PLINK format (.bed .fam .bim).
===Beagle genotype likelhoods===
The input file contains genotype likelihoods in a .beagle file format [http://faculty.washington.edu/browning/beagle/beagle.html].
and can be compressed with gzip.
==== BAM files ====
If you have BAM files you can use [[ANGSD]] to produce genotype likelihoods in .beagle format. Please
see [http://www.popgen.dk/angsd/index.php/Beagle_input Creation of Beagle files with ANGSD]

==== VCF files ====
If you already have made a VCF file that contains genotype likehood information then it should be possible to convert .vcf files with genotype likelihoods to .beagle file via vcftools [https://vcftools.github.io/man_latest.html]

<pre>
vcftools --vcf input.vcf --out test --BEAGLE-GL --chr 1,2
</pre>
Chromosome has to be specified.

You can also use bcftools' [https://samtools.github.io/bcftools/bcftools.html] 'query' option for generating a .beagle file from a .vcf file.

==Output File==
The analysis performed by evalAdmix produces one file, containing a tab delimited N times N symmetric correlation matrix, where column i in line j contains the correlation of residuals between individual i and j, and the diagonal values (self-correlation) are set to NA:

NA 0.008609 -0.006919 0.002731 0.020224<br />
0.008609 NA 0.000033 0.004968 -0.008470<br />
-0.006919 0.000033 NA 0.006982 0.005664<br />
0.002731 0.004968 0.006982 NA 0.000521<br />
0.020224 -0.008470 0.005664 0.000521 NA

==Run command example==

=== Genotype data ===
Download the input file containing genotypes in binary plink format
<pre>
wget http://pontus.popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.bed
wget http://pontus.popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.bim
wget http://pontus.popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.fam
</pre>

Run ADMIXTURE [http://software.genetics.ucla.edu/admixture/] to obtain admixture proprotions

<pre>admixture admixTjeck2.bed 3 -j20</pre>

::Genotypes file prefix admixTjeck2.bed
::Number of Ancestral Populations K = 3
::Number of CPU threads = 20

Run evalAdmix
<pre>./evalAdmix -plink admixTjeck2 -fname admixTjeck2.3.P -qname admixTjeck2.3.Q -P 20 </pre>

::Genotypes file prefix admixTjeck2 (-plink admixTjeck2).
::Ancestral Populations frequency file admixTjeck2.3.P (-fname admixTjeck2.3.P).
::Admixture proportions file admixTjeck2.3.Q (-qname admixTjeck2.3.Q).
::Computer cores = 20 (-P 20).

Plot results in R
<pre>
source("visFuns.R")

# read population labels and estimated admixture proportions
pop<-read.table("admixTjeck2.fam")
q<-read.table("admixTjeck2.3.Q",stringsAsFactors=T)

palette(c("#A6CEE3","#1F78B4","#B2DF8A","#33A02C","#FB9A99","#E31A1C","#FDBF6F","#FF7F00","#CAB2D6","#6A3D9A","#FFFF99","#B15928","#1B9E77","#999999"))

# order according to population and plot the ADMIXTURE reults
ord<-orderInds(pop = as.vector(pop[,2]), q = q)
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=3")
text(sort(tapply(1:nrow(pop),pop[ord,2],mean)),-0.05,unique(pop[ord,2]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,2]),function(x){sum(pop[ord,2]==x)})),col=1,lwd=1.2)

r<-as.matrix(read.table("output.corres.txt"))

# Plot correlation of residuals
plotCorRes(cor_mat = r, pop = as.vector(pop[,2]), ord=ord, title="Evaluation of 1000G admixture proportions with K=3", max_z=0.1, min_z=-0.1)

</pre>

=== Low depth sequencing data ===
Download the input file containing genotype likelihoods in beagle format
<pre>
wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz
wget popgen.dk/software/download/NGSadmix/data/Demo2pop.info
</pre>

Execute [[NgsAdmix |NGSadmix]] to obtain admixture proportions
<pre>./NGSadmix -likes Demo2input.gz -K 3 -P 20 -o myoutfiles -minMaf 0.05</pre>

::Input file = Demo2input.gz
::Ancestral Populations K=3
::Computer cores = 20 (-P 20).
::Output prefix = myoutfiles (-o myoutfiles)
::SNPs with MAF > 5% (-minMaf 0.05)

Run evalAdmix
<pre>./evalAdmix -beagle Demo2input.gz -fname myoutfiles.fopt.gz -qname myoutfiles.qopt -P 20 </pre>

::Genotype likelihoods file Demo2input.gz (-beagle Demo2input.gz).
::Ancestral Populations frequency file myoutfiles.fopt.gz (-fname myoutfiles.fopt.gz).
::Admixture proportions file myoutfiles.qopt(-qname myoutfiles.qopt).
::Computer cores = 20 (-P 20).

Plot results in R
<pre>
source("visFuns.R")
# read population labels and estimated admixture proportions
pop<-read.table("Demo2pop.info",as.is=T)
q<-read.table("myoutfiles.qopt")

# order according to population and plot the NGSadmix reults
ord<-orderInds(pop = as.vector(pop[,1]), q = q)
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=3")
text(sort(tapply(1:nrow(pop),pop[ord,1],mean)),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

r<-read.table("output.corres.txt")

# Plot correlation of residuals

plotCorRes(cor_mat = r, pop = as.vector(pop[,1]), ord = ord, title="Evaluation of 1000G admixture proportions with K=3", max_z=0.1, min_z=-0.1)

</pre>

==Citation==

[https://doi.org/10.1111/1755-0998.13171 Evaluation of Model Fit of Inferred Admixture Proportions]

EvalAdmix

2022-12-27T09:02:57Z

Albrecht: /* Genotype data */

'''IMPORTANT''': version 0.95 (updated on 30/06/2021) fixes a bug in the implementation for genotype data, which caused displacement of genotypes between samples when a site had missing data. When all sites have some missingness this would result in the last samples from the analyses having a correlation of nan with all other samples; but might have some more subtle effects whenever there is some level of missingness. If you have analyses from previous versions based on genotype data with any missingness might be a good idea to re-run them after updating. The bug did not affect the genotype likelihoods implementation so if you based the analyses on genotype likelihoods you do not need to worry. If you applied it to gentoype data without any missingness you also do not need to worry.

evalAdmix allows to evaluate the results of an admixture analysis (i.e. the result of applying [https://genome.cshlp.org/content/19/9/1655.long ADMIXTURE], [https://web.stanford.edu/group/pritchardlab/structure.html STRUCTURE], [http://www.popgen.dk/software/index.php/NgsAdmix NGSadmix] and similar). It only needs the input genotype data used for the previous admixture analysis and the output of that analysis (admixture proportions and ancestral population frequencies). The genotype input data can either be called genotypes in [https://www.cog-genomics.org/plink/1.9/formats#bed binary plink format] or genotype likelihoods in [http://www.popgen.dk/angsd/index.php/Genotype_Likelihoods#Beagle_format beagle format].

The output is a pairwise correlation of residuals matrix between individuals. The correlation will be close to 0 in case of a good fit of the data to the admixture model. When individuals do not fit the model, individuals with similar demographic histories (i.e. usually individuals from the same population) will be positively correlated; and individuals with different histories but that are modelled as sharing one or more ancestral populations as admixture sources will have a negative correlation. Positive correlation between a pair of individuals might also be due to relatedness.

[[File:evalAdmix.png|thumb|550px]]

==Download and Installation==

evalAdmix can be installed from [https://github.com/GenisGE/evalAdmix github]

<pre>git clone https://github.com/GenisGE/evalAdmix.git
cd evalAdmix
make
</pre>

==Quick start==
:<code> ./evalAdmix -beagle inputBeagleFile.gz -fname myoutfiles.fopt.gz -qname myoutfiles.qopt -o evaladmixOut.corres -P 10 </code>
:<code> ./evalAdmix -plink inputPlinkPrefix -fname inputPlinkPrefix.K.P -qname inputPlinkPrefix.K.Q -o evaladmixOut.corres -P 10 </code>

* '''-beagle''' beagle file of genotype likelihoods
* '''-plink''' binary plink file prefix with genotype data
* '''-fname''' file with ancestral frequencies (space delimited, rows are sites and columns ancestral populations)
* '''-qname''' file with admixture proportions (space delimited, rows are individuals and columns ancestral populations)
* '''-o''' prefix of output file names
* '''-P''' Number of threads used

==Parameters==

<pre>./evalAdmix </pre>

<pre>
Arguments:
Required:
-plink path to binary plink file (excluding the .bed)
or
-beagle path to beagle file containing genotype likelihoods (alternative to -plink)

-fname path to ancestral population frequencies file
-qname path to admixture proportions file

Optional:

-o name of the output file

Setup (optional):

-P 1 number of threads
-autosomeMax 23 autosome ends with this chromsome
-nIts 5 number of iterations to do for frequency correction; if set to 0 calculates correlation without correction (fast but biased)
-useSites 1.0 proportion of sites to use to calculate correlation of residuals
-useInds filename path to tab delimited file with first column containing all individuals ID and second column with 1 to include individual in analysis and 0 otherwise (default all individuals are included)
-misTol 0.05 tolerance for considering site as missing when using genotype likelihoods. Use same value as used in NGSadmix to keep compatibility when using genotype likelihoods (-beagle)
-minMaf 0.05 minimum minor allele frequency to keep site. Use same value as used in NGSadmix to keep compatibility when using genotype likelihoods (-beagle)
</pre>

==Input Files==

===Plink===
Genotype data files in binary PLINK format (.bed .fam .bim).
===Beagle genotype likelhoods===
The input file contains genotype likelihoods in a .beagle file format [http://faculty.washington.edu/browning/beagle/beagle.html].
and can be compressed with gzip.
==== BAM files ====
If you have BAM files you can use [[ANGSD]] to produce genotype likelihoods in .beagle format. Please
see [http://www.popgen.dk/angsd/index.php/Beagle_input Creation of Beagle files with ANGSD]

==== VCF files ====
If you already have made a VCF file that contains genotype likehood information then it should be possible to convert .vcf files with genotype likelihoods to .beagle file via vcftools [https://vcftools.github.io/man_latest.html]

<pre>
vcftools --vcf input.vcf --out test --BEAGLE-GL --chr 1,2
</pre>
Chromosome has to be specified.

You can also use bcftools' [https://samtools.github.io/bcftools/bcftools.html] 'query' option for generating a .beagle file from a .vcf file.

==Output File==
The analysis performed by evalAdmix produces one file, containing a tab delimited N times N symmetric correlation matrix, where column i in line j contains the correlation of residuals between individual i and j, and the diagonal values (self-correlation) are set to NA:

NA 0.008609 -0.006919 0.002731 0.020224<br />
0.008609 NA 0.000033 0.004968 -0.008470<br />
-0.006919 0.000033 NA 0.006982 0.005664<br />
0.002731 0.004968 0.006982 NA 0.000521<br />
0.020224 -0.008470 0.005664 0.000521 NA

==Run command example==

=== Genotype data ===
Download the input file containing genotypes in binary plink format
<pre>
wget http://pontus.popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.bed
wget http://pontus.popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.bim
wget http://pontus.popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.fam
</pre>

Run ADMIXTURE [http://software.genetics.ucla.edu/admixture/] to obtain admixture proprotions

<pre>admixture admixTjeck2.bed 3 -j20</pre>

::Genotypes file prefix admixTjeck2.bed
::Number of Ancestral Populations K = 3
::Number of CPU threads = 20

Run evalAdmix
<pre>./evalAdmix -plink admixTjeck2 -fname admixTjeck2.3.P -qname admixTjeck2.3.Q -P 20 </pre>

::Genotypes file prefix admixTjeck2 (-plink admixTjeck2).
::Ancestral Populations frequency file admixTjeck2.3.P (-fname admixTjeck2.3.P).
::Admixture proportions file admixTjeck2.3.Q (-qname admixTjeck2.3.Q).
::Computer cores = 20 (-P 20).

Plot results in R
<pre>
source("visFuns.R")

# read population labels and estimated admixture proportions
pop<-read.table("admixTjeck2.fam")
q<-read.table("admixTjeck2.3.Q")

# order according to population and plot the ADMIXTURE reults
ord<-orderInds(pop = as.vector(pop[,2]), q = q)
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=3")
text(sort(tapply(1:nrow(pop),pop[ord,2],mean)),-0.05,unique(pop[ord,2]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,2]),function(x){sum(pop[ord,2]==x)})),col=1,lwd=1.2)

r<-as.matrix(read.table("output.corres.txt"))

# Plot correlation of residuals
plotCorRes(cor_mat = r, pop = as.vector(pop[,2]), ord=ord, title="Evaluation of 1000G admixture proportions with K=3", max_z=0.1, min_z=-0.1)

</pre>

=== Low depth sequencing data ===
Download the input file containing genotype likelihoods in beagle format
<pre>
wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz
wget popgen.dk/software/download/NGSadmix/data/Demo2pop.info
</pre>

Execute [[NgsAdmix |NGSadmix]] to obtain admixture proportions
<pre>./NGSadmix -likes Demo2input.gz -K 3 -P 20 -o myoutfiles -minMaf 0.05</pre>

::Input file = Demo2input.gz
::Ancestral Populations K=3
::Computer cores = 20 (-P 20).
::Output prefix = myoutfiles (-o myoutfiles)
::SNPs with MAF > 5% (-minMaf 0.05)

Run evalAdmix
<pre>./evalAdmix -beagle Demo2input.gz -fname myoutfiles.fopt.gz -qname myoutfiles.qopt -P 20 </pre>

::Genotype likelihoods file Demo2input.gz (-beagle Demo2input.gz).
::Ancestral Populations frequency file myoutfiles.fopt.gz (-fname myoutfiles.fopt.gz).
::Admixture proportions file myoutfiles.qopt(-qname myoutfiles.qopt).
::Computer cores = 20 (-P 20).

Plot results in R
<pre>
source("visFuns.R")
# read population labels and estimated admixture proportions
pop<-read.table("Demo2pop.info",as.is=T)
q<-read.table("myoutfiles.qopt")

# order according to population and plot the NGSadmix reults
ord<-orderInds(pop = as.vector(pop[,1]), q = q)
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=3")
text(sort(tapply(1:nrow(pop),pop[ord,1],mean)),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

r<-read.table("output.corres.txt")

# Plot correlation of residuals

plotCorRes(cor_mat = r, pop = as.vector(pop[,1]), ord = ord, title="Evaluation of 1000G admixture proportions with K=3", max_z=0.1, min_z=-0.1)

</pre>

==Citation==

[https://doi.org/10.1111/1755-0998.13171 Evaluation of Model Fit of Inferred Admixture Proportions]

EvalAdmix

2022-12-27T09:02:24Z

Albrecht: /* Genotype data */

'''IMPORTANT''': version 0.95 (updated on 30/06/2021) fixes a bug in the implementation for genotype data, which caused displacement of genotypes between samples when a site had missing data. When all sites have some missingness this would result in the last samples from the analyses having a correlation of nan with all other samples; but might have some more subtle effects whenever there is some level of missingness. If you have analyses from previous versions based on genotype data with any missingness might be a good idea to re-run them after updating. The bug did not affect the genotype likelihoods implementation so if you based the analyses on genotype likelihoods you do not need to worry. If you applied it to gentoype data without any missingness you also do not need to worry.

evalAdmix allows to evaluate the results of an admixture analysis (i.e. the result of applying [https://genome.cshlp.org/content/19/9/1655.long ADMIXTURE], [https://web.stanford.edu/group/pritchardlab/structure.html STRUCTURE], [http://www.popgen.dk/software/index.php/NgsAdmix NGSadmix] and similar). It only needs the input genotype data used for the previous admixture analysis and the output of that analysis (admixture proportions and ancestral population frequencies). The genotype input data can either be called genotypes in [https://www.cog-genomics.org/plink/1.9/formats#bed binary plink format] or genotype likelihoods in [http://www.popgen.dk/angsd/index.php/Genotype_Likelihoods#Beagle_format beagle format].

The output is a pairwise correlation of residuals matrix between individuals. The correlation will be close to 0 in case of a good fit of the data to the admixture model. When individuals do not fit the model, individuals with similar demographic histories (i.e. usually individuals from the same population) will be positively correlated; and individuals with different histories but that are modelled as sharing one or more ancestral populations as admixture sources will have a negative correlation. Positive correlation between a pair of individuals might also be due to relatedness.

[[File:evalAdmix.png|thumb|550px]]

==Download and Installation==

evalAdmix can be installed from [https://github.com/GenisGE/evalAdmix github]

<pre>git clone https://github.com/GenisGE/evalAdmix.git
cd evalAdmix
make
</pre>

==Quick start==
:<code> ./evalAdmix -beagle inputBeagleFile.gz -fname myoutfiles.fopt.gz -qname myoutfiles.qopt -o evaladmixOut.corres -P 10 </code>
:<code> ./evalAdmix -plink inputPlinkPrefix -fname inputPlinkPrefix.K.P -qname inputPlinkPrefix.K.Q -o evaladmixOut.corres -P 10 </code>

* '''-beagle''' beagle file of genotype likelihoods
* '''-plink''' binary plink file prefix with genotype data
* '''-fname''' file with ancestral frequencies (space delimited, rows are sites and columns ancestral populations)
* '''-qname''' file with admixture proportions (space delimited, rows are individuals and columns ancestral populations)
* '''-o''' prefix of output file names
* '''-P''' Number of threads used

==Parameters==

<pre>./evalAdmix </pre>

<pre>
Arguments:
Required:
-plink path to binary plink file (excluding the .bed)
or
-beagle path to beagle file containing genotype likelihoods (alternative to -plink)

-fname path to ancestral population frequencies file
-qname path to admixture proportions file

Optional:

-o name of the output file

Setup (optional):

-P 1 number of threads
-autosomeMax 23 autosome ends with this chromsome
-nIts 5 number of iterations to do for frequency correction; if set to 0 calculates correlation without correction (fast but biased)
-useSites 1.0 proportion of sites to use to calculate correlation of residuals
-useInds filename path to tab delimited file with first column containing all individuals ID and second column with 1 to include individual in analysis and 0 otherwise (default all individuals are included)
-misTol 0.05 tolerance for considering site as missing when using genotype likelihoods. Use same value as used in NGSadmix to keep compatibility when using genotype likelihoods (-beagle)
-minMaf 0.05 minimum minor allele frequency to keep site. Use same value as used in NGSadmix to keep compatibility when using genotype likelihoods (-beagle)
</pre>

==Input Files==

===Plink===
Genotype data files in binary PLINK format (.bed .fam .bim).
===Beagle genotype likelhoods===
The input file contains genotype likelihoods in a .beagle file format [http://faculty.washington.edu/browning/beagle/beagle.html].
and can be compressed with gzip.
==== BAM files ====
If you have BAM files you can use [[ANGSD]] to produce genotype likelihoods in .beagle format. Please
see [http://www.popgen.dk/angsd/index.php/Beagle_input Creation of Beagle files with ANGSD]

==== VCF files ====
If you already have made a VCF file that contains genotype likehood information then it should be possible to convert .vcf files with genotype likelihoods to .beagle file via vcftools [https://vcftools.github.io/man_latest.html]

<pre>
vcftools --vcf input.vcf --out test --BEAGLE-GL --chr 1,2
</pre>
Chromosome has to be specified.

You can also use bcftools' [https://samtools.github.io/bcftools/bcftools.html] 'query' option for generating a .beagle file from a .vcf file.

==Output File==
The analysis performed by evalAdmix produces one file, containing a tab delimited N times N symmetric correlation matrix, where column i in line j contains the correlation of residuals between individual i and j, and the diagonal values (self-correlation) are set to NA:

NA 0.008609 -0.006919 0.002731 0.020224<br />
0.008609 NA 0.000033 0.004968 -0.008470<br />
-0.006919 0.000033 NA 0.006982 0.005664<br />
0.002731 0.004968 0.006982 NA 0.000521<br />
0.020224 -0.008470 0.005664 0.000521 NA

==Run command example==

=== Genotype data ===
Download the input file containing genotypes in binary plink format
<pre>
wget http://pontus.popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.bed
wget http://pontus.popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.bim
wget http://pontus.popgen.dk/albrecht/open/admixTjeck/plink/admixTjeck2.fam
</pre>

Run ADMIXTURE [http://software.genetics.ucla.edu/admixture/] to obtain admixture proprotions

<pre>admixture admixTjeck2.bed 3 -j20</pre>

::Genotypes file prefix admixTjeck2.bed
::Ancestral Populations K = 3

Run evalAdmix
<pre>./evalAdmix -plink admixTjeck2 -fname admixTjeck2.3.P -qname admixTjeck2.3.Q -P 20 </pre>

::Genotypes file prefix admixTjeck2 (-plink admixTjeck2).
::Ancestral Populations frequency file admixTjeck2.3.P (-fname admixTjeck2.3.P).
::Admixture proportions file admixTjeck2.3.Q (-qname admixTjeck2.3.Q).
::Computer cores = 20 (-P 20).

Plot results in R
<pre>
source("visFuns.R")

# read population labels and estimated admixture proportions
pop<-read.table("admixTjeck2.fam")
q<-read.table("admixTjeck2.3.Q")

# order according to population and plot the ADMIXTURE reults
ord<-orderInds(pop = as.vector(pop[,2]), q = q)
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=3")
text(sort(tapply(1:nrow(pop),pop[ord,2],mean)),-0.05,unique(pop[ord,2]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,2]),function(x){sum(pop[ord,2]==x)})),col=1,lwd=1.2)

r<-as.matrix(read.table("output.corres.txt"))

# Plot correlation of residuals
plotCorRes(cor_mat = r, pop = as.vector(pop[,2]), ord=ord, title="Evaluation of 1000G admixture proportions with K=3", max_z=0.1, min_z=-0.1)

</pre>

=== Low depth sequencing data ===
Download the input file containing genotype likelihoods in beagle format
<pre>
wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz
wget popgen.dk/software/download/NGSadmix/data/Demo2pop.info
</pre>

Execute [[NgsAdmix |NGSadmix]] to obtain admixture proportions
<pre>./NGSadmix -likes Demo2input.gz -K 3 -P 20 -o myoutfiles -minMaf 0.05</pre>

::Input file = Demo2input.gz
::Ancestral Populations K=3
::Computer cores = 20 (-P 20).
::Output prefix = myoutfiles (-o myoutfiles)
::SNPs with MAF > 5% (-minMaf 0.05)

Run evalAdmix
<pre>./evalAdmix -beagle Demo2input.gz -fname myoutfiles.fopt.gz -qname myoutfiles.qopt -P 20 </pre>

::Genotype likelihoods file Demo2input.gz (-beagle Demo2input.gz).
::Ancestral Populations frequency file myoutfiles.fopt.gz (-fname myoutfiles.fopt.gz).
::Admixture proportions file myoutfiles.qopt(-qname myoutfiles.qopt).
::Computer cores = 20 (-P 20).

Plot results in R
<pre>
source("visFuns.R")
# read population labels and estimated admixture proportions
pop<-read.table("Demo2pop.info",as.is=T)
q<-read.table("myoutfiles.qopt")

# order according to population and plot the NGSadmix reults
ord<-orderInds(pop = as.vector(pop[,1]), q = q)
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=3")
text(sort(tapply(1:nrow(pop),pop[ord,1],mean)),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

r<-read.table("output.corres.txt")

# Plot correlation of residuals

plotCorRes(cor_mat = r, pop = as.vector(pop[,1]), ord = ord, title="Evaluation of 1000G admixture proportions with K=3", max_z=0.1, min_z=-0.1)

</pre>

==Citation==

[https://doi.org/10.1111/1755-0998.13171 Evaluation of Model Fit of Inferred Admixture Proportions]

PCAngsdTutorial

2020-05-29T16:10:28Z

Albrecht: /* Admixture based on two PC */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd/pcangsd.py
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>mkdir Demo/Data</code>

<code>mkdir Demo/Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo/Data/Demo1pop.info")
C <- as.matrix(read.table("Demo/Results/Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -e 2 -o $OUT_DIR/Demo2PCANGSD_3 -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
##Requires previous installation of the library RcppCNPy

library(RcppCNPy) # Numpy library for R

pop<-read.table("Demo/Data/Demo1pop.info",as.is=T)

q <- npyLoad("Demo/Results/Demo2PCANGSD_3.admix.Q.npy")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_4 -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_4.inbreed.npy | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_5 -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_5.inbreed.npy | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

A file with the positions and sample information, and a beagle file are provided:

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>
(/home/albrecht/oldPhDCourse/PCangsd/data/eu1000g.small.beagle.gz)

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>
(/home/albrecht/oldPhDCourse/PCangsd/data/eu1000g.sample.Info)

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $OUT_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code>python $PCANGSD -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_1 -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("Demo2PCANGSD_1.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)

pdf("PCAngsdDemo2_1.pdf")

plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_2 -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>

library(RcppCNPy) # Numpy library for R

## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-npyLoad("Demo2PCANGSD_2.selection.npy")

## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("Demo2PCANGSD_2.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhattan plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2020-05-29T16:09:49Z

Albrecht: /* Admixture based on 1st two PC */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd/pcangsd.py
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>mkdir Demo/Data</code>

<code>mkdir Demo/Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo/Data/Demo1pop.info")
C <- as.matrix(read.table("Demo/Results/Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD_3 -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
##Requires previous installation of the library RcppCNPy

library(RcppCNPy) # Numpy library for R

pop<-read.table("Demo/Data/Demo1pop.info",as.is=T)

q <- npyLoad("Demo/Results/Demo2PCANGSD_3.admix.Q.npy")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_4 -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_4.inbreed.npy | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_5 -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_5.inbreed.npy | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

A file with the positions and sample information, and a beagle file are provided:

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>
(/home/albrecht/oldPhDCourse/PCangsd/data/eu1000g.small.beagle.gz)

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>
(/home/albrecht/oldPhDCourse/PCangsd/data/eu1000g.sample.Info)

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $OUT_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code>python $PCANGSD -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_1 -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("Demo2PCANGSD_1.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)

pdf("PCAngsdDemo2_1.pdf")

plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_2 -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>

library(RcppCNPy) # Numpy library for R

## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-npyLoad("Demo2PCANGSD_2.selection.npy")

## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("Demo2PCANGSD_2.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhattan plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2020-05-29T16:08:27Z

Albrecht: /* Admixture based on 1st two PC */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd/pcangsd.py
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>mkdir Demo/Data</code>

<code>mkdir Demo/Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo/Data/Demo1pop.info")
C <- as.matrix(read.table("Demo/Results/Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD_3 -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
##Requires previous installation of the library RcppCNPy

library(RcppCNPy) # Numpy library for R

pop<-read.table("Demo/Data/Demo1pop.info",as.is=T)

q <- npyLoad("Demo/Results/Demo2PCANGSD_3.admix.Q.npy")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_4 -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_4.inbreed.npy | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_5 -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_5.inbreed.npy | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

A file with the positions and sample information, and a beagle file are provided:

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>
(/home/albrecht/oldPhDCourse/PCangsd/data/eu1000g.small.beagle.gz)

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>
(/home/albrecht/oldPhDCourse/PCangsd/data/eu1000g.sample.Info)

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $OUT_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code>python $PCANGSD -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_1 -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("Demo2PCANGSD_1.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)

pdf("PCAngsdDemo2_1.pdf")

plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_2 -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>

library(RcppCNPy) # Numpy library for R

## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-npyLoad("Demo2PCANGSD_2.selection.npy")

## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("Demo2PCANGSD_2.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhattan plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2020-05-29T16:07:11Z

Albrecht: /* Estimating Individual Allele Frequencies */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd/pcangsd.py
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>mkdir Demo/Data</code>

<code>mkdir Demo/Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo/Data/Demo1pop.info")
C <- as.matrix(read.table("Demo/Results/Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD_3 -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
##Requires previous installation of the library RcppCNPy

library(RcppCNPy) # Numpy library for R

pop<-read.table("Demo1pop.info",as.is=T)

q <- npyLoad("Demo2PCANGSD_3.admix.Q.npy")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_4 -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_4.inbreed.npy | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_5 -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_5.inbreed.npy | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

A file with the positions and sample information, and a beagle file are provided:

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>
(/home/albrecht/oldPhDCourse/PCangsd/data/eu1000g.small.beagle.gz)

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>
(/home/albrecht/oldPhDCourse/PCangsd/data/eu1000g.sample.Info)

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $OUT_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code>python $PCANGSD -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_1 -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("Demo2PCANGSD_1.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)

pdf("PCAngsdDemo2_1.pdf")

plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_2 -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>

library(RcppCNPy) # Numpy library for R

## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-npyLoad("Demo2PCANGSD_2.selection.npy")

## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("Demo2PCANGSD_2.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhattan plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2020-05-29T16:06:03Z

Albrecht: /* Demo 1: Allele Frequencies */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd/pcangsd.py
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>mkdir Demo/Data</code>

<code>mkdir Demo/Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD_3 -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
##Requires previous installation of the library RcppCNPy

library(RcppCNPy) # Numpy library for R

pop<-read.table("Demo1pop.info",as.is=T)

q <- npyLoad("Demo2PCANGSD_3.admix.Q.npy")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_4 -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_4.inbreed.npy | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_5 -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_5.inbreed.npy | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

A file with the positions and sample information, and a beagle file are provided:

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>
(/home/albrecht/oldPhDCourse/PCangsd/data/eu1000g.small.beagle.gz)

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>
(/home/albrecht/oldPhDCourse/PCangsd/data/eu1000g.sample.Info)

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $OUT_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code>python $PCANGSD -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_1 -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("Demo2PCANGSD_1.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)

pdf("PCAngsdDemo2_1.pdf")

plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_2 -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>

library(RcppCNPy) # Numpy library for R

## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-npyLoad("Demo2PCANGSD_2.selection.npy")

## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("Demo2PCANGSD_2.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhattan plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2020-05-29T16:04:40Z

Albrecht: /* Download the input and population information files */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd/pcangsd.py
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>mkdir Demo/Data</code>

<code>mkdir Demo/Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $OUT_DIR</code>

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD_3 -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
##Requires previous installation of the library RcppCNPy

library(RcppCNPy) # Numpy library for R

pop<-read.table("Demo1pop.info",as.is=T)

q <- npyLoad("Demo2PCANGSD_3.admix.Q.npy")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_4 -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_4.inbreed.npy | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_5 -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_5.inbreed.npy | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

A file with the positions and sample information, and a beagle file are provided:

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>
(/home/albrecht/oldPhDCourse/PCangsd/data/eu1000g.small.beagle.gz)

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>
(/home/albrecht/oldPhDCourse/PCangsd/data/eu1000g.sample.Info)

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $OUT_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code>python $PCANGSD -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_1 -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("Demo2PCANGSD_1.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)

pdf("PCAngsdDemo2_1.pdf")

plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_2 -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>

library(RcppCNPy) # Numpy library for R

## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-npyLoad("Demo2PCANGSD_2.selection.npy")

## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("Demo2PCANGSD_2.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhattan plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2020-05-29T16:03:48Z

Albrecht: /* Create directories */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd/pcangsd.py
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>mkdir Demo/Data</code>

<code>mkdir Demo/Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $OUT_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD_3 -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
##Requires previous installation of the library RcppCNPy

library(RcppCNPy) # Numpy library for R

pop<-read.table("Demo1pop.info",as.is=T)

q <- npyLoad("Demo2PCANGSD_3.admix.Q.npy")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_4 -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_4.inbreed.npy | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_5 -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_5.inbreed.npy | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

A file with the positions and sample information, and a beagle file are provided:

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>
(/home/albrecht/oldPhDCourse/PCangsd/data/eu1000g.small.beagle.gz)

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>
(/home/albrecht/oldPhDCourse/PCangsd/data/eu1000g.sample.Info)

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $OUT_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code>python $PCANGSD -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_1 -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("Demo2PCANGSD_1.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)

pdf("PCAngsdDemo2_1.pdf")

plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_2 -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>

library(RcppCNPy) # Numpy library for R

## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-npyLoad("Demo2PCANGSD_2.selection.npy")

## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("Demo2PCANGSD_2.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhattan plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2020-05-29T16:02:52Z

Albrecht:

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd/pcangsd.py
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $OUT_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD_3 -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
##Requires previous installation of the library RcppCNPy

library(RcppCNPy) # Numpy library for R

pop<-read.table("Demo1pop.info",as.is=T)

q <- npyLoad("Demo2PCANGSD_3.admix.Q.npy")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_4 -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_4.inbreed.npy | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_5 -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_5.inbreed.npy | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

A file with the positions and sample information, and a beagle file are provided:

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>
(/home/albrecht/oldPhDCourse/PCangsd/data/eu1000g.small.beagle.gz)

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>
(/home/albrecht/oldPhDCourse/PCangsd/data/eu1000g.sample.Info)

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $OUT_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code>python $PCANGSD -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_1 -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("Demo2PCANGSD_1.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)

pdf("PCAngsdDemo2_1.pdf")

plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_2 -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>

library(RcppCNPy) # Numpy library for R

## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-npyLoad("Demo2PCANGSD_2.selection.npy")

## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("Demo2PCANGSD_2.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhattan plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsd

2020-05-29T15:59:32Z

Albrecht: /* Quick start */

PCAngsd is a program that estimates the covariance matrix for low depth next-generation sequencing (NGS) data in structured/heterogeneous populations using principal component analysis (PCA) to perform multiple population genetic analyses using an iterative procedure based on genotype likelihoods.

Since version 0.98, PCAngsd was re-written to be based on Cython for computational bottlenecks and parallelization and is now compatible with any newer Python version.

The method was published in 2018 and can be found here: [https://www.genetics.org/content/210/2/719]
[[File:Pcangsd_admix.gif|frame]]

[[File:Pcangsd_pca.png|thumb|400px|Simulated low depth NGS data of 3 populations]]

=Overview=
Framework for analyzing low depth next-generation sequencing (NGS) data in heterogeneous populations using principal component analysis (PCA). Population structure is inferred to detect the number of significant principal components which is used to estimate individual allele frequencies using genotype dosages in a SVD model. The estimated individual allele frequencies are then used in an probabilistic framework to update the genotype dosages such that an updated set of individual allele frequencies can be estimated iteratively based on inferred population structure. A covariance matrix can be estimated using the updated prior information of the estimated individual allele frequencies.

The estimated individual allele frequencies and principal components can be used as prior knowledge in other probabilistic methods based on a same Bayesian principle. PCAngsd can perform the following analyses:
*Covariance matrix
*Genotype calling
*Admixture
*Inbreeding coefficients (both per-individual and per-site)
*HWE test
*Genome selection scan
*Kinship matrix

The older version, based on the Numba library (only working with Python 2.7) is still available in version 0.973 and can be found here [https://github.com/Rosemeis/pcangsd/releases/tag/0.973].

=Download and Installation=

PCAngsd should work on all platforms meeting the requirements but server-side usage is highly recommended.
Installation has only been tested on Linux systems.

Get PCAngsd and build
<pre>
git clone https://github.com/Rosemeis/pcangsd.git
cd pcangsd/
python setup.py build_ext --inplace
</pre>
Install dependencies:

The required set of Python packages are easily installed using the pip command and the requirements.txt file included in the pcangsd folder.

<code>pip install --user -r requirements.txt</code>

=Quick start=

<pre>

# Download the input beagle file with genotype likelihoods
wget popgen.dk/software/download/NGSadmix/data/input.gz

# Only estimate covariance matrix using 10 threads
python pcangsd.py -beagle input.gz -o test1 -threads 10

# Estimate covariance matrix and individual admixture proportions
python pcangsd.py -beagle input.gz -admix -o test2 -threads 10

# Estimate covariance matrix and inbreeding coefficients
python pcangsd.py -beagle input.gz -inbreed 2 -o test3 -threads 10

# Estimate covariance matrix and perform selection scan
python pcangsd.py -beagle input.gz -selection -o test4 -threads 10

# perform PCA on covarince matrix in R
## open R
C <- as.matrix(read.table("test1.cov"))
e <- eigen(C)
plot(e$vectors[,1:2],xlab="PC1",ylab="PC2")
</pre>

==Detailed Examples and Tutorial==

Please refer to the tutorial's page [http://www.popgen.dk/software/index.php/PCAngsdTutorial]

=Input=
The only input PCAngsd needs and accepts are genotype likelihoods in [http://faculty.washington.edu/browning/beagle/beagle.html Beagle] format. New functionality for using PLINK files has been added (version 0.9). Genotypes are automatically converted into a genotype likelihood matrix where the user can incorporate an error model.

[http://popgen.dk/angsd ANGSD] can be easily be used to compute genotype likelihoods and output them in the required Beagle format.

<pre>
./angsd -GL 2 -out data -nThreads 10 -doGlf 2 -doMajorMinor 1 -doMaf 2 -SNP_pval 1e-6 -bam bam.filelist
</pre>

See [http://popgen.dk/angsd ANGSD] for more information on how to compute the genotype likelihoods and call SNPs.

=Output=

Since version 0.98, PCAngsd's output is only in binary Numpy format (.npy) except for the covariance matrix.

In order to read files in python:

<pre>
import numpy as np
S = np.load("output.selection.npy") # Reads results from selection scan
</pre>

R can also read Numpy matrices using the "RcppCNPy" library:
<pre>
library(RcppCNPy)
S <- npyLoad("output.selection.npy") # Reads results from selection scan
</pre>

=Using PCAngsd=

All the different options in PCAngsd are listed here. PCAngsd will always compute the covariance matrix, where it uses principal components to estimate individual allele frequencies in an iterative procedure. The estimated individual allele frequencies will then be used in any of the other specified options of PCAngsd.´

<pre>
# See all options in PCAngsd
python pcangsd.py -h
</pre>

==Estimation of individual allele frequencies==
; -beagle [Beagle filename]
Input file of genotype likelihoods in Beagle format (.beagle.gz).
; -plink [Prefix for binary PLINK files]
Path to PLINK files using their prefix (.bed, .bim, .fam).
; -plink_error [float]
Incorporate error model for PLINK genotypes.
; -minMaf [float]
Minimum minor allele frequency threshold. (Default: 0.05)
; -iter [int]
Maximum number of iterations for estimation of individual allele frequencies (Default: 100).
; -tole [float]
Tolerance value for update in estimation of individual allele frequencies (Default: 1e-5).
; -maf_iter [int]
Maximum number of EM iterations for computing the population allele frequencies (Default: 100).
; -maf_tole [float]
Tolerance value in EM algorithm for population allele frequencies estimation (Default: 1e-4).
; -e [int]
Manually select the number of eigenvalues to use in the modelling of individual allele frequencies (Default: Automatically tested using MAP test).
; -o [prefix]
Set the prefix for all output files created by PCAngsd (Default: "pcangsd").
; -indf_save
Choose to save estimated individual allele frequencies (Binary). Numpy format (.npy).
; -dosage_save
Choose to save estimated genotype dosages (Binary). Numpy format (.npy).
; -sites_save
Choose to save the marker IDs after performing filtering using population allele frequencies. Useful for especially selection scans and per-site inbreeding coefficients.
; -post_save
Choose to save the posterior genotype probabilities. Beagle format (.beagle).
; -threads [int]
Specify the number of thread(s) to use (Default: 1).

==Call genotypes==
Genotypes can be called from posterior genotype probabilities incorporating the individual allele frequencies as prior information.

; -geno [float]
Call genotypes with defined threshold.
; -genoInbreed [float]
Call genotypes with defined threshold also taking inbreeding into account. '''-inbreed [int]''' is required, since individual inbreeding coefficients must have been estimated prior to calling genotypes using that information.

==Admixture==
Individual admixture proportions and population-specific allele frequencies can be estimated based on assuming K ancestral populations using an accelerated mini-batch NMF method.

; -admix
Toggles admixture estimations. Individual ancestry proportions are saved (Binary). Numpy format (.npy).
; -admix_alpha [float-list]
Specify alpha (sparseness regularization parameter). Can be specified as a sequence to try several alpha's in a single run (Default: 0).
; -admix_auto [float]
Enable automatic search for optimal alpha using likelihood measure, by giving soft upper search bound of alpha.
; -admix_seed [int-list]
Specify seed for random initializations of factor matrices in admixture estimations. Can be specified as a sequence to try several different seeds in a single run.
; -admix_K [int]
Not recommended. Manually specify the number of ancestral populations to use in admixture estimations (overrides number chosen from '''-e'''). Structure explained by individual allele frequencies may therefore not reflect the manually chosen K. It is recommended to adjust '''-e''' instead of '''-admix_K'''.
; -admix_iter [int]
Maximum number of iterations for admixture estimations using NMF. (Default: 200)
; -admix_tole [float]
Tolerance value for update in admixture estimations using NMF. (Default: 1e-5)
; -admix_batch [int]
Specify the number of batches to use in NMF method. (Default: 10)
; -admix_save
Choose to save the population-specific allele frequencies (Binary). Numpy format (.npy).

==Inbreeding==
Per-individual inbreeding coefficients incorporating population structure can be computed using three different methods. However, -inbreed 1 is recommended for low depth cases.

; -inbreed 1
Simple estimator computed by an EM algorithm. Allows for F-values between -1 and 1. Based on [http://genome.cshlp.org/content/23/11/1852.full ngsF].

; -inbreed 2
A maximum likelihood estimator also computed by an EM algorithm. Only allows for F-values between 0 and 1. Based on [https://www.cambridge.org/core/journals/genetics-research/article/maximum-likelihood-estimation-of-individual-inbreeding-coefficients-and-null-allele-frequencies/2DEBA0C0C2B92DF0EE89BD27DFCAD3FB].

; -inbreed 3
Estimator using an estimated kinship matrix. Allows for F-values between -1 and 1. Based on [http://www.cell.com/ajhg/abstract/S0002-9297(15)00493-0 PC-Relate].

; -inbreed_iter [int]
Maximum number of iterations for the EM algorithm methods. (Default: 200)

; -inbreed_tole [float]
Tolerance value for the EM algorithms for inbreeding coefficients estimation. (Default: 1e-4)
Per-site inbreeding coefficients incorporating population structure alongside likehood ratio tests for HWE can be computed as follows:

; -inbreedSites
Use likelihood ratio tests (.lrt.sites.gz) generated from '''-inbreedSites''' to filter out variable sites using a given threshold for HWE test p-value:

; -hwe [LRT filename]

; -hwe_tole [float]
Tolerance value for HWE test. (Default: 1e-6)

==Selection==
A genome selection scan can be computed based on posterior expectations of the genotypes (genotype dosages):

; -selection
Using an extended model of [http://www.cell.com/ajhg/abstract/S0002-9297(16)00003-3 FastPCA]. Performs a genome selection scan along all significant PCs. Outputs the selection statistics and must be converted to p-values by user. Each column reflect the selection statistics along a tested PC and they are χ²-distributed with 1 degree of freedom.

==Relatedness==
Estimate kinship matrix based on method Based on [http://www.cell.com/ajhg/abstract/S0002-9297(15)00493-0 PC-Relate]:

; -kinship
Automatically estimated if '''-inbreed 3''' has been selected.

Remove related individuals based on kinhsip matrix of previous run:
; -relate [Kinship filename]
; -relate_tole [float]
Threshold for kinship coefficients for removing individuals (Default: 0.0625).

=Citation=
Our methods for inferring population structure have been published in GENETICS:

[http://www.genetics.org/content/early/2018/08/21/genetics.118.301336 Inferring Population Structure and Admixture Proportions in Low Depth NGS Data]

Our method for testing for HWE in structured populations has been published in Molecular Ecology Resources:

[https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13019 Testing for Hardy‐Weinberg Equilibrium in Structured Populations using Genotype or Low‐Depth NGS Data]

PCAngsd

2020-05-29T15:56:45Z

Albrecht: /* Output */

PCAngsd is a program that estimates the covariance matrix for low depth next-generation sequencing (NGS) data in structured/heterogeneous populations using principal component analysis (PCA) to perform multiple population genetic analyses using an iterative procedure based on genotype likelihoods.

Since version 0.98, PCAngsd was re-written to be based on Cython for computational bottlenecks and parallelization and is now compatible with any newer Python version.

The method was published in 2018 and can be found here: [https://www.genetics.org/content/210/2/719]
[[File:Pcangsd_admix.gif|frame]]

[[File:Pcangsd_pca.png|thumb|400px|Simulated low depth NGS data of 3 populations]]

=Overview=
Framework for analyzing low depth next-generation sequencing (NGS) data in heterogeneous populations using principal component analysis (PCA). Population structure is inferred to detect the number of significant principal components which is used to estimate individual allele frequencies using genotype dosages in a SVD model. The estimated individual allele frequencies are then used in an probabilistic framework to update the genotype dosages such that an updated set of individual allele frequencies can be estimated iteratively based on inferred population structure. A covariance matrix can be estimated using the updated prior information of the estimated individual allele frequencies.

The estimated individual allele frequencies and principal components can be used as prior knowledge in other probabilistic methods based on a same Bayesian principle. PCAngsd can perform the following analyses:
*Covariance matrix
*Genotype calling
*Admixture
*Inbreeding coefficients (both per-individual and per-site)
*HWE test
*Genome selection scan
*Kinship matrix

The older version, based on the Numba library (only working with Python 2.7) is still available in version 0.973 and can be found here [https://github.com/Rosemeis/pcangsd/releases/tag/0.973].

=Download and Installation=

PCAngsd should work on all platforms meeting the requirements but server-side usage is highly recommended.
Installation has only been tested on Linux systems.

Get PCAngsd and build
<pre>
git clone https://github.com/Rosemeis/pcangsd.git
cd pcangsd/
python setup.py build_ext --inplace
</pre>
Install dependencies:

The required set of Python packages are easily installed using the pip command and the requirements.txt file included in the pcangsd folder.

<code>pip install --user -r requirements.txt</code>

=Quick start=

<pre>

# Download the input beagle file with genotype likelihoods
wget popgen.dk/software/download/NGSadmix/data/input.gz

# Only estimate covariance matrix using 10 threads
python pcangsd.py -beagle input.gz -o test1 -threads 10

# Estimate covariance matrix and individual admixture proportions
python pcangsd.py -beagle input.gz -admix -o test2 -threads 10

# Estimate covariance matrix and inbreeding coefficients
python pcangsd.py -beagle input.gz -inbreed 2 -o test3 -threads 10

# Estimate covariance matrix and perform selection scan
python pcangsd.py -beagle input.gz -selection -o test4 -threads 10
</pre>

==Detailed Examples and Tutorial==

Please refer to the tutorial's page [http://www.popgen.dk/software/index.php/PCAngsdTutorial]

=Input=
The only input PCAngsd needs and accepts are genotype likelihoods in [http://faculty.washington.edu/browning/beagle/beagle.html Beagle] format. New functionality for using PLINK files has been added (version 0.9). Genotypes are automatically converted into a genotype likelihood matrix where the user can incorporate an error model.

[http://popgen.dk/angsd ANGSD] can be easily be used to compute genotype likelihoods and output them in the required Beagle format.

<pre>
./angsd -GL 2 -out data -nThreads 10 -doGlf 2 -doMajorMinor 1 -doMaf 2 -SNP_pval 1e-6 -bam bam.filelist
</pre>

See [http://popgen.dk/angsd ANGSD] for more information on how to compute the genotype likelihoods and call SNPs.

=Output=

Since version 0.98, PCAngsd's output is only in binary Numpy format (.npy) except for the covariance matrix.

In order to read files in python:

<pre>
import numpy as np
S = np.load("output.selection.npy") # Reads results from selection scan
</pre>

R can also read Numpy matrices using the "RcppCNPy" library:
<pre>
library(RcppCNPy)
S <- npyLoad("output.selection.npy") # Reads results from selection scan
</pre>

=Using PCAngsd=

All the different options in PCAngsd are listed here. PCAngsd will always compute the covariance matrix, where it uses principal components to estimate individual allele frequencies in an iterative procedure. The estimated individual allele frequencies will then be used in any of the other specified options of PCAngsd.´

<pre>
# See all options in PCAngsd
python pcangsd.py -h
</pre>

==Estimation of individual allele frequencies==
; -beagle [Beagle filename]
Input file of genotype likelihoods in Beagle format (.beagle.gz).
; -plink [Prefix for binary PLINK files]
Path to PLINK files using their prefix (.bed, .bim, .fam).
; -plink_error [float]
Incorporate error model for PLINK genotypes.
; -minMaf [float]
Minimum minor allele frequency threshold. (Default: 0.05)
; -iter [int]
Maximum number of iterations for estimation of individual allele frequencies (Default: 100).
; -tole [float]
Tolerance value for update in estimation of individual allele frequencies (Default: 1e-5).
; -maf_iter [int]
Maximum number of EM iterations for computing the population allele frequencies (Default: 100).
; -maf_tole [float]
Tolerance value in EM algorithm for population allele frequencies estimation (Default: 1e-4).
; -e [int]
Manually select the number of eigenvalues to use in the modelling of individual allele frequencies (Default: Automatically tested using MAP test).
; -o [prefix]
Set the prefix for all output files created by PCAngsd (Default: "pcangsd").
; -indf_save
Choose to save estimated individual allele frequencies (Binary). Numpy format (.npy).
; -dosage_save
Choose to save estimated genotype dosages (Binary). Numpy format (.npy).
; -sites_save
Choose to save the marker IDs after performing filtering using population allele frequencies. Useful for especially selection scans and per-site inbreeding coefficients.
; -post_save
Choose to save the posterior genotype probabilities. Beagle format (.beagle).
; -threads [int]
Specify the number of thread(s) to use (Default: 1).

==Call genotypes==
Genotypes can be called from posterior genotype probabilities incorporating the individual allele frequencies as prior information.

; -geno [float]
Call genotypes with defined threshold.
; -genoInbreed [float]
Call genotypes with defined threshold also taking inbreeding into account. '''-inbreed [int]''' is required, since individual inbreeding coefficients must have been estimated prior to calling genotypes using that information.

==Admixture==
Individual admixture proportions and population-specific allele frequencies can be estimated based on assuming K ancestral populations using an accelerated mini-batch NMF method.

; -admix
Toggles admixture estimations. Individual ancestry proportions are saved (Binary). Numpy format (.npy).
; -admix_alpha [float-list]
Specify alpha (sparseness regularization parameter). Can be specified as a sequence to try several alpha's in a single run (Default: 0).
; -admix_auto [float]
Enable automatic search for optimal alpha using likelihood measure, by giving soft upper search bound of alpha.
; -admix_seed [int-list]
Specify seed for random initializations of factor matrices in admixture estimations. Can be specified as a sequence to try several different seeds in a single run.
; -admix_K [int]
Not recommended. Manually specify the number of ancestral populations to use in admixture estimations (overrides number chosen from '''-e'''). Structure explained by individual allele frequencies may therefore not reflect the manually chosen K. It is recommended to adjust '''-e''' instead of '''-admix_K'''.
; -admix_iter [int]
Maximum number of iterations for admixture estimations using NMF. (Default: 200)
; -admix_tole [float]
Tolerance value for update in admixture estimations using NMF. (Default: 1e-5)
; -admix_batch [int]
Specify the number of batches to use in NMF method. (Default: 10)
; -admix_save
Choose to save the population-specific allele frequencies (Binary). Numpy format (.npy).

==Inbreeding==
Per-individual inbreeding coefficients incorporating population structure can be computed using three different methods. However, -inbreed 1 is recommended for low depth cases.

; -inbreed 1
Simple estimator computed by an EM algorithm. Allows for F-values between -1 and 1. Based on [http://genome.cshlp.org/content/23/11/1852.full ngsF].

; -inbreed 2
A maximum likelihood estimator also computed by an EM algorithm. Only allows for F-values between 0 and 1. Based on [https://www.cambridge.org/core/journals/genetics-research/article/maximum-likelihood-estimation-of-individual-inbreeding-coefficients-and-null-allele-frequencies/2DEBA0C0C2B92DF0EE89BD27DFCAD3FB].

; -inbreed 3
Estimator using an estimated kinship matrix. Allows for F-values between -1 and 1. Based on [http://www.cell.com/ajhg/abstract/S0002-9297(15)00493-0 PC-Relate].

; -inbreed_iter [int]
Maximum number of iterations for the EM algorithm methods. (Default: 200)

; -inbreed_tole [float]
Tolerance value for the EM algorithms for inbreeding coefficients estimation. (Default: 1e-4)
Per-site inbreeding coefficients incorporating population structure alongside likehood ratio tests for HWE can be computed as follows:

; -inbreedSites
Use likelihood ratio tests (.lrt.sites.gz) generated from '''-inbreedSites''' to filter out variable sites using a given threshold for HWE test p-value:

; -hwe [LRT filename]

; -hwe_tole [float]
Tolerance value for HWE test. (Default: 1e-6)

==Selection==
A genome selection scan can be computed based on posterior expectations of the genotypes (genotype dosages):

; -selection
Using an extended model of [http://www.cell.com/ajhg/abstract/S0002-9297(16)00003-3 FastPCA]. Performs a genome selection scan along all significant PCs. Outputs the selection statistics and must be converted to p-values by user. Each column reflect the selection statistics along a tested PC and they are χ²-distributed with 1 degree of freedom.

==Relatedness==
Estimate kinship matrix based on method Based on [http://www.cell.com/ajhg/abstract/S0002-9297(15)00493-0 PC-Relate]:

; -kinship
Automatically estimated if '''-inbreed 3''' has been selected.

Remove related individuals based on kinhsip matrix of previous run:
; -relate [Kinship filename]
; -relate_tole [float]
Threshold for kinship coefficients for removing individuals (Default: 0.0625).

=Citation=
Our methods for inferring population structure have been published in GENETICS:

[http://www.genetics.org/content/early/2018/08/21/genetics.118.301336 Inferring Population Structure and Admixture Proportions in Low Depth NGS Data]

Our method for testing for HWE in structured populations has been published in Molecular Ecology Resources:

[https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13019 Testing for Hardy‐Weinberg Equilibrium in Structured Populations using Genotype or Low‐Depth NGS Data]

PCAngsd

2020-05-29T15:51:56Z

Albrecht: /* Download and Installation */

PCAngsd is a program that estimates the covariance matrix for low depth next-generation sequencing (NGS) data in structured/heterogeneous populations using principal component analysis (PCA) to perform multiple population genetic analyses using an iterative procedure based on genotype likelihoods.

Since version 0.98, PCAngsd was re-written to be based on Cython for computational bottlenecks and parallelization and is now compatible with any newer Python version.

The method was published in 2018 and can be found here: [https://www.genetics.org/content/210/2/719]
[[File:Pcangsd_admix.gif|frame]]

[[File:Pcangsd_pca.png|thumb|400px|Simulated low depth NGS data of 3 populations]]

=Overview=
Framework for analyzing low depth next-generation sequencing (NGS) data in heterogeneous populations using principal component analysis (PCA). Population structure is inferred to detect the number of significant principal components which is used to estimate individual allele frequencies using genotype dosages in a SVD model. The estimated individual allele frequencies are then used in an probabilistic framework to update the genotype dosages such that an updated set of individual allele frequencies can be estimated iteratively based on inferred population structure. A covariance matrix can be estimated using the updated prior information of the estimated individual allele frequencies.

The estimated individual allele frequencies and principal components can be used as prior knowledge in other probabilistic methods based on a same Bayesian principle. PCAngsd can perform the following analyses:
*Covariance matrix
*Genotype calling
*Admixture
*Inbreeding coefficients (both per-individual and per-site)
*HWE test
*Genome selection scan
*Kinship matrix

The older version, based on the Numba library (only working with Python 2.7) is still available in version 0.973 and can be found here [https://github.com/Rosemeis/pcangsd/releases/tag/0.973].

=Download and Installation=

PCAngsd should work on all platforms meeting the requirements but server-side usage is highly recommended.
Installation has only been tested on Linux systems.

Get PCAngsd and build
<pre>
git clone https://github.com/Rosemeis/pcangsd.git
cd pcangsd/
python setup.py build_ext --inplace
</pre>
Install dependencies:

The required set of Python packages are easily installed using the pip command and the requirements.txt file included in the pcangsd folder.

<code>pip install --user -r requirements.txt</code>

=Quick start=

<pre>

# Download the input beagle file with genotype likelihoods
wget popgen.dk/software/download/NGSadmix/data/input.gz

# Only estimate covariance matrix using 10 threads
python pcangsd.py -beagle input.gz -o test1 -threads 10

# Estimate covariance matrix and individual admixture proportions
python pcangsd.py -beagle input.gz -admix -o test2 -threads 10

# Estimate covariance matrix and inbreeding coefficients
python pcangsd.py -beagle input.gz -inbreed 2 -o test3 -threads 10

# Estimate covariance matrix and perform selection scan
python pcangsd.py -beagle input.gz -selection -o test4 -threads 10
</pre>

==Detailed Examples and Tutorial==

Please refer to the tutorial's page [http://www.popgen.dk/software/index.php/PCAngsdTutorial]

=Input=
The only input PCAngsd needs and accepts are genotype likelihoods in [http://faculty.washington.edu/browning/beagle/beagle.html Beagle] format. New functionality for using PLINK files has been added (version 0.9). Genotypes are automatically converted into a genotype likelihood matrix where the user can incorporate an error model.

[http://popgen.dk/angsd ANGSD] can be easily be used to compute genotype likelihoods and output them in the required Beagle format.

<pre>
./angsd -GL 2 -out data -nThreads 10 -doGlf 2 -doMajorMinor 1 -doMaf 2 -SNP_pval 1e-6 -bam bam.filelist
</pre>

See [http://popgen.dk/angsd ANGSD] for more information on how to compute the genotype likelihoods and call SNPs.

=Output=

Since version 0.98, PCAngsd's output is only in binary Numpy format (.npy) except for the covariance matrix.

In order to read files in python:

<pre>
import numpy as np
C = np.genfromtxt("output.cov") # Reads in estimated covariance matrix
S = np.load("output.selection.npy") # Reads results from selection scan
</pre>

R can also read Numpy matrices using the "RcppCNPy" library:
<pre>
library(RcppCNPy)
C <- as.matrix(read.table("output.cov")) # Reads in estimated covariance matrix
S <- npyLoad("output.selection.npy") # Reads results from selection scan
</pre>

=Using PCAngsd=

All the different options in PCAngsd are listed here. PCAngsd will always compute the covariance matrix, where it uses principal components to estimate individual allele frequencies in an iterative procedure. The estimated individual allele frequencies will then be used in any of the other specified options of PCAngsd.´

<pre>
# See all options in PCAngsd
python pcangsd.py -h
</pre>

==Estimation of individual allele frequencies==
; -beagle [Beagle filename]
Input file of genotype likelihoods in Beagle format (.beagle.gz).
; -plink [Prefix for binary PLINK files]
Path to PLINK files using their prefix (.bed, .bim, .fam).
; -plink_error [float]
Incorporate error model for PLINK genotypes.
; -minMaf [float]
Minimum minor allele frequency threshold. (Default: 0.05)
; -iter [int]
Maximum number of iterations for estimation of individual allele frequencies (Default: 100).
; -tole [float]
Tolerance value for update in estimation of individual allele frequencies (Default: 1e-5).
; -maf_iter [int]
Maximum number of EM iterations for computing the population allele frequencies (Default: 100).
; -maf_tole [float]
Tolerance value in EM algorithm for population allele frequencies estimation (Default: 1e-4).
; -e [int]
Manually select the number of eigenvalues to use in the modelling of individual allele frequencies (Default: Automatically tested using MAP test).
; -o [prefix]
Set the prefix for all output files created by PCAngsd (Default: "pcangsd").
; -indf_save
Choose to save estimated individual allele frequencies (Binary). Numpy format (.npy).
; -dosage_save
Choose to save estimated genotype dosages (Binary). Numpy format (.npy).
; -sites_save
Choose to save the marker IDs after performing filtering using population allele frequencies. Useful for especially selection scans and per-site inbreeding coefficients.
; -post_save
Choose to save the posterior genotype probabilities. Beagle format (.beagle).
; -threads [int]
Specify the number of thread(s) to use (Default: 1).

==Call genotypes==
Genotypes can be called from posterior genotype probabilities incorporating the individual allele frequencies as prior information.

; -geno [float]
Call genotypes with defined threshold.
; -genoInbreed [float]
Call genotypes with defined threshold also taking inbreeding into account. '''-inbreed [int]''' is required, since individual inbreeding coefficients must have been estimated prior to calling genotypes using that information.

==Admixture==
Individual admixture proportions and population-specific allele frequencies can be estimated based on assuming K ancestral populations using an accelerated mini-batch NMF method.

; -admix
Toggles admixture estimations. Individual ancestry proportions are saved (Binary). Numpy format (.npy).
; -admix_alpha [float-list]
Specify alpha (sparseness regularization parameter). Can be specified as a sequence to try several alpha's in a single run (Default: 0).
; -admix_auto [float]
Enable automatic search for optimal alpha using likelihood measure, by giving soft upper search bound of alpha.
; -admix_seed [int-list]
Specify seed for random initializations of factor matrices in admixture estimations. Can be specified as a sequence to try several different seeds in a single run.
; -admix_K [int]
Not recommended. Manually specify the number of ancestral populations to use in admixture estimations (overrides number chosen from '''-e'''). Structure explained by individual allele frequencies may therefore not reflect the manually chosen K. It is recommended to adjust '''-e''' instead of '''-admix_K'''.
; -admix_iter [int]
Maximum number of iterations for admixture estimations using NMF. (Default: 200)
; -admix_tole [float]
Tolerance value for update in admixture estimations using NMF. (Default: 1e-5)
; -admix_batch [int]
Specify the number of batches to use in NMF method. (Default: 10)
; -admix_save
Choose to save the population-specific allele frequencies (Binary). Numpy format (.npy).

==Inbreeding==
Per-individual inbreeding coefficients incorporating population structure can be computed using three different methods. However, -inbreed 1 is recommended for low depth cases.

; -inbreed 1
Simple estimator computed by an EM algorithm. Allows for F-values between -1 and 1. Based on [http://genome.cshlp.org/content/23/11/1852.full ngsF].

; -inbreed 2
A maximum likelihood estimator also computed by an EM algorithm. Only allows for F-values between 0 and 1. Based on [https://www.cambridge.org/core/journals/genetics-research/article/maximum-likelihood-estimation-of-individual-inbreeding-coefficients-and-null-allele-frequencies/2DEBA0C0C2B92DF0EE89BD27DFCAD3FB].

; -inbreed 3
Estimator using an estimated kinship matrix. Allows for F-values between -1 and 1. Based on [http://www.cell.com/ajhg/abstract/S0002-9297(15)00493-0 PC-Relate].

; -inbreed_iter [int]
Maximum number of iterations for the EM algorithm methods. (Default: 200)

; -inbreed_tole [float]
Tolerance value for the EM algorithms for inbreeding coefficients estimation. (Default: 1e-4)
Per-site inbreeding coefficients incorporating population structure alongside likehood ratio tests for HWE can be computed as follows:

; -inbreedSites
Use likelihood ratio tests (.lrt.sites.gz) generated from '''-inbreedSites''' to filter out variable sites using a given threshold for HWE test p-value:

; -hwe [LRT filename]

; -hwe_tole [float]
Tolerance value for HWE test. (Default: 1e-6)

==Selection==
A genome selection scan can be computed based on posterior expectations of the genotypes (genotype dosages):

; -selection
Using an extended model of [http://www.cell.com/ajhg/abstract/S0002-9297(16)00003-3 FastPCA]. Performs a genome selection scan along all significant PCs. Outputs the selection statistics and must be converted to p-values by user. Each column reflect the selection statistics along a tested PC and they are χ²-distributed with 1 degree of freedom.

==Relatedness==
Estimate kinship matrix based on method Based on [http://www.cell.com/ajhg/abstract/S0002-9297(15)00493-0 PC-Relate]:

; -kinship
Automatically estimated if '''-inbreed 3''' has been selected.

Remove related individuals based on kinhsip matrix of previous run:
; -relate [Kinship filename]
; -relate_tole [float]
Threshold for kinship coefficients for removing individuals (Default: 0.0625).

=Citation=
Our methods for inferring population structure have been published in GENETICS:

[http://www.genetics.org/content/early/2018/08/21/genetics.118.301336 Inferring Population Structure and Admixture Proportions in Low Depth NGS Data]

Our method for testing for HWE in structured populations has been published in Molecular Ecology Resources:

[https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13019 Testing for Hardy‐Weinberg Equilibrium in Structured Populations using Genotype or Low‐Depth NGS Data]

PCAngsd

2020-05-29T15:50:32Z

Albrecht: /* Overview */

PCAngsd is a program that estimates the covariance matrix for low depth next-generation sequencing (NGS) data in structured/heterogeneous populations using principal component analysis (PCA) to perform multiple population genetic analyses using an iterative procedure based on genotype likelihoods.

Since version 0.98, PCAngsd was re-written to be based on Cython for computational bottlenecks and parallelization and is now compatible with any newer Python version.

The method was published in 2018 and can be found here: [https://www.genetics.org/content/210/2/719]
[[File:Pcangsd_admix.gif|frame]]

[[File:Pcangsd_pca.png|thumb|400px|Simulated low depth NGS data of 3 populations]]

=Overview=
Framework for analyzing low depth next-generation sequencing (NGS) data in heterogeneous populations using principal component analysis (PCA). Population structure is inferred to detect the number of significant principal components which is used to estimate individual allele frequencies using genotype dosages in a SVD model. The estimated individual allele frequencies are then used in an probabilistic framework to update the genotype dosages such that an updated set of individual allele frequencies can be estimated iteratively based on inferred population structure. A covariance matrix can be estimated using the updated prior information of the estimated individual allele frequencies.

The estimated individual allele frequencies and principal components can be used as prior knowledge in other probabilistic methods based on a same Bayesian principle. PCAngsd can perform the following analyses:
*Covariance matrix
*Genotype calling
*Admixture
*Inbreeding coefficients (both per-individual and per-site)
*HWE test
*Genome selection scan
*Kinship matrix

The older version, based on the Numba library (only working with Python 2.7) is still available in version 0.973 and can be found here [https://github.com/Rosemeis/pcangsd/releases/tag/0.973].

=Download and Installation=

PCAngsd should work on all platforms meeting the requirements but server-side usage is highly recommended.
Installation has only been tested on Linux systems.

It is assumed that OpenMP is installed [https://www.openmp.org/].

1. Login to your server using ssh on your terminal window.

2. Create the directory where you will install your software and enter it, such as:

<pre>
mkdir ~/Software
cd ~/Software
</pre>

3. Download the source code:

<pre>
git clone https://github.com/Rosemeis/pcangsd.git
</pre>

4. Configure, Compile and Install:

<pre>
cd pcangsd/
python setup.py build_ext --inplace
</pre>

5. Install dependencies:

The required set of Python packages are easily installed using the pip command and the requirements.txt file included in the pcangsd folder.

<code>pip install --user -r requirements.txt</code>

=Quick start=

<pre>

# Download the input beagle file with genotype likelihoods
wget popgen.dk/software/download/NGSadmix/data/input.gz

# Only estimate covariance matrix using 10 threads
python pcangsd.py -beagle input.gz -o test1 -threads 10

# Estimate covariance matrix and individual admixture proportions
python pcangsd.py -beagle input.gz -admix -o test2 -threads 10

# Estimate covariance matrix and inbreeding coefficients
python pcangsd.py -beagle input.gz -inbreed 2 -o test3 -threads 10

# Estimate covariance matrix and perform selection scan
python pcangsd.py -beagle input.gz -selection -o test4 -threads 10
</pre>

==Detailed Examples and Tutorial==

Please refer to the tutorial's page [http://www.popgen.dk/software/index.php/PCAngsdTutorial]

=Input=
The only input PCAngsd needs and accepts are genotype likelihoods in [http://faculty.washington.edu/browning/beagle/beagle.html Beagle] format. New functionality for using PLINK files has been added (version 0.9). Genotypes are automatically converted into a genotype likelihood matrix where the user can incorporate an error model.

[http://popgen.dk/angsd ANGSD] can be easily be used to compute genotype likelihoods and output them in the required Beagle format.

<pre>
./angsd -GL 2 -out data -nThreads 10 -doGlf 2 -doMajorMinor 1 -doMaf 2 -SNP_pval 1e-6 -bam bam.filelist
</pre>

See [http://popgen.dk/angsd ANGSD] for more information on how to compute the genotype likelihoods and call SNPs.

=Output=

Since version 0.98, PCAngsd's output is only in binary Numpy format (.npy) except for the covariance matrix.

In order to read files in python:

<pre>
import numpy as np
C = np.genfromtxt("output.cov") # Reads in estimated covariance matrix
S = np.load("output.selection.npy") # Reads results from selection scan
</pre>

R can also read Numpy matrices using the "RcppCNPy" library:
<pre>
library(RcppCNPy)
C <- as.matrix(read.table("output.cov")) # Reads in estimated covariance matrix
S <- npyLoad("output.selection.npy") # Reads results from selection scan
</pre>

=Using PCAngsd=

All the different options in PCAngsd are listed here. PCAngsd will always compute the covariance matrix, where it uses principal components to estimate individual allele frequencies in an iterative procedure. The estimated individual allele frequencies will then be used in any of the other specified options of PCAngsd.´

<pre>
# See all options in PCAngsd
python pcangsd.py -h
</pre>

==Estimation of individual allele frequencies==
; -beagle [Beagle filename]
Input file of genotype likelihoods in Beagle format (.beagle.gz).
; -plink [Prefix for binary PLINK files]
Path to PLINK files using their prefix (.bed, .bim, .fam).
; -plink_error [float]
Incorporate error model for PLINK genotypes.
; -minMaf [float]
Minimum minor allele frequency threshold. (Default: 0.05)
; -iter [int]
Maximum number of iterations for estimation of individual allele frequencies (Default: 100).
; -tole [float]
Tolerance value for update in estimation of individual allele frequencies (Default: 1e-5).
; -maf_iter [int]
Maximum number of EM iterations for computing the population allele frequencies (Default: 100).
; -maf_tole [float]
Tolerance value in EM algorithm for population allele frequencies estimation (Default: 1e-4).
; -e [int]
Manually select the number of eigenvalues to use in the modelling of individual allele frequencies (Default: Automatically tested using MAP test).
; -o [prefix]
Set the prefix for all output files created by PCAngsd (Default: "pcangsd").
; -indf_save
Choose to save estimated individual allele frequencies (Binary). Numpy format (.npy).
; -dosage_save
Choose to save estimated genotype dosages (Binary). Numpy format (.npy).
; -sites_save
Choose to save the marker IDs after performing filtering using population allele frequencies. Useful for especially selection scans and per-site inbreeding coefficients.
; -post_save
Choose to save the posterior genotype probabilities. Beagle format (.beagle).
; -threads [int]
Specify the number of thread(s) to use (Default: 1).

==Call genotypes==
Genotypes can be called from posterior genotype probabilities incorporating the individual allele frequencies as prior information.

; -geno [float]
Call genotypes with defined threshold.
; -genoInbreed [float]
Call genotypes with defined threshold also taking inbreeding into account. '''-inbreed [int]''' is required, since individual inbreeding coefficients must have been estimated prior to calling genotypes using that information.

==Admixture==
Individual admixture proportions and population-specific allele frequencies can be estimated based on assuming K ancestral populations using an accelerated mini-batch NMF method.

; -admix
Toggles admixture estimations. Individual ancestry proportions are saved (Binary). Numpy format (.npy).
; -admix_alpha [float-list]
Specify alpha (sparseness regularization parameter). Can be specified as a sequence to try several alpha's in a single run (Default: 0).
; -admix_auto [float]
Enable automatic search for optimal alpha using likelihood measure, by giving soft upper search bound of alpha.
; -admix_seed [int-list]
Specify seed for random initializations of factor matrices in admixture estimations. Can be specified as a sequence to try several different seeds in a single run.
; -admix_K [int]
Not recommended. Manually specify the number of ancestral populations to use in admixture estimations (overrides number chosen from '''-e'''). Structure explained by individual allele frequencies may therefore not reflect the manually chosen K. It is recommended to adjust '''-e''' instead of '''-admix_K'''.
; -admix_iter [int]
Maximum number of iterations for admixture estimations using NMF. (Default: 200)
; -admix_tole [float]
Tolerance value for update in admixture estimations using NMF. (Default: 1e-5)
; -admix_batch [int]
Specify the number of batches to use in NMF method. (Default: 10)
; -admix_save
Choose to save the population-specific allele frequencies (Binary). Numpy format (.npy).

==Inbreeding==
Per-individual inbreeding coefficients incorporating population structure can be computed using three different methods. However, -inbreed 1 is recommended for low depth cases.

; -inbreed 1
Simple estimator computed by an EM algorithm. Allows for F-values between -1 and 1. Based on [http://genome.cshlp.org/content/23/11/1852.full ngsF].

; -inbreed 2
A maximum likelihood estimator also computed by an EM algorithm. Only allows for F-values between 0 and 1. Based on [https://www.cambridge.org/core/journals/genetics-research/article/maximum-likelihood-estimation-of-individual-inbreeding-coefficients-and-null-allele-frequencies/2DEBA0C0C2B92DF0EE89BD27DFCAD3FB].

; -inbreed 3
Estimator using an estimated kinship matrix. Allows for F-values between -1 and 1. Based on [http://www.cell.com/ajhg/abstract/S0002-9297(15)00493-0 PC-Relate].

; -inbreed_iter [int]
Maximum number of iterations for the EM algorithm methods. (Default: 200)

; -inbreed_tole [float]
Tolerance value for the EM algorithms for inbreeding coefficients estimation. (Default: 1e-4)
Per-site inbreeding coefficients incorporating population structure alongside likehood ratio tests for HWE can be computed as follows:

; -inbreedSites
Use likelihood ratio tests (.lrt.sites.gz) generated from '''-inbreedSites''' to filter out variable sites using a given threshold for HWE test p-value:

; -hwe [LRT filename]

; -hwe_tole [float]
Tolerance value for HWE test. (Default: 1e-6)

==Selection==
A genome selection scan can be computed based on posterior expectations of the genotypes (genotype dosages):

; -selection
Using an extended model of [http://www.cell.com/ajhg/abstract/S0002-9297(16)00003-3 FastPCA]. Performs a genome selection scan along all significant PCs. Outputs the selection statistics and must be converted to p-values by user. Each column reflect the selection statistics along a tested PC and they are χ²-distributed with 1 degree of freedom.

==Relatedness==
Estimate kinship matrix based on method Based on [http://www.cell.com/ajhg/abstract/S0002-9297(15)00493-0 PC-Relate]:

; -kinship
Automatically estimated if '''-inbreed 3''' has been selected.

Remove related individuals based on kinhsip matrix of previous run:
; -relate [Kinship filename]
; -relate_tole [float]
Threshold for kinship coefficients for removing individuals (Default: 0.0625).

=Citation=
Our methods for inferring population structure have been published in GENETICS:

[http://www.genetics.org/content/early/2018/08/21/genetics.118.301336 Inferring Population Structure and Admixture Proportions in Low Depth NGS Data]

Our method for testing for HWE in structured populations has been published in Molecular Ecology Resources:

[https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13019 Testing for Hardy‐Weinberg Equilibrium in Structured Populations using Genotype or Low‐Depth NGS Data]

PCAngsd

2020-05-29T15:49:16Z

Albrecht:

PCAngsd is a program that estimates the covariance matrix for low depth next-generation sequencing (NGS) data in structured/heterogeneous populations using principal component analysis (PCA) to perform multiple population genetic analyses using an iterative procedure based on genotype likelihoods.

Since version 0.98, PCAngsd was re-written to be based on Cython for computational bottlenecks and parallelization and is now compatible with any newer Python version.

The method was published in 2018 and can be found here: [https://www.genetics.org/content/210/2/719]
[[File:Pcangsd_admix.gif|frame]]

[[File:Pcangsd_pca.png|thumb|400px|Simulated low depth NGS data of 3 populations]]

=Overview=

Based on population structure inference, PCAngsd is able to detect the number of significant principal components which is then used to estimate individual allele frequencies using genotype dosages in a SVD model. These individual allele frequencies can be used in various population genetic methods for heterogeneous populations, such that PCAngsd can perform PCA (estimate covariance matrix), call genotypes, estimate individual admixture proportions, estimate inbreeding coefficients (per-individual and per-site) and perform a genome selection scan using principal components.
The estimated individual allele frequencies and principal components can be used as prior knowledge in other probabilistic methods based on a same Bayesian principle. PCAngsd can perform the following analyses:
*Covariance matrix
*Genotype calling
*Admixture
*Inbreeding coefficients (both per-individual and per-site)
*HWE test
*Genome selection scan
*Kinship matrix

The older version, based on the Numba library (only working with Python 2.7) is still available in version 0.973 and can be found here [https://github.com/Rosemeis/pcangsd/releases/tag/0.973].

=Download and Installation=

PCAngsd should work on all platforms meeting the requirements but server-side usage is highly recommended.
Installation has only been tested on Linux systems.

It is assumed that OpenMP is installed [https://www.openmp.org/].

1. Login to your server using ssh on your terminal window.

2. Create the directory where you will install your software and enter it, such as:

<pre>
mkdir ~/Software
cd ~/Software
</pre>

3. Download the source code:

<pre>
git clone https://github.com/Rosemeis/pcangsd.git
</pre>

4. Configure, Compile and Install:

<pre>
cd pcangsd/
python setup.py build_ext --inplace
</pre>

5. Install dependencies:

The required set of Python packages are easily installed using the pip command and the requirements.txt file included in the pcangsd folder.

<code>pip install --user -r requirements.txt</code>

=Quick start=

<pre>

# Download the input beagle file with genotype likelihoods
wget popgen.dk/software/download/NGSadmix/data/input.gz

# Only estimate covariance matrix using 10 threads
python pcangsd.py -beagle input.gz -o test1 -threads 10

# Estimate covariance matrix and individual admixture proportions
python pcangsd.py -beagle input.gz -admix -o test2 -threads 10

# Estimate covariance matrix and inbreeding coefficients
python pcangsd.py -beagle input.gz -inbreed 2 -o test3 -threads 10

# Estimate covariance matrix and perform selection scan
python pcangsd.py -beagle input.gz -selection -o test4 -threads 10
</pre>

==Detailed Examples and Tutorial==

Please refer to the tutorial's page [http://www.popgen.dk/software/index.php/PCAngsdTutorial]

=Input=
The only input PCAngsd needs and accepts are genotype likelihoods in [http://faculty.washington.edu/browning/beagle/beagle.html Beagle] format. New functionality for using PLINK files has been added (version 0.9). Genotypes are automatically converted into a genotype likelihood matrix where the user can incorporate an error model.

[http://popgen.dk/angsd ANGSD] can be easily be used to compute genotype likelihoods and output them in the required Beagle format.

<pre>
./angsd -GL 2 -out data -nThreads 10 -doGlf 2 -doMajorMinor 1 -doMaf 2 -SNP_pval 1e-6 -bam bam.filelist
</pre>

See [http://popgen.dk/angsd ANGSD] for more information on how to compute the genotype likelihoods and call SNPs.

=Output=

Since version 0.98, PCAngsd's output is only in binary Numpy format (.npy) except for the covariance matrix.

In order to read files in python:

<pre>
import numpy as np
C = np.genfromtxt("output.cov") # Reads in estimated covariance matrix
S = np.load("output.selection.npy") # Reads results from selection scan
</pre>

R can also read Numpy matrices using the "RcppCNPy" library:
<pre>
library(RcppCNPy)
C <- as.matrix(read.table("output.cov")) # Reads in estimated covariance matrix
S <- npyLoad("output.selection.npy") # Reads results from selection scan
</pre>

=Using PCAngsd=

All the different options in PCAngsd are listed here. PCAngsd will always compute the covariance matrix, where it uses principal components to estimate individual allele frequencies in an iterative procedure. The estimated individual allele frequencies will then be used in any of the other specified options of PCAngsd.´

<pre>
# See all options in PCAngsd
python pcangsd.py -h
</pre>

==Estimation of individual allele frequencies==
; -beagle [Beagle filename]
Input file of genotype likelihoods in Beagle format (.beagle.gz).
; -plink [Prefix for binary PLINK files]
Path to PLINK files using their prefix (.bed, .bim, .fam).
; -plink_error [float]
Incorporate error model for PLINK genotypes.
; -minMaf [float]
Minimum minor allele frequency threshold. (Default: 0.05)
; -iter [int]
Maximum number of iterations for estimation of individual allele frequencies (Default: 100).
; -tole [float]
Tolerance value for update in estimation of individual allele frequencies (Default: 1e-5).
; -maf_iter [int]
Maximum number of EM iterations for computing the population allele frequencies (Default: 100).
; -maf_tole [float]
Tolerance value in EM algorithm for population allele frequencies estimation (Default: 1e-4).
; -e [int]
Manually select the number of eigenvalues to use in the modelling of individual allele frequencies (Default: Automatically tested using MAP test).
; -o [prefix]
Set the prefix for all output files created by PCAngsd (Default: "pcangsd").
; -indf_save
Choose to save estimated individual allele frequencies (Binary). Numpy format (.npy).
; -dosage_save
Choose to save estimated genotype dosages (Binary). Numpy format (.npy).
; -sites_save
Choose to save the marker IDs after performing filtering using population allele frequencies. Useful for especially selection scans and per-site inbreeding coefficients.
; -post_save
Choose to save the posterior genotype probabilities. Beagle format (.beagle).
; -threads [int]
Specify the number of thread(s) to use (Default: 1).

==Call genotypes==
Genotypes can be called from posterior genotype probabilities incorporating the individual allele frequencies as prior information.

; -geno [float]
Call genotypes with defined threshold.
; -genoInbreed [float]
Call genotypes with defined threshold also taking inbreeding into account. '''-inbreed [int]''' is required, since individual inbreeding coefficients must have been estimated prior to calling genotypes using that information.

==Admixture==
Individual admixture proportions and population-specific allele frequencies can be estimated based on assuming K ancestral populations using an accelerated mini-batch NMF method.

; -admix
Toggles admixture estimations. Individual ancestry proportions are saved (Binary). Numpy format (.npy).
; -admix_alpha [float-list]
Specify alpha (sparseness regularization parameter). Can be specified as a sequence to try several alpha's in a single run (Default: 0).
; -admix_auto [float]
Enable automatic search for optimal alpha using likelihood measure, by giving soft upper search bound of alpha.
; -admix_seed [int-list]
Specify seed for random initializations of factor matrices in admixture estimations. Can be specified as a sequence to try several different seeds in a single run.
; -admix_K [int]
Not recommended. Manually specify the number of ancestral populations to use in admixture estimations (overrides number chosen from '''-e'''). Structure explained by individual allele frequencies may therefore not reflect the manually chosen K. It is recommended to adjust '''-e''' instead of '''-admix_K'''.
; -admix_iter [int]
Maximum number of iterations for admixture estimations using NMF. (Default: 200)
; -admix_tole [float]
Tolerance value for update in admixture estimations using NMF. (Default: 1e-5)
; -admix_batch [int]
Specify the number of batches to use in NMF method. (Default: 10)
; -admix_save
Choose to save the population-specific allele frequencies (Binary). Numpy format (.npy).

==Inbreeding==
Per-individual inbreeding coefficients incorporating population structure can be computed using three different methods. However, -inbreed 1 is recommended for low depth cases.

; -inbreed 1
Simple estimator computed by an EM algorithm. Allows for F-values between -1 and 1. Based on [http://genome.cshlp.org/content/23/11/1852.full ngsF].

; -inbreed 2
A maximum likelihood estimator also computed by an EM algorithm. Only allows for F-values between 0 and 1. Based on [https://www.cambridge.org/core/journals/genetics-research/article/maximum-likelihood-estimation-of-individual-inbreeding-coefficients-and-null-allele-frequencies/2DEBA0C0C2B92DF0EE89BD27DFCAD3FB].

; -inbreed 3
Estimator using an estimated kinship matrix. Allows for F-values between -1 and 1. Based on [http://www.cell.com/ajhg/abstract/S0002-9297(15)00493-0 PC-Relate].

; -inbreed_iter [int]
Maximum number of iterations for the EM algorithm methods. (Default: 200)

; -inbreed_tole [float]
Tolerance value for the EM algorithms for inbreeding coefficients estimation. (Default: 1e-4)
Per-site inbreeding coefficients incorporating population structure alongside likehood ratio tests for HWE can be computed as follows:

; -inbreedSites
Use likelihood ratio tests (.lrt.sites.gz) generated from '''-inbreedSites''' to filter out variable sites using a given threshold for HWE test p-value:

; -hwe [LRT filename]

; -hwe_tole [float]
Tolerance value for HWE test. (Default: 1e-6)

==Selection==
A genome selection scan can be computed based on posterior expectations of the genotypes (genotype dosages):

; -selection
Using an extended model of [http://www.cell.com/ajhg/abstract/S0002-9297(16)00003-3 FastPCA]. Performs a genome selection scan along all significant PCs. Outputs the selection statistics and must be converted to p-values by user. Each column reflect the selection statistics along a tested PC and they are χ²-distributed with 1 degree of freedom.

==Relatedness==
Estimate kinship matrix based on method Based on [http://www.cell.com/ajhg/abstract/S0002-9297(15)00493-0 PC-Relate]:

; -kinship
Automatically estimated if '''-inbreed 3''' has been selected.

Remove related individuals based on kinhsip matrix of previous run:
; -relate [Kinship filename]
; -relate_tole [float]
Threshold for kinship coefficients for removing individuals (Default: 0.0625).

=Citation=
Our methods for inferring population structure have been published in GENETICS:

[http://www.genetics.org/content/early/2018/08/21/genetics.118.301336 Inferring Population Structure and Admixture Proportions in Low Depth NGS Data]

Our method for testing for HWE in structured populations has been published in Molecular Ecology Resources:

[https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13019 Testing for Hardy‐Weinberg Equilibrium in Structured Populations using Genotype or Low‐Depth NGS Data]

EMU

2020-04-27T09:48:57Z

Albrecht: /* Run example */

This page contains information about EMU (EM-PCA for Ultra-low Coverage Sequencing Data). EMU infers population structure in the presence of missingness and works for both haploid, psuedo-haploid and diploid genotype datasets. Due to EMUs iterative nature, it is able to infer population structure even for datasets of ultra-low coverage sequencing data with very high missingness rates in addition to being able to handle non-random missingness patterns where other existing methods fail. We use a procedure of low-rank approximations based on randomized PCA to iteratively update population structure in a very efficient manner.

EMU is written in Python and Cython and is freely available on Github. We have also implemented a very memory-efficient variant of EMU (EMU-mem) for large-scale datasets that uses the 2-bit data structures of PLINK binary file formats.

=Download=

The program can be downloaded from Github:
https://github.com/Rosemeis/emu

<pre>
git clone https://github.com/Rosemeis/emu.git
</pre>

See github for more information regarding installation.
Server-side usage is recommended.

==Quick start==
<pre>
# See all options in EMU
python emu.py -h

# Infer population structure using 2 eigenvectors and 64 threads from binary PLINK files (.bed, .bim, .fam)
python emu.py -plink plink_prefix -e 2 -t 64 -accel -o plink_emu

# Or directly from NumPy array input
python emu.py -npy matrix.npy -e 2 -t 64 -accel -o npy_emu

# Use EMU-mem variant
python emu_mem.py -plink plink_prefix -e 2 -t 64 -accel
</pre>

=Input=
EMU use either binary PLINK files as input (RECOMMENDED!) or saved NumPy genotype matrices in 8-bit format (numpy.int8). EMU-mem will only accept PLINK files as input due to the 2-bit data structures. If NumPy format should be preferred, you can use the script provided on Github for conversion (convertMat.py).

=Using EMU=
We highly recommend to use EM acceleration at all times (-accel). You can save factor matrices (-indf_save) from a run to use as starting point in a new run (-w, -s, -u). Due to convenience we have also implemented the PC-based selection scan of Galinsky et al. 2016 (-selection). MAF filtering is possible but it is recommended (and ASSUMED!) to do beforehand.

==Options==
; -plink [Prefix for binary PLINK files]
Path to PLINK files using their prefix (.bed, .bim, .fam).
; -npy [Numpy.int8 matrix format]
Path to NumPy matrix (.npy).
; -e [int]
Number of eigenvectors to use in optimization.
; -k [int]
Number of eigenvectors to output if user wants different than -e.
; -m [int]
Maximum number of iterations (Default: 100).
; -m_tole [float]
Tolerance for covergence of iterative procedure (Default: 5e-7).
; -t [int]
Number of threads to use (Default: 1).
; -maf [float]
Minimum minor allele frequency threshold (Default: 0.00).
; -selection
Perform genome-wide PC-based selection scan (Galinsky et al. 2016).
; -maf_save
Save the estimated minor allele frequencies.
; -bool_save
Save boolean vector of filtered sites based on MAF.
; -indf_save
Save estimated factor matrices (W, S, U).
; -index [file]
Provide index of individuals for guiding initialization (np.int8 format).
; -svd [string]
Select which low-rank SVD method to use, halko/arpack (Default: 'halko').
; -svd_power [int]
Number of power iterations to use in low-rank SVD (Default: 3).
; -w [file]
Provide starting point, left singular matrix (.w.npy).
; -s [file]
Provide starting point, singular values (.s.npy).
; -u [file]
Provide starting point, right singular matrix (.u.npy).
; -accel
Use EM acceleration (Highly recommended!).
; -o [string]
Prefix for all output files (Default: 'emu').
; -cost
Output Frobenius each iteration (DEBUG).
; -cost_step
Use acceleration based on Frobenius (DEBUG).

==Options in EMU-mem==
-maf, -bool_save, -svd, -cost, -cost_step functions are not available for EMU-mem. MAF filtering has to be performed beforehand, which is easily done in PLINK (--maf 0.05).

=Run example=

Download data
<pre>
wget popgen.dk/software/download/fastNGSadmix/data.tar.gz
tar -xzf data.tar.gz
</pre>

run
<pre>
python emu.py -plink data/humanOrigins_7worldPops -e 4 -t 4 -accel -o plink_emu
</pre>

plot
<pre>
library(RcppCNPy)
vec <- npyLoad("plink_emu.eigenvecs.npy") # Reads in eigen vectors
fam <- read.table("data/humanOrigins_7worldPops.fam",head=F)
plot(vec[,1:2],col=fam[,1],xlab="PC1",ylab="PC2")
legend("center",fill=1:7,levels(fam[,1]))
</pre>

=Citation=
TBA

EMU

2020-04-27T09:30:36Z

Albrecht:

This page contains information about EMU (EM-PCA for Ultra-low Coverage Sequencing Data). EMU infers population structure in the presence of missingness and works for both haploid, psuedo-haploid and diploid genotype datasets. Due to EMUs iterative nature, it is able to infer population structure even for datasets of ultra-low coverage sequencing data with very high missingness rates in addition to being able to handle non-random missingness patterns where other existing methods fail. We use a procedure of low-rank approximations based on randomized PCA to iteratively update population structure in a very efficient manner.

EMU is written in Python and Cython and is freely available on Github. We have also implemented a very memory-efficient variant of EMU (EMU-mem) for large-scale datasets that uses the 2-bit data structures of PLINK binary file formats.

=Download=

The program can be downloaded from Github:
https://github.com/Rosemeis/emu

<pre>
git clone https://github.com/Rosemeis/emu.git
</pre>

See github for more information regarding installation.
Server-side usage is recommended.

==Quick start==
<pre>
# See all options in EMU
python emu.py -h

# Infer population structure using 2 eigenvectors and 64 threads from binary PLINK files (.bed, .bim, .fam)
python emu.py -plink plink_prefix -e 2 -t 64 -accel -o plink_emu

# Or directly from NumPy array input
python emu.py -npy matrix.npy -e 2 -t 64 -accel -o npy_emu

# Use EMU-mem variant
python emu_mem.py -plink plink_prefix -e 2 -t 64 -accel
</pre>

=Input=
EMU use either binary PLINK files as input (RECOMMENDED!) or saved NumPy genotype matrices in 8-bit format (numpy.int8). EMU-mem will only accept PLINK files as input due to the 2-bit data structures. If NumPy format should be preferred, you can use the script provided on Github for conversion (convertMat.py).

=Using EMU=
We highly recommend to use EM acceleration at all times (-accel). You can save factor matrices (-indf_save) from a run to use as starting point in a new run (-w, -s, -u). Due to convenience we have also implemented the PC-based selection scan of Galinsky et al. 2016 (-selection). MAF filtering is possible but it is recommended (and ASSUMED!) to do beforehand.

==Options==
; -plink [Prefix for binary PLINK files]
Path to PLINK files using their prefix (.bed, .bim, .fam).
; -npy [Numpy.int8 matrix format]
Path to NumPy matrix (.npy).
; -e [int]
Number of eigenvectors to use in optimization.
; -k [int]
Number of eigenvectors to output if user wants different than -e.
; -m [int]
Maximum number of iterations (Default: 100).
; -m_tole [float]
Tolerance for covergence of iterative procedure (Default: 5e-7).
; -t [int]
Number of threads to use (Default: 1).
; -maf [float]
Minimum minor allele frequency threshold (Default: 0.00).
; -selection
Perform genome-wide PC-based selection scan (Galinsky et al. 2016).
; -maf_save
Save the estimated minor allele frequencies.
; -bool_save
Save boolean vector of filtered sites based on MAF.
; -indf_save
Save estimated factor matrices (W, S, U).
; -index [file]
Provide index of individuals for guiding initialization (np.int8 format).
; -svd [string]
Select which low-rank SVD method to use, halko/arpack (Default: 'halko').
; -svd_power [int]
Number of power iterations to use in low-rank SVD (Default: 3).
; -w [file]
Provide starting point, left singular matrix (.w.npy).
; -s [file]
Provide starting point, singular values (.s.npy).
; -u [file]
Provide starting point, right singular matrix (.u.npy).
; -accel
Use EM acceleration (Highly recommended!).
; -o [string]
Prefix for all output files (Default: 'emu').
; -cost
Output Frobenius each iteration (DEBUG).
; -cost_step
Use acceleration based on Frobenius (DEBUG).

==Options in EMU-mem==
-maf, -bool_save, -svd, -cost, -cost_step functions are not available for EMU-mem. MAF filtering has to be performed beforehand, which is easily done in PLINK (--maf 0.05).

=Run example=

<pre>
wget popgen.dk/software/download/fastNGSadmix/data.tar.gz
tar -xzf data.tar.gz

</pre>

=Citation=
TBA

IBSrelate

2019-09-24T12:36:56Z

Albrecht: /* Examine the results */

This page contains information about the method '''IBSrelate''', a method to identify pairs of related individuals without requiring population allele frequencies.

On this page we will show you how to estimate the R0, R1 and KING-robust kinship statistics for a pair (or more!) of individuals from aligned sequencing data. These statistics are informative about relatedness, but can also be useful for quality-control (QC).

For further details, including the interpretation of R0, R1 and KING-robust kinship, please see our paper in Molecular Ecology at: https://doi.org/10.1111/mec.14954

= Calculating statistics from the output of IBS and realSFS =
'''IBS''' and '''realSFS''' are two methods implemented in ANGSD [http://www.popgen.dk/angsd/index.php/ANGSD] that can be used to estimate the allele sharing ''genotype distribution'' for a pair of individuals. The paper describes and examines the differences between the two methods, but we expect they both will perform comparably well for most applications. Below are links to two R scripts that can be used to load the output of '''IBS''' and '''realSFS''' and produce estimates of '''R0''', '''R1''' and '''KING-robust kinship'''.

https://github.com/rwaples/freqfree_suppl/blob/master/read_IBS.R

https://github.com/rwaples/freqfree_suppl/blob/master/read_realSFS.R

= Application to the ANGSD example data =
The shell commands given below are available in a text file here: https://github.com/rwaples/freqfree_suppl/blob/master/example_data.sh .

They are available in a Jupyter notebook here (slightly older version): https://nbviewer.jupyter.org/github/rwaples/freqfree_suppl/blob/master/example_data.ipynb .

== Setup ==
You will need installations and both [http://www.popgen.dk/angsd/index.php/ANGSD ANGSD] and [http://www.htslib.org/ samtools], as well as Rscript (part of [https://www.r-project.org/ R]).

The commands below will download files to your current directory, as well as create a few sub-directories. In total the analysis will download and generate files with a total size of about 1.5 GB.

=== Set up shell variables ===
<pre>
# set paths to the analysis programs, may need to be replaced your local installation(s)
ANGSD="$HOME/programs/angsd/angsd"
realSFS="$HOME/programs/angsd/misc/realSFS"
IBS="$HOME/programs/angsd/misc/ibs"
SAMTOOLS="samtools"
</pre>

=== Get the example data ===
The example data set has small bam files from ten individuals.
<pre>
# download the example data
wget http://popgen.dk/software/download/angsd/bams.tar.gz

# unzip/untar and index the bam files
tar xf bams.tar.gz
for i in bams/*.bam;do samtools index $i;done
</pre>

== realSFS method ==
=== Setup ===
<pre>
# make a directory for the results
mkdir results_realsfs

# get the R script to parse the realSFS output
wget https://raw.githubusercontent.com/rwaples/freqfree_suppl/master/read_realSFS.R
</pre>

=== Specify an allele at each site ===
For the realSFS method, one of the alleles at each site must be specified. Here we will use an ancestral state file (fasta format).
<pre>
# download and index the ancestral state fasta file
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz
$SAMTOOLS faidx hg19ancNoChr.fa.gz
</pre>

=== Generate a saf (site allele frequency likelihood) file for each individual ===
In the commands below, I apply minMapQ (-minMapQ 30) and minQ (-minQ 20) filters, as well as specify a specific genotype likelihood model (-GL 2). These values worked well for this data set and seem to be reasonable defaults, but the best values may vary by data set. I also generate summaries of sequencing depth (-doDepth) and allele counts (-doCounts). The output of these are not evaluated here, but they should be examined be part of a general QC process for NGS data.

<pre>
# make a separate bam filelist for each individual
# also create a SAMPLES array for use below
BAMS=./bams/*.bam
SAMPLES=()
for b in $BAMS; do
# parse out the sample name
base="$(basename -- $b)"
sample="${base%%.mapped.*}"
SAMPLES+=("$sample")
echo $sample
echo $b > ${sample}.filelist.ind
done

# run doSAF on each individual
for s in "${SAMPLES[@]}"; do
$ANGSD -b ${s}.filelist.ind \
-anc hg19ancNoChr.fa.gz \
-minMapQ 30 -minQ 20 -GL 2 \
-doSaf 1 -doDepth 1 -doCounts 1 \
-out ${s}
done
</pre>

=== run realSFS on each pair of indiviudals ===
Here we have 10 individuals, and want to consider each pair just once. We index the SAMPLES array created above.
<pre>
for i in {0..9}; do
for j in {0..9}; do
if (( i < j)); then
sample1=${SAMPLES[i]}
sample2=${SAMPLES[j]}
$realSFS ${sample1}.saf.idx ${sample2}.saf.idx > ./results_realsfs/${sample1}_${sample2}.2dsfs
fi
done
done
</pre>

=== Parse the results for a single pair of individuals ===
Below shows how to use the read_realSFS() function in R to parse the output 2dsfs file generated by realSFS. It will be easier to open an interactive R session, source read_realSFS.R, and use the function read_realSFS() to parse the file into data frame to then explore, plot, or export.
<pre>
Rscript \
-e "source('./read_realSFS.R')" \
-e "res = read_realSFS('results_realsfs/smallNA06985_smallNA11830.2dsfs')" \
-e "res['sample1'] = 'smallNA06985'; res['sample2'] = 'smallNA11830'" \
-e "print(res[,c('sample1', 'sample2', 'nSites', 'Kin', 'R0', 'R1') ])"
</pre>

== IBS Method ==

=== Setup ===
<pre>
# make a directory for the results
mkdir results_IBS

# get the R script to parse the IBS output
wget https://raw.githubusercontent.com/rwaples/freqfree_suppl/master/read_IBS.R

## make a file with paths to the bam(s) for all individuals
ls bams/*.bam > all.filelist
</pre>

=== make a genotype likelihood file (glf) containing all individuals ===
see [http://www.popgen.dk/angsd/index.php/Genotype_Likelihoods#Output_genotype_likelihoods here] for a description of the file format produced by -doGLF 1.
<pre>
$ANGSD -b all.filelist \
-minMapQ 30 -minQ 20 -GL 2 \
-doGlf 1 \
-out example
</pre>

=== run IBS, this will analyse each pair of individuals ===
Here we have specified "model 0", so the command will generate expected values for each of the 100 possible pairwise 2-allele genotypes for each pair of individuals (10 unique genotypes per individual).
The man page for misc/ibs [http://www.popgen.dk/angsd/index.php/Genotype_Distribution here].

<pre>
$IBS -glf example.glf.gz \
-model 0 \
-nInd 10 -allpairs 1 \
-outFileName results_IBS/ibs.model0.results
</pre>

=== Examine the IBS results ===
<pre>
Rscript \
-e "source('./read_IBS.R')" \
-e "res = do_derived_stats(read_ibspair_model0('results_IBS/ibs.model0.results.ibspair'))" \
-e "print(res[6,c('ind1', 'ind2', 'nSites', 'Kin', 'R0', 'R1') ])"

# the IBS method in ANGSD indexes individuals as they appear in the filelist
# (zero-indexed)
cat all.filelist
</pre>

=Frequently asked Questions=

== I run out of RAM running IBS or realSFS on my entire data set, what should I do? ==
You can split the data set up by chromosome (or contig), run the IBS or SFS method on each chromosome, and then combine afterwards by summing the values ['A' - 'I'] across chromosomes.

There a few possible concerns with this approach. 1) The optimization routines work best with more data, so be careful of breaking up your data into too many small groups. This is especially true as R0, R1, and KING-robust kinship are all ratios and can be sensitive to small absolute errors. 2) Also, without chromosomes, it becomes difficult to generate meaningful confidence intervals around R0, R1, and KING-robust kinship, as it important to account for correlations between nearby sites when generating the confidence intervals with something like a jackknife. You can jackknife by leaving one contig out each time, but if the correlations (i.e. the identity-by-descent (IBD) blocks present between relatives) extend beyond the contigs, this procedure will produce confidence intervals that are smaller than they should be.

== How can I estimate confidence intervals around my R0, R1, or KING-robust kinship estimates? ==
Confidence intervals around R0, R1, and KING-robust kinship can be generated with a weighted block jackknife procedure. This involves dividing your data set into '''X''' blocks that can be of unequal sizes. Confidence intervals are then generated by leaving out each block and computing R0, R1, and KING-robust kinship on these '''X''' distinct sets [https://link.springer.com/article/10.1023/A:1008800423698 method citation]. This is a similar approach to generating Z-scores for D-statistics[https://www.genetics.org/content/192/3/1065.short] that are informative about admixture.

One important difference here is that this procedure assumes that the observations within each of the '''X''' blocks are independent from the observations in all the other pieces. For relatives, this implies that identity-by-descent (IBD) tracts should not extend into multiple blocks, as blocks within the same IBD tract will not be independent. One easy solution if your reference genome contains chromosomes is to assign each chromosome to a distinct block. However, if your reference genome only has smaller contigs, it will be difficult to interpret the confidence intervals produced by a block jackknife procedure like this as observations may not be fully independent. This issue is also discussed in the paper.

== Can I estimate R0, R1, or KING-robust kinship even if my reference genome is not assembled into chromosomes? ==
Yes, point estimates of R0, R1, or KING-robust kinship do not require a reference genome that is assembled into chromosomes. If you have enough RAM, you can use all of your data at once, otherwise you will need to split your data into pieces, analyze each piece, and combine them afterwards.

== How can the quality of the reference genome assembly affect R0, R1, or KING-robust kinship estimates? ==
Probably in multiple, complicated ways!

But one to watch out for are regions of your reference genome that are attracting aligned reads originating from multiple, distinct parts of the genome. These regions can exhibit elevated heterozygosity across all individuals. This is particularly problematic for our analyses, as shared heterozygosity between a pair of individuals (sites where both individuals are heterozygotes) are the strongest signals we have that the individuals are related. This is because shared heterozygosity shows two important things about a site: 1) the site is variable (segregating) in the population, and 2) that the pair of individuals has the maximum possible allelic identity at the site. A upward bias in shared heterozygosity will result in KING-robust kinship estimates biased toward 0.5, elevated values of R1, and lower values of R0.

If possible these regions of the reference genome should be excluded from the analysis. With access to multiple individuals, it may be possible to identify them by estimating heterozygosity across all individuals, or with a HWE test (see [https://onlinelibrary.wiley.com/doi/full/10.1111/1755-0998.13019 here] for a method of testing HWE on low depth data that is robust to population structure). Without access to multiple individuals, it will be difficult to resolve these issues except through a better reference genome.

One thing we found with the human data in the paper was that applying a "mappability" filter improved our estimates of R0, R1, or KING-robust kinship. This is essentially a scan of the reference genome that reports where in the genome it is possible to uniquely align reads, and needs to be calibrated to your read lengths. We used [https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0030377 GEM]
But there are also [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6237805 new programs] that claim to do the same.

== How will inbreeding affects the estimates produces here? ==
Inbreeding within one or both individuals of a pair can affect estimates of R0, R1, or KING-robust kinship, but there isn't a simple answer that covers all possibilities.

If only one of the individuals is inbred, the pair of individuals may appear less related than otherwise, as the pair of individuals may have an elevated number of alternate homozygous genotypes and a reduced number of shared heterozygous genotypes.

When considering inbreeding, it can be useful to compare the heterozygosities of a pair of individuals. This is possible from the output of realSFS or IBS, and is facilitated by the parsing scripts included above.

== What data type is required for these analysis? ==
The best starting point are aligned sequencing reads (bam files). The above example shows how to process these into all of the intermediate and final outputs.

The program [http://academic.oup.com/gigascience/article-abstract/8/5/giz034/5481763 ngsRelateV2] can also be used to estimate R0, R1, or KING-robust kinship. It can take multiple formats including BAM, VCF and GLF. Please notice that ngsRelateV2 has its own implementation of the something very close to the SFS-based method described in the paper, and so it assumes that you know one of the alleles that is present at each site.

=Citation=
Waples, R. K., Albrechtsen, A. and Moltke, I. (2018), Allele frequency‐free inference of close familial relationships from genotypes or low depth sequencing data. Mol Ecol. doi:10.1111/mec.14954

==Bibtex==
<pre>
@article{doi:10.1111/mec.14954,
author = {Waples, Ryan K and Albrechtsen, Anders and Moltke, Ida},
title = {Allele frequency-free inference of close familial relationships from genotypes or low depth sequencing data},
journal = {Molecular Ecology},
volume = {0},
number = {ja},
pages = {},
doi = {10.1111/mec.14954},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/mec.14954},
eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/mec.14954},
}
</pre>

IBSrelate

2019-09-24T12:36:44Z

Albrecht: /* run IBS, this will analyse each pair of individuals */

This page contains information about the method '''IBSrelate''', a method to identify pairs of related individuals without requiring population allele frequencies.

On this page we will show you how to estimate the R0, R1 and KING-robust kinship statistics for a pair (or more!) of individuals from aligned sequencing data. These statistics are informative about relatedness, but can also be useful for quality-control (QC).

For further details, including the interpretation of R0, R1 and KING-robust kinship, please see our paper in Molecular Ecology at: https://doi.org/10.1111/mec.14954

= Calculating statistics from the output of IBS and realSFS =
'''IBS''' and '''realSFS''' are two methods implemented in ANGSD [http://www.popgen.dk/angsd/index.php/ANGSD] that can be used to estimate the allele sharing ''genotype distribution'' for a pair of individuals. The paper describes and examines the differences between the two methods, but we expect they both will perform comparably well for most applications. Below are links to two R scripts that can be used to load the output of '''IBS''' and '''realSFS''' and produce estimates of '''R0''', '''R1''' and '''KING-robust kinship'''.

https://github.com/rwaples/freqfree_suppl/blob/master/read_IBS.R

https://github.com/rwaples/freqfree_suppl/blob/master/read_realSFS.R

= Application to the ANGSD example data =
The shell commands given below are available in a text file here: https://github.com/rwaples/freqfree_suppl/blob/master/example_data.sh .

They are available in a Jupyter notebook here (slightly older version): https://nbviewer.jupyter.org/github/rwaples/freqfree_suppl/blob/master/example_data.ipynb .

== Setup ==
You will need installations and both [http://www.popgen.dk/angsd/index.php/ANGSD ANGSD] and [http://www.htslib.org/ samtools], as well as Rscript (part of [https://www.r-project.org/ R]).

The commands below will download files to your current directory, as well as create a few sub-directories. In total the analysis will download and generate files with a total size of about 1.5 GB.

=== Set up shell variables ===
<pre>
# set paths to the analysis programs, may need to be replaced your local installation(s)
ANGSD="$HOME/programs/angsd/angsd"
realSFS="$HOME/programs/angsd/misc/realSFS"
IBS="$HOME/programs/angsd/misc/ibs"
SAMTOOLS="samtools"
</pre>

=== Get the example data ===
The example data set has small bam files from ten individuals.
<pre>
# download the example data
wget http://popgen.dk/software/download/angsd/bams.tar.gz

# unzip/untar and index the bam files
tar xf bams.tar.gz
for i in bams/*.bam;do samtools index $i;done
</pre>

== realSFS method ==
=== Setup ===
<pre>
# make a directory for the results
mkdir results_realsfs

# get the R script to parse the realSFS output
wget https://raw.githubusercontent.com/rwaples/freqfree_suppl/master/read_realSFS.R
</pre>

=== Specify an allele at each site ===
For the realSFS method, one of the alleles at each site must be specified. Here we will use an ancestral state file (fasta format).
<pre>
# download and index the ancestral state fasta file
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz
$SAMTOOLS faidx hg19ancNoChr.fa.gz
</pre>

=== Generate a saf (site allele frequency likelihood) file for each individual ===
In the commands below, I apply minMapQ (-minMapQ 30) and minQ (-minQ 20) filters, as well as specify a specific genotype likelihood model (-GL 2). These values worked well for this data set and seem to be reasonable defaults, but the best values may vary by data set. I also generate summaries of sequencing depth (-doDepth) and allele counts (-doCounts). The output of these are not evaluated here, but they should be examined be part of a general QC process for NGS data.

<pre>
# make a separate bam filelist for each individual
# also create a SAMPLES array for use below
BAMS=./bams/*.bam
SAMPLES=()
for b in $BAMS; do
# parse out the sample name
base="$(basename -- $b)"
sample="${base%%.mapped.*}"
SAMPLES+=("$sample")
echo $sample
echo $b > ${sample}.filelist.ind
done

# run doSAF on each individual
for s in "${SAMPLES[@]}"; do
$ANGSD -b ${s}.filelist.ind \
-anc hg19ancNoChr.fa.gz \
-minMapQ 30 -minQ 20 -GL 2 \
-doSaf 1 -doDepth 1 -doCounts 1 \
-out ${s}
done
</pre>

=== run realSFS on each pair of indiviudals ===
Here we have 10 individuals, and want to consider each pair just once. We index the SAMPLES array created above.
<pre>
for i in {0..9}; do
for j in {0..9}; do
if (( i < j)); then
sample1=${SAMPLES[i]}
sample2=${SAMPLES[j]}
$realSFS ${sample1}.saf.idx ${sample2}.saf.idx > ./results_realsfs/${sample1}_${sample2}.2dsfs
fi
done
done
</pre>

=== Parse the results for a single pair of individuals ===
Below shows how to use the read_realSFS() function in R to parse the output 2dsfs file generated by realSFS. It will be easier to open an interactive R session, source read_realSFS.R, and use the function read_realSFS() to parse the file into data frame to then explore, plot, or export.
<pre>
Rscript \
-e "source('./read_realSFS.R')" \
-e "res = read_realSFS('results_realsfs/smallNA06985_smallNA11830.2dsfs')" \
-e "res['sample1'] = 'smallNA06985'; res['sample2'] = 'smallNA11830'" \
-e "print(res[,c('sample1', 'sample2', 'nSites', 'Kin', 'R0', 'R1') ])"
</pre>

== IBS Method ==

=== Setup ===
<pre>
# make a directory for the results
mkdir results_IBS

# get the R script to parse the IBS output
wget https://raw.githubusercontent.com/rwaples/freqfree_suppl/master/read_IBS.R

## make a file with paths to the bam(s) for all individuals
ls bams/*.bam > all.filelist
</pre>

=== make a genotype likelihood file (glf) containing all individuals ===
see [http://www.popgen.dk/angsd/index.php/Genotype_Likelihoods#Output_genotype_likelihoods here] for a description of the file format produced by -doGLF 1.
<pre>
$ANGSD -b all.filelist \
-minMapQ 30 -minQ 20 -GL 2 \
-doGlf 1 \
-out example
</pre>

=== run IBS, this will analyse each pair of individuals ===
Here we have specified "model 0", so the command will generate expected values for each of the 100 possible pairwise 2-allele genotypes for each pair of individuals (10 unique genotypes per individual).
The man page for misc/ibs [http://www.popgen.dk/angsd/index.php/Genotype_Distribution here].

<pre>
$IBS -glf example.glf.gz \
-model 0 \
-nInd 10 -allpairs 1 \
-outFileName results_IBS/ibs.model0.results
</pre>

=== Examine the results ===
<pre>
Rscript \
-e "source('./read_IBS.R')" \
-e "res = do_derived_stats(read_ibspair_model0('results_IBS/ibs.model0.results.ibspair'))" \
-e "print(res[6,c('ind1', 'ind2', 'nSites', 'Kin', 'R0', 'R1') ])"

# the IBS method in ANGSD indexes individuals as they appear in the filelist
# (zero-indexed)
cat all.filelist
</pre>

=Frequently asked Questions=

== I run out of RAM running IBS or realSFS on my entire data set, what should I do? ==
You can split the data set up by chromosome (or contig), run the IBS or SFS method on each chromosome, and then combine afterwards by summing the values ['A' - 'I'] across chromosomes.

There a few possible concerns with this approach. 1) The optimization routines work best with more data, so be careful of breaking up your data into too many small groups. This is especially true as R0, R1, and KING-robust kinship are all ratios and can be sensitive to small absolute errors. 2) Also, without chromosomes, it becomes difficult to generate meaningful confidence intervals around R0, R1, and KING-robust kinship, as it important to account for correlations between nearby sites when generating the confidence intervals with something like a jackknife. You can jackknife by leaving one contig out each time, but if the correlations (i.e. the identity-by-descent (IBD) blocks present between relatives) extend beyond the contigs, this procedure will produce confidence intervals that are smaller than they should be.

== How can I estimate confidence intervals around my R0, R1, or KING-robust kinship estimates? ==
Confidence intervals around R0, R1, and KING-robust kinship can be generated with a weighted block jackknife procedure. This involves dividing your data set into '''X''' blocks that can be of unequal sizes. Confidence intervals are then generated by leaving out each block and computing R0, R1, and KING-robust kinship on these '''X''' distinct sets [https://link.springer.com/article/10.1023/A:1008800423698 method citation]. This is a similar approach to generating Z-scores for D-statistics[https://www.genetics.org/content/192/3/1065.short] that are informative about admixture.

One important difference here is that this procedure assumes that the observations within each of the '''X''' blocks are independent from the observations in all the other pieces. For relatives, this implies that identity-by-descent (IBD) tracts should not extend into multiple blocks, as blocks within the same IBD tract will not be independent. One easy solution if your reference genome contains chromosomes is to assign each chromosome to a distinct block. However, if your reference genome only has smaller contigs, it will be difficult to interpret the confidence intervals produced by a block jackknife procedure like this as observations may not be fully independent. This issue is also discussed in the paper.

== Can I estimate R0, R1, or KING-robust kinship even if my reference genome is not assembled into chromosomes? ==
Yes, point estimates of R0, R1, or KING-robust kinship do not require a reference genome that is assembled into chromosomes. If you have enough RAM, you can use all of your data at once, otherwise you will need to split your data into pieces, analyze each piece, and combine them afterwards.

== How can the quality of the reference genome assembly affect R0, R1, or KING-robust kinship estimates? ==
Probably in multiple, complicated ways!

But one to watch out for are regions of your reference genome that are attracting aligned reads originating from multiple, distinct parts of the genome. These regions can exhibit elevated heterozygosity across all individuals. This is particularly problematic for our analyses, as shared heterozygosity between a pair of individuals (sites where both individuals are heterozygotes) are the strongest signals we have that the individuals are related. This is because shared heterozygosity shows two important things about a site: 1) the site is variable (segregating) in the population, and 2) that the pair of individuals has the maximum possible allelic identity at the site. A upward bias in shared heterozygosity will result in KING-robust kinship estimates biased toward 0.5, elevated values of R1, and lower values of R0.

If possible these regions of the reference genome should be excluded from the analysis. With access to multiple individuals, it may be possible to identify them by estimating heterozygosity across all individuals, or with a HWE test (see [https://onlinelibrary.wiley.com/doi/full/10.1111/1755-0998.13019 here] for a method of testing HWE on low depth data that is robust to population structure). Without access to multiple individuals, it will be difficult to resolve these issues except through a better reference genome.

One thing we found with the human data in the paper was that applying a "mappability" filter improved our estimates of R0, R1, or KING-robust kinship. This is essentially a scan of the reference genome that reports where in the genome it is possible to uniquely align reads, and needs to be calibrated to your read lengths. We used [https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0030377 GEM]
But there are also [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6237805 new programs] that claim to do the same.

== How will inbreeding affects the estimates produces here? ==
Inbreeding within one or both individuals of a pair can affect estimates of R0, R1, or KING-robust kinship, but there isn't a simple answer that covers all possibilities.

If only one of the individuals is inbred, the pair of individuals may appear less related than otherwise, as the pair of individuals may have an elevated number of alternate homozygous genotypes and a reduced number of shared heterozygous genotypes.

When considering inbreeding, it can be useful to compare the heterozygosities of a pair of individuals. This is possible from the output of realSFS or IBS, and is facilitated by the parsing scripts included above.

== What data type is required for these analysis? ==
The best starting point are aligned sequencing reads (bam files). The above example shows how to process these into all of the intermediate and final outputs.

The program [http://academic.oup.com/gigascience/article-abstract/8/5/giz034/5481763 ngsRelateV2] can also be used to estimate R0, R1, or KING-robust kinship. It can take multiple formats including BAM, VCF and GLF. Please notice that ngsRelateV2 has its own implementation of the something very close to the SFS-based method described in the paper, and so it assumes that you know one of the alleles that is present at each site.

=Citation=
Waples, R. K., Albrechtsen, A. and Moltke, I. (2018), Allele frequency‐free inference of close familial relationships from genotypes or low depth sequencing data. Mol Ecol. doi:10.1111/mec.14954

==Bibtex==
<pre>
@article{doi:10.1111/mec.14954,
author = {Waples, Ryan K and Albrechtsen, Anders and Moltke, Ida},
title = {Allele frequency-free inference of close familial relationships from genotypes or low depth sequencing data},
journal = {Molecular Ecology},
volume = {0},
number = {ja},
pages = {},
doi = {10.1111/mec.14954},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/mec.14954},
eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/mec.14954},
}
</pre>

IBSrelate

2019-09-24T12:35:12Z

Albrecht: /* What data type is required for these analysis? */

This page contains information about the method '''IBSrelate''', a method to identify pairs of related individuals without requiring population allele frequencies.

On this page we will show you how to estimate the R0, R1 and KING-robust kinship statistics for a pair (or more!) of individuals from aligned sequencing data. These statistics are informative about relatedness, but can also be useful for quality-control (QC).

For further details, including the interpretation of R0, R1 and KING-robust kinship, please see our paper in Molecular Ecology at: https://doi.org/10.1111/mec.14954

= Calculating statistics from the output of IBS and realSFS =
'''IBS''' and '''realSFS''' are two methods implemented in ANGSD [http://www.popgen.dk/angsd/index.php/ANGSD] that can be used to estimate the allele sharing ''genotype distribution'' for a pair of individuals. The paper describes and examines the differences between the two methods, but we expect they both will perform comparably well for most applications. Below are links to two R scripts that can be used to load the output of '''IBS''' and '''realSFS''' and produce estimates of '''R0''', '''R1''' and '''KING-robust kinship'''.

https://github.com/rwaples/freqfree_suppl/blob/master/read_IBS.R

https://github.com/rwaples/freqfree_suppl/blob/master/read_realSFS.R

= Application to the ANGSD example data =
The shell commands given below are available in a text file here: https://github.com/rwaples/freqfree_suppl/blob/master/example_data.sh .

They are available in a Jupyter notebook here (slightly older version): https://nbviewer.jupyter.org/github/rwaples/freqfree_suppl/blob/master/example_data.ipynb .

== Setup ==
You will need installations and both [http://www.popgen.dk/angsd/index.php/ANGSD ANGSD] and [http://www.htslib.org/ samtools], as well as Rscript (part of [https://www.r-project.org/ R]).

The commands below will download files to your current directory, as well as create a few sub-directories. In total the analysis will download and generate files with a total size of about 1.5 GB.

=== Set up shell variables ===
<pre>
# set paths to the analysis programs, may need to be replaced your local installation(s)
ANGSD="$HOME/programs/angsd/angsd"
realSFS="$HOME/programs/angsd/misc/realSFS"
IBS="$HOME/programs/angsd/misc/ibs"
SAMTOOLS="samtools"
</pre>

=== Get the example data ===
The example data set has small bam files from ten individuals.
<pre>
# download the example data
wget http://popgen.dk/software/download/angsd/bams.tar.gz

# unzip/untar and index the bam files
tar xf bams.tar.gz
for i in bams/*.bam;do samtools index $i;done
</pre>

== realSFS method ==
=== Setup ===
<pre>
# make a directory for the results
mkdir results_realsfs

# get the R script to parse the realSFS output
wget https://raw.githubusercontent.com/rwaples/freqfree_suppl/master/read_realSFS.R
</pre>

=== Specify an allele at each site ===
For the realSFS method, one of the alleles at each site must be specified. Here we will use an ancestral state file (fasta format).
<pre>
# download and index the ancestral state fasta file
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz
$SAMTOOLS faidx hg19ancNoChr.fa.gz
</pre>

=== Generate a saf (site allele frequency likelihood) file for each individual ===
In the commands below, I apply minMapQ (-minMapQ 30) and minQ (-minQ 20) filters, as well as specify a specific genotype likelihood model (-GL 2). These values worked well for this data set and seem to be reasonable defaults, but the best values may vary by data set. I also generate summaries of sequencing depth (-doDepth) and allele counts (-doCounts). The output of these are not evaluated here, but they should be examined be part of a general QC process for NGS data.

<pre>
# make a separate bam filelist for each individual
# also create a SAMPLES array for use below
BAMS=./bams/*.bam
SAMPLES=()
for b in $BAMS; do
# parse out the sample name
base="$(basename -- $b)"
sample="${base%%.mapped.*}"
SAMPLES+=("$sample")
echo $sample
echo $b > ${sample}.filelist.ind
done

# run doSAF on each individual
for s in "${SAMPLES[@]}"; do
$ANGSD -b ${s}.filelist.ind \
-anc hg19ancNoChr.fa.gz \
-minMapQ 30 -minQ 20 -GL 2 \
-doSaf 1 -doDepth 1 -doCounts 1 \
-out ${s}
done
</pre>

=== run realSFS on each pair of indiviudals ===
Here we have 10 individuals, and want to consider each pair just once. We index the SAMPLES array created above.
<pre>
for i in {0..9}; do
for j in {0..9}; do
if (( i < j)); then
sample1=${SAMPLES[i]}
sample2=${SAMPLES[j]}
$realSFS ${sample1}.saf.idx ${sample2}.saf.idx > ./results_realsfs/${sample1}_${sample2}.2dsfs
fi
done
done
</pre>

=== Parse the results for a single pair of individuals ===
Below shows how to use the read_realSFS() function in R to parse the output 2dsfs file generated by realSFS. It will be easier to open an interactive R session, source read_realSFS.R, and use the function read_realSFS() to parse the file into data frame to then explore, plot, or export.
<pre>
Rscript \
-e "source('./read_realSFS.R')" \
-e "res = read_realSFS('results_realsfs/smallNA06985_smallNA11830.2dsfs')" \
-e "res['sample1'] = 'smallNA06985'; res['sample2'] = 'smallNA11830'" \
-e "print(res[,c('sample1', 'sample2', 'nSites', 'Kin', 'R0', 'R1') ])"
</pre>

== IBS Method ==

=== Setup ===
<pre>
# make a directory for the results
mkdir results_IBS

# get the R script to parse the IBS output
wget https://raw.githubusercontent.com/rwaples/freqfree_suppl/master/read_IBS.R

## make a file with paths to the bam(s) for all individuals
ls bams/*.bam > all.filelist
</pre>

=== make a genotype likelihood file (glf) containing all individuals ===
see [http://www.popgen.dk/angsd/index.php/Genotype_Likelihoods#Output_genotype_likelihoods here] for a description of the file format produced by -doGLF 1.
<pre>
$ANGSD -b all.filelist \
-minMapQ 30 -minQ 20 -GL 2 \
-doGlf 1 \
-out example
</pre>

=== run IBS, this will analyse each pair of individuals ===
Here we have specified "model 0", so the command will generate expected values for each of the 100 possible pairwise 2-allele genotypes for each pair of individuals (10 unique genotypes per individual).
man page for misc/ibs [http://www.popgen.dk/angsd/index.php/Genotype_Distribution here].

<pre>
$IBS -glf example.glf.gz \
-model 0 \
-nInd 10 -allpairs 1 \
-outFileName results_IBS/ibs.model0.results
</pre>

=== Examine the results ===
<pre>
Rscript \
-e "source('./read_IBS.R')" \
-e "res = do_derived_stats(read_ibspair_model0('results_IBS/ibs.model0.results.ibspair'))" \
-e "print(res[6,c('ind1', 'ind2', 'nSites', 'Kin', 'R0', 'R1') ])"

# the IBS method in ANGSD indexes individuals as they appear in the filelist
# (zero-indexed)
cat all.filelist
</pre>

=Frequently asked Questions=

== I run out of RAM running IBS or realSFS on my entire data set, what should I do? ==
You can split the data set up by chromosome (or contig), run the IBS or SFS method on each chromosome, and then combine afterwards by summing the values ['A' - 'I'] across chromosomes.

There a few possible concerns with this approach. 1) The optimization routines work best with more data, so be careful of breaking up your data into too many small groups. This is especially true as R0, R1, and KING-robust kinship are all ratios and can be sensitive to small absolute errors. 2) Also, without chromosomes, it becomes difficult to generate meaningful confidence intervals around R0, R1, and KING-robust kinship, as it important to account for correlations between nearby sites when generating the confidence intervals with something like a jackknife. You can jackknife by leaving one contig out each time, but if the correlations (i.e. the identity-by-descent (IBD) blocks present between relatives) extend beyond the contigs, this procedure will produce confidence intervals that are smaller than they should be.

== How can I estimate confidence intervals around my R0, R1, or KING-robust kinship estimates? ==
Confidence intervals around R0, R1, and KING-robust kinship can be generated with a weighted block jackknife procedure. This involves dividing your data set into '''X''' blocks that can be of unequal sizes. Confidence intervals are then generated by leaving out each block and computing R0, R1, and KING-robust kinship on these '''X''' distinct sets [https://link.springer.com/article/10.1023/A:1008800423698 method citation]. This is a similar approach to generating Z-scores for D-statistics[https://www.genetics.org/content/192/3/1065.short] that are informative about admixture.

One important difference here is that this procedure assumes that the observations within each of the '''X''' blocks are independent from the observations in all the other pieces. For relatives, this implies that identity-by-descent (IBD) tracts should not extend into multiple blocks, as blocks within the same IBD tract will not be independent. One easy solution if your reference genome contains chromosomes is to assign each chromosome to a distinct block. However, if your reference genome only has smaller contigs, it will be difficult to interpret the confidence intervals produced by a block jackknife procedure like this as observations may not be fully independent. This issue is also discussed in the paper.

== Can I estimate R0, R1, or KING-robust kinship even if my reference genome is not assembled into chromosomes? ==
Yes, point estimates of R0, R1, or KING-robust kinship do not require a reference genome that is assembled into chromosomes. If you have enough RAM, you can use all of your data at once, otherwise you will need to split your data into pieces, analyze each piece, and combine them afterwards.

== How can the quality of the reference genome assembly affect R0, R1, or KING-robust kinship estimates? ==
Probably in multiple, complicated ways!

But one to watch out for are regions of your reference genome that are attracting aligned reads originating from multiple, distinct parts of the genome. These regions can exhibit elevated heterozygosity across all individuals. This is particularly problematic for our analyses, as shared heterozygosity between a pair of individuals (sites where both individuals are heterozygotes) are the strongest signals we have that the individuals are related. This is because shared heterozygosity shows two important things about a site: 1) the site is variable (segregating) in the population, and 2) that the pair of individuals has the maximum possible allelic identity at the site. A upward bias in shared heterozygosity will result in KING-robust kinship estimates biased toward 0.5, elevated values of R1, and lower values of R0.

If possible these regions of the reference genome should be excluded from the analysis. With access to multiple individuals, it may be possible to identify them by estimating heterozygosity across all individuals, or with a HWE test (see [https://onlinelibrary.wiley.com/doi/full/10.1111/1755-0998.13019 here] for a method of testing HWE on low depth data that is robust to population structure). Without access to multiple individuals, it will be difficult to resolve these issues except through a better reference genome.

One thing we found with the human data in the paper was that applying a "mappability" filter improved our estimates of R0, R1, or KING-robust kinship. This is essentially a scan of the reference genome that reports where in the genome it is possible to uniquely align reads, and needs to be calibrated to your read lengths. We used [https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0030377 GEM]
But there are also [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6237805 new programs] that claim to do the same.

== How will inbreeding affects the estimates produces here? ==
Inbreeding within one or both individuals of a pair can affect estimates of R0, R1, or KING-robust kinship, but there isn't a simple answer that covers all possibilities.

If only one of the individuals is inbred, the pair of individuals may appear less related than otherwise, as the pair of individuals may have an elevated number of alternate homozygous genotypes and a reduced number of shared heterozygous genotypes.

When considering inbreeding, it can be useful to compare the heterozygosities of a pair of individuals. This is possible from the output of realSFS or IBS, and is facilitated by the parsing scripts included above.

== What data type is required for these analysis? ==
The best starting point are aligned sequencing reads (bam files). The above example shows how to process these into all of the intermediate and final outputs.

The program [http://academic.oup.com/gigascience/article-abstract/8/5/giz034/5481763 ngsRelateV2] can also be used to estimate R0, R1, or KING-robust kinship. It can take multiple formats including BAM, VCF and GLF. Please notice that ngsRelateV2 has its own implementation of the something very close to the SFS-based method described in the paper, and so it assumes that you know one of the alleles that is present at each site.

=Citation=
Waples, R. K., Albrechtsen, A. and Moltke, I. (2018), Allele frequency‐free inference of close familial relationships from genotypes or low depth sequencing data. Mol Ecol. doi:10.1111/mec.14954

==Bibtex==
<pre>
@article{doi:10.1111/mec.14954,
author = {Waples, Ryan K and Albrechtsen, Anders and Moltke, Ida},
title = {Allele frequency-free inference of close familial relationships from genotypes or low depth sequencing data},
journal = {Molecular Ecology},
volume = {0},
number = {ja},
pages = {},
doi = {10.1111/mec.14954},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/mec.14954},
eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/mec.14954},
}
</pre>

IBSrelate

2019-09-24T12:34:21Z

Albrecht: /* Frequently asked Questions */

This page contains information about the method '''IBSrelate''', a method to identify pairs of related individuals without requiring population allele frequencies.

On this page we will show you how to estimate the R0, R1 and KING-robust kinship statistics for a pair (or more!) of individuals from aligned sequencing data. These statistics are informative about relatedness, but can also be useful for quality-control (QC).

For further details, including the interpretation of R0, R1 and KING-robust kinship, please see our paper in Molecular Ecology at: https://doi.org/10.1111/mec.14954

= Calculating statistics from the output of IBS and realSFS =
'''IBS''' and '''realSFS''' are two methods implemented in ANGSD [http://www.popgen.dk/angsd/index.php/ANGSD] that can be used to estimate the allele sharing ''genotype distribution'' for a pair of individuals. The paper describes and examines the differences between the two methods, but we expect they both will perform comparably well for most applications. Below are links to two R scripts that can be used to load the output of '''IBS''' and '''realSFS''' and produce estimates of '''R0''', '''R1''' and '''KING-robust kinship'''.

https://github.com/rwaples/freqfree_suppl/blob/master/read_IBS.R

https://github.com/rwaples/freqfree_suppl/blob/master/read_realSFS.R

= Application to the ANGSD example data =
The shell commands given below are available in a text file here: https://github.com/rwaples/freqfree_suppl/blob/master/example_data.sh .

They are available in a Jupyter notebook here (slightly older version): https://nbviewer.jupyter.org/github/rwaples/freqfree_suppl/blob/master/example_data.ipynb .

== Setup ==
You will need installations and both [http://www.popgen.dk/angsd/index.php/ANGSD ANGSD] and [http://www.htslib.org/ samtools], as well as Rscript (part of [https://www.r-project.org/ R]).

The commands below will download files to your current directory, as well as create a few sub-directories. In total the analysis will download and generate files with a total size of about 1.5 GB.

=== Set up shell variables ===
<pre>
# set paths to the analysis programs, may need to be replaced your local installation(s)
ANGSD="$HOME/programs/angsd/angsd"
realSFS="$HOME/programs/angsd/misc/realSFS"
IBS="$HOME/programs/angsd/misc/ibs"
SAMTOOLS="samtools"
</pre>

=== Get the example data ===
The example data set has small bam files from ten individuals.
<pre>
# download the example data
wget http://popgen.dk/software/download/angsd/bams.tar.gz

# unzip/untar and index the bam files
tar xf bams.tar.gz
for i in bams/*.bam;do samtools index $i;done
</pre>

== realSFS method ==
=== Setup ===
<pre>
# make a directory for the results
mkdir results_realsfs

# get the R script to parse the realSFS output
wget https://raw.githubusercontent.com/rwaples/freqfree_suppl/master/read_realSFS.R
</pre>

=== Specify an allele at each site ===
For the realSFS method, one of the alleles at each site must be specified. Here we will use an ancestral state file (fasta format).
<pre>
# download and index the ancestral state fasta file
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz
$SAMTOOLS faidx hg19ancNoChr.fa.gz
</pre>

=== Generate a saf (site allele frequency likelihood) file for each individual ===
In the commands below, I apply minMapQ (-minMapQ 30) and minQ (-minQ 20) filters, as well as specify a specific genotype likelihood model (-GL 2). These values worked well for this data set and seem to be reasonable defaults, but the best values may vary by data set. I also generate summaries of sequencing depth (-doDepth) and allele counts (-doCounts). The output of these are not evaluated here, but they should be examined be part of a general QC process for NGS data.

<pre>
# make a separate bam filelist for each individual
# also create a SAMPLES array for use below
BAMS=./bams/*.bam
SAMPLES=()
for b in $BAMS; do
# parse out the sample name
base="$(basename -- $b)"
sample="${base%%.mapped.*}"
SAMPLES+=("$sample")
echo $sample
echo $b > ${sample}.filelist.ind
done

# run doSAF on each individual
for s in "${SAMPLES[@]}"; do
$ANGSD -b ${s}.filelist.ind \
-anc hg19ancNoChr.fa.gz \
-minMapQ 30 -minQ 20 -GL 2 \
-doSaf 1 -doDepth 1 -doCounts 1 \
-out ${s}
done
</pre>

=== run realSFS on each pair of indiviudals ===
Here we have 10 individuals, and want to consider each pair just once. We index the SAMPLES array created above.
<pre>
for i in {0..9}; do
for j in {0..9}; do
if (( i < j)); then
sample1=${SAMPLES[i]}
sample2=${SAMPLES[j]}
$realSFS ${sample1}.saf.idx ${sample2}.saf.idx > ./results_realsfs/${sample1}_${sample2}.2dsfs
fi
done
done
</pre>

=== Parse the results for a single pair of individuals ===
Below shows how to use the read_realSFS() function in R to parse the output 2dsfs file generated by realSFS. It will be easier to open an interactive R session, source read_realSFS.R, and use the function read_realSFS() to parse the file into data frame to then explore, plot, or export.
<pre>
Rscript \
-e "source('./read_realSFS.R')" \
-e "res = read_realSFS('results_realsfs/smallNA06985_smallNA11830.2dsfs')" \
-e "res['sample1'] = 'smallNA06985'; res['sample2'] = 'smallNA11830'" \
-e "print(res[,c('sample1', 'sample2', 'nSites', 'Kin', 'R0', 'R1') ])"
</pre>

== IBS Method ==

=== Setup ===
<pre>
# make a directory for the results
mkdir results_IBS

# get the R script to parse the IBS output
wget https://raw.githubusercontent.com/rwaples/freqfree_suppl/master/read_IBS.R

## make a file with paths to the bam(s) for all individuals
ls bams/*.bam > all.filelist
</pre>

=== make a genotype likelihood file (glf) containing all individuals ===
see [http://www.popgen.dk/angsd/index.php/Genotype_Likelihoods#Output_genotype_likelihoods here] for a description of the file format produced by -doGLF 1.
<pre>
$ANGSD -b all.filelist \
-minMapQ 30 -minQ 20 -GL 2 \
-doGlf 1 \
-out example
</pre>

=== run IBS, this will analyse each pair of individuals ===
Here we have specified "model 0", so the command will generate expected values for each of the 100 possible pairwise 2-allele genotypes for each pair of individuals (10 unique genotypes per individual).
man page for misc/ibs [http://www.popgen.dk/angsd/index.php/Genotype_Distribution here].

<pre>
$IBS -glf example.glf.gz \
-model 0 \
-nInd 10 -allpairs 1 \
-outFileName results_IBS/ibs.model0.results
</pre>

=== Examine the results ===
<pre>
Rscript \
-e "source('./read_IBS.R')" \
-e "res = do_derived_stats(read_ibspair_model0('results_IBS/ibs.model0.results.ibspair'))" \
-e "print(res[6,c('ind1', 'ind2', 'nSites', 'Kin', 'R0', 'R1') ])"

# the IBS method in ANGSD indexes individuals as they appear in the filelist
# (zero-indexed)
cat all.filelist
</pre>

=Frequently asked Questions=

== I run out of RAM running IBS or realSFS on my entire data set, what should I do? ==
You can split the data set up by chromosome (or contig), run the IBS or SFS method on each chromosome, and then combine afterwards by summing the values ['A' - 'I'] across chromosomes.

There a few possible concerns with this approach. 1) The optimization routines work best with more data, so be careful of breaking up your data into too many small groups. This is especially true as R0, R1, and KING-robust kinship are all ratios and can be sensitive to small absolute errors. 2) Also, without chromosomes, it becomes difficult to generate meaningful confidence intervals around R0, R1, and KING-robust kinship, as it important to account for correlations between nearby sites when generating the confidence intervals with something like a jackknife. You can jackknife by leaving one contig out each time, but if the correlations (i.e. the identity-by-descent (IBD) blocks present between relatives) extend beyond the contigs, this procedure will produce confidence intervals that are smaller than they should be.

== How can I estimate confidence intervals around my R0, R1, or KING-robust kinship estimates? ==
Confidence intervals around R0, R1, and KING-robust kinship can be generated with a weighted block jackknife procedure. This involves dividing your data set into '''X''' blocks that can be of unequal sizes. Confidence intervals are then generated by leaving out each block and computing R0, R1, and KING-robust kinship on these '''X''' distinct sets [https://link.springer.com/article/10.1023/A:1008800423698 method citation]. This is a similar approach to generating Z-scores for D-statistics[https://www.genetics.org/content/192/3/1065.short] that are informative about admixture.

One important difference here is that this procedure assumes that the observations within each of the '''X''' blocks are independent from the observations in all the other pieces. For relatives, this implies that identity-by-descent (IBD) tracts should not extend into multiple blocks, as blocks within the same IBD tract will not be independent. One easy solution if your reference genome contains chromosomes is to assign each chromosome to a distinct block. However, if your reference genome only has smaller contigs, it will be difficult to interpret the confidence intervals produced by a block jackknife procedure like this as observations may not be fully independent. This issue is also discussed in the paper.

== Can I estimate R0, R1, or KING-robust kinship even if my reference genome is not assembled into chromosomes? ==
Yes, point estimates of R0, R1, or KING-robust kinship do not require a reference genome that is assembled into chromosomes. If you have enough RAM, you can use all of your data at once, otherwise you will need to split your data into pieces, analyze each piece, and combine them afterwards.

== How can the quality of the reference genome assembly affect R0, R1, or KING-robust kinship estimates? ==
Probably in multiple, complicated ways!

But one to watch out for are regions of your reference genome that are attracting aligned reads originating from multiple, distinct parts of the genome. These regions can exhibit elevated heterozygosity across all individuals. This is particularly problematic for our analyses, as shared heterozygosity between a pair of individuals (sites where both individuals are heterozygotes) are the strongest signals we have that the individuals are related. This is because shared heterozygosity shows two important things about a site: 1) the site is variable (segregating) in the population, and 2) that the pair of individuals has the maximum possible allelic identity at the site. A upward bias in shared heterozygosity will result in KING-robust kinship estimates biased toward 0.5, elevated values of R1, and lower values of R0.

If possible these regions of the reference genome should be excluded from the analysis. With access to multiple individuals, it may be possible to identify them by estimating heterozygosity across all individuals, or with a HWE test (see [https://onlinelibrary.wiley.com/doi/full/10.1111/1755-0998.13019 here] for a method of testing HWE on low depth data that is robust to population structure). Without access to multiple individuals, it will be difficult to resolve these issues except through a better reference genome.

One thing we found with the human data in the paper was that applying a "mappability" filter improved our estimates of R0, R1, or KING-robust kinship. This is essentially a scan of the reference genome that reports where in the genome it is possible to uniquely align reads, and needs to be calibrated to your read lengths. We used [https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0030377 GEM]
But there are also [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6237805 new programs] that claim to do the same.

== How will inbreeding affects the estimates produces here? ==
Inbreeding within one or both individuals of a pair can affect estimates of R0, R1, or KING-robust kinship, but there isn't a simple answer that covers all possibilities.

If only one of the individuals is inbred, the pair of individuals may appear less related than otherwise, as the pair of individuals may have an elevated number of alternate homozygous genotypes and a reduced number of shared heterozygous genotypes.

When considering inbreeding, it can be useful to compare the heterozygosities of a pair of individuals. This is possible from the output of realSFS or IBS, and is facilitated by the parsing scripts included above.

== What data type is required for these analysis? ==
The best starting point are aligned sequencing reads (bam files). The above example shows how to process these into all of the intermediate and final outputs.

The program [http://academic.oup.com/gigascience/article-abstract/8/5/giz034/5481763 ngsRelateV2] can also be used to estimate R0, R1, or KING-robust kinship. It can take multiple formats including GLF, BAM, and VCF. Please notice that ngsRelateV2 has its own implementation of the something very close to the SFS-based method described in the paper, and so it assumes that you know one of the alleles that is present at each site.

=Citation=
Waples, R. K., Albrechtsen, A. and Moltke, I. (2018), Allele frequency‐free inference of close familial relationships from genotypes or low depth sequencing data. Mol Ecol. doi:10.1111/mec.14954

==Bibtex==
<pre>
@article{doi:10.1111/mec.14954,
author = {Waples, Ryan K and Albrechtsen, Anders and Moltke, Ida},
title = {Allele frequency-free inference of close familial relationships from genotypes or low depth sequencing data},
journal = {Molecular Ecology},
volume = {0},
number = {ja},
pages = {},
doi = {10.1111/mec.14954},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/mec.14954},
eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/mec.14954},
}
</pre>

IBSrelate

2019-09-24T12:27:41Z

Albrecht: /* How will inbreeding affects the estimates produces here? */

This page contains information about the method '''IBSrelate''', a method to identify pairs of related individuals without requiring population allele frequencies.

On this page we will show you how to estimate the R0, R1 and KING-robust kinship statistics for a pair (or more!) of individuals from aligned sequencing data. These statistics are informative about relatedness, but can also be useful for quality-control (QC).

For further details, including the interpretation of R0, R1 and KING-robust kinship, please see our paper in Molecular Ecology at: https://doi.org/10.1111/mec.14954

= Calculating statistics from the output of IBS and realSFS =
'''IBS''' and '''realSFS''' are two methods implemented in ANGSD [http://www.popgen.dk/angsd/index.php/ANGSD] that can be used to estimate the allele sharing ''genotype distribution'' for a pair of individuals. The paper describes and examines the differences between the two methods, but we expect they both will perform comparably well for most applications. Below are links to two R scripts that can be used to load the output of '''IBS''' and '''realSFS''' and produce estimates of '''R0''', '''R1''' and '''KING-robust kinship'''.

https://github.com/rwaples/freqfree_suppl/blob/master/read_IBS.R

https://github.com/rwaples/freqfree_suppl/blob/master/read_realSFS.R

= Application to the ANGSD example data =
The shell commands given below are available in a text file here: https://github.com/rwaples/freqfree_suppl/blob/master/example_data.sh .

They are available in a Jupyter notebook here (slightly older version): https://nbviewer.jupyter.org/github/rwaples/freqfree_suppl/blob/master/example_data.ipynb .

== Setup ==
You will need installations and both [http://www.popgen.dk/angsd/index.php/ANGSD ANGSD] and [http://www.htslib.org/ samtools], as well as Rscript (part of [https://www.r-project.org/ R]).

The commands below will download files to your current directory, as well as create a few sub-directories. In total the analysis will download and generate files with a total size of about 1.5 GB.

=== Set up shell variables ===
<pre>
# set paths to the analysis programs, may need to be replaced your local installation(s)
ANGSD="$HOME/programs/angsd/angsd"
realSFS="$HOME/programs/angsd/misc/realSFS"
IBS="$HOME/programs/angsd/misc/ibs"
SAMTOOLS="samtools"
</pre>

=== Get the example data ===
The example data set has small bam files from ten individuals.
<pre>
# download the example data
wget http://popgen.dk/software/download/angsd/bams.tar.gz

# unzip/untar and index the bam files
tar xf bams.tar.gz
for i in bams/*.bam;do samtools index $i;done
</pre>

== realSFS method ==
=== Setup ===
<pre>
# make a directory for the results
mkdir results_realsfs

# get the R script to parse the realSFS output
wget https://raw.githubusercontent.com/rwaples/freqfree_suppl/master/read_realSFS.R
</pre>

=== Specify an allele at each site ===
For the realSFS method, one of the alleles at each site must be specified. Here we will use an ancestral state file (fasta format).
<pre>
# download and index the ancestral state fasta file
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz
$SAMTOOLS faidx hg19ancNoChr.fa.gz
</pre>

=== Generate a saf (site allele frequency likelihood) file for each individual ===
In the commands below, I apply minMapQ (-minMapQ 30) and minQ (-minQ 20) filters, as well as specify a specific genotype likelihood model (-GL 2). These values worked well for this data set and seem to be reasonable defaults, but the best values may vary by data set. I also generate summaries of sequencing depth (-doDepth) and allele counts (-doCounts). The output of these are not evaluated here, but they should be examined be part of a general QC process for NGS data.

<pre>
# make a separate bam filelist for each individual
# also create a SAMPLES array for use below
BAMS=./bams/*.bam
SAMPLES=()
for b in $BAMS; do
# parse out the sample name
base="$(basename -- $b)"
sample="${base%%.mapped.*}"
SAMPLES+=("$sample")
echo $sample
echo $b > ${sample}.filelist.ind
done

# run doSAF on each individual
for s in "${SAMPLES[@]}"; do
$ANGSD -b ${s}.filelist.ind \
-anc hg19ancNoChr.fa.gz \
-minMapQ 30 -minQ 20 -GL 2 \
-doSaf 1 -doDepth 1 -doCounts 1 \
-out ${s}
done
</pre>

=== run realSFS on each pair of indiviudals ===
Here we have 10 individuals, and want to consider each pair just once. We index the SAMPLES array created above.
<pre>
for i in {0..9}; do
for j in {0..9}; do
if (( i < j)); then
sample1=${SAMPLES[i]}
sample2=${SAMPLES[j]}
$realSFS ${sample1}.saf.idx ${sample2}.saf.idx > ./results_realsfs/${sample1}_${sample2}.2dsfs
fi
done
done
</pre>

=== Parse the results for a single pair of individuals ===
Below shows how to use the read_realSFS() function in R to parse the output 2dsfs file generated by realSFS. It will be easier to open an interactive R session, source read_realSFS.R, and use the function read_realSFS() to parse the file into data frame to then explore, plot, or export.
<pre>
Rscript \
-e "source('./read_realSFS.R')" \
-e "res = read_realSFS('results_realsfs/smallNA06985_smallNA11830.2dsfs')" \
-e "res['sample1'] = 'smallNA06985'; res['sample2'] = 'smallNA11830'" \
-e "print(res[,c('sample1', 'sample2', 'nSites', 'Kin', 'R0', 'R1') ])"
</pre>

== IBS Method ==

=== Setup ===
<pre>
# make a directory for the results
mkdir results_IBS

# get the R script to parse the IBS output
wget https://raw.githubusercontent.com/rwaples/freqfree_suppl/master/read_IBS.R

## make a file with paths to the bam(s) for all individuals
ls bams/*.bam > all.filelist
</pre>

=== make a genotype likelihood file (glf) containing all individuals ===
see [http://www.popgen.dk/angsd/index.php/Genotype_Likelihoods#Output_genotype_likelihoods here] for a description of the file format produced by -doGLF 1.
<pre>
$ANGSD -b all.filelist \
-minMapQ 30 -minQ 20 -GL 2 \
-doGlf 1 \
-out example
</pre>

=== run IBS, this will analyse each pair of individuals ===
Here we have specified "model 0", so the command will generate expected values for each of the 100 possible pairwise 2-allele genotypes for each pair of individuals (10 unique genotypes per individual).
man page for misc/ibs [http://www.popgen.dk/angsd/index.php/Genotype_Distribution here].

<pre>
$IBS -glf example.glf.gz \
-model 0 \
-nInd 10 -allpairs 1 \
-outFileName results_IBS/ibs.model0.results
</pre>

=== Examine the results ===
<pre>
Rscript \
-e "source('./read_IBS.R')" \
-e "res = do_derived_stats(read_ibspair_model0('results_IBS/ibs.model0.results.ibspair'))" \
-e "print(res[6,c('ind1', 'ind2', 'nSites', 'Kin', 'R0', 'R1') ])"

# the IBS method in ANGSD indexes individuals as they appear in the filelist
# (zero-indexed)
cat all.filelist
</pre>

=Frequently asked Questions=

== I run out of RAM running IBS or realSFS on my entire data set, what should I do? ==
You can split the data set up by chromosome (or contig), run the IBS or SFS method on each chromosome, and then combine afterwards by summing the values ['A' - 'I'] across chromosomes.

There a few possible concerns with this approach. 1) The optimization routines work best with more data, so be careful of breaking up your data into too many small groups. This is especially true as R0, R1, and KING-robust kinship are all ratios and can be sensitive to small absolute errors. 2) Also, without chromosomes, it becomes difficult to generate meaningful confidence intervals around R0, R1, and KING-robust kinship, as it important to account for correlations between nearby sites when generating the confidence intervals with something like a jackknife. You can jackknife by leaving one contig out each time, but if the correlations (i.e. the identity-by-descent (IBD) blocks present between relatives) extend beyond the contigs, this procedure will produce confidence intervals that are smaller than they should be.

== How can I estimate confidence intervals around my R0, R1, or KING-robust kinship estimates? ==
Confidence intervals around R0, R1, and KING-robust kinship can be generated with a weighted block jackknife procedure. This involves dividing your data set into '''X''' blocks that can be of unequal sizes. Confidence intervals are then generated by leaving out each block and computing R0, R1, and KING-robust kinship on these '''X''' distinct sets [https://link.springer.com/article/10.1023/A:1008800423698 method citation]. This is a similar approach to generating Z-scores for D-statistics[https://www.genetics.org/content/192/3/1065.short] that are informative about admixture.

One important difference here is that this procedure assumes that the observations within each of the '''X''' blocks are independent from the observations in all the other pieces. For relatives, this implies that identity-by-descent (IBD) tracts should not extend into multiple blocks, as blocks within the same IBD tract will not be independent. One easy solution if your reference genome contains chromosomes is to assign each chromosome to a distinct block. However, if your reference genome only has smaller contigs, it will be difficult to interpret the confidence intervals produced by a block jackknife procedure like this as observations may not be fully independent. This issue is also discussed in the paper.

== Can I estimate R0, R1, or KING-robust kinship even if my reference genome is not assembled into chromosomes? ==
Yes, point estimates of R0, R1, or KING-robust kinship do not require a reference genome that is assembled into chromosomes. If you have enough RAM, you can use all of your data at once, otherwise you will need to split your data into pieces, analyze each piece, and combine them afterwards.

== How can the quality of the reference genome assembly affect R0, R1, or KING-robust kinship estimates? ==
Probably in multiple, complicated ways!

But one to watch out for are regions of your reference genome that are attracting aligned reads originating from multiple, distinct parts of the genome. These regions can exhibit elevated heterozygosity across all individuals. This is particularly problematic for our analyses, as shared heterozygosity between a pair of individuals (sites where both individuals are heterozygotes) are the strongest signals we have that the individuals are related. This is because shared heterozygosity shows two important things about a site: 1) the site is variable (segregating) in the population, and 2) that the pair of individuals has the maximum possible allelic identity at the site. A upward bias in shared heterozygosity will result in KING-robust kinship estimates biased toward 0.5, elevated values of R1, and lower values of R0.

If possible these regions of the reference genome should be excluded from the analysis. With access to multiple individuals, it may be possible to identify them by estimating heterozygosity across all individuals, or with a HWE test (see [https://onlinelibrary.wiley.com/doi/full/10.1111/1755-0998.13019 here] for a method of testing HWE on low depth data that is robust to population structure). Without access to multiple individuals, it will be difficult to resolve these issues except through a better reference genome.

One thing we found with the human data in the paper was that applying a "mappability" filter improved our estimates of R0, R1, or KING-robust kinship. This is essentially a scan of the reference genome that reports where in the genome it is possible to uniquely align reads, and needs to be calibrated to your read lengths. We used [https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0030377 GEM]
But there are also [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6237805 new programs] that claim to do the same.

== How will inbreeding affects the estimates produces here? ==
Inbreeding within one or both individuals of a pair can affect estimates of R0, R1, or KING-robust kinship, but there isn't a simple answer that covers all possibilities.

If only one of the individuals is inbred, the pair of individuals may appear less related than otherwise, as the pair of individuals may have an elevated number of alternate homozygous genotypes and a reduced number of shared heterozygous genotypes.

When considering inbreeding, it can be useful to compare the heterozygosities of a pair of individuals. This is possible from the output of realSFS or IBS, and is facilitated by the parsing scripts included above.

=Citation=
Waples, R. K., Albrechtsen, A. and Moltke, I. (2018), Allele frequency‐free inference of close familial relationships from genotypes or low depth sequencing data. Mol Ecol. doi:10.1111/mec.14954

==Bibtex==
<pre>
@article{doi:10.1111/mec.14954,
author = {Waples, Ryan K and Albrechtsen, Anders and Moltke, Ida},
title = {Allele frequency-free inference of close familial relationships from genotypes or low depth sequencing data},
journal = {Molecular Ecology},
volume = {0},
number = {ja},
pages = {},
doi = {10.1111/mec.14954},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/mec.14954},
eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/mec.14954},
}
</pre>

IBSrelate

2019-09-24T12:03:59Z

Albrecht: /* How can the quality of the reference genome assembly affect R0, R1, or KING-robust kinship estimates? */

This page contains information about the method '''IBSrelate''', a method to identify pairs of related individuals without requiring population allele frequencies.

On this page we will show you how to estimate the R0, R1 and KING-robust kinship statistics for a pair (or more!) of individuals from aligned sequencing data. These statistics are informative about relatedness, but can also be useful for quality-control (QC).

For further details, including the interpretation of R0, R1 and KING-robust kinship, please see our paper in Molecular Ecology at: https://doi.org/10.1111/mec.14954

= Calculating statistics from the output of IBS and realSFS =
'''IBS''' and '''realSFS''' are two methods implemented in ANGSD [http://www.popgen.dk/angsd/index.php/ANGSD] that can be used to estimate the allele sharing ''genotype distribution'' for a pair of individuals. The paper describes and examines the differences between the two methods, but we expect they both will perform comparably well for most applications. Below are links to two R scripts that can be used to load the output of '''IBS''' and '''realSFS''' and produce estimates of '''R0''', '''R1''' and '''KING-robust kinship'''.

https://github.com/rwaples/freqfree_suppl/blob/master/read_IBS.R

https://github.com/rwaples/freqfree_suppl/blob/master/read_realSFS.R

= Application to the ANGSD example data =
The shell commands given below are available in a text file here: https://github.com/rwaples/freqfree_suppl/blob/master/example_data.sh .

They are available in a Jupyter notebook here (slightly older version): https://nbviewer.jupyter.org/github/rwaples/freqfree_suppl/blob/master/example_data.ipynb .

== Setup ==
You will need installations and both [http://www.popgen.dk/angsd/index.php/ANGSD ANGSD] and [http://www.htslib.org/ samtools], as well as Rscript (part of [https://www.r-project.org/ R]).

The commands below will download files to your current directory, as well as create a few sub-directories. In total the analysis will download and generate files with a total size of about 1.5 GB.

=== Set up shell variables ===
<pre>
# set paths to the analysis programs, may need to be replaced your local installation(s)
ANGSD="$HOME/programs/angsd/angsd"
realSFS="$HOME/programs/angsd/misc/realSFS"
IBS="$HOME/programs/angsd/misc/ibs"
SAMTOOLS="samtools"
</pre>

=== Get the example data ===
The example data set has small bam files from ten individuals.
<pre>
# download the example data
wget http://popgen.dk/software/download/angsd/bams.tar.gz

# unzip/untar and index the bam files
tar xf bams.tar.gz
for i in bams/*.bam;do samtools index $i;done
</pre>

== realSFS method ==
=== Setup ===
<pre>
# make a directory for the results
mkdir results_realsfs

# get the R script to parse the realSFS output
wget https://raw.githubusercontent.com/rwaples/freqfree_suppl/master/read_realSFS.R
</pre>

=== Specify an allele at each site ===
For the realSFS method, one of the alleles at each site must be specified. Here we will use an ancestral state file (fasta format).
<pre>
# download and index the ancestral state fasta file
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz
$SAMTOOLS faidx hg19ancNoChr.fa.gz
</pre>

=== Generate a saf (site allele frequency likelihood) file for each individual ===
In the commands below, I apply minMapQ (-minMapQ 30) and minQ (-minQ 20) filters, as well as specify a specific genotype likelihood model (-GL 2). These values worked well for this data set and seem to be reasonable defaults, but the best values may vary by data set. I also generate summaries of sequencing depth (-doDepth) and allele counts (-doCounts). The output of these are not evaluated here, but they should be examined be part of a general QC process for NGS data.

<pre>
# make a separate bam filelist for each individual
# also create a SAMPLES array for use below
BAMS=./bams/*.bam
SAMPLES=()
for b in $BAMS; do
# parse out the sample name
base="$(basename -- $b)"
sample="${base%%.mapped.*}"
SAMPLES+=("$sample")
echo $sample
echo $b > ${sample}.filelist.ind
done

# run doSAF on each individual
for s in "${SAMPLES[@]}"; do
$ANGSD -b ${s}.filelist.ind \
-anc hg19ancNoChr.fa.gz \
-minMapQ 30 -minQ 20 -GL 2 \
-doSaf 1 -doDepth 1 -doCounts 1 \
-out ${s}
done
</pre>

=== run realSFS on each pair of indiviudals ===
Here we have 10 individuals, and want to consider each pair just once. We index the SAMPLES array created above.
<pre>
for i in {0..9}; do
for j in {0..9}; do
if (( i < j)); then
sample1=${SAMPLES[i]}
sample2=${SAMPLES[j]}
$realSFS ${sample1}.saf.idx ${sample2}.saf.idx > ./results_realsfs/${sample1}_${sample2}.2dsfs
fi
done
done
</pre>

=== Parse the results for a single pair of individuals ===
Below shows how to use the read_realSFS() function in R to parse the output 2dsfs file generated by realSFS. It will be easier to open an interactive R session, source read_realSFS.R, and use the function read_realSFS() to parse the file into data frame to then explore, plot, or export.
<pre>
Rscript \
-e "source('./read_realSFS.R')" \
-e "res = read_realSFS('results_realsfs/smallNA06985_smallNA11830.2dsfs')" \
-e "res['sample1'] = 'smallNA06985'; res['sample2'] = 'smallNA11830'" \
-e "print(res[,c('sample1', 'sample2', 'nSites', 'Kin', 'R0', 'R1') ])"
</pre>

== IBS Method ==

=== Setup ===
<pre>
# make a directory for the results
mkdir results_IBS

# get the R script to parse the IBS output
wget https://raw.githubusercontent.com/rwaples/freqfree_suppl/master/read_IBS.R

## make a file with paths to the bam(s) for all individuals
ls bams/*.bam > all.filelist
</pre>

=== make a genotype likelihood file (glf) containing all individuals ===
see [http://www.popgen.dk/angsd/index.php/Genotype_Likelihoods#Output_genotype_likelihoods here] for a description of the file format produced by -doGLF 1.
<pre>
$ANGSD -b all.filelist \
-minMapQ 30 -minQ 20 -GL 2 \
-doGlf 1 \
-out example
</pre>

=== run IBS, this will analyse each pair of individuals ===
Here we have specified "model 0", so the command will generate expected values for each of the 100 possible pairwise 2-allele genotypes for each pair of individuals (10 unique genotypes per individual).
man page for misc/ibs [http://www.popgen.dk/angsd/index.php/Genotype_Distribution here].

<pre>
$IBS -glf example.glf.gz \
-model 0 \
-nInd 10 -allpairs 1 \
-outFileName results_IBS/ibs.model0.results
</pre>

=== Examine the results ===
<pre>
Rscript \
-e "source('./read_IBS.R')" \
-e "res = do_derived_stats(read_ibspair_model0('results_IBS/ibs.model0.results.ibspair'))" \
-e "print(res[6,c('ind1', 'ind2', 'nSites', 'Kin', 'R0', 'R1') ])"

# the IBS method in ANGSD indexes individuals as they appear in the filelist
# (zero-indexed)
cat all.filelist
</pre>

=Frequently asked Questions=

== I run out of RAM running IBS or realSFS on my entire data set, what should I do? ==
You can split the data set up by chromosome (or contig), run the IBS or SFS method on each chromosome, and then combine afterwards by summing the values ['A' - 'I'] across chromosomes.

There a few possible concerns with this approach. 1) The optimization routines work best with more data, so be careful of breaking up your data into too many small groups. This is especially true as R0, R1, and KING-robust kinship are all ratios and can be sensitive to small absolute errors. 2) Also, without chromosomes, it becomes difficult to generate meaningful confidence intervals around R0, R1, and KING-robust kinship, as it important to account for correlations between nearby sites when generating the confidence intervals with something like a jackknife. You can jackknife by leaving one contig out each time, but if the correlations (i.e. the identity-by-descent (IBD) blocks present between relatives) extend beyond the contigs, this procedure will produce confidence intervals that are smaller than they should be.

== How can I estimate confidence intervals around my R0, R1, or KING-robust kinship estimates? ==
Confidence intervals around R0, R1, and KING-robust kinship can be generated with a weighted block jackknife procedure. This involves dividing your data set into '''X''' blocks that can be of unequal sizes. Confidence intervals are then generated by leaving out each block and computing R0, R1, and KING-robust kinship on these '''X''' distinct sets [https://link.springer.com/article/10.1023/A:1008800423698 method citation]. This is a similar approach to generating Z-scores for D-statistics[https://www.genetics.org/content/192/3/1065.short] that are informative about admixture.

One important difference here is that this procedure assumes that the observations within each of the '''X''' blocks are independent from the observations in all the other pieces. For relatives, this implies that identity-by-descent (IBD) tracts should not extend into multiple blocks, as blocks within the same IBD tract will not be independent. One easy solution if your reference genome contains chromosomes is to assign each chromosome to a distinct block. However, if your reference genome only has smaller contigs, it will be difficult to interpret the confidence intervals produced by a block jackknife procedure like this as observations may not be fully independent. This issue is also discussed in the paper.

== Can I estimate R0, R1, or KING-robust kinship even if my reference genome is not assembled into chromosomes? ==
Yes, point estimates of R0, R1, or KING-robust kinship do not require a reference genome that is assembled into chromosomes. If you have enough RAM, you can use all of your data at once, otherwise you will need to split your data into pieces, analyze each piece, and combine them afterwards.

== How can the quality of the reference genome assembly affect R0, R1, or KING-robust kinship estimates? ==
Probably in multiple, complicated ways!

But one to watch out for are regions of your reference genome that are attracting aligned reads originating from multiple, distinct parts of the genome. These regions can exhibit elevated heterozygosity across all individuals. This is particularly problematic for our analyses, as shared heterozygosity between a pair of individuals (sites where both individuals are heterozygotes) are the strongest signals we have that the individuals are related. This is because shared heterozygosity shows two important things about a site: 1) the site is variable (segregating) in the population, and 2) that the pair of individuals has the maximum possible allelic identity at the site. A upward bias in shared heterozygosity will result in KING-robust kinship estimates biased toward 0.5, elevated values of R1, and lower values of R0.

If possible these regions of the reference genome should be excluded from the analysis. With access to multiple individuals, it may be possible to identify them by estimating heterozygosity across all individuals, or with a HWE test (see [https://onlinelibrary.wiley.com/doi/full/10.1111/1755-0998.13019 here] for a method of testing HWE on low depth data that is robust to population structure). Without access to multiple individuals, it will be difficult to resolve these issues except through a better reference genome.

One thing we found with the human data in the paper was that applying a "mappability" filter improved our estimates of R0, R1, or KING-robust kinship. This is essentially a scan of the reference genome that reports where in the genome it is possible to uniquely align reads, and needs to be calibrated to your read lengths. We used [https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0030377 GEM]
But there are also [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6237805 new programs] that claim to do the same.

== How will inbreeding affects the estimates produces here? ==

=Citation=
Waples, R. K., Albrechtsen, A. and Moltke, I. (2018), Allele frequency‐free inference of close familial relationships from genotypes or low depth sequencing data. Mol Ecol. doi:10.1111/mec.14954

==Bibtex==
<pre>
@article{doi:10.1111/mec.14954,
author = {Waples, Ryan K and Albrechtsen, Anders and Moltke, Ida},
title = {Allele frequency-free inference of close familial relationships from genotypes or low depth sequencing data},
journal = {Molecular Ecology},
volume = {0},
number = {ja},
pages = {},
doi = {10.1111/mec.14954},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/mec.14954},
eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/mec.14954},
}
</pre>

IBSrelate

2019-09-24T12:03:48Z

Albrecht: /* How can reference genome assembly errors affect R0, R1, or KING-robust kinship estimates? */

This page contains information about the method '''IBSrelate''', a method to identify pairs of related individuals without requiring population allele frequencies.

On this page we will show you how to estimate the R0, R1 and KING-robust kinship statistics for a pair (or more!) of individuals from aligned sequencing data. These statistics are informative about relatedness, but can also be useful for quality-control (QC).

For further details, including the interpretation of R0, R1 and KING-robust kinship, please see our paper in Molecular Ecology at: https://doi.org/10.1111/mec.14954

= Calculating statistics from the output of IBS and realSFS =
'''IBS''' and '''realSFS''' are two methods implemented in ANGSD [http://www.popgen.dk/angsd/index.php/ANGSD] that can be used to estimate the allele sharing ''genotype distribution'' for a pair of individuals. The paper describes and examines the differences between the two methods, but we expect they both will perform comparably well for most applications. Below are links to two R scripts that can be used to load the output of '''IBS''' and '''realSFS''' and produce estimates of '''R0''', '''R1''' and '''KING-robust kinship'''.

https://github.com/rwaples/freqfree_suppl/blob/master/read_IBS.R

https://github.com/rwaples/freqfree_suppl/blob/master/read_realSFS.R

= Application to the ANGSD example data =
The shell commands given below are available in a text file here: https://github.com/rwaples/freqfree_suppl/blob/master/example_data.sh .

They are available in a Jupyter notebook here (slightly older version): https://nbviewer.jupyter.org/github/rwaples/freqfree_suppl/blob/master/example_data.ipynb .

== Setup ==
You will need installations and both [http://www.popgen.dk/angsd/index.php/ANGSD ANGSD] and [http://www.htslib.org/ samtools], as well as Rscript (part of [https://www.r-project.org/ R]).

The commands below will download files to your current directory, as well as create a few sub-directories. In total the analysis will download and generate files with a total size of about 1.5 GB.

=== Set up shell variables ===
<pre>
# set paths to the analysis programs, may need to be replaced your local installation(s)
ANGSD="$HOME/programs/angsd/angsd"
realSFS="$HOME/programs/angsd/misc/realSFS"
IBS="$HOME/programs/angsd/misc/ibs"
SAMTOOLS="samtools"
</pre>

=== Get the example data ===
The example data set has small bam files from ten individuals.
<pre>
# download the example data
wget http://popgen.dk/software/download/angsd/bams.tar.gz

# unzip/untar and index the bam files
tar xf bams.tar.gz
for i in bams/*.bam;do samtools index $i;done
</pre>

== realSFS method ==
=== Setup ===
<pre>
# make a directory for the results
mkdir results_realsfs

# get the R script to parse the realSFS output
wget https://raw.githubusercontent.com/rwaples/freqfree_suppl/master/read_realSFS.R
</pre>

=== Specify an allele at each site ===
For the realSFS method, one of the alleles at each site must be specified. Here we will use an ancestral state file (fasta format).
<pre>
# download and index the ancestral state fasta file
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz
$SAMTOOLS faidx hg19ancNoChr.fa.gz
</pre>

=== Generate a saf (site allele frequency likelihood) file for each individual ===
In the commands below, I apply minMapQ (-minMapQ 30) and minQ (-minQ 20) filters, as well as specify a specific genotype likelihood model (-GL 2). These values worked well for this data set and seem to be reasonable defaults, but the best values may vary by data set. I also generate summaries of sequencing depth (-doDepth) and allele counts (-doCounts). The output of these are not evaluated here, but they should be examined be part of a general QC process for NGS data.

<pre>
# make a separate bam filelist for each individual
# also create a SAMPLES array for use below
BAMS=./bams/*.bam
SAMPLES=()
for b in $BAMS; do
# parse out the sample name
base="$(basename -- $b)"
sample="${base%%.mapped.*}"
SAMPLES+=("$sample")
echo $sample
echo $b > ${sample}.filelist.ind
done

# run doSAF on each individual
for s in "${SAMPLES[@]}"; do
$ANGSD -b ${s}.filelist.ind \
-anc hg19ancNoChr.fa.gz \
-minMapQ 30 -minQ 20 -GL 2 \
-doSaf 1 -doDepth 1 -doCounts 1 \
-out ${s}
done
</pre>

=== run realSFS on each pair of indiviudals ===
Here we have 10 individuals, and want to consider each pair just once. We index the SAMPLES array created above.
<pre>
for i in {0..9}; do
for j in {0..9}; do
if (( i < j)); then
sample1=${SAMPLES[i]}
sample2=${SAMPLES[j]}
$realSFS ${sample1}.saf.idx ${sample2}.saf.idx > ./results_realsfs/${sample1}_${sample2}.2dsfs
fi
done
done
</pre>

=== Parse the results for a single pair of individuals ===
Below shows how to use the read_realSFS() function in R to parse the output 2dsfs file generated by realSFS. It will be easier to open an interactive R session, source read_realSFS.R, and use the function read_realSFS() to parse the file into data frame to then explore, plot, or export.
<pre>
Rscript \
-e "source('./read_realSFS.R')" \
-e "res = read_realSFS('results_realsfs/smallNA06985_smallNA11830.2dsfs')" \
-e "res['sample1'] = 'smallNA06985'; res['sample2'] = 'smallNA11830'" \
-e "print(res[,c('sample1', 'sample2', 'nSites', 'Kin', 'R0', 'R1') ])"
</pre>

== IBS Method ==

=== Setup ===
<pre>
# make a directory for the results
mkdir results_IBS

# get the R script to parse the IBS output
wget https://raw.githubusercontent.com/rwaples/freqfree_suppl/master/read_IBS.R

## make a file with paths to the bam(s) for all individuals
ls bams/*.bam > all.filelist
</pre>

=== make a genotype likelihood file (glf) containing all individuals ===
see [http://www.popgen.dk/angsd/index.php/Genotype_Likelihoods#Output_genotype_likelihoods here] for a description of the file format produced by -doGLF 1.
<pre>
$ANGSD -b all.filelist \
-minMapQ 30 -minQ 20 -GL 2 \
-doGlf 1 \
-out example
</pre>

=== run IBS, this will analyse each pair of individuals ===
Here we have specified "model 0", so the command will generate expected values for each of the 100 possible pairwise 2-allele genotypes for each pair of individuals (10 unique genotypes per individual).
man page for misc/ibs [http://www.popgen.dk/angsd/index.php/Genotype_Distribution here].

<pre>
$IBS -glf example.glf.gz \
-model 0 \
-nInd 10 -allpairs 1 \
-outFileName results_IBS/ibs.model0.results
</pre>

=== Examine the results ===
<pre>
Rscript \
-e "source('./read_IBS.R')" \
-e "res = do_derived_stats(read_ibspair_model0('results_IBS/ibs.model0.results.ibspair'))" \
-e "print(res[6,c('ind1', 'ind2', 'nSites', 'Kin', 'R0', 'R1') ])"

# the IBS method in ANGSD indexes individuals as they appear in the filelist
# (zero-indexed)
cat all.filelist
</pre>

=Frequently asked Questions=

== I run out of RAM running IBS or realSFS on my entire data set, what should I do? ==
You can split the data set up by chromosome (or contig), run the IBS or SFS method on each chromosome, and then combine afterwards by summing the values ['A' - 'I'] across chromosomes.

There a few possible concerns with this approach. 1) The optimization routines work best with more data, so be careful of breaking up your data into too many small groups. This is especially true as R0, R1, and KING-robust kinship are all ratios and can be sensitive to small absolute errors. 2) Also, without chromosomes, it becomes difficult to generate meaningful confidence intervals around R0, R1, and KING-robust kinship, as it important to account for correlations between nearby sites when generating the confidence intervals with something like a jackknife. You can jackknife by leaving one contig out each time, but if the correlations (i.e. the identity-by-descent (IBD) blocks present between relatives) extend beyond the contigs, this procedure will produce confidence intervals that are smaller than they should be.

== How can I estimate confidence intervals around my R0, R1, or KING-robust kinship estimates? ==
Confidence intervals around R0, R1, and KING-robust kinship can be generated with a weighted block jackknife procedure. This involves dividing your data set into '''X''' blocks that can be of unequal sizes. Confidence intervals are then generated by leaving out each block and computing R0, R1, and KING-robust kinship on these '''X''' distinct sets [https://link.springer.com/article/10.1023/A:1008800423698 method citation]. This is a similar approach to generating Z-scores for D-statistics[https://www.genetics.org/content/192/3/1065.short] that are informative about admixture.

One important difference here is that this procedure assumes that the observations within each of the '''X''' blocks are independent from the observations in all the other pieces. For relatives, this implies that identity-by-descent (IBD) tracts should not extend into multiple blocks, as blocks within the same IBD tract will not be independent. One easy solution if your reference genome contains chromosomes is to assign each chromosome to a distinct block. However, if your reference genome only has smaller contigs, it will be difficult to interpret the confidence intervals produced by a block jackknife procedure like this as observations may not be fully independent. This issue is also discussed in the paper.

== Can I estimate R0, R1, or KING-robust kinship even if my reference genome is not assembled into chromosomes? ==
Yes, point estimates of R0, R1, or KING-robust kinship do not require a reference genome that is assembled into chromosomes. If you have enough RAM, you can use all of your data at once, otherwise you will need to split your data into pieces, analyze each piece, and combine them afterwards.

== How can the quality of the reference genome assembly affect R0, R1, or KING-robust kinship estimates? ==
Probably in multiple, complicated ways!

But one to watch out for are regions of your reference genome that are attracting aligned reads originating from multiple, distinct parts of the genome. These regions can exhibit elevated heterozygosity across all individuals. This is particularly problematic for our analyses, as shared heterozygosity between a pair of individuals (sites where both individuals are heterozygotes) are the strongest signals we have that the individuals are related. This is because shared heterozygosity shows two important things about a site: 1) the site is variable (segregating) in the population, and 2) that the pair of individuals has the maximum possible allelic identity at the site. A upward bias in shared heterozygosity will result in KING-robust kinship estimates biased toward 0.5, elevated values of R1, and lower values of R0.

If possible these regions of the reference genome should be excluded from the analysis. With access to multiple individuals, it may be possible to identify them by estimating heterozygosity across all individuals, or with a HWE test (see [https://onlinelibrary.wiley.com/doi/full/10.1111/1755-0998.13019 here] for a method of testing HWE on low depth data that is robust to population structure). Without access to multiple individuals, it will be difficult to resolve these issues except through a better reference genome.

One thing we found with the human data in the paper was that applying a "mappability" filter improved our estimates of R0, R1, or KING-robust kinship. This is essentially a scan of the reference genome that reports where in the genome it is possible to uniquely align reads, and needs to be calibrated to your read lengths. We used [https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0030377 GEM]
But there are also [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6237805 new programs] that claim to do the same .

== How will inbreeding affects the estimates produces here? ==

=Citation=
Waples, R. K., Albrechtsen, A. and Moltke, I. (2018), Allele frequency‐free inference of close familial relationships from genotypes or low depth sequencing data. Mol Ecol. doi:10.1111/mec.14954

==Bibtex==
<pre>
@article{doi:10.1111/mec.14954,
author = {Waples, Ryan K and Albrechtsen, Anders and Moltke, Ida},
title = {Allele frequency-free inference of close familial relationships from genotypes or low depth sequencing data},
journal = {Molecular Ecology},
volume = {0},
number = {ja},
pages = {},
doi = {10.1111/mec.14954},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/mec.14954},
eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/mec.14954},
}
</pre>

IBSrelate

2019-09-24T11:54:07Z

Albrecht: /* How can reference genome assembly errors affect R0, R1, or KING-robust kinship estimates? */

This page contains information about the method '''IBSrelate''', a method to identify pairs of related individuals without requiring population allele frequencies.

On this page we will show you how to estimate the R0, R1 and KING-robust kinship statistics for a pair (or more!) of individuals from aligned sequencing data. These statistics are informative about relatedness, but can also be useful for quality-control (QC).

For further details, including the interpretation of R0, R1 and KING-robust kinship, please see our paper in Molecular Ecology at: https://doi.org/10.1111/mec.14954

= Calculating statistics from the output of IBS and realSFS =
'''IBS''' and '''realSFS''' are two methods implemented in ANGSD [http://www.popgen.dk/angsd/index.php/ANGSD] that can be used to estimate the allele sharing ''genotype distribution'' for a pair of individuals. The paper describes and examines the differences between the two methods, but we expect they both will perform comparably well for most applications. Below are links to two R scripts that can be used to load the output of '''IBS''' and '''realSFS''' and produce estimates of '''R0''', '''R1''' and '''KING-robust kinship'''.

https://github.com/rwaples/freqfree_suppl/blob/master/read_IBS.R

https://github.com/rwaples/freqfree_suppl/blob/master/read_realSFS.R

= Application to the ANGSD example data =
The shell commands given below are available in a text file here: https://github.com/rwaples/freqfree_suppl/blob/master/example_data.sh .

They are available in a Jupyter notebook here (slightly older version): https://nbviewer.jupyter.org/github/rwaples/freqfree_suppl/blob/master/example_data.ipynb .

== Setup ==
You will need installations and both [http://www.popgen.dk/angsd/index.php/ANGSD ANGSD] and [http://www.htslib.org/ samtools], as well as Rscript (part of [https://www.r-project.org/ R]).

The commands below will download files to your current directory, as well as create a few sub-directories. In total the analysis will download and generate files with a total size of about 1.5 GB.

=== Set up shell variables ===
<pre>
# set paths to the analysis programs, may need to be replaced your local installation(s)
ANGSD="$HOME/programs/angsd/angsd"
realSFS="$HOME/programs/angsd/misc/realSFS"
IBS="$HOME/programs/angsd/misc/ibs"
SAMTOOLS="samtools"
</pre>

=== Get the example data ===
The example data set has small bam files from ten individuals.
<pre>
# download the example data
wget http://popgen.dk/software/download/angsd/bams.tar.gz

# unzip/untar and index the bam files
tar xf bams.tar.gz
for i in bams/*.bam;do samtools index $i;done
</pre>

== realSFS method ==
=== Setup ===
<pre>
# make a directory for the results
mkdir results_realsfs

# get the R script to parse the realSFS output
wget https://raw.githubusercontent.com/rwaples/freqfree_suppl/master/read_realSFS.R
</pre>

=== Specify an allele at each site ===
For the realSFS method, one of the alleles at each site must be specified. Here we will use an ancestral state file (fasta format).
<pre>
# download and index the ancestral state fasta file
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz
$SAMTOOLS faidx hg19ancNoChr.fa.gz
</pre>

=== Generate a saf (site allele frequency likelihood) file for each individual ===
In the commands below, I apply minMapQ (-minMapQ 30) and minQ (-minQ 20) filters, as well as specify a specific genotype likelihood model (-GL 2). These values worked well for this data set and seem to be reasonable defaults, but the best values may vary by data set. I also generate summaries of sequencing depth (-doDepth) and allele counts (-doCounts). The output of these are not evaluated here, but they should be examined be part of a general QC process for NGS data.

<pre>
# make a separate bam filelist for each individual
# also create a SAMPLES array for use below
BAMS=./bams/*.bam
SAMPLES=()
for b in $BAMS; do
# parse out the sample name
base="$(basename -- $b)"
sample="${base%%.mapped.*}"
SAMPLES+=("$sample")
echo $sample
echo $b > ${sample}.filelist.ind
done

# run doSAF on each individual
for s in "${SAMPLES[@]}"; do
$ANGSD -b ${s}.filelist.ind \
-anc hg19ancNoChr.fa.gz \
-minMapQ 30 -minQ 20 -GL 2 \
-doSaf 1 -doDepth 1 -doCounts 1 \
-out ${s}
done
</pre>

=== run realSFS on each pair of indiviudals ===
Here we have 10 individuals, and want to consider each pair just once. We index the SAMPLES array created above.
<pre>
for i in {0..9}; do
for j in {0..9}; do
if (( i < j)); then
sample1=${SAMPLES[i]}
sample2=${SAMPLES[j]}
$realSFS ${sample1}.saf.idx ${sample2}.saf.idx > ./results_realsfs/${sample1}_${sample2}.2dsfs
fi
done
done
</pre>

=== Parse the results for a single pair of individuals ===
Below shows how to use the read_realSFS() function in R to parse the output 2dsfs file generated by realSFS. It will be easier to open an interactive R session, source read_realSFS.R, and use the function read_realSFS() to parse the file into data frame to then explore, plot, or export.
<pre>
Rscript \
-e "source('./read_realSFS.R')" \
-e "res = read_realSFS('results_realsfs/smallNA06985_smallNA11830.2dsfs')" \
-e "res['sample1'] = 'smallNA06985'; res['sample2'] = 'smallNA11830'" \
-e "print(res[,c('sample1', 'sample2', 'nSites', 'Kin', 'R0', 'R1') ])"
</pre>

== IBS Method ==

=== Setup ===
<pre>
# make a directory for the results
mkdir results_IBS

# get the R script to parse the IBS output
wget https://raw.githubusercontent.com/rwaples/freqfree_suppl/master/read_IBS.R

## make a file with paths to the bam(s) for all individuals
ls bams/*.bam > all.filelist
</pre>

=== make a genotype likelihood file (glf) containing all individuals ===
see [http://www.popgen.dk/angsd/index.php/Genotype_Likelihoods#Output_genotype_likelihoods here] for a description of the file format produced by -doGLF 1.
<pre>
$ANGSD -b all.filelist \
-minMapQ 30 -minQ 20 -GL 2 \
-doGlf 1 \
-out example
</pre>

=== run IBS, this will analyse each pair of individuals ===
Here we have specified "model 0", so the command will generate expected values for each of the 100 possible pairwise 2-allele genotypes for each pair of individuals (10 unique genotypes per individual).
man page for misc/ibs [http://www.popgen.dk/angsd/index.php/Genotype_Distribution here].

<pre>
$IBS -glf example.glf.gz \
-model 0 \
-nInd 10 -allpairs 1 \
-outFileName results_IBS/ibs.model0.results
</pre>

=== Examine the results ===
<pre>
Rscript \
-e "source('./read_IBS.R')" \
-e "res = do_derived_stats(read_ibspair_model0('results_IBS/ibs.model0.results.ibspair'))" \
-e "print(res[6,c('ind1', 'ind2', 'nSites', 'Kin', 'R0', 'R1') ])"

# the IBS method in ANGSD indexes individuals as they appear in the filelist
# (zero-indexed)
cat all.filelist
</pre>

=Frequently asked Questions=

== I run out of RAM running IBS or realSFS on my entire data set, what should I do? ==
You can split the data set up by chromosome (or contig), run the IBS or SFS method on each chromosome, and then combine afterwards by summing the values ['A' - 'I'] across chromosomes.

There a few possible concerns with this approach. 1) The optimization routines work best with more data, so be careful of breaking up your data into too many small groups. This is especially true as R0, R1, and KING-robust kinship are all ratios and can be sensitive to small absolute errors. 2) Also, without chromosomes, it becomes difficult to generate meaningful confidence intervals around R0, R1, and KING-robust kinship, as it important to account for correlations between nearby sites when generating the confidence intervals with something like a jackknife. You can jackknife by leaving one contig out each time, but if the correlations (i.e. the identity-by-descent (IBD) blocks present between relatives) extend beyond the contigs, this procedure will produce confidence intervals that are smaller than they should be.

== How can I estimate confidence intervals around my R0, R1, or KING-robust kinship estimates? ==
Confidence intervals around R0, R1, and KING-robust kinship can be generated with a weighted block jackknife procedure. This involves dividing your data set into '''X''' blocks that can be of unequal sizes. Confidence intervals are then generated by leaving out each block and computing R0, R1, and KING-robust kinship on these '''X''' distinct sets [https://link.springer.com/article/10.1023/A:1008800423698 method citation]. This is a similar approach to generating Z-scores for D-statistics[https://www.genetics.org/content/192/3/1065.short] that are informative about admixture.

One important difference here is that this procedure assumes that the observations within each of the '''X''' blocks are independent from the observations in all the other pieces. For relatives, this implies that identity-by-descent (IBD) tracts should not extend into multiple blocks, as blocks within the same IBD tract will not be independent. One easy solution if your reference genome contains chromosomes is to assign each chromosome to a distinct block. However, if your reference genome only has smaller contigs, it will be difficult to interpret the confidence intervals produced by a block jackknife procedure like this as observations may not be fully independent. This issue is also discussed in the paper.

== Can I estimate R0, R1, or KING-robust kinship even if my reference genome is not assembled into chromosomes? ==
Yes, point estimates of R0, R1, or KING-robust kinship do not require a reference genome that is assembled into chromosomes. If you have enough RAM, you can use all of your data at once, otherwise you will need to split your data into pieces, analyze each piece, and combine them afterwards.

== How can reference genome assembly errors affect R0, R1, or KING-robust kinship estimates? ==
Probably in multiple, complicated ways!

But one to watch out for are regions of your reference genome that are attracting aligned reads originating from multiple, distinct parts of the genome. These regions can exhibit elevated heterozygosity across all individuals. This is particularly problematic for our analyses, as shared heterozygosity between a pair of individuals (sites where both individuals are heterozygotes) are the strongest signals we have that the individuals are related. This is because shared heterozygosity shows two important things about a site: 1) the site is variable (segregating) in the population, and 2) that the pair of individuals has the maximum possible allelic identity at the site. A upward bias in shared heterozygosity will result in KING-robust kinship estimates biased toward 0.5, elevated values of R1, and lower values of R0.

If possible these regions of the reference genome should be excluded from the analysis. With access to multiple individuals, it may be possible to identify them by estimating heterozygosity across all individuals, or with a HWE test (see [https://onlinelibrary.wiley.com/doi/full/10.1111/1755-0998.13019 here] for a method of testing HWE on low depth data that is robust to population structure). Without access to multiple individuals, it will be difficult to resolve these issues except through a better reference genome.

== How will inbreeding affects the estimates produces here? ==

=Citation=
Waples, R. K., Albrechtsen, A. and Moltke, I. (2018), Allele frequency‐free inference of close familial relationships from genotypes or low depth sequencing data. Mol Ecol. doi:10.1111/mec.14954

==Bibtex==
<pre>
@article{doi:10.1111/mec.14954,
author = {Waples, Ryan K and Albrechtsen, Anders and Moltke, Ida},
title = {Allele frequency-free inference of close familial relationships from genotypes or low depth sequencing data},
journal = {Molecular Ecology},
volume = {0},
number = {ja},
pages = {},
doi = {10.1111/mec.14954},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/mec.14954},
eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/mec.14954},
}
</pre>

IBSrelate

2019-09-24T11:32:38Z

Albrecht: /* Can I estimate R0, R1, or KING-robust kinship even if my reference genome is not assembled into chromosomes? */

This page contains information about the method '''IBSrelate''', a method to identify pairs of related individuals without requiring population allele frequencies.

On this page we will show you how to estimate the R0, R1 and KING-robust kinship statistics for a pair (or more!) of individuals from aligned sequencing data. These statistics are informative about relatedness, but can also be useful for quality-control (QC).

For further details, including the interpretation of R0, R1 and KING-robust kinship, please see our paper in Molecular Ecology at: https://doi.org/10.1111/mec.14954

= Calculating statistics from the output of IBS and realSFS =
'''IBS''' and '''realSFS''' are two methods implemented in ANGSD [http://www.popgen.dk/angsd/index.php/ANGSD] that can be used to estimate the allele sharing ''genotype distribution'' for a pair of individuals. The paper describes and examines the differences between the two methods, but we expect they both will perform comparably well for most applications. Below are links to two R scripts that can be used to load the output of '''IBS''' and '''realSFS''' and produce estimates of '''R0''', '''R1''' and '''KING-robust kinship'''.

https://github.com/rwaples/freqfree_suppl/blob/master/read_IBS.R

https://github.com/rwaples/freqfree_suppl/blob/master/read_realSFS.R

= Application to the ANGSD example data =
The shell commands given below are available in a text file here: https://github.com/rwaples/freqfree_suppl/blob/master/example_data.sh .

They are available in a Jupyter notebook here (slightly older version): https://nbviewer.jupyter.org/github/rwaples/freqfree_suppl/blob/master/example_data.ipynb .

== Setup ==
You will need installations and both [http://www.popgen.dk/angsd/index.php/ANGSD ANGSD] and [http://www.htslib.org/ samtools], as well as Rscript (part of [https://www.r-project.org/ R]).

The commands below will download files to your current directory, as well as create a few sub-directories. In total the analysis will download and generate files with a total size of about 1.5 GB.

=== Set up shell variables ===
<pre>
# set paths to the analysis programs, may need to be replaced your local installation(s)
ANGSD="$HOME/programs/angsd/angsd"
realSFS="$HOME/programs/angsd/misc/realSFS"
IBS="$HOME/programs/angsd/misc/ibs"
SAMTOOLS="samtools"
</pre>

=== Get the example data ===
The example data set has small bam files from ten individuals.
<pre>
# download the example data
wget http://popgen.dk/software/download/angsd/bams.tar.gz

# unzip/untar and index the bam files
tar xf bams.tar.gz
for i in bams/*.bam;do samtools index $i;done
</pre>

== realSFS method ==
=== Setup ===
<pre>
# make a directory for the results
mkdir results_realsfs

# get the R script to parse the realSFS output
wget https://raw.githubusercontent.com/rwaples/freqfree_suppl/master/read_realSFS.R
</pre>

=== Specify an allele at each site ===
For the realSFS method, one of the alleles at each site must be specified. Here we will use an ancestral state file (fasta format).
<pre>
# download and index the ancestral state fasta file
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz
$SAMTOOLS faidx hg19ancNoChr.fa.gz
</pre>

=== Generate a saf (site allele frequency likelihood) file for each individual ===
In the commands below, I apply minMapQ (-minMapQ 30) and minQ (-minQ 20) filters, as well as specify a specific genotype likelihood model (-GL 2). These values worked well for this data set and seem to be reasonable defaults, but the best values may vary by data set. I also generate summaries of sequencing depth (-doDepth) and allele counts (-doCounts). The output of these are not evaluated here, but they should be examined be part of a general QC process for NGS data.

<pre>
# make a separate bam filelist for each individual
# also create a SAMPLES array for use below
BAMS=./bams/*.bam
SAMPLES=()
for b in $BAMS; do
# parse out the sample name
base="$(basename -- $b)"
sample="${base%%.mapped.*}"
SAMPLES+=("$sample")
echo $sample
echo $b > ${sample}.filelist.ind
done

# run doSAF on each individual
for s in "${SAMPLES[@]}"; do
$ANGSD -b ${s}.filelist.ind \
-anc hg19ancNoChr.fa.gz \
-minMapQ 30 -minQ 20 -GL 2 \
-doSaf 1 -doDepth 1 -doCounts 1 \
-out ${s}
done
</pre>

=== run realSFS on each pair of indiviudals ===
Here we have 10 individuals, and want to consider each pair just once. We index the SAMPLES array created above.
<pre>
for i in {0..9}; do
for j in {0..9}; do
if (( i < j)); then
sample1=${SAMPLES[i]}
sample2=${SAMPLES[j]}
$realSFS ${sample1}.saf.idx ${sample2}.saf.idx > ./results_realsfs/${sample1}_${sample2}.2dsfs
fi
done
done
</pre>

=== Parse the results for a single pair of individuals ===
Below shows how to use the read_realSFS() function in R to parse the output 2dsfs file generated by realSFS. It will be easier to open an interactive R session, source read_realSFS.R, and use the function read_realSFS() to parse the file into data frame to then explore, plot, or export.
<pre>
Rscript \
-e "source('./read_realSFS.R')" \
-e "res = read_realSFS('results_realsfs/smallNA06985_smallNA11830.2dsfs')" \
-e "res['sample1'] = 'smallNA06985'; res['sample2'] = 'smallNA11830'" \
-e "print(res[,c('sample1', 'sample2', 'nSites', 'Kin', 'R0', 'R1') ])"
</pre>

== IBS Method ==

=== Setup ===
<pre>
# make a directory for the results
mkdir results_IBS

# get the R script to parse the IBS output
wget https://raw.githubusercontent.com/rwaples/freqfree_suppl/master/read_IBS.R

## make a file with paths to the bam(s) for all individuals
ls bams/*.bam > all.filelist
</pre>

=== make a genotype likelihood file (glf) containing all individuals ===
see [http://www.popgen.dk/angsd/index.php/Genotype_Likelihoods#Output_genotype_likelihoods here] for a description of the file format produced by -doGLF 1.
<pre>
$ANGSD -b all.filelist \
-minMapQ 30 -minQ 20 -GL 2 \
-doGlf 1 \
-out example
</pre>

=== run IBS, this will analyse each pair of individuals ===
Here we have specified "model 0", so the command will generate expected values for each of the 100 possible pairwise 2-allele genotypes for each pair of individuals (10 unique genotypes per individual).
man page for misc/ibs [http://www.popgen.dk/angsd/index.php/Genotype_Distribution here].

<pre>
$IBS -glf example.glf.gz \
-model 0 \
-nInd 10 -allpairs 1 \
-outFileName results_IBS/ibs.model0.results
</pre>

=== Examine the results ===
<pre>
Rscript \
-e "source('./read_IBS.R')" \
-e "res = do_derived_stats(read_ibspair_model0('results_IBS/ibs.model0.results.ibspair'))" \
-e "print(res[6,c('ind1', 'ind2', 'nSites', 'Kin', 'R0', 'R1') ])"

# the IBS method in ANGSD indexes individuals as they appear in the filelist
# (zero-indexed)
cat all.filelist
</pre>

=Frequently asked Questions=

== I run out of RAM running IBS or realSFS on my entire data set, what should I do? ==
You can split the data set up by chromosome (or contig), run the IBS or SFS method on each chromosome, and then combine afterwards by summing the values ['A' - 'I'] across chromosomes.

There a few possible concerns with this approach. 1) The optimization routines work best with more data, so be careful of breaking up your data into too many small groups. This is especially true as R0, R1, and KING-robust kinship are all ratios and can be sensitive to small absolute errors. 2) Also, without chromosomes, it becomes difficult to generate meaningful confidence intervals around R0, R1, and KING-robust kinship, as it important to account for correlations between nearby sites when generating the confidence intervals with something like a jackknife. You can jackknife by leaving one contig out each time, but if the correlations (i.e. the identity-by-descent (IBD) blocks present between relatives) extend beyond the contigs, this procedure will produce confidence intervals that are smaller than they should be.

== How can I estimate confidence intervals around my R0, R1, or KING-robust kinship estimates? ==
Confidence intervals around R0, R1, and KING-robust kinship can be generated with a weighted block jackknife procedure. This involves dividing your data set into '''X''' blocks that can be of unequal sizes. Confidence intervals are then generated by leaving out each block and computing R0, R1, and KING-robust kinship on these '''X''' distinct sets [https://link.springer.com/article/10.1023/A:1008800423698 method citation]. This is a similar approach to generating Z-scores for D-statistics[https://www.genetics.org/content/192/3/1065.short] that are informative about admixture.

One important difference here is that this procedure assumes that the observations within each of the '''X''' blocks are independent from the observations in all the other pieces. For relatives, this implies that identity-by-descent (IBD) tracts should not extend into multiple blocks, as blocks within the same IBD tract will not be independent. One easy solution if your reference genome contains chromosomes is to assign each chromosome to a distinct block. However, if your reference genome only has smaller contigs, it will be difficult to interpret the confidence intervals produced by a block jackknife procedure like this as observations may not be fully independent. This issue is also discussed in the paper.

== Can I estimate R0, R1, or KING-robust kinship even if my reference genome is not assembled into chromosomes? ==
Yes, point estimates of R0, R1, or KING-robust kinship do not require a reference genome that is assembled into chromosomes. If you have enough RAM, you can use all of your data at once, otherwise you will need to split your data into pieces, analyze each piece, and combine them afterwards.

== How can reference genome assembly errors affect R0, R1, or KING-robust kinship estimates? ==

== How will inbreeding affects the estimates produces here? ==

=Citation=
Waples, R. K., Albrechtsen, A. and Moltke, I. (2018), Allele frequency‐free inference of close familial relationships from genotypes or low depth sequencing data. Mol Ecol. doi:10.1111/mec.14954

==Bibtex==
<pre>
@article{doi:10.1111/mec.14954,
author = {Waples, Ryan K and Albrechtsen, Anders and Moltke, Ida},
title = {Allele frequency-free inference of close familial relationships from genotypes or low depth sequencing data},
journal = {Molecular Ecology},
volume = {0},
number = {ja},
pages = {},
doi = {10.1111/mec.14954},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/mec.14954},
eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/mec.14954},
}
</pre>

IBSrelate

2019-09-24T11:29:39Z