This page contains information about EMU (EM-PCA for Ultra-low Coverage Sequencing Data). EMU infers population structure in the presence of missingness and works for both haploid, psuedo-haploid and diploid genotype datasets. Due to EMUs iterative nature, it is able to infer population structure even for datasets of ultra-low coverage sequencing data with very high missingness rates in addition to being able to handle non-random missingness patterns where other existing methods fail. We use a procedure of low-rank approximations based on randomized PCA to iteratively update population structure in a very efficient manner.

EMU is written in Python and Cython and is freely available on Github. We have also implemented a very memory-efficient variant of EMU (EMU-mem) for large-scale datasets that uses the 2-bit data structures of PLINK binary file formats.

Download

The program can be downloaded from Github: https://github.com/Rosemeis/emu

git clone https://github.com/Rosemeis/emu.git

See github for more information regarding installation. Server-side usage is recommended.

Quick start

# See all options in EMU
python emu.py -h

# Infer population structure using 2 eigenvectors and 64 threads from binary PLINK files (.bed, .bim, .fam)
python emu.py -plink plink_prefix -e 2 -t 64 -accel -o plink_emu

# Or directly from NumPy array input
python emu.py -npy matrix.npy -e 2 -t 64 -accel -o npy_emu

# Use EMU-mem variant
python emu_mem.py -plink plink_prefix -e 2 -t 64 -accel

Input

EMU use either binary PLINK files as input (RECOMMENDED!) or saved NumPy genotype matrices in 8-bit format (numpy.int8). EMU-mem will only accept PLINK files as input due to the 2-bit data structures. If NumPy format should be preferred, you can use the script provided on Github for conversion (convertMat.py).

Using EMU

We highly recommend to use EM acceleration at all times (-accel). You can save factor matrices (-indf_save) from a run to use as starting point in a new run (-w, -s, -u). Due to convenience we have also implemented the PC-based selection scan of Galinsky et al. 2016 (-selection). MAF filtering is possible but it is recommended (and ASSUMED!) to do beforehand.

Options

-plink [Prefix for binary PLINK files]

Path to PLINK files using their prefix (.bed, .bim, .fam).

-npy [Numpy.int8 matrix format]

Path to NumPy matrix (.npy).

-e [int]

Number of eigenvectors to use in optimization.

-k [int]

Number of eigenvectors to output if user wants different than -e.

-m [int]

Maximum number of iterations (Default: 100).

-m_tole [float]

Tolerance for covergence of iterative procedure (Default: 5e-7).

-t [int]

Number of threads to use (Default: 1).

-maf [float]

Minimum minor allele frequency threshold (Default: 0.00).

-selection

Perform genome-wide PC-based selection scan (Galinsky et al. 2016).

-maf_save

Save the estimated minor allele frequencies.

-bool_save

Save boolean vector of filtered sites based on MAF.

-indf_save

Save estimated factor matrices (W, S, U).

-index [file]

Provide index of individuals for guiding initialization (np.int8 format).

-svd [string]

Select which low-rank SVD method to use, halko/arpack (Default: 'halko').

-svd_power [int]

Number of power iterations to use in low-rank SVD (Default: 3).

-w [file]

Provide starting point, left singular matrix (.w.npy).

-s [file]

Provide starting point, singular values (.s.npy).

-u [file]

Provide starting point, right singular matrix (.u.npy).

-accel

Use EM acceleration (Highly recommended!).

-o [string]

Prefix for all output files (Default: 'emu').

-cost

Output Frobenius each iteration (DEBUG).

-cost_step

Use acceleration based on Frobenius (DEBUG).

Options in EMU-mem

-maf, -bool_save, -svd, -cost, -cost_step functions are not available for EMU-mem. MAF filtering has to be performed beforehand, which is easily done in PLINK (--maf 0.05).

Citation

TBA

EMU

Contents