My Research interest are statistical and computational methods for analysis of genomic data including methods for multi-loci association studies, methods for detecting and correcting for population stratification, detecting selection on disease susceptibility genes, loci dependent methods for modelling identity by descent and various topics for analysis of second generation sequencing. Here I are some of the studies I have done, some am currently involved in and some that I am planning to do

Methods for population genetics

Generalising method for NGS and admixed data

Due to the uncertainty in NGS data many of the standard methods in population genetics are not directly applicable for NGS data. I am involved in projects for developing methods appropriate for NGS data:

Identity by descent

Genetic relatedness between pairs of individuals plays an important role within several fields of genetic research including forensic and medical genetics. Most genome-wide association studies are based on the assumption that all individuals analyzed are unrelated and if related individuals are not removed or controlled for they can lead to highly inflated false positive rates. Relatedness between a pair of individuals is usually described using the concept of identity-by-descent (IBD), which is genetic identity due to recent common ancestry. More specifically relatedness between two non-inbred individuals is often described by the fractions, of the genome in which the two individuals share 0, 1 or 2 alleles identical IBD. There are numerous estimators for relatedness base on both moment estimators \citep{Ritland96,Lynch99,Wang02,plink} and maximum likelihood (ML) estimators \citep{Thompson75,Milligan03}. Additionally, there are several Hidden Markow Model based methods that estimate IBD sharing locally along the genome (e.g. \citealp{Albrechtsen09,Browning10,Moltke11}), which if applied to whole genome also give rise to relatedness estimates.

Relatedness in admixed individuals

All of the above mentioned methods are based on an assumption that the analyzed individuals are from a homogenous population and it has been shown that when this assumption is violated this can lead to a marked missestimation of relatedness \citep{Anderson07,king,Thornton12}. Recently \citep{Thornton12} suggested to use a method of moment based method called REAP to account for admixture. We suggest using a more direct Maximum likelihood method, called RelateAdmix, to solve this problem and suggest an accelerated EM algorithm\citep{Varadhan08} in order to make it computationally feasible. The goal is to infer relatedness from the genotypes and admixture proportions assuming we know the allele frequencies in the K ancestral populations. The likelihood can be written as with the IBD state and the ancestral state as latent variables.

***Mendelian disorders I have used relatedness in order to detect Mendelian disorders or identify the founder IBD tract where the mutation resides in several projects \citep{Albrechtsen09,Hansen10,Hansen09,Hansen09b,Dad10}. In these project and the ones I am working on relatedness is used to infer the location of the disease loci. This is extremely powerful when a founder mutation causes the same disease in seemingly unrelated individuals.

\end{description}

NGS data analysis

ANGSD

During the last five years I have been working with NGS data on many projects. Many of these methods \citep{Albrechtsen13,Nielsen12,Skotte12,Rasmussen11,Kim11,Li10,Orlando13} I have implemented in collaboration with Thorfinn Korneliussen and Rasmus Nielsen. These methods together with many others have resulted in one of the largest and most flexible software packages for large scale NGS analysis for nuclear genomes. The software ANGSD (Analysis of next generation sequencing data) can be found at \url{www.popgen.dk/angsd} and is continually being developed.

Type specific errors in genotypes likelihoods

One of the core components of analyzing NGS data is genotype likelihoods \citep{nielsen11}. Genotype likelihoods often constructed such that all relevant information about the observed sequencing data is summaries into 10 values. A major limitation of genotype likelihood implementations is lack of proper modelling of type specific errors. The error structure is usually only modeled as a function of the type of the base quality score, the observed base, the machine cycle and the strand \citep{soapsnp,maq,samtools,GATK}. This simple way of considering the error structure can be very problematic especially for Ancient DNA where certain error types such as G->A is many fold higher than G->T. Additionally the magnitude of the error rates depends one the distance to both the 3' and the 5' end of the read. Therefore, modelling the reads based only on machine cycle and base types is not enough. The number of parameters needed in order to model the type specific error of every combination for strand, base quality, distance to the 3' and 5' is very large (many million combinations) and not all combinations will be observed. Therefore, I suggest modelling this as a multinomial logistic regression problem with the observed base as the response and the reference base, strand, base quality, 3' distance and 5' distance as predictors. A Smoothing function such as a polynomials can then be used in order to get better estimates for the combinations with few or no observations. This framework directly models the type specific errors and is fast and efficient enough to optimize on an entire genome. With better modelling of the type specific errors we will be able to perform analysis base on genotype likelihoods even on problematic samples such ancient DNA samples.

Ancient DNA

There are several factors that makes ancient DNA an interesting area to work. Often the challenges in analyzing ancient samples are fairly unique and no existing tool sets are applicable. Therefore, it is necessary to develop statistical methods for each project. During the last couple of years I have been involved several major projects \citep{Rasmussen10,Rasmussen11,Orlando13}. Resent ones include the ancient Upper Paleolithic Siberian Mal'ta (lead author: Maanansa Raghavan), The Paleo-Indian Clovis culture (lead author: Morten Rasmussen).

Methods for association studies

Population specific effects

The above association study in the Greenlandic population is not a standard GWAS. The individuals are highly admixed (see figure \ref{fig:man} right) and due to the small population size they are highly related. We model this correlation between each pair of individuals using a mixed model with a correlation structure that is proportional to the identical by state (IBS) matrix \citep{Zhou12} estimated from the SNP chip data. This mixed model for $n$ individuals

where $X$ is the fixed effects and $G$ is the $n \times n$ IBS matrix. Both $\rho$, and $\tau$ scalars. This matrix contains both the relatedness (Kinship) information and the population structure information. This solves the problem of performing GWAS in small isolated populations. When performing GWAS in a small admixed population we are often only interested in one of the ancestries. In the case of the Greenlandic population we are only interested in the Inuit part and not the European part. The above model assumed that the different population have the same effect sizes. This is especially not true because of the very different linkage disequilibrium patterns. Therefore, it would be desirable to model the relatedness (kinship) and the ancestry (population structure) separately. In order to do that you will need to know the Kinship matrix. I already have promising results of how to do that (see Relatedness in admixture individuals below). I want to use a similar model as above but use only the kinship matrix for the random effect structure and then model the ancestry directly as a latent variable in the fixed effect part as we have done previously\citep{Skotte12}. The ancestry proportions are needed but can easy be inferred using for example \citep{Alexander09} and the optimization will be done in a similar fashion as \citep{Kang08}. This we allow for different effect sizes in the Inuit part of the population and the European part of the population which should increase the power to detect an association and increase the interpretation of the model.

Method for moddeling interactions

Most common diseases are multifactorial and influenced by several factors that can be of both genetic and environmental origin. Even though conditions such as diabetes harbour strong genetic components, there is not a single or a few single-nucleotide polymorphisms (SNPs) that explain most of the genetic variance for these disorders. It is hypothesized that much of the genetic variation may be caused by the interaction (epistasis) of multiple SNPs and interaction with environmental conditions. One method we are working on is a Bayesian method using Markov chain Monte Carlo (MCMC) to overcome some of the problems described. This method explores sets of effects (risk sets) which increase the risk, or the phenotypic value, for individuals who fulfil the criterion defined by the sets. A risk set may contain one or more genetic or environmental conditions. The MCMC method then provides a probability that a particular risk set exists, i.e. that the conditions specified by the risk truly causes an increase in the phenotypic value or higher disease risk. Methods that explore such a large range of models (combinations of effects) often have very little power because they do not efficiently combine the evidence for association from different models. The new Bayesian method will address this problem by combining information from many different models, for example by evaluating the effect of all possible interactions when testing the effect of a single SNP

Gene based association studies

Another method is in a more conventinal frequentist mixed model framework. The objective of this method is to develop a way to detect associations when multiple SNPs in the same gene have an effect on the trait. In candidate gene approaches, multiple SNPs are typically typed in and around the same gene. The relevant scientific question is then not just what SNPs affect the trait, but also which genes affect the trait. When each SNP in a gene has a small effect, but the combined effect of all the SNPs in the gene is large, conventional methods have very low power to detect the effect. For example, a strong effect may not be observed before both copies of a gene carried by an individual have received sufficiently many deleterious mutations to impair the function of the protein. In such cases, the effect of each mutation may be small, and the power of traditional statistical methods is drastically reduced. However, by using more realistic genetic models, and non-linear statistical models, it is possible to develop methods with a high statistical power to detect such effects. The method we are developing is based on calculating the probability of observing the phenotypic data under a specific genetic model which can account for the combined effect of multiple SNPs. The method will sum over all possible SNP combinations that could be pathogenetic (affect the trait) to form a likelihood ratio test of the hypothesis of no phenotypic affects of any of the mutations. Another deviation of this method is a Bayesian model selection scheme that essentially works in the same way as the mixed model but instead for summing over the possible SNP combinations the method will include (or exclude) SNPs from the model using reversible jump MCMC. The method can both be purely Bayesian or use maximum likelihood.

Loci dependent identity be decent

When estimating inbreeding or relatedness most methods ignore both the de- pendencies from the identity be decent (IBD) state from adjacent loci and the linkage between adjacent markers. We are working on developing methods that will allows for linkage disequilibrium between markers and than will give bet- ter estimates by modelling the inbreeding and or relatedness tracks along the genome. This will be done using hidden Markov models and using inferred hap- lotypes instead of the observed genotypes. We are using Thompsons likelihood function with Jacques IBD coefficients.

Applied disease mapping

A molecular understanding of the pathology of diseases is a prerequisite for future rational treatment and prevention of many common diseases. Association mapping has recently been the most successful approach for identifying genetic variants that are associated with a common disease. Using micro-arrays or next generation sequencing (NGS), hundreds of thousands, or even millions of Single Nuclecotide Polymorphisms (SNPs) can be genotyped relatively cheaply and in a short amount of time. Genome-wide association studies (GWAS) that take advantage of this technology to detect novel disease associated SNPs, have been successfully applied to many diseases including common diseases like Type 2 diabetes, Obesity , and many cancers.

Glucose regulation in Greenland

Association studies have traditionally been performed in large populations, such as Europeans, but there are several advantages for using smaller populations. The Greenlandic population is genetically very different from the European populations that most disease studies performed to date have focused on. Therefore the Greenlandic population may contain disease-causing mutations that are not present in any of the well-studied European populations and mutations with high effect may have higher frequencies providing an increase in power to detect the effect. Another great advantage is that this population has a large amount of linkage disequilibrium (LD) compared to e.g. the Danish population. In GWAS not all variants are interrogated. Instead only a subset of common SNPs are typed and these act as a proxy for other variants through LD. Due to the higher amount of LD we will be able to capture, and thus indirectly test, more of the variants. So far no systematic genome-wide genetic studies have been performed in this population. We have access to blood samples from ca. 5000 individuals for whom we have a range of mainly metabolically and behavior related phenotypes. Glucose regulation in Europeans and Inuits differs dramatically. Two hour glucose levels following a sugar intake(2-h OGTT) is a phenotype reflex the ability to remove glucose from the blood stream. We have performed an initial GWAS on 3000 Greenlanders genotypes on a cheap metaboChip (200k SNPs) and found a novel variant that associates with 2-h glucose. The effect size for this common variant by far the largest effect size observed for this trait so far with the largest of the many known ones being 20-40 times smaller. We have replicated this finding in another Greenlandic cohort and have found the causal variant.