# Algorithms for Human Genetics

### Bonnie Kirkpatrick

###
EECS Department

University of California, Berkeley

Technical Report No. UCB/EECS-2011-25

April 5, 2011

### http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-25.pdf

Whereas Mendel used breeding experiments and painstakingly counted peas, modern biology increasingly requires computational tools. In the late 1800's probability and experimental genetics were the critical tools for discovering the gene. Today, the combined use of statistical and computational methods to make genetic and genomic discoveries has increased after the discovery of the DNA double-helix and the development of sequencing methods. By examining relationships among individuals using computational tools, geneticists have been able to understand the biological mechanisms that produce genetic diversity, map ancestral movements of populations, reconstruct ancestral genomes, and identify relatives. Furthermore, models in genetics have inspired advances in computer science, notably the model for inheritance in families is an early example of a graphical model and helped inspire the sum-product algorithm. In human genetics there are two main ways to model relatedness: evolutionary relationships between people and closer, family relationships. Evolutionary relationships, from the domain of population genetics, occur through a distant relative and leave small traces of the relationship in the genome. Family relationships are typically much closer and leave much larger traces in the genome. This thesis examines algorithms for both types of relationships. For evolutionarily related individuals, this thesis presents the perfect phylogeny and coalescent and then examines two related questions. The first is related to privacy of genetic data used for research purposes. In order to share data from studies while hopefully maintaining the privacy of study participants, geneticists have released the summary statistics of the data. A natural question, whether individuals can be detected in the summary data, is answered in the affirmative by using a perfect phylogeny model. The second question is how to construct perfect phylogenies from haplotypes where there is missing data. We introduce a polynomial-time algorithm for enumerating such phylogenies. This algorithm can be used to compute the probability of the data as an expectation over possible coalescent genealogies. Recent relationships are modeled using a family tree, or pedigree graph. Traditionally, geneticists construct these graphs from genealogical records in a very tedious process of examining records. As an alternative to manual methods, this thesis addresses the problem of automatically constructing pedigree graphs from genetic data. The most obvious way to reconstruct pedigrees from genetic data is to use a structured machine learning approach, similar to phylogenetic reconstruction. That method would involve a search over the space of pedigree graphs where the objective is to find the pedigree graph with the highest likelihood of generating the observed data. Unfortunately, this is not a good way to proceed for two reasons: the space of pedigree graphs is exponential, and the likelihood calculation has exponential running time. The likelihood calculation given genotype data is known to be NP-hard. This thesis presents several exponential algorithms related to computing the likelihood. Since likelihood-based approaches seem completely infeasible, a completely different approach is introduced. We focus on the problem of inferring relationships between a set of living individuals with available identity-by-descent data. For convenience, we assume that the inferred pedigree is monogamous without inter-generational mating. Two heuristic and practical pedigree reconstruction methods are introduced, one for inbred pedigrees and the other for outbred pedigrees. This work immediately reveals another important problem, that of evaluating the resulting inferred pedigree against a ground-truth pedigree. This can be done either by determining whether the two pedigrees are isomorphic or by finding the edit distance between the two pedigrees.

**Advisor:** Richard M. Karp

BibTeX citation:

@phdthesis{Kirkpatrick:EECS-2011-25, Author = {Kirkpatrick, Bonnie}, Title = {Algorithms for Human Genetics}, School = {EECS Department, University of California, Berkeley}, Year = {2011}, Month = {Apr}, URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-25.html}, Number = {UCB/EECS-2011-25}, Abstract = {Whereas Mendel used breeding experiments and painstakingly counted peas, modern biology increasingly requires computational tools. In the late 1800's probability and experimental genetics were the critical tools for discovering the gene. Today, the combined use of statistical and computational methods to make genetic and genomic discoveries has increased after the discovery of the DNA double-helix and the development of sequencing methods. By examining relationships among individuals using computational tools, geneticists have been able to understand the biological mechanisms that produce genetic diversity, map ancestral movements of populations, reconstruct ancestral genomes, and identify relatives. Furthermore, models in genetics have inspired advances in computer science, notably the model for inheritance in families is an early example of a graphical model and helped inspire the sum-product algorithm. In human genetics there are two main ways to model relatedness: evolutionary relationships between people and closer, family relationships. Evolutionary relationships, from the domain of population genetics, occur through a distant relative and leave small traces of the relationship in the genome. Family relationships are typically much closer and leave much larger traces in the genome. This thesis examines algorithms for both types of relationships. For evolutionarily related individuals, this thesis presents the perfect phylogeny and coalescent and then examines two related questions. The first is related to privacy of genetic data used for research purposes. In order to share data from studies while hopefully maintaining the privacy of study participants, geneticists have released the summary statistics of the data. A natural question, whether individuals can be detected in the summary data, is answered in the affirmative by using a perfect phylogeny model. The second question is how to construct perfect phylogenies from haplotypes where there is missing data. We introduce a polynomial-time algorithm for enumerating such phylogenies. This algorithm can be used to compute the probability of the data as an expectation over possible coalescent genealogies. Recent relationships are modeled using a family tree, or pedigree graph. Traditionally, geneticists construct these graphs from genealogical records in a very tedious process of examining records. As an alternative to manual methods, this thesis addresses the problem of automatically constructing pedigree graphs from genetic data. The most obvious way to reconstruct pedigrees from genetic data is to use a structured machine learning approach, similar to phylogenetic reconstruction. That method would involve a search over the space of pedigree graphs where the objective is to find the pedigree graph with the highest likelihood of generating the observed data. Unfortunately, this is not a good way to proceed for two reasons: the space of pedigree graphs is exponential, and the likelihood calculation has exponential running time. The likelihood calculation given genotype data is known to be NP-hard. This thesis presents several exponential algorithms related to computing the likelihood. Since likelihood-based approaches seem completely infeasible, a completely different approach is introduced. We focus on the problem of inferring relationships between a set of living individuals with available identity-by-descent data. For convenience, we assume that the inferred pedigree is monogamous without inter-generational mating. Two heuristic and practical pedigree reconstruction methods are introduced, one for inbred pedigrees and the other for outbred pedigrees. This work immediately reveals another important problem, that of evaluating the resulting inferred pedigree against a ground-truth pedigree. This can be done either by determining whether the two pedigrees are isomorphic or by finding the edit distance between the two pedigrees.} }

EndNote citation:

%0 Thesis %A Kirkpatrick, Bonnie %T Algorithms for Human Genetics %I EECS Department, University of California, Berkeley %D 2011 %8 April 5 %@ UCB/EECS-2011-25 %U http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-25.html %F Kirkpatrick:EECS-2011-25