Research Page

Evolutionary models and algorithms for ancestral sequence reconstruction

Both linguistics and biology face scientific questions that require reconstructing the ancestral forms of discrete sequences from their modern descendants. In linguistics, these questions are about the words that appeared in the protolanguages from which modern languages evolved. Linguists painstakingly reconstruct these words by hand using knowledge of the relationships between languages and the plausibility of sound changes. In biology, analogous questions concern the DNA, RNA, or protein sequences of ancestral genes and genomes. By reconstructing ancestral sequences and the evolutionary paths between them, biologists can make inferences about the evolution of gene function and the nature of the environment in which they evolved. While computational methods are becoming widely used in bioinformatics and computational linguistics, the problem of accurately reconstructing ancestral words and genes still poses a considerable challenge.

The goal of this line of work is to improve the accuracy of ancestral sequence reconstruction in both linguistic and biological settings. Our main contributions are: (1) Three probabilistic models of language change that capture many important phenomena not modeled in previous work. Each model in this sequence is increasingly accurate but correspondingly more computationally expansive. Each model was extensively evaluated empirically. (2) A novel algorithm to perform inference in these hard combinatorial spaces as well as its theoretical and empirical analysis. (3) A quantitative methodology and a new data set for evaluating reconstruction models and algorithms. (4) As a bonus application, we obtained state-of-the-art numbers for multiple sequence alignment systems based on an evolutionary model---a task central to computational biology.

This work has several concrete applications. In historical linguistics (Gray et al., 2003), it can be used to study migration of ancient populations, to help deciphering scripts of dead languages and to elucidate phonetic characteristics of the language drift process. In biology (Huelsenbeck et al., 2001), it enables inference of extinct species' phenotypical properties (for instance optimal growth temperature) as well as the function of modern structures (for instance for finding active sites in proteins). Lexicon filling is another application useful in Machine Translation.

Relevant publications:

Approximate inference in structured combinatorial spaces

A recurrent theme in my research is the need for efficient and accurate approximate inference algorithms. Indeed, approximating expectations in large combinatorial spaces is often the computational bottle-neck at the core of many applications. I have worked on this challenging research problem in the context of applied domains (machine translation, multiple sequence alignment, statistical parsing, and phylogenetic inference) as well as within the theoretical frameworks of variational and MCMC algorithms.

The main contributions in this branch of research are: (1) a collection of novel algorithms for inference in large combinatorial spaces where routine application of MCMC or variational approximations is not possible (for example, when the conditional independence assumptions cannot be expressed in a graphical model). (2) A general purpose framework to approximate high-degree dynamic programs. These dynamic programs often arise in machine translation, statistical parsing and many statistical Natural Language Processing (NLP) tasks. The proposed solution is elegant and simple to implement. (3) A new algorithm to optimize mean-field objective functions, and (4) a theoretical analysis and extended empirical evaluation of all of these algorithms.

Relevant publications:

Training statistical machine translation models

Besides phylogenetics and multiple sequence alignment, I have also been interested in applications related to machine translation. My general goal here is to improve the training procedure of translation models in order to increase translation performance as measure by human evaluation and automatic translation metrics such as the BLEU score.

I have worked on two specific methods for doing so: a Bayesian non-parametric approach based on the Dirichlet Process (DP) prior, and a discriminative approach based on an extension of the classical perceptron algorithm. The former approach yields compact translation tables but is computationally expansive, while the latter allows fast learning on large scale data sets, but at the cost of considerable inflation of the translation tables. I am currently working on a novel inference procedure based on Sequential Monte Carlo that has the potential to combine the advantages of both.

Relevant publications:

  • Alexandre Bouchard-Côté, Slav Petrov and Dan Klein. (2009) Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs. Advances in Neural Information Processing Systems 22 (NIPS). Vancouver, Canada.
    In press
  • John DeNero, Alexandre Bouchard-Côté and Dan Klein. (2008) Sampling Alignment Structure under a Bayesian Translation Model. Proceedings of the 2008 Conference on Empirical Methods on Natural Language Processing (EMNLP08). Waikiki, USA.
    [paper]
  • Percy Liang, Alexandre Bouchard-Côté, Dan Klein, and Ben Taskar. (2006) An End-to-End Discriminative Approach to Machine Translation. Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics (ACL06). Sydney, Australia.
    [paper]