Electrical Engineering
      and Computer Sciences

Electrical Engineering and Computer Sciences


UC Berkeley


2010 Research Summary

Evolutionary models and algorithms for ancestral sequence reconstruction

View Current Project Information

Alexandre Bouchard-Cote, Percy Shuo Liang, Daniel Klein and Thomas Griffiths

NDSEG Fellowship, Microsoft, National Science Foundation, FQRNT and NSERC

Both linguistics and biology face scientific questions that require reconstructing the ancestral forms of discrete sequences from their modern descendants. In linguistics, these questions are about the words that appeared in the protolanguages from which modern languages evolved. Linguists painstakingly reconstruct these words by hand using knowledge of the relationships between languages and the plausibility of sound changes. In biology, analogous questions concern the DNA, RNA, or protein sequences of ancestral genes and genomes. By reconstructing ancestral sequences and the evolutionary paths between them, biologists can make inferences about the evolution of gene function and the nature of the environment in which they evolved. While computational methods are becoming widely used in bioinformatics and computational linguistics, the problem of accurately reconstructing ancestral words and genes still poses a considerable challenge.

The goal of this line of work is to improve the accuracy of ancestral sequence reconstruction in both linguistic and biological settings. Our main contributions are: (1) Three probabilistic models of language change that capture many important phenomena not modeled in previous work. Each model in this sequence is increasingly accurate but correspondingly more computationally expansive. Each model was extensively evaluated empirically. (2) A novel algorithm to perform inference in these hard combinatorial spaces as well as its theoretical and empirical analysis. (3) A quantitative methodology and a new data set for evaluating reconstruction models and algorithms. (4) As a bonus application, we obtained state-of-the-art numbers for multiple sequence alignment systems based on an evolutionary model---a task central to computational biology.

This work has several concrete applications. In historical linguistics (Gray et al., 2003), it can be used to study migration of ancient populations, to help deciphering scripts of dead languages and to elucidate phonetic characteristics of the language drift process. In biology (Huelsenbeck et al., 2001), it enables inference of extinct species' phenotypical properties (for instance optimal growth temperature) as well as the function of modern structures (for instance for finding active sites in proteins). Lexicon filling is another application useful in Machine Translation.

Alexandre Bouchard-Côté, Thomas L. Griffiths and Dan Klein. (2009) Improved Reconstruction of Protolanguage Word Forms.
Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL09). Boulder, USA.
Alexandre Bouchard-Côté, Michael I. Jordan and Dan Klein. (2009) Efficient Inference in Phylogenetic InDel Trees.
Advances in Neural Information Processing Systems 21 (NIPS). Vancouver, Canada.
Alexandre Bouchard-Côté, Percy Liang, Thomas Griffiths and Dan Klein. (2008) A Probabilistic Approach to Language Change.
Advances in Neural Information Processing Systems 20 (NIPS). Vancouver, Canada.
Alexandre Bouchard-Côté, Percy Liang, Thomas Griffiths, and Dan Klein. (2007) A Probabilistic Approach to Diachronic Phonology.
Proceedings of the 2007 Conference on Empirical Methods on Natural Language Processing (EMNLP07). Prague, Czech Republic.