Predicting protein molecular function

Barbara Engelhardt

U. of California (Berkeley)


The number of known nucleotide sequences encoding proteins is growing at an extraordinarily fast rate due to technologies developed in the last decade that enable rapid sequence acquisition. Such rapid acquisition is a prelude to understanding the molecular function and tertiary structure of these protein sequences, and from there to an understanding of the role these proteins play in a particular organism. The experimental technologies that enable us to understand molecular function have not progressed as fast as those for sequencing. One important role of computational biology is to make accurate predictions for molecular function based on the protein's sequence alone.

Phylogenomics is a field of study that approaches the problem of protein molecular function prediction from an evolutionary perspective. In particular, a phylogenomic analysis transfers existing (but sparse) molecular function annotations to a query protein based on a reconciled phylogeny, which explicitly represents the evolutionary relationships of a set of related proteins. In my dissertation, I formalize the phylogenomics methodology as a statistical graphical model of molecular function evolution. Within this framework, we can predict protein molecular function from protein sequence alone. Molecular function evolution is represented as a simple continuous time Markov chain, and the random variables at each node in the tree are a set of functional terms from the Gene Ontology. The model is encapsulated in a framework called SIFTER (Statistical Inference of Function Through Evolutionary Relationships).

SIFTER has performed well on a number of diverse protein families, as compared to standard annotation transfer methods and other phylogenomics-based approaches. SIFTER has been applied to the complete genomes of 46 fungal species, and is able to make molecular function predictions for a large percentage of the predicted proteins in these genomes. Moreover, through these predictions we can explore some genomic comparisons for fungi. Motivated by the high cost of characterization experiments, active learning techniques have also been applied to SIFTER's protein function predictions, with good results.

Maintained by: Fei Sha