Predicting protein molecular function
Abstract
The number of known nucleotide sequences encoding proteins is growing
at an extraordinarily fast rate due to technologies developed in the
last decade that enable rapid sequence acquisition. Such rapid
acquisition is a prelude to understanding the molecular function and
tertiary structure of these protein sequences, and from there to an
understanding of the role these proteins play in a particular
organism. The experimental technologies that enable us to understand
molecular function have not progressed as fast as those for
sequencing. One important role of computational biology is to make
accurate predictions for molecular function based on the protein's
sequence alone.
Phylogenomics is a field of study that approaches the problem of
protein molecular function prediction from an evolutionary
perspective. In particular, a phylogenomic analysis transfers existing
(but sparse) molecular function annotations to a query protein based
on a reconciled phylogeny, which explicitly represents the
evolutionary relationships of a set of related proteins. In my
dissertation, I formalize the phylogenomics methodology as a
statistical graphical model of molecular function evolution. Within
this framework, we can predict protein molecular function from protein
sequence alone. Molecular function evolution is represented as a
simple continuous time Markov chain, and the random variables at each
node in the tree are a set of functional terms from the Gene
Ontology. The model is encapsulated in a framework called SIFTER
(Statistical Inference of Function Through Evolutionary
Relationships).
SIFTER has performed well on a number of diverse protein families, as
compared to standard annotation transfer methods and other
phylogenomics-based approaches. SIFTER has been applied to the
complete genomes of 46 fungal species, and is able to make molecular
function predictions for a large percentage of the predicted proteins
in these genomes. Moreover, through these predictions we can explore
some genomic comparisons for fungi. Motivated by the high cost of
characterization experiments, active learning techniques have also
been applied to SIFTER's protein function predictions, with good
results.
Maintained by:
Fei Sha