EECS Joint Colloquium Distinguished Lecture Series

Wednesday, January 28, 2004
Hewlett Packard Auditorium, 306 Soda Hall
4:00-5:00 p.m.

Eran Segal

Stanford University


Rich Probabilistic Models for Genomic Data




Genomic datasets, spanning many organisms and data types, are rapidly being produced, creating new opportunities for understanding the molecular mechanisms underlying human disease, and for studying complex biological processes on a global scale. Transforming these immense amounts of data into biological information is a challenging task. We address this challenge by presenting a statistical modeling language, based on Bayesian networks, for representing heterogeneous biological entities and modeling the mechanism by which they interact. We use statistical learning approaches in order to learn the details of these models (structure and parameters) automatically from raw genomic data. The biological insights are then derived directly from the learned model.

In this talk, I will describe three applications of this framework to the study of gene regulation: * Understanding the process by which DNA patterns (motifs) in the control regions of genes play a role in controlling their activity. Using only DNA sequence and gene expression data as input, these models recovered many of the known motifs in yeast and several known motif combinations in human. * Finding regulatory modules and their actual regulator genes directly from gene expression data. Some of the predictions from this analysis were tested successfully in the wet-lab, suggesting regulatory roles for three previously uncharacterized proteins. * Combining gene expression profiles from several organisms for a more robust prediction of gene function and regulatory pathways, and for studying the degree to which regulatory relationships have been conserved across evolution.


Mr. Segal works on computational biology, focusing on exploiting genomic data for the study of real world biological problems. He also develops visualization and browsing tools that are easily accessible to biologists, including GeneXPress, a generic software environment for visualization and statistical analysis of heterogeneous genomic data. Segal holds a B.Sc. in Computer Science from Tel Aviv University, and is currently a Ph.D. candidate at Stanford (Computer Science, with a Ph.D. minor in genetics), working with Daphne Koller.