Abstracts for Marti A. Hearst --SIMS
The EECS Research Summary for 2003
BioText - Infrastructure for Mining of Biological Text
Gaurav Bhalotia and Ariel Schwartz
(Professor Marti A. Hearst --SIMS)
GAANN Fellowship and Genentech
The BioText project's main goal is to provide an intelligent information
extraction and retrieval system for use in biomedical and genomics research. The
system would enable fast and flexible access to text-based information needed
by biological scientists, and would also provide an efficent, modular
infrastructure for NLP scientists developing text-mining and text-analysis
algorithms [1-3].
We are working on the design and implementation of the system’s
infrastructure. Our main interest is in extending object relational databases to
support the special requirement of information extraction from biomedical text.
Current plans are for the system to include:
- Efficient storage representation and access methods for
semantically annotated text. The system should be able to
efficiently store and access different layers of annotation of the
same text, including part-of-speech tagging, resolving definitions
of acronyms [3], and mapping to a domain-specific ontology [2]. It
should also provide an API for updating and retrieving
annotations, and for registering NLP algorithms into the
system.
- Extended query semantics. Combining relational query language and
keyword based search and support for the use of synonyms and
definition substitution as part of the query language and
access methods. The system should support queries like "find
ligands that bind to a given receptor;" and "find all documents
that reference a specific gene and its synonyms."
- Ranking results based on their relevance to the query.
Relevance metrics should incorporate support and confidence
information, time of publication, number of references, and other
tunable parameters. The challenge here is to integrate the process
of ranking and result discovery.