Electrical Engineering
      and Computer Sciences

Electrical Engineering and Computer Sciences

COLLEGE OF ENGINEERING

UC Berkeley

Adaptive Sentence Boundary Disambiguation

David D. Palmer and Marti A. Hearst

EECS Department
University of California, Berkeley
Technical Report No. UCB/CSD-94-797
February 1994

http://www.eecs.berkeley.edu/Pubs/TechRpts/1994/CSD-94-797.pdf

Labeling of sentence boundaries is a necessary prerequisite for many natural language processing tasks, including part-of-speech tagging and sentence alignment. End-of-sentence punctuation marks are ambiguous; to disambiguate them most systems use brittle, special-purpose regular expression grammars and exception rules. As an alternative, we have developed an efficient, trainable algorithm that uses a lexicon with part-of-speech probabilities and a feed-forward neural network. After training for less than one minute, the method correctly labels over 98.5% of sentence boundaries in a corpus of over 27,000 sentence-boundary marks. We show the method to be efficient and easily adaptable to different text genres, including single-case texts.


BibTeX citation:

@techreport{Palmer:CSD-94-797,
    Author = {Palmer, David D. and Hearst, Marti A.},
    Title = {Adaptive Sentence Boundary Disambiguation},
    Institution = {EECS Department, University of California, Berkeley},
    Year = {1994},
    Month = {Feb},
    URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/1994/6317.html},
    Number = {UCB/CSD-94-797},
    Abstract = {Labeling of sentence boundaries is a necessary prerequisite for many natural language processing tasks, including part-of-speech tagging and sentence alignment. End-of-sentence punctuation marks are ambiguous; to disambiguate them most systems use brittle, special-purpose regular expression grammars and exception rules. As an alternative, we have developed an efficient, trainable algorithm that uses a lexicon with part-of-speech probabilities and a feed-forward neural network. After training for less than one minute, the method correctly labels over 98.5% of sentence boundaries in a corpus of over 27,000 sentence-boundary marks. We show the method to be efficient and easily adaptable to different text genres, including single-case texts.}
}

EndNote citation:

%0 Report
%A Palmer, David D.
%A Hearst, Marti A.
%T Adaptive Sentence Boundary Disambiguation
%I EECS Department, University of California, Berkeley
%D 1994
%@ UCB/CSD-94-797
%U http://www.eecs.berkeley.edu/Pubs/TechRpts/1994/6317.html
%F Palmer:CSD-94-797