Adaptive Sentence Boundary Disambiguation

David D. Palmer and Marti A. Hearst

EECS Department
University of California, Berkeley
Technical Report No. UCB/CSD-94-797
February 1994

http://www.eecs.berkeley.edu/Pubs/TechRpts/1994/CSD-94-797.pdf

Labeling of sentence boundaries is a necessary prerequisite for many natural language processing tasks, including part-of-speech tagging and sentence alignment. End-of-sentence punctuation marks are ambiguous; to disambiguate them most systems use brittle, special-purpose regular expression grammars and exception rules. As an alternative, we have developed an efficient, trainable algorithm that uses a lexicon with part-of-speech probabilities and a feed-forward neural network. After training for less than one minute, the method correctly labels over 98.5% of sentence boundaries in a corpus of over 27,000 sentence-boundary marks. We show the method to be efficient and easily adaptable to different text genres, including single-case texts.


BibTeX citation:

@techreport{Palmer:CSD-94-797,
    Author = {Palmer, David D. and Hearst, Marti A.},
    Title = {Adaptive Sentence Boundary Disambiguation},
    Institution = {EECS Department, University of California, Berkeley},
    Year = {1994},
    Month = {Feb},
    URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/1994/6317.html},
    Number = {UCB/CSD-94-797},
    Abstract = {Labeling of sentence boundaries is a necessary prerequisite for many natural language processing tasks, including part-of-speech tagging and sentence alignment. End-of-sentence punctuation marks are ambiguous; to disambiguate them most systems use brittle, special-purpose regular expression grammars and exception rules. As an alternative, we have developed an efficient, trainable algorithm that uses a lexicon with part-of-speech probabilities and a feed-forward neural network. After training for less than one minute, the method correctly labels over 98.5% of sentence boundaries in a corpus of over 27,000 sentence-boundary marks. We show the method to be efficient and easily adaptable to different text genres, including single-case texts.}
}

EndNote citation:

%0 Report
%A Palmer, David D.
%A Hearst, Marti A.
%T Adaptive Sentence Boundary Disambiguation
%I EECS Department, University of California, Berkeley
%D 1994
%@ UCB/CSD-94-797
%U http://www.eecs.berkeley.edu/Pubs/TechRpts/1994/6317.html
%F Palmer:CSD-94-797