EECS Joint Colloquium Distinguished Lecture Series
     
   

Wednesday, March 03, 2004
Hewlett Packard Auditorium, 306 Soda Hall
4:00-5:00 p.m.

Dan Klein

Stanford University

 
 

Unsupervised Learning of Natural Language Syntax

 

Abstract:

   

There is precisely one complete language processing system to date: the human brain. Though there is debate on how much built-in bias human learners might have, we definitely acquire language in a primarily unsupervised fashion. On the other hand, computational approaches to language processing are almost exclusively supervised, relying on hand-labeled corpora for training. This reliance is largely due to repeated failures of unsupervised approaches. In particular, the problem of learning syntax (grammar) from completely unannotated text has received a great deal of attention for well over a decade, with little in the way of positive results. We argue that previous methods for this task have generally failed because of the representations they used. Overly complex models are easily distracted by non-syntactic correlations (such as topical associations), while overly simple models aren't rich enough to capture important first-order properties of language (such as directionality, adjacency, and valence). We describe several syntactic representations which are designed to capture the basic character of natural language syntax as directly as possible. With these representations, high-quality parses can be learned from surprisingly little text, with no labeled examples and no language-specific biases. Our results are the first to show above-baseline performance in unsupervised parsing, and far exceed the baseline (in multiple languages). These specific grammar learning methods are useful since parsed corpora exist for only a small number of languages. More generally, most high-level NLP tasks, such as machine translation and question-answering, lack richly annotated corpora, making unsupervised methods extremely appealing even for common languages like English.

    Biography:
   

Dan Klein works on unsupervised language induction and large-scale machine learning for NLP, including statistical parsing, information extraction, fast inference in large dynamic programs, and automatic clustering. He holds a BA from Cornell University (in computer science, linguistics, and math) and a masters in linguistics from Oxford University. He was awarded "Best Paper" at ACL in 2003.