Statistical Parsing Fifteen Years Later

Eugene Charniak
Brown University

Abstract

The creation of the Penn-treebank fifteen years ago has revolutionized work in parsing --- determining the syntactic structure of natural-language sentences. The treebank's 1,000,000 words of human parsed text suggested the application of statistical machine learning techniques to the problem and I and others followed this suggestion. This research program has proved remarkably successful. Indeed, for English, and for "standard" newspaper text, the problem can almost be considered solved in so far as there are several parsers on the web that can produce quite acceptable parses for all the articles in, say, today's New York Times. The bulk of this talk will describe what has led to this happy state of affairs. At the end we will look at where new work in the area is going. As you might expect, it is largely on non-English or non-standard text.

Eugene Charniak is University Professor of Computer Science at Brown University and past chair of the department. He received his A.B. degree in Physics from University of Chicago, and a Ph.D. from M.I.T. in Computer Science. He has published four books the most recent being Statistical Language Learning. He is a Fellow of the American Association of Artificial Intelligence and was previously a Councilor of the organization. His research has always been in the area of language understanding or technologies which relate to it. Over the last 15 years years he has been interested in statistical techniques for many areas of language processing including parsing, discourse and anaphora.