Berkeley Electrical Engineering and Computer Sciences

The ability to speak and understand language is taken for granted, yet it has proved to be among the most challenging skills for a computer to master. One problem is that the grammatical rules of the "natural" languages developed by humans are not only complex, requiring thousands of pages of text to describe, but highly ambiguous. For instance, "time flies like an arrow" is easily understood from context to mean that time moves quickly, just as an arrow does. Yet to a software program that lacks a human's knowledge and experience, such a phrase is baffling. Is "time" the subject and "flies" the verb or vice versa? Or are "time flies" a particular type of insect that favors arrows? "Our brains suppress the ambiguities," says Dan Klein, a natural language expert who joined the computer science faculty in 2005.

Dan Klein

EECS Professor Dan Klein. (Photo by Peg Skorpinski)
Natural language processing tools are useful, not only for practical tasks but for gaining insight on how humans learn language. One mystery is whether human brains deduce the rules of a language from scratch or are pre-wired for it. Noam Chomsky, the famous linguist, has long posited that syntax is innate. Klein's groundbreaking Ph.D. thesis work showed, however, that syntax can be induced, merely by exposure to naturally occurring sentences.

Computer programs for understanding language make use of a grammar, a set of rules governing sentence construction in the language. The best performing grammars have been obtained from supervised machine learning methods, which require some labeled data as input. These labeled data sets, in which every sentence is parsed by hand, are expensive to assemble, are often full of errors, and are simply unavailable for many languages and types of text sources.

An alternative is to devise programs that learn the rules directly from a large set of unlabeled data by detecting patterns in the way sentences are put together. But attempts at unsupervised learning, going back over twenty years, had been abject failures, showing little improvement over random guessing. In fact, the parsing accuracies of grammars obtained from unsupervised algorithms were lower than those of even quite trivial baseline grammars, which, for many researchers, seemed to confirm Chomsky's hypothesis.

While a graduate student at Stanford, Klein wondered why so many efforts that seemed to make sense performed so poorly. "When someone has an idea that basically makes sense and you build a system around it and it's a failure, there must be a deep reason for this," he says. So, he decided to do a systematic study of previous unsupervised models to figure out why they didn't work.

He was able to point to several widespread, faulty assumptions. For instance, the models posited that pairs of words that frequently appear together must be related syntactically. "This is wrong," Klein says, noting that a more likely reason for two words to co-occur is because they are used to describe the same topic—for example, "ballot" and "president."

Klein went on to address this issue and others, creating a tool that vastly outperforms previous unsupervised learning systems, exceeding even supervised baseline grammars. "You can learn more than people expected from less than what they thought was required," he says.

Since coming to Berkeley, Klein has been looking at other ways to simplify and improve language tools. For instance, a longstanding approach to supervised learning has been to "lexicalize" the grammars—that is, to have rules that are specific to each word in the language rather than covering broad syntactic categories. Lexicalized grammars work well but are overly specific and unwieldy to use. Klein showed that the lexicalized approach can be beaten by a far simpler method that starts with a basic grammar with word categories and systematically refines it, automatically discovering linguistic features that improve the grammar. This approach turns out to be simpler, faster, and more accurate, giving what is currently the world's best supervised parser.

Klein and his students are now applying the refined grammars from this approach to the problem of automatic language translation.