Learning and Inference for Hierarchically Split Probabilistic Context-Free Grammars
Slav Orlinov Petrov and Daniel Klein
Parsing is the task of analyzing the grammatical structure of natural language. Given a sequence of words, a parser forms units like subject, verb, and object, and determines the relations between these units according to some grammar formalism. Our work has focused on learning probabilistic context-free grammars (PCFGs) and then using these grammars to assign a sequence of words the most likely parse tree.
Learning a refined grammar can be seen as the search for an optimal grammar consistent with a coarse training set (treebank). We describe a method in which a minimal grammar is hierarchically refined using an EM algorithm to give accurate, compact, and interpretable grammars. The resulting grammars are extremely compact compared to other high-performance parsers, yet the parser gives the best published accuracies on several languages, as well as the best generative parsing numbers in English. In addition, we give an associated coarse-to-fine inference scheme which vastly improves inference time with no loss in test set accuracy.
This approach is applicable more broadly to other problems where we have observed training structure which is coarser than the true underlying process in a similar way.
- S. Petrov and D. Klein, "Improved Inference for Unlexicalized Parsing," Proceedings of HLT-NAACL, 2007.
- S. Petrov and D. Klein, "Learning and Inference for Hierarchically Split PCFGs," Proceedings of AAAI (Nectar Track), 2007.