Electrical Engineering
      and Computer Sciences

Electrical Engineering and Computer Sciences


UC Berkeley


2009 Research Summary

Jointly Constraining Parsing and Word Alignment on Bitexts

View Current Project Information

David Burkett, John Blitzer and Daniel Klein

Most modern systems for syntactic machine translation require training data in the form of a bitext with word alignments and syntactic parses of one or both sides. Typically, word alignments and parses are generated in a preprocessing phase using independent word aligners and monolingual parsers. However, word alignments and parses are not, in fact, independent, and so it should be possible to improve both by imposing some system of mutual constraints.

Joint Parsing

Recently, we developed a model for jointly parsing a bitext using various features of the input derived from a pair of baseline monolingual parsers, the candidate parses themselves, and the posterior probabilities from a standard model of word alignment between the sentence pairs. The key intuition is shown the example below, where a state-of-the-art English parser has chosen an incorrect structure (a) which is incompatible with the (correctly chosen) output of a comparable Chinese parser.

Our model learns the appropriate correspondences between languages by inducing a latent alignment between tree structures, and is trained by iteratively finding the optimal tree alignment for each pair of candidate parses, and then optimizing feature weights under the optimal alignments. Using this technique, we are able to improve F1 by 1.8 on in-domain Chinese sentences and by 2.5 on out-of-domain English sentences. Furthermore, by using our joint parsing model to preprocess the input to a syntactic MT system, we are able to improve BLEU by 2.4 points over the same system trained with parses from our baseline monolingual parsers [1].

Constraining Parsing and Word Alignment

We are currently investigating methods for training models that incorporate constraints between parses on both sides of a bitext and an alignment between the words of the sentences (and possibly the tree structures in the candidate parses). We look forward to presenting these results soon.

Figure 1
Figure 1: Two possible parse pairs for a Chinese-English sentence pair. The parses in a) are chosen by independent monolingual statistical parsers, but only the Chinese side is correct. The gold English parse shown in b) is further down in the 100-best list, despite being more consistent with the gold Chinese parse. The circles show where the two parses differ. Note that in b), the ADVP and PP nodes correspond nicely to Chinese tree nodes, whereas the correspondence for nodes in a), particularly the SBAR node, is less clear.

D. Burkett and D. Klein, "Two Languages are Better than One (for Syntactic Parsing)," Proceedings of EMNLP, 2008.