Unsupervised Word Alignments for Machine Translation
John Sturdy DeNero, Percy Shuo Liang, Daniel Klein and Ben Taskar
National Science Foundation
Word alignment, a facet of machine translation, is the task of aligning the words of sentence-aligned parallel corpora. We have developed state-of-the-art supervised techniques, and unsupervised techniques competitive with even supervised methods. Recently, we have worked to correct pathological errors in alignments for syntactic translation.
We recently released the BerkeleyAligner, an unsupervised word aligner, as an open source project.
The core innovation we have explored is a training method we call cross-EM, which jointly trains two conditional alignment models (the classic HMM word models) and propagates information between them. Our joint training technique reduced alignment error rate by 32%, handily outperforming GIZA++. Software to train and decode these models appears below .
Syntax Sensitive Alignments
In syntax-based translation systems, the training corpus is typically parsed and aligned independently. Thus, alignment errors can violate the constituent structure of the parse tree.
We have developed an unsupervised model and a decoding heuristic that together eliminate errors like these. Software based on this work also appears below .
- The BerkeleyAligner includes software for cross-EM training of unsupervised models, syntax-sensitive distortion models, and a suite of decoding heuristics.
- Cross-EM Aligner: an unsupervised word-aligner written in Java based on the Alignment by Agreement paper.
- P. Liang, B. Taskar, and D. Klein, "Alignment by Agreement," Proceedings of NAACL, 2006. [pdf] [slides] [bib]
- J. DeNero and D. Klein, "Tailoring Word Alignments to Syntactic Machine Translation," Proceedings of ACL, 2007. [pdf] [slides]