Electrical Engineering
      and Computer Sciences

Electrical Engineering and Computer Sciences

COLLEGE OF ENGINEERING

UC Berkeley

Syntactic Agreement in Bilingual Corpora

David Burkett

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2012-187
August 15, 2012

http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-187.pdf

The task of automatic machine translation (MT) is the focus of a huge variety of active research efforts, both because of the intrinsic utility of this difficult task, and the theoretical and linguistic insights that arise from modeling relationships between natural languages. However, MT systems that leverage syntactic information are only recently becoming practical, and in a typical system of this sort, syntactic information is generated by monolingual parsers; the task of explicitly modeling syntactic relationships between target and source languages is yet to be fully explored. This thesis investigates the problem of finding syntactic parse trees of target and/or source sentences that are more appropriate for use in a syntactic MT system. Two basic methodologies are explored. First, we present a sequence of two statistical models that leverage bilingual information to improve the linguistic quality of syntactic parses, as measured by their ability to replicate human-generated gold-standard annotations. The first model uses word to word alignments as an external source of information, while the second models the alignments jointly. These models are both quite effective at improving the intrinsic quality of the parse trees, and the second model additionally improves word alignment performance. However, while the two models achieve similar parsing improvements, we find that improving parses in conjunction with word alignments is much more helpful for the downstream machine translation task. In the next part of the thesis, we explore this finding further by investigating the effects on MT performance of agreement between parse trees and word alignments. We present a simple method for transforming input trees in a way that ignores gold-standard annotations, concentrating instead on improving syntactic agreement directly. In experiments, we find that though we obviously lose fidelity to more linguistically informed treebank annotation guidelines, this transformation-based approach yields the strongest improvements in syntactic machine translation.

Advisor: Daniel Klein


BibTeX citation:

@phdthesis{Burkett:EECS-2012-187,
    Author = {Burkett, David},
    Title = {Syntactic Agreement in Bilingual Corpora},
    School = {EECS Department, University of California, Berkeley},
    Year = {2012},
    Month = {Aug},
    URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-187.html},
    Number = {UCB/EECS-2012-187},
    Abstract = {The task of automatic machine translation (MT) is the focus of a huge variety of active research efforts, both because of the intrinsic utility of this difficult task, and the theoretical and linguistic insights that arise from modeling relationships between natural languages. However, MT systems that leverage syntactic information are only recently becoming practical, and in a typical system of this sort, syntactic information is generated by monolingual parsers; the task of explicitly modeling syntactic relationships between target and source languages is yet to be fully explored.

This thesis investigates the problem of finding syntactic parse trees of target and/or source sentences that are more appropriate for use in a syntactic MT system. Two basic methodologies are explored.

First, we present a sequence of two statistical models that leverage bilingual information to improve the linguistic quality of syntactic parses, as measured by their ability to replicate human-generated gold-standard annotations. The first model uses word to word alignments as an external source of information, while the second models the alignments jointly. These models are both quite effective at improving the intrinsic quality of the parse trees, and the second model additionally improves word alignment performance. However, while the two models achieve similar parsing improvements, we find that improving parses in conjunction with word alignments is much more helpful for the downstream machine translation task.

In the next part of the thesis, we explore this finding further by investigating the effects on MT performance of <em>agreement</em> between parse trees and word alignments. We present a simple method for transforming input trees in a way that ignores gold-standard annotations, concentrating instead on improving syntactic agreement directly. In experiments, we find that though we obviously lose fidelity to more linguistically informed treebank annotation guidelines, this transformation-based approach yields the strongest improvements in syntactic machine translation.}
}

EndNote citation:

%0 Thesis
%A Burkett, David
%T Syntactic Agreement in Bilingual Corpora
%I EECS Department, University of California, Berkeley
%D 2012
%8 August 15
%@ UCB/EECS-2012-187
%U http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-187.html
%F Burkett:EECS-2012-187