Defense Event

Integrating Syntax and Word Alignment in Syntax-Based Machine Translation

Victoria Li Fossum

Friday, April 02, 2010
11:30am - 1:30pm
School of Education Room 2327

Add to Google Calendar

About the Event

Training a string-to-tree syntax-based statistical machine translation system to translate from a source language (e.g. Chinese or Arabic) into a target language (e.g. English) requires the following resources: a parallel corpus (a large set of example sentences in the source language that have been translated into the target language by a human); a word alignment (a word-to-word correspondence between each source-target sentence pair); and a parse tree (a syntactic representation) of each sentence in the target language. From these training examples, the system learns to translate source-language sequences of words into target-language trees. In order to ensure broad coverage, the parallel corpus of training examples must be sufficiently large (on the order of millions of sentence pairs). Manually annotating such large corpora would be prohibitively time-consuming. Instead, these corpora must be word-aligned and parsed automatically. There are two problems with existing approaches to automatic word alignment and parsing for syntax-based machine translation. First, these processes are noisy and introduce errors which impact translation quality. Second, these processes are typically performed independently of one another. Since each process produces constraints that can be used to guide the other, by more closely integrating them, we can expect to improve the accuracy of each process. In this thesis, we address these two problems as follows: first, we improve upon the accuracy of a state-of-the-art parser; second, we use word alignments to improve parse accuracy; third, we use parses to improve word alignment accuracy; and fourth, we optimize parses and word alignments simultaneously. We examine the impact of each of these methods upon parse quality, alignment quality, and translation quality in a downstream syntax-based machine translation system. Our results demonstrate that more closely integrating word alignment and syntactic parsing can indeed improve the accuracy of each process, and in some cases leads to an improvement in translation quality relative to a state-of-the-art syntax-based statistical machine translation system.

Additional Information

Sponsor(s): Steven P. Abney

Open to: Public