Building and Using Parallel Texts:
Data Driven Machine Translation and Beyond

HLT-NAACL 2003 Workshop, May 31, 2003

A follow-up to this workshop, featuring again a word alignment shared task, is currently taking place as an ACL 2005 workshop


Index
  • Workshop Proceedings
  • Resources for Word Alignment
  • Workshop Program
  • Call for Papers
  • Call for Late-Breaking Papers
  • Shared Task
  • Program Committee
  • Important Dates
  • Submission instructions
  • Camera ready format
  •  
  • HLT-NAACL 2003
  • ACL
  • Workshop Proceedings AAAA AAAAAA AAAAA AAAA AAAAAAAAA AAAAA AAAAAAAAAAAA AAAA AAAAAAAAA AAAAA AA

    Invited Talk

    On the Pleasure of Being Bi-textual OR My Life in Parallel Text [ppt]
    Elliott Macklovitch, Laboratoire RALI, Université de Montréal


    Word Alignment Shared Task

    An Evaluation Exercise for Word Alignment [pdf], [ps], [bib]
    Rada Mihalcea and Ted Pedersen

    ProAlign: Shared Task System Description [pdf], [ps], [bib]
    Dekang Lin and Colin Cherry

    Word Alignment Based on Bilingual Bracketing [pdf], [ps], [bib]
    Bing Zhao and Stephan Vogel

    Statistical Translation Alignment with Compositionality Constraints [pdf], [ps], [bib]
    Michel Simard and Philippe Langlais

    Reducing Parameter Space for Word Alignment [pdf], [ps], [bib]
    Herve Dejean, Eric Gaussier, Cyril Goutte and Kenji Yamada

    Word Alignment Baselines [pdf], [ps], [bib]
    John C. Henderson

    TREQ-AL: A Word Alignment System with Limited Language Resources [pdf], [ps], [bib].
    A new version describing the TREQ-AL system after bug fixes is also available [pdf], [ps]
    Dan Tufis, Ana-Maria Barbu, Radu Ion

    The Duluth Word Alignment System [pdf], [ps], [bib]
    Bridget Thomson McInnes and Ted Pedersen

    Phrase-based Evaluation of Word-to-Word Alignments [pdf], [ps], [bib]
    Michael Carl and Sisay Fissaha

    Regular Papers

    Bootstrapping Parallel Corpora [pdf], [ps], [bib]
    Chris Callison-Burch and Miles Osborne

    Retrieving Meaning-equivalent Sentences for Example-based Rough Translation [pdf], [ps], [bib]
    Mitsuo Shimohata and Eiichiro Sumita and Yuji Matsumoto

    Word Selection for EBMT based on Monolingual Similarity and Translation Confidence [pdf], [ps], [bib]
    Eiji Aramaki and Sadao Kurohashi and Hideki Kashioka and Hideki Tanaka

    Translation Spotting for Translation Memories [pdf], [ps], [bib]
    Michel Simard

    Learning Sequence-to-Sequence Correspondences from Parallel Corpora via Sequential Pattern Mining [pdf], [ps], [bib]
    Kaoru Yamamoto and Taku Kudo and Yuta Tsuboi and Yuji Matsumoto

    Efficient Optimization for Bilingual Sentence Alignment Based on Linear Regression [pdf], [ps], [bib]
    Bing Zhao and Klaus Zechner and Stephen Vogel and Alex Waibel

    POS-Tagger for English Vietnamese Bilingual Corpus [pdf], [ps], [bib]
    Dinh Dien and Hoang Kiem

    Acquisition of English-Chinese Transliterated Word Pairs from Parallel-Aligned Texts using a Statistical Machine Transliteration Model [pdf], [ps], [bib]
    Chun-Jen Lee and Jason S. Chang

    Input Sentence Splitting and Translating [pdf], [ps], [bib]
    Takao Doi and Eiichiro Sumita


    Short Papers

    An LSA Implementation Against Parallel Texts in French and English [pdf], [ps], [bib]
    Katri A. Clodfelder

    Aligning and Using an English-Inuktitut Parallel Corpus [pdf], [ps], [bib]
    Joel Martin and Howard Johnson and Benoit Farley and Anna Maclachlan

    Comparing the Sentence Alignment Yield from Two News Corpora Using a Dictionary-Based Alignment System [pdf], [ps], [bib]
    Stephen Nightingale and Hideki Tanaka





    Index
  • Workshop Proceedings
  • Resources for Word Alignment
  • Workshop Program
  • Call for Papers
  • Call for Late-Breaking Papers
  • Shared Task
  • Program Committee
  • Important Dates
  • Submission instructions
  • Camera ready format
  • Resources for Word Alignment

    These are the resources that were used during the word alignment shared task organized during the HLT/NAACL 2003 workshop on "Building and Using Parallel Texts". We are making available (1) the guidelines that were given to participants in this task, (2) training, trial, and test data for the language pairs English-French and Romanian-English, (3) evaluation software for word alignment systems. Both data and software are distributed without any warranty. If you use any of these data/software, please make sure you give proper attribution (see below for the citations suggested for each data set). For questions or concerns regarding these resources, please send an email to Rada Mihalcea or Ted Pedersen
    • Guidelines for the shared task.
    • Training data
    • Trial data (see the section below for proper attribution)
    • Test data
      • Romanian-English test data. These word alignments were created by Rada Mihalcea and Ted Pedersen. If you are using this data set, please acknowledge with the following citation:
        Rada Mihalcea and Ted Pedersen, "An Evaluation Exercise for Word Alignment". Proc. of the HLT/NAACL workshop on "Building and Using Parallel Texts: Data Driven Machine Translation and Beyond", Edmonton, Canada, May 2003.
      • English-French test data. These word alignments were created by Franz Och and Hermann Ney. If you are using this data set, please acknowledge with the following citation:
        Franz Josef Och, Hermann Ney. "Improved Statistical Alignment Models". Proc. of the 38th Annual Meeting of the Association for Computational Linguistics, pp. 440-447, Hongkong, China, October 2000.
    • Code for alignment evaluation, and for format validation of alignment files.
    • General text alignment resources

    Other Resources

    These are resources that were created and kindly made available by participants in the workshop.
    • English-Inuktitut parallel corpus. If you use this parallel corpus, please acknowledge with the following citation:
      Joel Martin, Howard Johnson, Benoit Farley, and Anna Maclachlan, "Aligning and Using an English-Inuktitut Parallel Corpus". Proc. of the HLT/NAACL workshop on "Building and Using Parallel Texts: Data Driven Machine Translation and Beyond", Edmonton, Canada, May 2003.
    • Phrase-based evaluations. Data (dictionaries) and evaluation code. If you make use of this code, please acknowledge with the following citation:
      Michael Carl and Sisay Fissaha, "Phrase-based Evaluation of Word-to-Word Alignments". Proc. of the HLT/NAACL workshop on "Building and Using Parallel Texts: Data Driven Machine Translation and Beyond", Edmonton, Canada, May 2003.
    • Japanese-English parallel text (about 150,000+ aligned sentences). Users of this corpus should acknowledge:
      Masao Utiyama and Hitoshi Isahara. "Reliable Measures for Aligning Japanese-English News Articles and Sentences", Proc. of the ACL conference, Sapporo, Japan, 2003.
      This is the corpus used in: Kaoru Yamamoto and Taku Kudo and Yuta Tsuboi and Yuji Matsumoto, "Learning Sequence-to-Sequence Correspondences from Parallel Corpora via Sequential Pattern Mining", Proc. of the HLT/NAACL workshop on "Building and Using Parallel Texts: Data Driven Machine Translation and Beyond", Edmonton, Canada, May 2003.





    Index
  • Workshop Proceedings
  • Resources for Word Alignment
  • Workshop Program
  • Call for Papers
  • Call for Late-Breaking Papers
  • Shared Task
  • Program Committee
  • Important Dates
  • Submission instructions
  • Camera ready format
  • Workshop Program and Registration

    Registration is now open, you can either register online using the HLT/NAACL registration form, or on site.

    Workshop Program (pdf)
    8:45-9:00 Welcome
    Invited Talk
    9:00-10:00 On the Pleasures of being Bi-textual OR My Life in Parallel Text
    Elliot Macklovitch
    Word Alignment Shared Task
    10:00-10:20 An Evaluation Exercise for Word Alignment
    Rada Mihalcea and Ted Pedersen
    10:20-10:30 ProAlign: Shared Task System Description
    Dekang Lin and Colin Cherry
    10:30-11:00 Break
    11:00-11:10 Word Alignment Based on Bilingual Bracketing
    Bing Zhao and Stephan Vogel
    11:10-11:20 Statistical Translation Alignment with Compositionality Constraints
    Michel Simard and Philippe Langlais
    11:20-11:30 Reducing Parameter Space for Word Alignment
    Herve Dejean, Eric Gaussier, Cyril Goutte and Kenji Yamada
    11:30-11:40 Word Alignment Baselines
    John C. Henderson
    11:40-11:50 The Duluth Word Alignment System
    Bridget Thomson McInnes and Ted Pedersen
    Regular Papers
    11:50-12:10 Bootstrapping Parallel Corpora
    Chris Callison-Burch and Miles Osborne
    12:10-12:30 Retrieving Meaning-equivalent Sentences for Example-based Rough Translation
    Mitsuo Shimohata and Eiichiro Sumita and Yuji Matsumoto
    12:20-2:00 Lunch
    2:00-2:20 Word Selection for EBMT based on Monolingual Similarity and Translation Confidence
    Eiji Aramaki and Sadao Kurohashi and Hideki Kashioka and Hideki Tanaka
    2:20-2:40 Translation Spotting for Translation Memories
    Michel Simard
    2:40-3:00 Learning Sequence-to-Sequence Correspondences from Parallel Corpora via Sequential Pattern Mining
    Kaoru Yamamoto and Taku Kudo and Yuta Tsuboi and Yuji Matsumoto
    3:00-3:20 Efficient Optimization for Bilingual Sentence Alignment Based on Linear Regression
    Bing Zhao and Klaus Zechner and Stephen Vogel and Alex Waibel
    3:20-4:00 Break
    4:00-4:20 Acquisition of English-Chinese Transliterated Word Pairs from Parallel-Aligned Texts using a Statistical Machine Transliteration Model
    Chun-Jen Lee and Jason S. Chang
    4:20-4:40 Input Sentence Splitting and Translating
    Takao Doi and Eiichiro Sumita
    Short Papers
    4:40-4:55 An LSA Implementation Against Parallel Texts in French and English
    Katri A. Clodfelder
    4:55-5:10 Aligning and Using an English-Inuktitut Parallel Corpus
    Joel Martin and Howard Johnson and Benoit Farley and Anna Maclachlan
    5:10-5:25 Comparing the Sentence Alignment Yield from Two News Corpora Using a Dictionary-Based Alignment System
    Stephen Nightingale and Hideki Tanaka


    Invited Talk

    On the Pleasure of Being Bi-textual OR My Life in Parallel Text
    Elliott Macklovitch
    Laboratoire RALI
    Université de Montréal

    Beginning with the predicate translate and certain basic properties of the translation relation, I will propose a definition of the notion of bi-text and then go on to provide a brief history of the young branch of computational linguistics that focuses on the exploitation of parallel texts. The remainder of the talk will highlight some of the more important work conducted over the last ten years on the tasks of building and exploiting parallel corpora. Most of the discussion will involve applications that target translation automation, particularly those applications developed by our group in Montreal, but I will also touch on the uses of parallel text for non-translation related tasks, such as word-sense disambiguation.







    Index
  • Workshop Proceedings
  • Resources for Word Alignment
  • Workshop Program
  • Call for Papers
  • Call for Late-Breaking Papers
  • Shared Task
  • Program Committee
  • Important Dates
  • Submission instructions
  • Camera ready format
  •  

    Call for Papers

    The goal of this workshop is to provide a forum for researchers working on problems related to the creation and use of parallel text. Recent events have demonstrated once again the importance of inter-language communication, and reinforce the need for advances in machine translation (MT) and multi-lingual processing tools.

    The workshop will be centered around the problem of building and using parallel corpora, which are vital resources for efficiently deriving multi-lingual text processing tools. In addition to regular papers, the workshop also includes a shared task that will result in a comparative evaluation of word alignment techniques.

    We invite submissions of papers addressing any of the following issues:
    • Construction of parallel corpora, including the automatic identification and harvesting of parallel corpora from the Web.
    • Methods to evaluate the quality of parallel corpora and word alignments
    • Tools for processing parallel corpora, including automatic sentence alignment, word alignment, phrase alignment, detection of omissions and gaps in translations, and others
    • Using parallel corpora for data driven Machine Translation
    • Using parallel corpora for the derivation of language processing tools in new languages
    • Using parallel corpora for automatic corpora annotation
    • Language learning applied to parallel corpora
    • Translation memory systems as a source of aligned corpora
    While we invite submissions addressing any of the above topics, or related issues, we particularly welcome work involving parallel corpora addressing languages with scarce resources.

    We expect to make arrangements with a journal in Natural Language Processing or Computational Linguistics for a special issue that will include selected papers from this workshop.

    Invited Speaker

    Elliot Macklovitch, University of Montreal






    Index
  • Workshop Proceedings
  • Resources for Word Alignment
  • Workshop Program
  • Call for Papers
  • Call for Late-Breaking Papers
  • Shared Task
  • Program Committee
  • Important Dates
  • Submission instructions
  • Camera ready format
  • Call for Late-Breaking (Short) Papers

    Short (Late-Breaking) Papers at the HLT-NAACL Workshop on Building and Using Parallel Text provide a venue for authors to present late-breaking results. Submitted Short Papers will be carefully evaluated on the basis of originality, significance, technical soundness, and clarity of exposition.

    Short late-breaking papers are due on April 1, 2003 5pm your local time.

    Short papers are restricted to 4 pages in length. They must be submitted in camera-ready format; see www.hlt-naacl03.org/format.html. Short papers that are not in PDF or are incorrectly formatted may be rejected on that basis. Authors are strongly encouraged to use the LaTeX style files or MSWord equivalents available on the website.

    Submissions must describe original, completed, unpublished work, and include concrete evaluation results when appropriate. See full paper submission information for topics of interest.

    Reviewing will not be blind. Because we need camera-ready formatted papers, authors must include their identifying information on the paper. This is to accommodate the late-breaking format; we need time to review the papers and get the accepted papers to the printer in time.

    Note that notification date for acceptance and rejection is April 7, with final camera ready copy due on April 10.

    See below for detailed submission instructions.






    Index
  • Workshop Proceedings
  • Resources for Word Alignment
  • Workshop Program
  • Call for Papers
  • Call for Late-Breaking Papers
  • Shared Task
  • Program Committee
  • Important Dates
  • Submission instructions
  • Camera ready format
  • Shared Task

    All researchers who have a word alignment system available are invited to participate in the shared task, individually or as part of a team.

    Participants in the shared task will be provided with common sets of training data, consisting of Romanian-English and French-English parallel texts. Participants will be given approximately one month to train their systems with this data, and then previously held out test data will be released. Participants will run their alignment system on this test data and submit their results, which will be evaluated using a common set of metrics.

    Everybody interested in the shared task is invited to register in the workshop mailing list http://groups.yahoo.com/group/wpt03/ (this mailing list is open to everybody interested in word alignment, regardless of their participation in the shared task). A list of general text alignment resources is also provided.

    Registration form now available here. All active participants who intend to submit word alignment results by March 25 are required to register. During the test period (March 18 - March 25) test data will be released only to registered participants!

    Last day to register for participation in the shared task: March 21.

    Results submission form available here.

    Shared task timetable

    Activity Availability
    Complete guidelines February 7
    Training data February 14
    Trial data February 14
    Test data March 18
    Submission of results March 25
    Results back to participants March 28
    Submission of short papers April 1


    Data and Resources






    Index
  • Workshop Proceedings
  • Resources for Word Alignment
  • Workshop Program
  • Call for Papers
  • Call for Late-Breaking Papers
  • Shared Task
  • Program Committee
  • Important Dates
  • Submission instructions
  • Camera ready format
  • ..... ........... .......... .......... ....... ....... ................ .............. ................. ................ ................ .................. ........... .......... .......... ............ .............. ............... .... ........ ...... ...... ...... ........... ........ .......

    Organization Committee

    Rada Mihalcea, University of North Texas
    Ted Pedersen, University of Minnesota, Duluth

    Program Committee

    Lars Ahrenberg, Linkoping University
    Nicoletta Calzolari, University of Pisa
    Tim Chklovski, Massachusetts Institute of Technology
    Mona Diab, University of Maryland
    Ulrich Germann, University of Southern California / Information Sciences Institute
    Daniel Gildea, University of Pennsylvania
    Maria das Gracas Volpe Nunes, University of Sao Paulo
    Nancy Ide, Vassar College
    Philippe Langlais, University of Montreal
    Lucia Helena Machado Rino, Federal University of Sao Carlos
    Eduard Hovy, University of Southern California / Information Sciences Institute
    Elliot Macklovitch, University of Montreal
    Daniel Marcu, University of Southern California / Information Sciences Institute
    Dan Melamed, New York University
    Magnus Merkel, Linkoping University
    Ruslan Mitkov, University of Wolverhampton
    Hermann Ney, RWTH Aachen
    Grace Ngai, Hong Kong Polytechnic University
    Franz Och, University of Southern California / Information Sciences Institute
    Kemal Oflazer, Sabanci University
    Kishore Papineni, IBM
    Jessie Pinkham, Microsoft Research
    Andrei Popescu-Belis, ISSCO/TIM/ETI University of Geneva
    Florence Reeder, MITRE
    Philip Resnik, University of Maryland
    Antonio Ribeiro, European Commission, Joint Research Centre, Italy
    Michel Simard, University of Montreal
    Harold Somers, University of Manchester Institute of Science and Technology
    Arturo Trujillo, Canon Research Centre Europe
    Dan Tufis, RACAI, Romania
    Jean Veronis, University of Provence
    Clare Voss, Army Research Lab
    Yorick Wilks, University of Sheffield






    Index
  • Workshop Proceedings
  • Resources for Word Alignment
  • Workshop Program
  • Call for Papers
  • Call for Late-Breaking Papers
  • Shared Task
  • Program Committee
  • Important Dates
  • Submission instructions
  • Camera ready format
  • Submission instructions

    Submissions should consist of regular full papers of max. 7 pages, or late-breaking short papers of max. 4 pages, formatted following the HLT-NAACL 2003 guidelines. In addition, teams participating in the word alignment shared task are invited to submit short papers (max. 4 pages) describing their systems and/or evaluation methodology.

    Send your submission (a pdf file), to both:

    Rada Mihalcea
    University of North Texas
    rada@cs.unt.edu

    and

    Ted Pedersen
    University of Minnesota, Duluth
    tpederse@d.umn.edu

    Important dates

    • Deadline for regular paper submissions: March 10
    • Deadline for shared task registration: March 21
    • Deadline for results submissions: March 25 (shared task)
    • Deadline for late-breaking short papers: April 1
    • Deadline for short paper submissions: April 1 (shared task)
    • Notification of acceptance for regular papers: April 1
    • Deadline for camera-ready papers: April 10

    Index
  • Workshop Proceedings
  • Resources for Word Alignment
  • Workshop Program
  • Call for Papers
  • Call for Late-Breaking Papers
  • Shared Task
  • Program Committee
  • Important Dates
  • Submission instructions
  • Camera ready format
  • Camera ready format

    Camera ready papers should follow these guidelines:
    • 10pt Times-Roman font for the body of the text
    • No page numbers
    • Standard US-letter size, not A4
    • Authors names and affiliations appear on the paper.
    • PDF format
    • Regular papers are restricted to 8 pages; short/late-breaking papers are restricted to 4 pages.
    • Follow the HLT/NAACL formatting guidelines.
    Deadline for sending in your camera-ready paper is April 10.
    Send your camera-ready paper (a pdf file), to both:

    Rada Mihalcea
    University of North Texas
    rada@cs.unt.edu

    and

    Ted Pedersen
    University of Minnesota, Duluth
    tpederse@d.umn.edu

    Authors of accepted papers should also fill in, sign, and fax this copyright form to the attention of Ted Pedersen, fax (218) 726-8240. The copyright form should reach us no later than April 30, 2003.