US mini logoHome | A-Z Index | People | Reference | Contact us
University of Sussex
About | People | Projects | Doctoral Programme | Seminar Series | Resources

Finding Paraphrases for Dialogue Utterances Using a Multilingual Parallel Movie Subtitle Corpus

Speaker

Rob Koeling

Affilliation

Sussex

Abstract

I will describe a method for finding paraphrases for common dialogue utterance using a multilingual corpus of movie subtitles. Paraphrases are found by 1) finding (potential) translations of an utterance in the corpus and 2) subsequently translating these translations back into the original language. The set of results of the second step are the potential paraphrases. Even though the basic model produces nice results, we show that a few simple constraints on the basic model reduce the probability of most of the noisy candidates to such an extent that a simple threshold becomes very effective in removing noisy candidates, while retaining a wide variety of good paraphrases. Similar methods have been proposed in the Machine Translation literature, but we improve on those methods by exploiting the multilingual nature of the corpus. Cross-checking over languages allow us to formulate consistency constraints, which prove to be very effective.

The method is characterized by the fact that it is: Unsupervised: we don't need manually annotated data in order to train a model to produce the results. Applicable to any common dialogue utterance: even though we focused on the queries that were used in a previous stage of the project, we showed that with minimal effort additional queries can be handled as well. Not restricted to English: this subtitle corpus makes it possible to formulate queries in different languages and produce paraphrases for these languages. Potentially applicable to a wider range of paraphrasing problems. Therefore it might well be of interest to researchers outside the dialogue field (e.g. machine translation).

see also

Site maintained by: John Carroll Disclaimer | Feedback