US mini logoHome | A-Z Index | People | Reference | Contact us
University of Sussex
About | People | Projects | Doctoral Programme | Seminar Series | Resources

Unsupervised Bidirectional Estimation for Noisy-Channel Models

Speaker

Khalil Simaan

Affilliation

Amsterdam

Abstract

Shannon's Noisy-Channel model, which describes how a corrupted message might be reconstructed, has been the corner stone for much work in statistical language and speech processing. The model factors into two components: a language model to characterize the original message and a channel model to describe the channel's corruptive process. The data for training this model consists of a pair of coprora, one consists of messages and the other of observations (in parallel corpora, the messages and observations are aligned pair-wise).

The standard approach for estimating the parameters of the channel model is unsupervised Maximum-Likelihood of the observation data, usually approximated using the Expectation-Maximization (EM) algorithm. Under the EM algorithm the model parameters are fitted only to data from one side of the channel: The language model parameters depend solely on data from the message-side; and the channel model parameters are chosen to maximize the likelihood of the data from the observable-side of the channel alone. However, the Noisy-Channel model can be formulated in two directions, whereby each time one side of the data serves as the message-side, whereas the other side serves as observation-side. Because of weak language models, asymmetric channel models and sparse-data, the estimation of these two directional models using EM often leads to suboptimal estimates. In this work we show that it is better to maximize the likelihood of the total data *at both ends of the noisy-channel* under a single set of parameters that governs both directional models. In this work we derive a corresponding bi-directional EM algorithm and show that it gives better performance than standard EM on three tasks: (1) word-based translation by estimating a probabilistic lexicon trained on non-parallel corpora, (2) adaptation of a part-of-speech tagger between related languages, and (3) last minute results on word alignment under the IBM models and the commonly used HMM model (Giza++).

see also

Site maintained by: John Carroll Disclaimer | Feedback