US mini logoHome | A-Z Index | People | Reference | Contact us
University of Sussex
About | People | Projects | Doctoral Programme | Seminar Series | Resources

Particle Language Modelling for Arabic Speech Recognition

Speaker

Bilal Khaliq

Affilliation

Sussex

Abstract

Due to the inflectional nature and morphological complexity of the Arabic language, Arabic text data suffers significantly from two key problems for Automatic Speech Recognition, that of data sparsity and higher out-of-vocabulary (OOV) rates. Data sparsity poses problems for standard N-gram models reducing the number of instances of many words while at the same time increasing the required numbers of N-grams. And as Arabic generates many more unique words than English, it results in a higher OOV rate posing the need for an Arabic corpus to be much larger to achieve an OOV rate similar to an English corpus.

To address these two problems, a statistical technique to build sub-words or 'particles' as modelling units was previously developed and successfully applied to Russian. In this talk I will examine the utility of particle language models for Arabic, a language exhibiting similar morphological characteristics to Russian. Further, the models were evaluated using Word Error Rates based on Speech Recognition experiments which is a more reliable measure of performance than evaluation using Perplexity values, as was done for Russian.

see also

Site maintained by: John Carroll Disclaimer | Feedback