US mini logoHome | A-Z Index | People | Reference | Contact us
University of Sussex
About | People | Projects | Doctoral Programme | Seminar Series | Resources

Measures of Text Semantic Similarity

Speaker

Rada Mihalcea

Affilliation

North Texas

Abstract

Measures of text similarity have been used for a long time in a variety of applications, including information retrieval, text classification, word sense disambiguation, extractive summarization, and more recently in automatic evaluation of machine translation and text summarization. With a few exceptions, the typical approach to finding the similarity between two text segments is to use a simple lexical matching method, and produce a similarity score based on the number of lexical units that occur in both input segments (usually referred to as the 'vectorial model'). While successful to a certain degree, these lexical similarity methods cannot always identify the semantic similarity of texts. For instance, there is an obvious similarity between the text segments ``I own a dog'' and ``I have an animal,'' but most of the current text similarity metrics will fail to identify any kind of connection between these texts.

In this talk, I will describe our work in developing methods for measuring the semantic similarity of texts using corpus-based and knowledge-based measures of similarity. Given that a large fraction of the information available today, on the Web or elsewhere, consists of short text snippets (e.g.\ abstracts of scientific documents, imagine captions, product descriptions), in this work we focus on measuring the semantic similarity of short texts. Through experiments performed on a paraphrase data set, we show that the semantic similarity method outperforms methods based on simple lexical matching, resulting in significant error rate reductions with respect to the traditional vector-based similarity metric.

see also

Site maintained by: John Carroll Disclaimer | Feedback