Project Participants
- Prof John Tait, Computing and Information Systems, University of Sunderland
- Dr John Carroll, Cognitive and Computing Sciences, University of Sussex
Project Description
Improving access to written language on the Information Superhighway has been identified as a priority by both the Technology Foresight Programme and the EPSRC Speech and Language Programme. One barrier to accessing written material on the World Wide Web (WWW) -- and the Information Superhighway more generally -- is that most of the written material is in English, often employing an extensive vocabulary and a sophisticated style which may make the text difficult or impossible to understand for people for whom English is a foreign language in which they are not fluent, or for people who have language disabilities.
We propose to help widen access to the Information Superhighway by building a computer system which takes in English (newspaper) text across the WWW, and outputs a simplified version with broadly similar meaning with, for example, uncommon or unusual words replaced with more common or familiar synonyms, and difficult to follow syntactic constructs replaced with simpler ones (e.g. passive to active). We will also evaluate the utility of the system in a practical situation (with people suffering from aphasia which impairs their comprehension of written English), and make appropriate tools developed in the course of the project available to the wider Speech and Language Community.
Objectives
- To determine by practical experiment whether the proposed system is of assistance to target users in accessing textual information, and to determine if appropriate what further developments would be required to bring such a system into full operational use.
- To produce a powerful and robust set of text analysis tools than has been previously available to the speech and language community and make them available to that community.
- To construct a natural language analysis system able to take in raw newspaper text and analyse it sufficiently rapidly and accurately to allow improved comprehension by the target experimental user group (in the context of the remainder of the system).
- To construct a simplifying natural language generation system which can take in the output of the analyser and generate appropriately simplified equivalent English text sufficiently rapidly and accurately to allow improved comprehension by the target experimental user group.
The project was funded by the UK Engineering and Physical Sciences Research Council (EPSRC) under the 1996 Research Programme in Speech and Language, and ran August 97-July 2000 (Sunderland, GR/L50105) / February 98-September 2001 (Sussex, GR/L53175).
Selected Publications
Practical Simplification of English Newspaper Text to Assist Aphasic Readers
AAAI-98 Workshop on Integrating Artificial Intelligence and Assistive Technology, Madison, Wisconsin, 1998.
Aphasia is a disability of language processing often suffered by people as a result of a stroke or head injury. In order to assist aphasic readers we are developing a system which automatically simplifies English newspaper texts as available on the Internet. The system combines state-of-the-art natural language processing tools with innovative research on text simplification. We present the architecture of the system, discuss the analysis of newspaper text and a number of criteria for simplification. In addition, we provide some initial implementation details and propose an evaluation method.
Can Subcategorisation Probabilities Help a Statistical Parser?
in 6th ACL/SIGDAT Workshop on Very Large Corpora, Montreal, Canada, 1998.
Research into the automatic acquisition of lexical information from corpora is starting to produce large-scale computational lexicons containing data on the relative frequencies of subcategorisation alternatives for individual verbal predicates. However, the empirical question of whether this type of frequency information can in practice improve the accuracy of a statistical parser has not yet been answered. In this paper we describe an experiment with a wide-coverage statistical grammar and parser for English and subcategorisation frequencies acquired from ten million words of text which shows that this information can significantly improve parse accuracy.
Aiding Communication for Aphasic People by Simplifying the Text of a Local Newspaper
Platform Presentation at Communication Matters National Symposium (CM'98): Augmentative and Alternative Communication, University of Lancaster, UK, 1998.
Simplifying Newspaper Stories for Readers with Language Impairments
Poster Presentation at the First National Showcase of the Best of British Science, Engineering and Technology by Younger Researchers from University, Industrial and Government Laboratories: SET'99, House of Commons, London, 1999.
Simplifying Text for Language-Impaired Readers
in 9th Conference of the European Chapter of the Association for Computational Linguistics (EACL-99), Bergen, Norway, 1999.
Automatic text simplification for language-impaired readers is a relatively unexplored area in natural language processing. We describe a generic system for text simplification (currently at the prototype stage) incorporating a range of state-of-the-art language processing tools. We are applying the system to help people with aphasia (various language impairments, typically occurring as a result of a stroke or head injury) to understand English newspaper articles.
Corpus Annotation for Parser Evaluation
in EACL-99 Post-Conference Workshop on Linguistically Interpreted Corpora (LINC-99), Bergen, Norway, 1999.
We describe a recently developed corpus annotation scheme for evaluating parsers that avoids shortcomings of current methods. The scheme encodes grammatical relations between heads and dependents, and has been used to mark up a new public-domain corpus of naturally occurring English text. We show how the corpus can be used to evaluate the accuracy of a robust parser, and relate the corpus to extant resources.
Syntactic Simplification of Newspaper Text for Aphasic Readers
in ACM SIGIR'99: Workshop on Customised Information Delivery, University of California, Berkeley, 1999.
This paper describes SYSTAR (SYntactic Simplification of Text for Aphasic Readers), a system which automatically resolves 8 anaphors and replaces them with original noun phrases, replaces 4 deictic pronouns, and simplifies certain syntactic constructions, to aid comprehension for aphasic readers. SYSTAR is designed to interface between an analyser (which pre-processes, part-of-speech tags, lemmatises and parses text) and a lexical simplifier, morphological generator and post-processor. Syntactic simplification is a two-step process in SYSTAR; pattern unification followed by generation. Training and test data are articles downloaded from the Internet website of a local newspaper, "the Echo".
Automatic Text Simplification for Readers with Aphasia
Platform Presentation at the British Aphasiology Society Biennial International Conference, City University, London, 1999.
The Application of Assistive Technology in Facilitating the Comprehension of Newspaper Text by Aphasic People
in C. Buehler & H. Knops (Eds.) Assistive Technology on the Threshold of the New Millenium, Assistive Technology Research Series, volume 6, IOS Press, The Netherlands, 1999.
Robust, Applied Morphological Generation
in 1st International Natural Language Generation Conference (INLG'2000), Mitzpe Ramon, Israel, 2000.
In practical natural language generation systems it is often advantageous to have a separate component that deals purely with morphological processing. We present such a component: a fast and robust morphological generator for English based on finite-state techniques that generates a word form given a specification of the lemma, part-of-speech, and the type of inflection required. We describe how this morphological generator is used in a prototype system for automatic simplification of English newspaper text, and discuss practical morphological and orthographic issues we have encountered in generation of unrestricted text within this application.
Word Sense Disambiguation Using Automatically Acquired Verbal Preferences.
Computers and the Humanities, 34(1-2). 109-114. 2000.
The selectional preferences of verbal predicates are an important component of a computational lexicon. They have frequently been cited as being useful for WSD, alongside other sources of knowledge. We evaluate automatically acquired selectional preferences on the level playing field provided by SENSEVAL to examine to what extent they help in WSD.
Applied Morphological Processing of English,
Natural Language Engineering, 7(3). 207-223. 2001.
We describe two newly developed computational tools for morphological processing: a program for analysis of English inflectional morphology, and a morphological generator, automatically derived from the analyser. The tools are fast, being based on finite-state techniques, have wide coverage, incorporating data from various corpora and machine readable dictionaries, and are robust, in that they are able to deal effectively with unknown words. The tools are freely available. We evaluate the accuracy and speed of both tools and discuss a number of practical applications in which they have been put to use.
Synonymy in Collocation Extraction
in Proceedings of the Workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations at NAACL'01, Pittsburgh, PA, 2001.
This paper describes the use of WordNet in a new technique for collocation extraction. The approach is based on restrictions on the possible substitutions for synonyms within candidate phrases. Following a general discussion of collocations and their applications, current extraction techniques are briefly described. This is followed by a detailed description of the new approach and results and evaluation of experiments that utilise WordNet as a source of synonymic information.
Disambiguating Noun and Verb Senses Using Automatically Acquired Selectional Preferences
in Proceedings of the SENSEVAL-2 Workshop at ACL/EACL'01, Toulouse, France, 2001.
Our system for the SENSEVAL-2 all words task uses automatically acquired selectional preferences to sense tag subject and object head nouns, along with the associated verbal predicates. The selectional preferences comprise probability distributions over WordNet nouns, and these distributions are conditioned on WordNet verb classes. The conditional distributions are used directly to disambiguate the head nouns. We use prior distributions and Bayes rule to compute the highest probability verb class, given a noun class. We also use anaphora resolution and the `one sense per discourse' heuristic to cover nouns and verbs not occurring in these relationships in the target text. The selectional preferences are acquired without recourse to sense tagged data so our system is unsupervised.