Current projects
The Ergonomics of Electronic Patient Records
Electronic patient records contain a mixture of coded information and free text. We will develop generalisable methods for the identification and interrogation of potentially important data "concealed" in free text, use the results to enhance coded data, and evaluate the utility of this approach. Through user centred methods, we will explore what influences clinicians in the balance between recording free text vs using standard codes (e.g. 002.23 Appendicectomy), and how information needs to be stored for it to be useful to and retrievable by clinicians. Natural Language Processing (NLP) will be used to search the free text of large quantities of anonymised free text patient records, and to enhance coded data with pseudo-codes. Statistical methods will be used to explore the impact of integrating the additional information on (a) prevalence estimates (rheumatoid arthritis), and (b) estimates of dates of first relevant presentation (ovarian cancer). A visualization tool for the integrated graphical display of coded and NLP generated data will be developed. It will be used to validate the novel data through clinician and researcher review, and thus to explore the value of these techniques in improving the quality and accessibility of information in electronic patient records. This is a joint project with The Brighton and Sussex Medical School and a number of other partners.
PI (NLP Workstrand): John Carroll
PI (Visualization Workstrand): Donia Scott
Research fellow: Rob Koeling
SWAT (Semantic Web Authoring Tool)
The project aims to open up the semantic web to a wide audience through novel techniques that allow viewing and editing of semantic web representation languages in ordinary natural language, as opposed to the methods currently used, such as source coding or graphical interfaces, which require significant training.
PI (Sussex): Donia Scott
Past projects
Ranking Word Senses for Disambiguation: Models and Applications
The most accurate techniques for word sense disambiguation to date are those which are trained on text in which each word has been manually annotated with its intended meaning. A major shortcoming of these methods, though, is that accuracy is strongly correlated with the quantity of training data available, and this is in short supply because its production is very labour-intensive. In this project we developed novel ways of estimating the frequency distributions of senses of words from raw (unannotated) text. This was a joint project with Informatics, University of Edinburgh.
PIs (Sussex): Diana McCarthy, John Carroll
Research fellow: Rob Koeling
PI (Edinburgh): Mirella Lapata
COGENT: Controlled Generation of Text
With current NLP technology, embedding natural language generation into applications involves hand-crafting and special-purpose tuning by experts which is non-portable, non-scaleable, time-consuming and expensive. In this project, we investigated reflective techniques for controlling wide-coverage generation effectively. This was a joint project with the University of Brighton.
PIs (Sussex): David Weir, John Carroll
Research fellow: Daniel Paiva
DPhil student: Eva Esteve Ferrer
PI (Brighton): Roger Evans
Natural Habitats
The pervasive computing environment of the future will provide a wide variety of networked services. The value of such services will be greatly enhanced if the user is able to compose them -- link them up in ways that are tailored to their own particular environment. This project investigated how NLP techniques can help make service composition a possibility for non-technical users, focusing on the development of an interactive service composition tool that uses a natural language interface.
PIs: David Weir, Bill Keller, Ian Wakeman
Research fellows: Julie Weeds, Tim Owen
DPhil students: Thom Heslop, James Dowdall
MEANING: Developing Multilingual Web-scale Language Technologies
In this project we collected and analysed language data from the WWW on a large scale, in order to build more comprehensive multilingual lexical knowledge bases to support improved word sense disambiguation.
PI: John Carroll
Research fellows: Rob Koeling, Diana McCarthy
DPhil student: Xinglong Wang
DEEP THOUGHT: Hybrid Deep and Shallow Methods for Knowledge-Intensive Information Extraction
This project investigated methods for combining robust shallow methods for language analysis with deep semantic processing. The approach was demonstrated in business intelligence, automated email processing and document production support applications.
PI: John Carroll
Research fellow: Alex Fang
Visiting Researchers: Stephan Oepen, Naoki Yoshinaga
RASP: Robust Accurate Statistical Parsing
This project was concerned with improving the accuracy and robustness of syntactic parsers. Particular areas worked on were automated grammar and lexicon induction, parser evaluation, and statistical models of disambiguation.
PI: John Carroll
Research fellow: Diana McCarthy
DPhil student: Mark McLauchlan
LUCY
The project, sponsored by ESRC, developed an electronic database of structurally analysed modern written English, including not only the "polished" writing of published books and magazines but the writing of young children and teenagers.
PI: Geoff Sampson
Research fellows: Anna Babarczy, Alan Morris
PSET: Practical Simplification of English Text
The project built a prototype system which took in English newspaper text across the WWW, and output a simplified version with broadly similar meaning; intended users were people suffering from aphasia which impairs their comprehension of written English.
PI: John Carroll
Research fellows: Diana McCarthy, Guido Minnen
DPhil student: Darren Pearce
CHRISTINE
The CHRISTINE Corpus comprises a socially-representative annotated sample of current spontaneous speech, applying the annotation standards devised in the SUSANNE project (see below) to create resources for studying structure in present-day British language. It includes various extensions of the annotation scheme to identify the many structural features particular to speech. The Corpus is freely available.
PI: Geoffrey Sampson
LEXSYS: Analysis of Naturally-occurring English Text with Stochastic Lexicalized Grammars
The project developed a robust wide-coverage parsing system for English text, exploiting a combination of statistical techniques involving online corpora, inheritance hierarchies for imposing structure on NLP data, and lexicalised grammars.
PIs: David Weir, John Carroll
POLYLEX
The project developed an inheritance-based trilingual lexicon for the core vocabulary of Dutch, English and German using inheritance networks to share information across the languages at all levels of linguistic description.
PIs: Gerald Gazdar, Lynne Cahill
SPARKLE: Shallow Parsing for Acquisition of Lexical Knowledge
The project developed shallow parsing technology in English together with corpus-based lexical acquisition techniques, for deployment by collaborators in prototype multilingual information retrieval and speech dialogue systems.
PI: John Carroll
SUSANNE: Surface and Underlying Structural Analysis of Natural English
The project designed an annotation scheme for English, and produced a 130,000-word corpus of written (American) English annotated in accordance with the scheme. The SUSANNE Corpus is freely available without formalities for use by researchers anywhere.
PI: Geoffrey Sampson
POETIC: POrtable Extendable Traffic Information Collator
The POETIC project involved the development of a research prototype software system, based on natural language processing and expert system technology, which accepts 'live' police reports about traffic incidents, recognises information of relevance to other motorists, formulates suitable advisory messages, and coordinates message delivery to motorists via media such as paging, cellular radio, and the Radio Data System.