Speaker
Affilliation
Open University
Abstract
The background to this research is a system, called ENIGMA, which generates cryptic crossword clues based on wordplay puzzles. A clue is typically a short clause, with some ellipsis, which presents a wordplay puzzle to the reader, such as an anagram, when interpreted symbolically following a set of conventions. The symbolic reading is disguised by the fact that the clue appears to be, in the words of Azed, "a piece of English prose". A given puzzle for a particular word can be rendered symbolically in many ways, usually between 10^7 and 10^14, and only a very small fraction of these renderings will also happen to appear to be meaningful fragments of English. ENIGMA explores this search space using syntactic and semantic constraints as heuristics and returns the renderings which, it is hoped, will appear to be grammatical and meaningful.
Building the data sources behind the semantic constraints raises challenging research questions. Existing data sets describing selectional constraints, such as VerbNet, only contain a small number of very broad semantic classes, whereas manually curated resources are narrow and are time-consuming to construct. In this talk I describe the process of extracting and evaluating two data sources from the British National Corpus. The first determines the strength and character of the thematic association between pairs of words, the second defines the domain and/or range of a small set of syntactic dependencies. The thematic association algorithm is based on word distance in the corpus measured over a set of concentric windows ranging from +-1 to +-1000 words in size. For a given pair the system returns a boolean result indicating whether or not a thematic association is implied by the data in the corpus, and if an association is implied whether the context is most commonly at word boundary, at phrase or at document level. The second data set was constructed by running a statistical parser over the BNC and generalizing the domain/range of each dependency over WordNet. The raw output from the corpus was disambiguated using the WordNet lexicographer file numbers as stand-in domains and generalized using a cautious mixture of arc distance and coverage.
Both data sources are designed to provide information about plausible rather than prototypical collocations, and this poses some awkward problems for evaluation. The process of generalizing the syntactic dependency data also highlighted some common difficulties in working with corpus data, such as polysemy, figurative language, synecdoche etc, and illustrated the shortcomings of using a static one-dimensional hierarchy such as WordNet to explore a wide range of different interactions between words when each interaction is based on a different subset of the features of the underlying concept.