Well-formed substring tables

For the purposes of this chapter, we will equip ourselves with Grammar5 as a fragment of a simple phrase structure grammar for English.


BOX:




Rule {simple sentence formation} S -> NP VP. Rule {intransitive verb} VP -> IV. Rule {intransitive verb plus PP complement} VP -> IV PP. Rule {transitive verb} VP -> TV NP. Rule {transitive verb plus PP complement} VP -> TV NP PP. Rule {transitive verb plus VP complement} VP -> TV NP VP. Rule {simple noun phrase} NP -> Det N. Rule {noun phrase with PP complement} NP -> Det N PP. Rule {simple prepositional phrase} PP -> P NP.

Word the: <cat> = Det. Word her: <cat> = Det. Word her: <cat> = NP. Word they: <cat> = NP. Word nurses: <cat> = NP. Word nurses: <cat> = N. Word book: <cat> = N. Word travel: <cat> = N. Word report: <cat> = N. Word report: <cat> = IV. Word hear: <cat> = TV. Word see: <cat> = TV. Word on: <cat> = P.



Grammar5 will assign structures to such sentences as:


Nurses hear her. The nurses report. They see the book on the nurses. They hear her report on the nurses.

Since the grammar has no number agreement mechanism and no case distinctions, it will also assign structures to a range of ungrammatical strings, such as '*The book hear they', as well as a typical collection of grammatical but silly sentences, such as 'They hear the book'. These apparent shortcomings are not our focus of interest here, however. Our concern initially is with the nature of the encoding of information about strings that a parser can glean from this grammar.

Consider the problem confronted by a top-down parser faced with the string 'they saw the nurses report'. Ignoring the parsing of the subject noun phrase, the top of the search tree looks as shown in Figure 6.1(a). (For brevity, we have assumed that the parser is intelligent with lexical categories and does not waste time looking for lexical items that are not present.) The main part of the tree (circled) shows the parser's efforts in looking for just a noun phrase after the transitive verb 'saw'. All the paths in this part of the tree lead to failure, because there is an intransitive verb after the noun phrase, but nevertheless along the way a noun phrase is successfully parsed. The worrying thing is that an almost identical tree is to be found when we look more closely at the next subtree (Figure 6.1(b)). Apart from the bottom left node, this tree is exactly like the previous one, except that there is an extra PP in front of each ':'. So, the parser is having to do exactly the same work looking for a noun phrase in this part of the tree. Moreover, the same thing happens yet again in the third subtree, although this time there will be an extra VP in front of each ':', but the work done looking for the noun phrase will be the same. Here, perhaps the extra work may not matter much, but parsing a noun phrase can sometimes be a lengthy process, and it would be wasteful to carry out the same work three times.


In this example, we have discovered that there is redundancy in the parsing search space; that is, there are repetitions of essentially the same situation coming up again and again. How can we alleviate this problem? The trouble is that the parsers we have described have no memory of what they have done before. If our parser is exploring the search tree depth first from left to right, it will successfully parse the noun phrase 'the nurses' in the first subtree. But since this subtree leads to failure, everything in it will be discarded and the work will have to begin again in the second subtree. If the parser had kept some record of what it had found, it would not have had to repeat the work of finding it again.

The key idea, then, is for the parser to have some way of recording the constituents (and their structure) that it has found. We have already discussed the most familiar way of encoding structural information - namely, by means of trees such as that shown in Figure 6.2. Trees provide an excellent way of representing the completed structure of an unambiguous string that is admitted by the grammar, but our problem here is in representing partial structures that may or may not be part of a final complete parse. Just because we find a possible noun phrase such as 'nurses' in one part of the search tree does not mean in general that this will have to be a noun phrase in a complete parse since 'nurses' may be a verb or part of a larger noun phrase like 'the nurses'. So, we want to be able to store information about alternative partial analyses of the string, without committing ourselves to a premature decision as to which are correct.


There are a number of other occasions where such an ability might also come in useful. For instance, suppose we want to see the information that Grammar5 can provide about a string that it does not admit, such as:


The nurses book her travel.

Our grammar has some things it can say about this string, but it cannot assign a single tree to it. How then can we represent what it does have to say? One obvious answer is as a sequence of trees, as shown in Figure 6.3. We might well want to represent such a string in a natural language front end which is confronted with ill-formed input, a grammatical construction that falls outside its own set of rules, a mispelling, or a word that is not in its lexicon. In such cases, a robust system will want the syntactic component to tell the semantic and pragmatic components as much as it has been able to find out about the syntactic structure, rather than simply breaking, reporting 'No parse available', and leaving other components to try and sort out the mess on their own on the basis of an unanalyzed string of words. This is really just like the situation in our original example, where we wanted a similar kind of communication between the computations in different parts of the parsing search tree. In the case of ungrammatical input, the grammar can only say less than a single complete tree would say. We also find strings where there is more to say than can be said by a single complete tree:


They hear the report on the travel.

The grammar assigns two complete trees to this string. How then can we represent this fact? One obvious answer is as a set of (complete) trees, as illustrated in Figure 6.4. Again, a practical natural language understanding system needs to be able to represent syntactic ambiguities: they are ubiquitous but can often be resolved by the semantics or pragmatics, at least in principle. But suppose we encounter a string where the grammar can assign no complete tree, but can assign multiple, mutually inconsistent, partial analyses, as, for example, in the following example:


They see his report on her.

Putting together our two previous answers, we seem to be led in the direction of a data structure consisting of a set of sequences of trees. But if we were to construct such a set for the example just given, or similar examples, then a problem will become apparent: even if the number of possible partial analyses is quite small, enumerating all possible sequences that cover the whole input string could result in a very large set. This is connected with the problem of ambiguities being multiplicative, which was discussed in Chapter 5.



The solution to these problems of structural representation lies in representing everything that would be required to enumerate the possible tree sequences; that is, the possible partial trees and what portions of the string they might account for, without actually doing the enumeration. The resulting representation is called a well-formed substring table (WFST). Consider the beginning and end of the string to be numbered 0 and n, respectively, and the gaps between words to be numbered from 1 to n-1 from left to right. Now, a WFST simply tells us, for each pair of points i, j (0 <= i < j <= n), what categories can span the substring of words found between i and j.

One way to think of a WFST, and the one that we shall adopt, is as a directed (each arc has a particular direction), acyclic (containing no cycles) graph with unique first and last nodes, a graph whose nodes are (conveniently and conventionally) labelled from 0 (the first node) to n (the last), where n is, as before, the number of words in the string, and whose arcs are labelled with syntactic categories and words. For illustration, Figure 6.5 shows the cases we have just considered represented in this fashion. It can be readily seen that a WFST can represent all three cases with equal facility and with no redundancy.


We can represent a WFST as a set of arcs, where an arc is a structure with the following attributes:




<START> = ... some integer ... <FINISH> = ... some integer ... <LABEL> = ... some category ...

A phrase structure tree of the conventional kind is also a directed acyclic graph, of course, but one in which the nodes are labelled with categories and the arcs are unlabelled. In our WFST, the arcs carry the category labels and the nodes have essentially arbitrary names, although we have used integers and exploited the '<' relation on integers to implicitly encode the head and tail of the (directed) arcs. Notice that a WFST encodes

the order of phrases directly: a constituent with an arc going from i to j in the table precedes a constituent with an arc from m to n just in case j < m. This is in contrast to trees that encode ordering directly for sister constituents only. Trees do, however, encode immediate dominance relations such as, which phrases are parts of which other phrases, directly whereas simple WFSTs, such as those considered here, do not. In fact, strictly speaking, our WFSTs as presented so far do not, by themselves, suffice to recover all the information encoded in our more verbose tree-based data structures. Simply by looking at the WFST, you cannot be sure of the rule (or rules) that legitimates a particular arc. Consider the WFST shown in Figure 6.6. Is the covering H warranted in virtue of H dominating D and E or in virtue of H dominating F and G? Without knowing what rules are in the grammar, we cannot tell. However, we can keep this information available if we make the label of an arc a parse tree - or a category, together with the sequence of arcs that correspond to the immediate subphrases - instead of a simple category.


How can a WFST be used by a top-down parser? We require that successfully found phrases be recorded in the table at some point. When a phrase of a particular category is then sought at a particular position in the string, the table is consulted. If there is an entry for the right kind of phrase starting at the right place, then the entries in the table are used as the possible analyses, rather than the work having to be done again. This strategy relies, however, on a particular convention about entering partial analyses into the table; namely, that an entry for a given category, starting at a given position, is not entered until all such entries can be entered. Otherwise, the second piece of computation, relying on the fruits of the first, will not incorporate all possible analyses into its investigations. A convention of this kind will be necessary even for a recognizer, because different possible analyses for a category, starting at a given position, may have different end positions.

WFSTs can be used by any kind of parser: they do not commit one to parsing bottom up, left to right, or breadth first, although they permit one to do any of these things (or their familiar antitheses), and in any combination and any degree. The data structure itself is completely neutral, just as the more familiar phrase structure tree is. The use of a WFST simply allows a parser to refrain from rediscovering things that it has already found out.

We have motivated the WFST representation on a variety of intuitive grounds, but it turns out that the use of a WFST is also suggested by the complexity results on parsing of non-deterministic CF-PSGs. What the various n\u3\d CF-PSG parsing algorithms have in common is the implicit or explicit use of a WFST, and this single factor suffices to account for their cubic worst-case efficiency. Simpler algorithms that do not use a WFST typically have an exponential worst case. The intuition is a straightforward one: by storing intermediate results and by always testing for the presence of a category in the store before trying to build it, a WFST-based parser can avoid ever having to do the same thing more than once. The price to be paid is in the space consumed, but a WFST does not contain any redundancy, so this space demand is never worse than a multiple of n\u2\d (for a single parse).

And since memory is cheap and natural language sentences are short, the price is an easily affordable one. Although they are very suggestive, these complexity results must be treated with care, unfortunately, as they reflect the complexity of finding the first parse only of a string. In practice, we will often be interested in finding all parses of a string, and the complexity of this, when the language itself is allowed to contain arbitrary ambiguity, is always exponential. Nevertheless, the complexity results do give an indication of the computational savings resulting from the use of a WFST and, as a result, many non-deterministic natural language parsers since the early 1970s have used a WFST of some sort or another.

Exercise 6.1

Send us a comment.



[Contents] [Previous] [Next]
This document was translated by troff2html v0.21 on October 22, 1996.