A notation for networks

An FSTN description has three components:
(1) A name for the network.

(2) A collection of declarations.
(3) A collection of arc descriptions.

The first component, the name, is only of mnemonic value when we are discussing FSTNs, since no other component makes reference to it. However, as we shall see in Chapter 3, the name of a network plays an absolutely crucial role in the definition of RTNs. It is introduced here for consistency and for completeness. The relevant statement of our language simply consists of the word 'Name' followed by a name for the network (any string of one or more characters), followed by a colon. For example:




Name TO-LAUGH:


The second component, the declarations, consists of

a single obligatory initial states declaration and a single obligatory final states declaration. Each declaration consists of a word indicating what is being declared and a list of the terms being declared, separated by commas. Thus, for instance:




Final 1,2,3

declares 1, 2 and 3 to be final states. So, a set of declarations for the FSTN in Figure 2.1 would look like this:




Initial 1 Final 4

It is often very convenient to be able to employ a single expression to abbreviate some subset of symbols and so we will introduce a type of declaration explicitly for this purpose. An abbreviation declaration consists of the abbreviatory expression, which must be distinct from any symbol followed by the word 'abbreviates', followed by a colon and a list of symbols, separated by commas and terminated by a period. The use of this type of declaration will be clearer in the light of some examples:




V abbreviates: a, e, i, o, u. C abbreviates: b, c, d, f, g, h, j, k, l, m, n, p, q, r, s, t, v, w, x, y, z.

We will put no orthographic restrictions on symbols, states or abbreviations: any character or string of characters, upper or lowercase, may appear in any role.

When other things are equal, we will adhere to a convention of using integers for states, lowercase (strings of) letters for symbols and uppercase (strings of) letters for abbreviations. But this is merely the convention used here - it is not to be thought of as part of our definition of the formal language.

The third major component of our FSTN description consists of a set of one or more arc descriptions, each of which has exactly the same form. An arc description consists of the word 'From' followed by a state, followed by the word 'to', followed by another state (not necessarily distinct from the first state), followed by the word 'by', followed by a single item. In this context, an item can be a symbol, an abbreviation (declared elsewhere) or the hash symbol ('#', pronounced 'jumping' in this context). This sounds much more cumbersome than it looks. Here, to illustrate, are the arc descriptions that would be needed to describe Figure 2.1:




From 1 to 2 by h From 2 to 3 by a From 3 to 2 by h From 3 to 4 by !

And here is the complete description of the FSTN in Figure 2.3:




Name TO-LAUGH-3: Initial 1 Final 4 From 1 to 2 by h From 2 to 3 by a From 3 to 1 by # From 3 to 4 by !.

Notice that our choice of symbols so far has always been restricted to single letters and '!'. But nothing in the theory of FSTNs restricts us in this way, and, in fact, we can define an even simpler laughing machine by using a single multi-character symbol:




Name TO-LAUGH-4: Initial 1 Final 2 From 1 to 1 by ha From 1 to 2 by !.

This FSTN illustrates another aspect of FSTNs that we have not explicitly mentioned up to this point; namely, the possibility of an arc looping back to the state that it started out from. This is very useful when we wish to allow for unrestricted repetition of a symbol. The following FSTN, ENGLISH-1 uses such loops to good effect in a machine that will produce or recognize simple English sentences.



BOX:




Name ENGLISH-1: Initial 1 Final 9 From 1 to 3 by NP From 1 to 2 by DET From 2 to 3 by N From 3 to 4 by BV From 4 to 5 by ADV From 4 to 5 by # From 5 to 6 by DET From 5 to 7 by DET From 5 to 8 by # From 6 to 6 by MOD From 6 to 7 by ADJ From 7 to 9 by N From 8 to 8 by MOD From 8 to 9 by ADJ From 9 to 4 by CNJ From 9 to 1 by CNJ.

NP abbreviates: kim, sandy, lee. DET abbreviates: a, the, her. N abbreviates: consumer, man, woman. BV abbreviates: is, was. CNJ abbreviates: and, or. ADJ abbreviates: happy, stupid. MOD abbreviates: very. ADV abbreviates: often, always, sometimes.



This network uses some abbreviations corresponding to basic syntactic categories used in the description of English syntax. What is a syntactic category? Well, at its simplest, a syntactic category is the name we give to collections of expressions of the language that have the same distribution. So, for example, 'Sandy' and 'the person you spoke to' are both noun phrases: they may or may not refer to the same person, but regardless of that, they have exactly the same syntactic privileges of occurrence. Construct any English sentence you like using the proper name 'Sandy' and you will be able to construct another perfectly grammatical English sentence by substituting 'the person you spoke to' for 'Sandy' in the sentence you first constructed. The two sentences may mean different things, but they will both be grammatical. Examples of syntactic categories that linguists commonly make use of are noun phrase (NP), sentence (S), verb phrase (VP), verb (V), and so on. We can also think of more detailed descriptions, such as plural noun phrase, interrogative sentence, passive verb phrase and tensed verb, as examples of more specified syntactic categories. Notice that in our example, the two noun phrases that we claimed to be intersubstitutable were actually both singular noun phrases. Although a singular noun phrase can sometimes be substituted for a plural one (or conversely) while preserving grammaticality, this is not always the case.

For the purposes of our illustrative example, a word is an NP (noun phrase) if it can stand alone as the subject of a sentence. By N we mean common noun, as distinct from proper noun. Determiners (DET) are words that can come before (common) nouns, as in 'the woman', but adjectives (ADJ) like 'stupid' can interpose. The category BV is used here for parts of the verb 'be' which can be optionally followed by an adverb (ADV) such as 'often'. A word of category MOD is used to modify adjectives, changing 'stupid' into 'very stupid', for instance. Finally, the category CNJ (conjunction) is used for words that can join two sentences into a single larger sentence. Here are some example strings that can be recognized by this network:


Kim was happy. Lee is a consumer and often very very stupid. Sandy was sometimes a stupid man and Kim is always the happy woman.

Our indentation is conventional, not part of the language definition, but adhering to it will make some of the complex networks exhibited in the next chapter a good deal easier to read. It will be convenient to have a name for this formal language in the course of this chapter and the next, and we shall refer to it as NATR (network and transducer representation).

We have not, as yet, said anything explicitly about what exactly it is that NATR means. For the most part, this is too obvious to need saying, but in the case of abbreviations we have a bit of notation that does not map one-to-one into our existing graphic representation. However, our abbreviations are exactly what their name suggests. Thus, for example,




From 1 to 2 by A. A abbreviates: a, b, c.

is wholly synonymous with




From 1 to 2 by a From 1 to 2 by b From 1 to 2 by c.

and thus corresponds to the FSTN fragment shown in Figure 2.4(a).

Since fragments like this get to look very messy when more than two or three arcs join a given pair of states, we shall, henceforth, import our abbreviations into the graphic representation whenever it is convenient to do so. So, we may exhibit the fragment of Figure 2.4(b)

in place of the unabbreviated fragment in Figure 2.4(a) above.


Restricting ourselves to the deterministic type of FSTN for the moment, we can introduce another formalism for (partially) representing (deterministic) FSAs used for recognition. This is the state-transition table, a matrix that exhibits the transition function of an FSA in a very clear and readily implemented manner. The vertical axis lists the states and the horizontal axis lists the symbols. The states in the matrix represent the state the machine moves into if it starts in the state given by the vertical coordinate and consumes the symbol given by the horizontal coordinate. Here is the state-transition table that corresponds to Figure 2.1:




h a ! _______ 1 | 2 0 0 2 | 0 3 0 3 | 2 0 4 4 | 0 0 0

A zero indicates that the automaton cannot proceed, and will thus be unable to accept an input string that would have the effect of forcing it into that position in the matrix. So, for instance, the first row of the matrix shows that if the machine is in state 1 the only possibility is a transition to state 2, which requires the consumption of the symbol 'h'. Similarly the third row shows that from state 3 there are two possibilities, corresponding to the symbols 'h' and '!'. Bear in mind that, as it stands, the state-transition table is an incomplete representation of the automaton, since it does not indicate which states are initial and which are final.

Now, at last, we are in a position to directly address the problem with which we started, that of natural language graphotactics (the definition of possible words in terms of permissible letter sequences). One graphotactic constraint that is familiar to most users of English is that the letter 'q' must be followed by a 'u'. Another example from English is the requirement that a word-initial 'st' only be followed by a vowel, a 'y' or an 'r' (we ignore the word 'sthenic' on the grounds that no self-respecting native speaker of English would dream of using it). Constraints of this kind are easy to code up in an FSTN.

ENG-MONOSYL is a first attempt at an FSTN for the orthographic forms of possible one syllable English words.


BOX:


Name ENG-MONOSYL: Initial 1,2 Final 3,4,5

From 1 to 2 by C0 From 2 to 3 by V From 3 to 4 by C8 From 4 to 5 by s From 1 to 7 by C3 From 7 to 2 by w From 1 to 6 by C2 From 6 to 2 by l From 6 to 5 by # From 1 to 5 by C1 From 5 to 2 by r From 1 to 8 by s From 8 to 5 by C4 From 8 to 2 by C5 From 3 to 9 by l From 3 to 10 by s From 3 to 11 by C7 From 9 to 4 by C6 From 10 to 4 by C4 From 11 to 4 by th.

V abbreviates: a, ae, ai, au, e, ea, ee, ei, eu, i, ia, ie, o, oa, oe, oi, oo, ou, u, ua, ue, ui. C0 abbreviates: b, c, ch, d, f, g, h, j, k, l, m, n, p, qu, r, s, sh, t, th, v, w, x, y, z. C1 abbreviates: d, sh, th. C2 abbreviates: b, c, f, g, k. C3 abbreviates: d, g, h, t, th. C4 abbreviates: c, k, p, t. C5 abbreviates: c, k, l, m, n, p, pl, qu, t, w. C6 abbreviates: b, f, m. C7 abbreviates: d, f, l, n, x. C8 abbreviates: b, c, ch, ck, d, f, g, h, k, l, m, mp, mph, n, ng, p, que, r, s, sh, t, th, v, w, x, y, z.



It is important to remember that FSTNs such as the above only attempt to define possible English (or Polish or Spanish) words, not actual English (or Polish or Spanish) words. In the case of Polish and Spanish, it is rather likely that an FSTN for the actual words could be defined, but it would be a massive undertaking to attempt to construct it by hand. And there are languages, such as Bambara, for whose actual word set it seems that no FSTN could be given. But for many practical purposes, an FSTN for possible words is more use than one for the actual words. Speakers make words up constantly, but their neologisms are always possible words of the language in question. Thus, an English speaker will hardly notice the neologism in 'her secretary nixonized the tapes' but will immediately react to the third item in 'her secretary nxioniezd the tapes'.

We have been discussing graphotactics, but FSTNs are not restricted in their application to this, linguistically rather barren, level of description. For most languages, although not all, it makes sense to postulate a unit of syntactic organization between the letter and the word. This unit is known as a morpheme (roughly speaking, the minimal meaningful unit). Here, as elsewhere in this book, we restrict our attention to the written form of languages (In the spoken form, we have individual sounds, known as phones, rather than letters of an alphabet, and these sounds typically group into higher level sound units such as syllables.) Rather than attempt to define the notion of a morpheme, we will simply exhibit the component morphemes of some English words:




print-s print-ed print-ing re-print im-print slow-ly in-de-cipher-able

English does not have a rich morphological structure but many languages allow the speaker to express in a single word something that an English speaker would require a whole sentence to say. For example, the Swahili word 'wametulipa' translates into English as 'they have paid us', while 'unamsumbua' translates as 'you are annoying him'. It is often possible to capture the permissible morpheme sequences of a language, or of part of a language, by means of an FSTN. SWAHILI-1 is an FSTN for a subset of Swahili words.


BOX:




Name SWAHILI-1: Initial 1 Final 5 From 1 to 2 by SUBJ From 2 to 3 by TENSE From 3 to 4 by OBJ From 4 to 5 by STEM.

SUBJ abbreviates: ni, u, a, tu, wa. OBJ abbreviates: ni, ku, m, tu, wa. TENSE abbreviates: ta, na, me, li. STEM abbreviates: penda, piga, sumbua, lipa.



This FSTN will recognize each of 100 morphological variants of four Swahili verbs.

This is pleasing, but is it useful? The answer is probably no. After all, if we are concerned with Swahili words at this kind of level of detail, as opposed to wondering whether some string of characters was a possible Swahili word, then we are almost certain to want more than the yes or no that an FSTN can provide. To get more information, we need a parser or a transducer, not a recognizer. We will look at finite-state transducers in some detail in a subsequent section of this chapter.

At this point, let us take stock and recapitulate with some practical rules of thumb for the specification of FSAs by FSTNs:
(i) If we want to get from state i to state j without moving across a symbol in the input, or putting a symbol on the output, then we connect i and j with a jump arc.
(ii) If we want to get from state i to state j while having the option to move across, or print, a symbol s then we connect i and j with two arcs, one a jump arc and the other labelled with s.
(iii) If we are in state i and wish to move across, or print, zero or more occurrences of a symbol s, then we connect i to itself with an arc labelled with s.
(iv) If we are in state i and wish to move across, or print, one or more occurrences of a symbol s, then we connect i to a new node j with an arc labelled with s, and then we connect j back to i with a jump arc.
(v) If we are in state i and wish to move across, or print, one or more occurrences of a string S that can be reached between states i and j, then we connect j back to i with a jump arc.
(vi) If the presence of even one occurrence of the string is optional in the latter case, then i should also be connected to j with a jump arc, pointing in the opposite direction to the one already introduced.
These are simply heuristics: they may not have the intended effect if they interact with other features of the FSTN, most obviously if any of the named states, or intermediate states, are also final states.

Exercise 2.1

Send us a comment.



[Contents] [Previous] [Next]
This document was translated by troff2html v0.21 on October 22, 1996.