An FSA of the kind we have discussed, used to analyze some existing input, is a recognizer, not a parser or a transducer, so all it can do is decide on the well-formedness of a string. If it can reach the end of the string in a final state, then the string is well formed; if it cannot reach the end of the string, or if cannot reach the end of the string and simultaneously be in a final state, then the string is not well formed. That is all the information it provides. To get more information, we need a parser or a transducer, not a recognizer.
Although we can interpret an FSTN as a machine that produces a simple yes or no output for any input, with some small extensions to the notation, we can interpret one as a finite-state transducer (FST). An FST is a more interesting kind of FSA that allows a string of output symbols to be produced as an input string is recognized.
One way to think of an FST is as a special kind of FSA that inspects two (or more) symbols at a time and proceeds accordingly. To put some intuitive flesh on this rather barren statement, we describe a family of hypothetical children's games. Imagine a long, straight pavement made up of coloured, square paving stones set two abreast. At any given time, apart from when a player actually makes a move forward, the player must have their left foot on a left-side paving stone and their right foot on the right-side paving stone that is immediately adjacent to it. And, at any given time, the player must be in one of a (finite) number of states. For simplicity, we will assume just two states: that of having your hand in your pocket, and that of not having it there. Each variant of the game consists of a collection of rules, of which the following are examples:
If your left foot is on a red paving stone and your right foot is on a green paving stone and your hand is in your pocket, then you can advance to the next pair of stones and keep your hand in your pocket.
If your left foot is on a blue paving stone and your right foot is on a blue paving stone and your hand is not in your pocket, then you can advance to the next pair of stones and put your hand in your pocket.
Given such a collection of rules, the player's goal is to move from the beginning of the pavement to the end, without making any move that is not expressly permitted by the rules. A child playing this game will be implementing an FST.
As we have just presented it, an FST is a kind of inspector that runs along checking if two strings of symbols stand in an appropriate correspondence. As such, it seems very little different, and hardly more useful, than an FSA running as a recognizer. But just as an FSTN can be interpreted both as a string recognizer and a string generator, so a network representing an FST can be interpreted, for instance, as both a string correspondence checker and as a machine that reads one string while writing another. We will use the term FST to refer to any such finite-state machine that makes use of multiple tapes, even though it is the latter interpretation that is of most interest and utility in the natural language domain. How, then, are we to conceive of an FST applying to a linguistic domain? Here is an example of the kind of rule that we might use in a linguistic application:
If your current English word is 'fish' and you are in state 13, then you may print 'poisson' and advance to the next English word and shift into state 7.
But before we get overambitious and attempt English-French machine translation with an FST, let us step back and equip ourselves with a notation for talking about FSTs.
An FST can be specified by an FSTN whose arcs are labelled with pairs of symbols, as opposed to single symbols, and as before we will blur the distinction between the abstract notation, which describes sets of pairs of strings, and the possible machines that it might specify. To specify an FST, then, the simplest thing to do is just to augment the NATR language we already have for defining FSTNs. Let us imagine that the symbols we are dealing with are being read from, or printed on to, tapes. There is no need to make any changes in our formalism for expressing initial and final states. But we do need to allow our arc description statements to specify pairs of symbols as labels. We will use expressions like A_a, where 'A' is a tape 1 symbol and 'a' is a tape 2 symbol, for such symbol pairs.
In fact, there is no need to restrict ourselves to two tapes. Both our syntax for transducer symbol declarations and our notation for symbol pairs are designed to generalize to the n-tape situation, for n greater than 2. Hence, A_a_alpha could be used for a symbol triple in a three-tape situation, for example. However, in what follows, we will restrict ourselves to two-tape machines.
An arc description statement for an FST will appear thus:
From 1 to 2 by A_a
What about cases where we wish to read, or write, a symbol on one tape, but do nothing on the other? Here, we can use our jumping symbol; so, #_a and A_# will be well-formed pairs. (Thus, the hash remains a special symbol, part of our language for talking about FSTs. If you want to have the FST manipulate the real hash symbol, then it will need to be set off in quotes.)
This minor generalization from single symbols to pairs - or, more generally, n-tuples - of symbols is all we really need to define FSTs.
Here is an example of an abbreviation for an FST:
DIGIT abbreviates: one_un, two_deux, three_trois, four_quatre, five_cinq, six_six, seven_sept, eight_huit, nine_neuf, ten_dix.
Having introduced these additions to our existing FSTN language, we can consider
ENG_FRE-1, which is
an example of a complete FST specified in this augmented NATR notation.
Name ENG_FRE-1: Initial 1 Final 5 From 1 to 2 by WHERE From 2 to 3 by BV From 3 to 4 by DET From 4 to 5 by NOUN. WHERE abbreviates: where_ou. BV abbreviates: is_est. DET abbreviates: the_#. NOUN abbreviates: exit_la sortie, policeman_le gendarme, shop_la boutique, toilet_la toilette.
The little FST described in ENG_FRE-1 would be completely trivial were it not for one wrinkle: in French the form of the determiner or definite article varies with the gender of the noun it accompanies, whereas English shows no such variation. This FST gets round the problem by letting the English 'the' map into nothing, and then requiring English nouns to map into the relevant French noun preceded by a determiner marked for the appropriate gender. Thus la sortie and le gendarme, for example, are single symbols as far as this FST is concerned. This is not a very elegant solution to the problem since it obfuscates the correspondence that holds between the English 'the' and French 'la/le'.
An alternative FST for this English-French translation task,
one that is arguably more perspicuous, despite the
introduction of an additional state and two additional arcs,
is given in ENG_FRE-2.
Name ENG_FRE-2: Initial 1 Final 5 From 1 to 2 by WHERE From 2 to 3 by BV From 3 to 4 by DET-FEMN From 4 to 5 by N-FEMN From 3 to 6 by DET-MASC From 6 to 5 by N-MASC. WHERE abbreviates: where_ou. BV abbreviates: is_est. DET-FEMN abbreviates: the_la. DET-MASC abbreviates: the_le. N-FEMN abbreviates: exit_sortie, shop_boutique, toilet_toilette. N-MASC abbreviates: policeman_gendarme.
Notice in ENG_FRE-2 how the gender distinction has been, in effect, encoded into the states: if we traverse the net via state 4, then we must have a feminine determiner and a feminine noun, whereas if we go via state 6, then we must have a masculine determiner and a masculine noun. There are no other possibilities.
Notice also that if we use this transducer to translate from French to English, then it will operate deterministically - it will never need to make a choice. But if we use it to translate from English to French, its operation will not be deterministic: in translating 'Where is the policeman?' it will be faced with a choice when it reaches the determiner. It can either traverse the DET-MASC arc or the DET-FEMN arc and it has no basis for deciding which. If it goes the DET-FEMN route, then it will fail when it reaches the N-FEMN arc, since policeman has no French counterpart in N-FEMN. So, any algorithm that we devise to employ FSTs that exhibit such non-determinism will need to incorporate either an ability to backtrack or an ability to explore multiple arcs in parallel. It is worth noting that our earlier and uglier English-French FST was deterministic in both directions.
The examples of FSTs that we have just been playing with are misleading in
that nobody nowadays would dream of attempting to do serious English-French
machine translation with an FST, for reasons that will begin to emerge in
the final section of this chapter and which should be self-evident by the
time you have reached the end of this book. But FSTs are potentially
well suited to providing efficient solutions to certain small self-contained
areas of linguistic analysis. Examples that spring to mind are the
translation or interpretation of number names and time of day expressions
(although not in all languages), text to speech transduction in languages with
fairly well-behaved orthographies and the inflectional analysis of word
forms. We will look briefly at the latter here, returning to our earlier
Swahili example, reconstructed as an FST mapping between the Swahili
morphemes and reasonably perspicuous representations of their syntactic and
semantic content (SWAHILI-2).
Name SWAHILI-2: Initial 10 Final 90 From 10 to 21 by Subj_ni From 10 to 22 by Subj_u From 10 to 23 by Subj_a From 10 to 24 by Subj_tu From 10 to 25 by Subj_wa From 21 to 31 by 1ST From 22 to 31 by 2ND From 23 to 31 by 3RD From 24 to 32 by 1ST From 25 to 32 by 3RD From 31 to 40 by SING From 32 to 40 by PLUR From 40 to 50 by TENSE From 50 to 61 by Obj_ni From 50 to 62 by Obj_ku From 50 to 63 by Obj_m From 50 to 64 by Obj_tu From 50 to 65 by Obj_wa From 61 to 71 by 1ST From 62 to 71 by 2ND From 63 to 71 by 3RD From 64 to 72 by 1ST From 65 to 72 by 3RD From 71 to 80 by SING From 72 to 80 by PLUR From 80 to 90 by STEM.
1ST abbreviates: 1st_#. 2ND abbreviates: 2nd_#. 3RD abbreviates: 3rd_#. SING abbreviates: Sing_#. PLUR abbreviates: Plur_#. TENSE abbreviates: Future_ta, Present_na, Perfect_me, Past_li. STEM abbreviates: LIKE_penda, BEAT_piga, ANNOY_sumbua, PAY_lipa.
This network will map a Swahili expression like 'wa-me-ni-sumbua' (on the second tape) into 'Subj-3rd-Plur-Perfect-Obj-1st-Sing-ANNOY' (on the first tape) or conversely, and it can be seen why this might well be a useful thing to do in a system designed to analyze or synthesize inflected Swahili words. Notice that the FST given is deterministic when we use it to map from Swahili into our analysis expressions, but not when we go in the opposite direction. Notice also that we have been cheating in our discussion so far: real Swahili words come to us as 'wamenisumbua' not as 'wa-me-ni-sumbua' with all the morpheme breaks conveniently marked. But we can solve this problem with the FST machinery that we already have: we simply need to transduce 'w a m e n i s u m b u a' into 'wa me ni sumbua'.
An FST is a finite-state machine that deals with two tapes. Thus, we can amend our programs to deal with FSTs, mainly by changing the tape-handling procedures. Let us represent the 'tape' of a transducer as a list of the two actual tapes, each of which is a normal list of symbols. Now the label on a network arc will need to specify constraints on the next symbol on both tapes. We can show this by providing a pair of labels (list of two labels) or an abbreviation that stands for such a pair. For instance, here is the network and abbreviations for the ENG_FRE-1 example: Code:eng_fre1.p
[Final 5] [From 1 to 2 by WHERE] [From 2 to 3 by BE] [From 3 to 4 by DET] [From 4 to 5 by NOUN]] -> eng_fre1;
[[WHERE abbreviates [where ou]] [BE abbreviates [is est]] [DET abbreviates [the #]] [NOUN abbreviates [exit 'la sortie'] [policeman 'le gendarme'] [shop 'la boutique'] [toilet 'la toilette']]] -> abbreviations;
As before, the tape-handling procedures will depend on whether we wish to do recognition or generation. With transducers, there are in fact more possibilities, as we may wish to recognize inputs on both tapes, generate output on both tapes or recognize input on one and produce output on the other. Here is the tape-handling procedure for the case where we wish to recognize input on the first tape and produce output on the second: Code:fstrans.p
define transduce_move(label, tape); vars input output newinput newoutput l1 l2 e exps; tape --> [?input ?output]; if label matches [?l1 ?l2] then [% recognize_move(l1, input) %] -> newinput; unless newinput = [] then hd(newinput) -> newinput; ;;; there is only one possibility for newoutput in [% generate_move(l2, output) %] do [^newinput ^newoutput] endfor endunless elseif label = "#" then tape elseif abbreviations matches [== [^label abbreviates ??exps] ==] then for e in exps do transduce_move(e, tape) endfor endif enddefine;
We have made use here of the recognize_move and generate_move procedures previously developed for finite-state machines that are recognizing and generating (these appear in lib fstape). In a network for an FST, a label on an arc will be (possibly an abbreviation for) a pair of labels. We need to attempt to move the first (input) tape in accordance with the first label (this is what recognize_move does). If this is successful, we need to produce output on the second tape in accordance with what the second label dictates (this is what generate_move does). If all goes well, we finally glue together the new input tape with each output tape in turn to give the list of new tapes.
Here is the rest of the program. As before, we have a main procedure and a next procedure. The initial double-tape consists of the words to be analyzed (the input tape) and an empty list (the output tape so far). The terminating condition on the double-tape is that the first component must be empty. In this case, the second component is the complete sequence generated on the second tape. This program stops as soon as one successful network traversal has occurred (as in the recognize case).
define transduce_next(node, tape, network); vars newnode label newtape output; if tape matches [[] ?output] and member(node, final_nodes(network)) then output; exitfrom(transduce) else foreach [From ^node to ?newnode by ?label] in transitions(network) do for newtape in [%transduce_move(label, tape)%] do transduce_next(newnode, newtape, network) endfor endforeach endif enddefine;
define transduce(network, tape); vars i; for i in initial_nodes(network) do transduce_next(i, [^tape []], network) endfor; false enddefine;
Send us a comment.