| Path | Value |
| syn cat | verb |
| syn type | main |
| syn form | present participle |
| mor form | love ing |
Word1:
<syn cat> = verb
<syn type> = main
<syn form> = present participle
<mor form> = love ing.
Here, angle brackets <...> delimit paths. Note that values can be
atomic or they can consist of sequences of atoms, as the two last
lines of the example illustrate. Node names and atoms are
distinct but essentially arbitrary classes of tokens in DATR. In this
this document (and elsewhere) we distinguish them by a simple case
convention - node
names start with an uppercase letter, atoms do not. As a first
approximation, nodes can be
thought of as denoting partial functions from paths (sequences of
atoms) to values (sequences of atoms). This is an approximation
since it ignores the role of global contexts - see Section
4, below.
In itself, this tiny fragment of DATR is not persuasive, apparently allowing only for the specification of words by simple listing of path/value statements for each one. It seems that if we wished to describe the passive form of love we would have to write:
Word2:
<syn cat> = verb
<syn type> = main
<syn form> = passive participle
<mor form> = love ed.
This does not seem very helpful: the whole point of a lexical
description
language is to capture generalisations and avoid the kind of duplication
evident in the specification of Word1 and Word2. And indeed, we
shall shortly introduce an inheritance mechanism which allows us to do
just that. But there is one sense in which this listing approach is
exactly what we want: it represents the actual information we generally
wish to access from the description. So in a sense we do want all the
above statements to be present in our description; what we want to avoid
is repeated specification of the common elements.
This problem is overcome in DATR in the following way: such exhaustively listed path/value statements are indeed present in a description, but typically only implicitly present. Their presence is a logical consequence of a second set of statements, which have the concise, generalisation-capturing properties we expect. To make the distinction sharp, we call the first type of statement extensional and the second type definitional. Syntactically, the distinction is made with the equality operator: for extensional statements (as above), we use =, while for definitional statements we use ==. And, although our first example of DATR consisted entirely of extensional statements, almost all the remaining examples will be definitional. The semantics of the DATR language binds the two together in a declarative fashion, allowing us to concentrate on concise definitions of the network structure from which the extensional ``results'' can be read off.
Our first step towards a more concise account of Word1 and Word2 is simply to change the extensional statements to definitional ones:
Word1:
<syn cat> == verb
<syn type> == main
<syn form> == present participle
<mor form> == love ing.
Word2:
<syn cat> == verb
<syn type> == main
<syn form> == passive participle
<mor form> == love ed.
This is possible because DATR respects the unsurprising condition that
if at some node a value is specifically defined for a path with a
definitional statement, then the corresponding extensional statement
also holds. So the statements we previously made concerning Word1
and Word2 remain true, but now only implicitly true.
Although this change does not itself make the description more concise, it allows us to introduce other ways of describing values in definitional statements, in addition to simply specifying them. Such value descriptors will include inheritance specifications which allow us to gather together the properties that Word1 and Word2 have solely by virtue of being verbs. We start by introducing a VERB node:
VERB:
<syn cat> == verb
<syn type> == main.
and then redefine Word1 and Word2
to inherit their verb properties from it. A direct
encoding for this is as follows:
Word1:
<syn cat> == VERB:<syn cat>
<syn type> == VERB:<syn type>
<syn form> == present participle
<mor form> == love ing.
Word2:
<syn cat> == VERB:<syn cat>
<syn type> == VERB:<syn type>
<syn form> == passive participle
<mor form> == love ed.
In these revised definitions the right hand side of the <syn cat>
statement is not a direct value specification, but instead an
inheritance descriptor. This is the simplest form of DATR inheritance, it just specifies a new node and path from which to obtain
the required value. It can be glossed roughly as ``the value associated
with <syn cat> at Word1 is the same as the value associated with
<syn cat> at VERB''. Thus from VERB:<syn cat> ==
verb it now follows that Word1:<syn cat> ==
verb. And hence also the extensional version,
Word1:<syn cat> = verb.
However, this modification to our analysis seems to make it less rather than more concise. It can be improved in two ways. The first is really just a syntactic trick: if the path on the right hand side is the same as the path on the left hand side it can be omitted. So we can replace VERB:<syn type>, in the example above, with just VERB. We can also extend this abbreviation strategy to cover cases like the following, where the path on the right hand side is different but the node is the same:
Come:
<mor root> == come
<mor past participle> == Come:<mor root>.
In this case we can simply omit the node:
Come:
<mor root> == come
<mor past participle> == <mor root>.
The other improvement introduces one of the most important features of
DATR - specification by default. Recall that paths are
sequences of attributes. If we understand paths to start at their left
hand end, we can construct a notion of path extension: a path
P2 extends a path P1 if and only if all the attributes of
P1 occur in the same order at the left hand end of P2 (so <a1
a2 a3> extends <>, <a1>, <a1 a2> and <a1 a2 a3>, but not <a2>,
<a1 a3>, etc..). If we now consider the (finite) set of paths
occurring in definitional statements associated with some node, that set
will not include all possible paths (of which there are infinitely
many). So the question arises of what we can say about paths for which
there is no specific definition. For some path P1 not defined at
node N, there are two cases to consider: either P1 is the
extension of some path defined at N or it is not. The latter case
is easiest - there is simply no definition for P1 at N
(hence N can be a partial function, as already noted above).
But in the former case, where P1 extends some P2 which
is defined at N, P1 assumes a definition ``by default''.
If P2 is the only path defined at N which P1 extends,
then P1 takes its definition from the definition of P2. If
P1 extends several paths defined at N, it takes its
definition from the most specific (i.e., the longest) of the paths that
it extends.
In the present example, this mode of default specification can be applied as follows. We have two statements at Word1 which (after applying the abbreviation introduced above) both inherit from VERB:
Word1:
<syn cat> == VERB
<syn type> == VERB.
Because they have a common leading subpath <syn>, we can collapse them
into a single statement about <syn> alone:
Word1:
<syn> == VERB.
If this were the entire definition of Word1, the default mechanism
would ensure that all extensions of <syn> (including the two that
concern us here) would be given the same definition - inheritance from
VERB. But in our example, of course, there are other statements
concerning Word1. If we add these back in, the complete definition
looks like this:
Word1:
<syn> == VERB
<syn form> == present participle
<mor form> == love ing.
The paths <syn type> and <syn cat> (and also many others, such as
<syn cat foo>, <syn baz>) obtain their definitions from <syn>
using the default mechanism just introduced, and so inherit from
VERB. But <syn form>, being explicitly defined, is exempt from
this default behaviour, and so retains its value definition,
present participle. And any extensions of <syn form> obtain
their definitions from <syn form> rather than <syn> (since it is a
more specific leading subpath), and so will have the value present
participle also.
The net effect of this definition for Word1 can be glossed as ``Word1 stipulates its morphological form to be love ing and inherits values for its syntactic features from VERB, except for <syn form> which is present participle ''. More generally, this mechanism allows us to define nodes differentially: by inheritance from default specifications, augmented by any non-default settings associated with the node at hand. In fact, the Word1 example can take this default inheritance one step further, by inheriting everything (not just <syn>) from VERB, except for the specifically mentioned values:
Word1:
<> == VERB
<syn form> == present participle
<mor form> == love ing.
Here the empty path <> is a leading subpath of every path, and so
acts as a ``catch all'' - any path for which no more specific
definition at Word1 exists will inherit from VERB.
Inheritance via the empty path is ubiquitous in real DATR lexicons
but it should be remembered that the empty path has no special formal
status in the language.
In this way Word1 and Word2 can both inherit their general verbal properties from VERB. But of course these two particular forms have more in common than simply being verbs - they are both instances of the same verb, love. By introducing an abstract Love lexeme, we can provide a site for properties shared by all forms of love (in this simple example, just its morphological root and the fact that it is a verb).
VERB:
<syn cat> == verb
<syn type> == main.
Love:
<> == VERB
<mor root> == love.
Word1:
<> == Love
<syn form> == present participle
<mor form> == <mor root> ing.
Word2:
<> == Love
<syn form> == passive participle
<mor form> == <mor root> ed.
So now Word1 inherits from Love rather than
VERB (but Love inherits from VERB, so the latter's
definitions are still present at Word1). However, instead of
explicitly including the atom love in the morphological form, the
value definition includes the descriptor <mor root>. This descriptor
is equivalent to Word1:<mor root> and, since <mor root> is not
defined at Word1, the empty path definition applies, causing it to
inherit from Love:<mor root>, and thereby return the expected value,
love. Notice here that each element of a value can be
defined entirely independently of the others; for <mor form> we now
have an inheritance descriptor for the first element and a simple value
for the second.
Our toy fragment is beginning to look somewhat more respectable: a single node for abstract verbs, a node for each abstract verb lexeme, and then individual nodes for each morphological form of each verb. But there is still more that can be done. Our focus on a single lexeme has meant that one class of redundancy has remained hidden. The line
<mor form> == <mor root> ing
will occur in every present participle form of every verb. But it is a
completely generic statement that can be applied to all English present
participle verb forms. So can we not replace it with a single statement
in the VERB node? Using the mechanisms we have seen so far, the
answer is no. The statement would have to be (i), which is equivalent
to (ii), whereas the effect we want is (iii):
(i) VERB:<mor form> == <mor root> ing (ii) VERB:<mor form> == VERB:<mor root> ing (iii) VERB:<mor form> == Word1:<mor root> ingUsing (i) or (ii), we would end up with the same morphological root for every verb (or more likely no value at all, since it is hard to imagine what value VERB:<mor root> might plausibly be given), rather than a different one for each. And of course, we cannot simply use (iii) as it is, since that only applies to the particular word described by Word1, namely loving.
The problem is that the inheritance mechanism we have been using is local, in the sense that it can only be used to inherit either from a specifically named node (and/or path), or relative to the local context of the node (and/or path) at which it is defined. What we need is a way of specifying inheritance relative to the the original node/path specification whose value we are trying to determine, rather than the one we have reached by following inheritance links. We shall refer to this original specification as the query we are attempting to evaluate, and the node and path associated with this query as the global context. Strictly speaking, the query node and path form just the initial global context, since as we shall see in Section 3.2.2 below, the global context can change during inheritance processing. Global inheritance, that is, inheritance relative to the global context, is indicated in DATR by using quoted ("...") descriptors, and we can use it to extend our definition of VERB as follows:
VERB:
<syn cat> == verb
<syn type> == main
<mor form> == "<mor root>" ing.
Here we have added a definition for <mor form> which contains the
quoted path "<mor root>". Roughly speaking, this is to be
interpreted as ``inherit the value of <mor root> from the node
originally queried''. With this extra definition, we no longer need a
<mor form> definition in Word1, so it just becomes:
Word1:
<> == Love
<syn form> == present participle.
To see how this global inheritance works, consider evaluating the query
Word1:<mor form>. Since <mor form> is not defined at Word1, it
will inherit from VERB via Love. This specifies inheritance of
<mor root> from the query node, which in this case is
Word1. The path <mor root> is not defined at Word1 but inherits the
value love from Love. Finally, the definition of <mor form>
at VERB adds an explicit ing, resulting in a value of
love ing for Word1:<mor form>. However, had we begun
evaluation at, say, a daughter of the lexeme Eat, we would have been
directed from VERB:<mor form> back to the original daughter of
Eat to determine its <mor root>, which would be inherited from
Eat itself. So we would have ended up with the value eat ing.
The analysis is now almost the way we would like it to be. However, by moving <mor form> from Word1 to VERB, we have introduced a new problem: we have frozen in the present participle as the (default) value of <mor form> for all verbs. Clearly, if we want to specify other forms at the same level of generality, then <mor form> is currently misnamed: it should be <mor present participle>, so that we can add <mor past participle>, <mor present tense>, etc. If we make this change, then the VERB node will look like this:
VERB:
<syn cat> == verb
<syn type> == main
<mor past> == "<mor root>" ed
<mor passive> == "<mor past>"
<mor present> == "<mor root>"
<mor present participle> == "<mor root>" ing
<mor present tense sing three> == "<mor root>" s.
In adding these new specifications, we have added a little extra
structure as well. The passive form is asserted to be the same as the
past form - the use of global inheritance here ensures that irregular
or subregular
past forms result in irregular or subregular passive forms,
as we shall see shortly.
The paths introduced for the present forms illustrate another use of
default definition. We assume that the morphology of present tense
forms is specified with paths of five attributes, the fourth specifying
number, the fifth, person. Here we define default present morphology to
be simply the root, and this generalises to all the longer forms, except
the present participle and the third person singular.
So now for Love, the following extensional statements hold, inter alia:
Love:
<syn cat> = verb
<syn type> = main
<mor present tense sing one> = love
<mor present tense sing two> = love
<mor present tense sing three> = love s
<mor present tense plur> = love
<mor present participle> = love ing
<mor past tense sing one> = love ed
<mor past tense sing two> = love ed
<mor past tense sing three> = love ed
<mor past tense plur> = love ed
<mor past participle> = love ed
<mor passive participle> = love ed.
There remains one last problem in the definitions of Word1 and Word2. The morphological form of Word1 is now given by <mor present participle>. Similarly, Word2's morphological form is given by <mor passive participle>. There is no longer a unique path representing morphological form. But this can be corrected by the addition of a single statement to VERB:
VERB:
<mor form> == "<mor "<syn form>">".
This statement employs a DATR construct, the evaluable path,
which we have not encountered before. The right hand side consists of a
(global) path specification, one of whose component attributes is itself
a descriptor, to be evaluated before the outer path can be. The effect
of the above statement is to say that <mor form> globally inherits
from the path given by the atom mor followed by the global value
of <syn form>. For Word1, <syn form> is present
participle, so <mor form> inherits from <mor present participle>.
But for Word2, <mor form> inherits from <mor passive
participle>. Effectively, the <syn form> is being used as a parameter
to control which specific form should be considered the
morphological form. Evaluable paths may themselves be global (as in
our example) or local and their evaluable components may also
involve global or local reference.
Our analysis now looks like this:
VERB:
<syn cat> == verb
<syn type> == main
<mor form> == "<mor "<syn form>">"
<mor past> == "<mor root>" ed
<mor passive> == "<mor past>"
<mor present> == "<mor root>"
<mor present participle> == "<mor root>" ing
<mor present tense sing three> == "<mor root>" s.
Love:
<> == VERB
<mor root> == love.
Word1:
<> == Love
<syn form> == present participle.
Word2:
<> == Love
<syn form> == passive participle.
The entire analysis is somewhat larger than the original, but it encodes
all the past and present tense forms as well as all three participial
forms. More importantly, almost all the information is in the VERB
node and is common to many verb lexemes. Linguistically,
the analysis is still not
abstract enough since it fails to encode the morphotactic generalisation
that, by default, an inflected English word consists of a root optionally
followed by a suffix. Such generalisations are easy enough to state in
DATR but would entail more elaboration of our running example than
its expository purpose requires. Indeed, the other nodes are as
small as they reasonably could be: Love simply states that it is a
verb with morphological root love and Word1 simply states that
it is a present participle instance of Love.
Of course, Love is a completely regular verb. But DATR's capacity for definition by default allows subregular and irregular lexemes to be concisely represented also. As an example, consider the class of verbs which take en as their past participle ending: hew, mow, saw, sew, etc. (our orthographic representations here presuppose some basic ``spelling rules'', thus love ed is spelt loved, love ing is spelt loving and mow en is spelt mown. If we had chosen to represent roots and suffixes as letter sequences rather than as atoms then it would have been possible to implement the necessary spelling rules in a finite state transducer written in DATR itself. See, for example, that presented in Section 6.3, below.). We can represent this subregularity with a new verbal node which defaults to VERB, but overrides just the past participle morphology:
EN_VERB:
<> == VERB
<mor past participle> == "<mor root>" en.
Relevant individual verb lexemes then inherit from this
node instead of directly from VERB:
Mow:
<> == EN_VERB
<mor root> == mow.
Sew:
<> == EN_VERB
<mor root> == sew.
As noted above, the passive forms of these subregular verbs will also
now be correct, because of the use of a global cross-reference to the
past participle form in the VERB node. So for example, the
definition of the passive form of sew is:
Word3:
<> == Sew
<syn form> == passive participle.
If we seek to establish the <mor form> of Word3, we are sent
up the hierarchy of nodes, first to Sew, then to EN_VERB,
and then to VERB. Here we encounter
"<mor "<syn form>">" which resolves to
"<mor passive participle>" in virtue of the embedded global
reference to <syn form> at Word3. This means we now
have to establish the value of <mor passive participle> at Word3.
Again, we ascend the hierarchy to VERB and find ourselves
referred to the global descriptor "<mor past participle>".
This takes us back to Word3, from where we again climb,
first to Sew, then to EN_VERB. Here, <mor past participle>
is given as the sequence "<mor root>" en. This
leads us to look for the <mor root> of Word3 which we find
at Sew giving the result we seek:
Word3:
<mor form> = sew en.
Irregularity can be treated as just the limiting case of subregularity,
so, for example, the morphology of Do can be specified as
follows (orthographically, the form does could simply
be treated as regular (from do s).
However, we have chosen to stipulate it here since,
although the spelling appears regular, the phonology is not, so
in a lexicon that defined phonological forms it would need to be
stipulated.):
Do:
<> == VERB
<mor root> == do
<mor past> == did
<mor past participle> == done
<mor present tense sing three> == does.
Likewise, the morphology of Be can be specified as follows:
Be:
<> == EN_VERB
<mor root> == be
<mor present tense sing one> == am
<mor present tense sing three> == is
<mor present tense plur> == are
<mor past tense sing one> == <mor past tense sing three>
<mor past tense sing three> == was
<mor past tense plur> == were.
In this section we have moved from simple attribute/value listings to a compact, generalisation-capturing representation for a fragment of English verbal morphology. In so doing, we have seen examples of most of the important ingredients of DATR: local and global descriptors, definition by default, and evaluable paths.
