ana - automatic a/an decisions
Deciding whether a word in text should be preceded by a or an is a surprisingly complex task. The school rule of `an before words beginning with a vowel letter (aeiou), a otherwise' does not work in a wide range of cases. It needs to be modified to `an before words which start with a vowel sound, a for a consonant sound'.
It is therefore necessary to somehow map from the orthography (the letters) of the word to its phonetic realisation (its pronunciation). Fortunately, there are several 'rules' that can be used to detect these exceptions in most cases. However, ultimately, one has to resort to memorising exceptions. ana attempts to perform the ``a/an decision'' automatically.
ana's default operation is to take a list of words either from the command-line or from standard input and determine whether each word begins with a vowel sound or not. The output consists of each word preceded by ana's choice of a or an followed by a tab. This behaviour is demonstrated in the examples below.
command line
ana hello goodbye
response
a hello a goodbye
command line
ana one two three
response
a one a two a three
Note that in this example, the default rule (`an before vowels') would not have worked for one.
If no command-line arguments are supplied then ana reads lines from standard input (stdin). This is called stdin-mode. Each line is split on white-space and each word processed in turn:
command line
ana
typed text
seven eight nine ten
response
a seven an eight a nine a ten
This functionality is only useful for demonstration purposes. The real utility of ana comes from stream processing.
ana has three distinct modes of operation. Two of these, command-line mode and stdin-mode are discussed above. The third and most complex but by far the most
useful is stream mode.
Use the option -stream to enable stream mode. ana
will then read one-line sentences from stdin and correct all a/an decisions. This is called `plain stream mode'.
command line
ana -stream
typed text
Eat a apple and an banana daily
response
Eat an apple and a banana daily
`Plain' is an example of a named stream specification. The option -stream is an alias for the more general option -name with an argument of `plain'. The -name option takes at least one argument which is the name of the type of stream. At present this can be one of three: -plain, tag or sgml. These are discussed in more detail below.
In tagged text, each word is followed by an indication of its part-of-speech (PoS). For example:
a_AT0 elephant_NP1 ate_VBZ a_AT0 orange_NP1
In order to use ana to correct a/an decisions here, it is necessary to use `tag stream mode':
command line
ana -name tag _ AT0
typed text
a_AT0 elephant_NP1 ate_VBZ a_AT0 orange_NP1
response
an_AT0 elephant_NP1 ate_VBZ an_AT0 orange_NP1
The command line specifies that the character (really a regular expression) `_' separates the word from the tag and the a/an tag is `AT0'. For more details on specifying arguments for tag stream mode, see the full description of -name below.
Many corpora and system output use SGML (or variants) to delimit words, sentences, etc. ana has a stream mode that handles this type of output.
command line
ana -name sgml WORD
typed text (on one line)
<WORD>a</WORD> <WORD>elephant</WORD> <WORD>ate</WORD> <WORD>a</WORD> <WORD>orange</WORD>
response (on one line)
<WORD>an</WORD> <WORD>elephant</WORD> <WORD>ate</WORD> <WORD>an</WORD> <WORD>orange</WORD>
ana takes several command-line options which should provide enough functionality to handle a variety of text formats requring a/an correction. The full syntax of ana is shown below followed by a thorough description of the command-line options.
ana
[ -aan a an [ aan_regexp ] ]
[ -capitals ]
[ -delimiters [ sentence_start_re ] sentence_end_re ]
[ -help ]
[ -line ]
[ -multiline ]
[ -name plain ]
[ -name tag [ [ tag_sep_re ] tag_re ] ]
[ -name sgml [ [ attrib_re ] element_re ] ]
[ -regexps re1 re2 [ re3 [ re4 [ re5 ] ] ] ]
[ -stream ]
[ -transpose ]
[ -verbosity [ level ] ]
[ -word regexp ]
[ -- ]
[ word ... ]
Option processing continues up to the first `--' in the argument list (if present). Options must precede non-option command-line arguments.
Some options implicitly set variables specific to other options. For example, many of the options are only appropriate in stream mode so using these will implictly switch to stream processing. The descriptions below state which variables are being set implictly.
All options to ana can use initial letter abbreviations. Initial letters can be `bundled'. For example:
ana -sm
is equivalent to
ana -s -m
Bundling options which take arguments is handled appropriately but, for the sake of clarity, is not advised.
The majority of the options described below take arguments, some of which are optional. However, the options are all `greedy'. If an option takes between 2 and 5 arguments then if 5 are available, they are used.
It is advised that `--' is used to terminate a complicated command-line syntax.
Option arguments that are regular expressions are indicated in the syntax specifications by the presence of re. Specifying these can be problematic for a variety of reasons.
Firstly, to prevent interpretation by the shell, regular expression arguments should be enclosed in single quotes. This applies to most Unix shells.
It is worthwhile reading the regular expression section of the perl manuals. Try:
man perlre
The -verbosity option can be very useful for viewing the processing of the regular expressions specified. Verbosity is automatically switched on when most higher-level like -regexps are used.
Since ana uses back-referencing, if any regular expressions specified use parentheses then these must be:
(?: ... )
The presence of `?:' suppresses back-referencing.
If you have any problems, please get in touch with the author (see below).
Option Syntax:
-aan a an
-aan a an aan_regexp
This option specifies which two strings should be used to detect and/or replace a/an.
-aan a an
The regular expression used to search for a/an in text will be: `(a|an)'. If the target word starts with a consonant sound, a is used, otherwise an is used. See Example: -aan (1) below.
-aan a an aan_regexp
The regular expression used to search for a/an in text will be aan_regexp. a will be used for consonant sounding words, an for vowel sounding words. See Example: -aan (2) below.
Option Syntax:
-capitals
By default, ana expects the text it is processing to be lower-case. This enables it to detect acronyms accurately since the regular expressions rely on cases changing throughout the word.
For source texts that are completely in upper case, the acronym rules would lead to many incorrect a/an decisions. This option supresses the application of the acronym regular expressions. This will not eliminate all incorrect decisions but is the better of two accuracies.
Option Syntax:
-delimiters sentence_start_re sentence_end_re
In multiline mode, sentences are delimited by two regular expressions. Depending on the named mode (if any), these are set to the following defaults:
Mode | Sentence Start | Sentence End
--------+----------------+-------------
plain | "" | "\."
tag | "" | "._[^_]*"
sgml | "<SENTENCE>" | "</SENTENCE>"
This option specifies explictly both the start-of-sentence and end-of-sentence regular expressions over-riding the defaults above.
Note that this option must be used if -regexps is the most recent specification of the 5 regular expressions and multiline mode is in operation.
See Example: -delimiters (1) and (2) below as well as the following examples for -multiline: (1) and (2).
Option Syntax:
-help
Displays the syntax of ana.
Option Syntax:
-line
Indicates that each line should be considered a separate sentence. By default, ana operates in line mode.
Option Syntax:
-multiline
The default behaviour of ana in stream mode is to consider each input line to be one sentence. For many applications, this is not the case. The -multiline option relaxes this one-line constraint so that sentences can be allowed to cover several lines.
In order to split the stream into sentences, ana must know how to find the start and end of a sentence. Default regular expressions are assumed depending on the named mode unless over-ridden by the -delimiters option.
Option Syntax:
-name plain
-name tag [ [ tag_sep_re ] tag_re ]
-name sgml [ [ attrib_re ] element_re ]
ana currently recognises three different `named streams': plain, tag and sgml. The arguments that these streams take (if any) are described below.
-name plain
-name tag
Detects a/an in tagged text. The character `_' is assumed to separate a word from its tag and the a/an tag is AT0 (as in the CLAWS5 tagset).
Regular Expressions:
left1 \b right1 _AT0 sep .+? left2 \b right2 _
See Example: -name (3) below.
-name tag tag_re
The supplied argument over-rides the default AT0 regular expression for matching a/an tags.
Regular Expressions:
left1 \b right1 _<tag_re> sep .+? left2 \b right2 _
See Example: -name (4) below.
-name tag tag_sep_re tag_re
The first argument over-rides the default `_' regular expression separating a word from its tag and the second over-rides the AT0 default a/an tag.
Regular Expressions:
left1 \b right1 <tag_sep_re><tag_re> sep .+? left2 \b right2 <tag_sep_re>
See Example: -name (5) below.
-name sgml
This uses the default sgml-stream specification. Words are surrounded by:
<WORD> ... </WORD>.
See Example: -name (6) below.
-name sgml element_re
The supplied element over-rides the default WORD. So words are surrounded by:
<element_re> ... </element_re>.
See Example: -name (7) below.
-name sgml attrib_re element_re
Words will be set in one of the following contexts:
<element_re attrib_re=... >
or:
<element_re attrib_re=... > </element_re>.
See Example: -name (8) below.
For a discussion of how regular expressions are used in ana, see the section on regular expression matching below.
Option Syntax:
-regexps left1 right1
-regexps left1 right1 sep
-regexps left1 right1 sep left2
-regexps left1 right1 sep left2 right2
Omitted arguments use the following defaults:
regexp | default
--------+--------
sep | .+?
left2 | left1
right2 | right1
This option provides a way of completely specifying the 5 regular expressions used for identifying the location of a/an and target words. See Examples: -regexps (1) and (2) below.
This is equivalent to -name plain.
Using this option swaps the order of a/an and the target word. This means that the five regular expressions (discussed in the Regular Expressions section below) will now be used as follows:
left1 the left context for the target word right1 the right context for the target word sep the characters separating the two words and their contexts left2 the left context for a/an right2 the right context for a/an
Note this has no effect if not in stream mode.
Option Syntax:
-verbosity
-verbosity level
Turns verbosity on. The default is level 1. Higher (or lower) verbosity levels can be set by supplying an optional integer argument. At present, verbosity levels range from 0 to 2.
ana has been designed with `cut-and-paste' in mind. The regular expressions used utilise only basic regular expressions constructs thus aiding integration into other applications that may not have regular expression facilities as powerful as those in Perl 5.
Option Syntax:
-word word_regexp
By default, the regular expression used to match target words is `[^\s]+?'. ie. any sequence of one or more non-white-space characters. The question mark after the `+' means that the regular expression will match as little as possible of the string. Since this is normally followed by right context, this is stretched over the next word.
command line
ana -stream -aan "C" "V"
typed text
this is C example of V decision
response
this is V example of C decision
command line
ana -capitals
typed text
A MPHIL DEGREE
response
A MPHIL DEGREE
Notice how stream mode is implicitly set through the use of the -capitals option.
command line
ana -stream
typed text
A MPHIL DEGREE
response
An MPHIL DEGREE
Since input is assumed to be in lower-case, the presence of upper-case is treated as acronyms.
command line
ana -stream
typed text
A MPhil degree
response
An MPhil degree
The correct output given expected input.
command line
ana -multiline
typed text
a egg in an cup.
response
an egg in a cup.
command line
ana -multiline -name tag
typed text
an_AT0 woman_NP1 bought_VBZ an_AT0 unix_JJ box_NP1 ._.
response
a_AT0 woman_NP1 bought_VBZ a_AT0 unix_JJ box_NP1 ._.
command line
ana -name plain
typed text
a example sentence
response
an example sentence
command line
ana -name plain
typed text
an "quoted" example
response
a "quoted" example
command line
ana -name tag
typed text
an_AT0 "_PUNCT default_JJ "_PUNCT tag_NN1
response
a_AT0 "_PUNCT default_JJ "_PUNCT tag_NN1
By default, AT0 is the a/an tag and `_' is the word/tag separation character (can be a
regular expression).
command line
ana -name tag AT1
typed text
an_AT1 default_JJ tag_NN1
response
a_AT1 default_JJ tag_NN1
Over-riding the default AT0 for the a/an tag, AT1 is now used. The default `_' is still the word/tag separation character
(can be a regular expression).
command line
ana -name tag '\s*#\s*' ARTICLE
typed text
a # ARTICLE exception #WORD
response
an # ARTICLE exception #WORD
Over-riding the default AT0 for the a/an tag, AT1 is now used. A `#' optionally surrounded by any amount of white space is
used as the word/tag separation regular expression.
command line
ana -name sgml
typed text
<WORD>an</WORD> <WORD>table</WORD>
response
<WORD>a</WORD> <WORD>table</WORD>
command line
ana -name html W
typed text
<W>an</W> <W>table</W>
response
<W>a</W> <W>table</W>
command line
ana -name sgml W ORTH
typed text
<W ORTH="a"> <W ORTH=elephant></W>
response
<W ORTH="an"> <W ORTH=elephant></W>
command line
ana -regexps '<' '>'
typed text
<an> <sentence> <with> <chevrons>
response
<a> <sentence> <with> <chevrons>
Notice how the arguments are quoted to prevent interpretation by the shell.
command line
ana -regexps '<' '>' '\.'
typed text
<an> <sentence><with> <an>.<dot>
response
<an> <sentence><with> <a>.<dot>
The third argument to the option -regexps specifies the separator regular expression. Here the expression consists only of a single dot. This must be preceded by a backslash to prevent its special interpretation within regular expressions.
In the example, the first two words an and sentence are not matched since they are separated by a space, not a dot.
command line
ana -t
-a ' consonant' ' vowel' ""
-r "orth:\s*\b" "\b" ".*?" "decision:" "\s*"
typed text
orth: hour decision:
response
orth: hour decision: vowel
The command-line is shown spread across several lines for clarity. This would in reality all be on the same line.
Internally, the way in which ana determines which word is a/an and which is the `target word' is through the use of several regular expressions.
Each of these words uses two regular expressions. These specify the left-context and the right context. For example, in plain stream mode, a/an is identified by \b(a|an)\b (where \b matches word boundaries in Perl). Here the left context is `\b' as is the right context.
A further regular expression (the fifth) is used to specify how the words (and their contexts) are separated from each other. In most cases, this will be white space.
So in total, 5 regular expressions are needed:
left1 the left context for a/an right1 the right context for a/an sep the characters separating the two words and their contexts left2 the left context for the target word right2 the right context for the target word
The named stream modes provide an easy way of specifying these without worrying about regular expressions. However, for some text formats which are not covered by the named stream modes, it may be necessary to specify the regular expressions completely. Option -regexps provides a way of doing this. ESCAPE CHARS, BACK REFS.
/usr/bin/ana executable + documentation
ana uses simple regular expressions to detect acronyms based on the case of the letters in the word. For such words, the a/an decision is based on whether its first letter is pronounced like a vowel or a consonant, eg. `an FBI agent', `a UFO'. A problem occurs if all the source text is upper-case since ana will mistakenly detect acronyms.
One work around is to down-case the source text. This will improve the accuracy of correction but no acronyms will be detected. If necessary, the text can then be up-cased after a/an processing.
For example:
cat myfile | tr A-Z a-z | ana <options> | tr a-z A-Z
A future version of ana may include a (probabilistic) decision on whether any word is `pronouncable' and will therefore solve this problem...
The author wishes to thank John Carroll for keeping the code from contracting an even more dangerous strain of featuritis (!) and Sam Simpson for all her support as well as the name ana instead of aan.
Darren Pearce <Darren.Pearce@sussex.ac.uk>