NAME

ana - automatic a/an decisions


DESCRIPTION

Deciding whether a word in text should be preceded by a or an is a surprisingly complex task. The school rule of `an before words beginning with a vowel letter (aeiou), a otherwise' does not work in a wide range of cases. It needs to be modified to `an before words which start with a vowel sound, a for a consonant sound'.

It is therefore necessary to somehow map from the orthography (the letters) of the word to its phonetic realisation (its pronunciation). Fortunately, there are several 'rules' that can be used to detect these exceptions in most cases. However, ultimately, one has to resort to memorising exceptions. ana attempts to perform the ``a/an decision'' automatically.


BASIC OPERATION

ana's default operation is to take a list of words either from the command-line or from standard input and determine whether each word begins with a vowel sound or not. The output consists of each word preceded by ana's choice of a or an followed by a tab. This behaviour is demonstrated in the examples below.


Example: ana hello goodbye

command line

 ana hello goodbye

response

 a       hello
 a       goodbye


Example: ana one two three

command line

 ana one two three

response

 a       one
 a       two
 a       three

Note that in this example, the default rule (`an before vowels') would not have worked for one.


Example: ana

If no command-line arguments are supplied then ana reads lines from standard input (stdin). This is called stdin-mode. Each line is split on white-space and each word processed in turn:

command line

 ana

typed text

 seven eight nine ten

response

 a       seven
 an      eight
 a       nine
 a       ten

This functionality is only useful for demonstration purposes. The real utility of ana comes from stream processing.


STREAM MODE

ana has three distinct modes of operation. Two of these, command-line mode and stdin-mode are discussed above. The third and most complex but by far the most useful is stream mode.

Use the option -stream to enable stream mode. ana will then read one-line sentences from stdin and correct all a/an decisions. This is called `plain stream mode'.


Example: ana -stream

command line

 ana -stream

typed text

 Eat a apple and an banana daily

response

 Eat an apple and a banana daily

`Plain' is an example of a named stream specification. The option -stream is an alias for the more general option -name with an argument of `plain'. The -name option takes at least one argument which is the name of the type of stream. At present this can be one of three: -plain, tag or sgml. These are discussed in more detail below.


TAG STREAM MODE

In tagged text, each word is followed by an indication of its part-of-speech (PoS). For example:

        a_AT0 elephant_NP1 ate_VBZ a_AT0 orange_NP1

In order to use ana to correct a/an decisions here, it is necessary to use `tag stream mode':


Example: ana -name tag _ AT0

command line

 ana -name tag _ AT0

typed text

 a_AT0 elephant_NP1 ate_VBZ a_AT0 orange_NP1

response

 an_AT0 elephant_NP1 ate_VBZ an_AT0 orange_NP1

The command line specifies that the character (really a regular expression) `_' separates the word from the tag and the a/an tag is `AT0'. For more details on specifying arguments for tag stream mode, see the full description of -name below.


SGML STREAM MODE

Many corpora and system output use SGML (or variants) to delimit words, sentences, etc. ana has a stream mode that handles this type of output.


Example: ana -name sgml WORD

command line

 ana -name sgml WORD

typed text (on one line)

 <WORD>a</WORD> <WORD>elephant</WORD>
 <WORD>ate</WORD> <WORD>a</WORD> <WORD>orange</WORD>

response (on one line)

 <WORD>an</WORD> <WORD>elephant</WORD>
 <WORD>ate</WORD> <WORD>an</WORD> <WORD>orange</WORD>


SYNTAX

ana takes several command-line options which should provide enough functionality to handle a variety of text formats requring a/an correction. The full syntax of ana is shown below followed by a thorough description of the command-line options.

ana

[ -aan a an [ aan_regexp ] ]

[ -capitals ]

[ -delimiters [ sentence_start_re ] sentence_end_re ]

[ -help ]

[ -line ]

[ -multiline ]

[ -name plain ]

[ -name tag [ [ tag_sep_re ] tag_re ] ]

[ -name sgml [ [ attrib_re ] element_re ] ]

[ -regexps re1 re2 [ re3 [ re4 [ re5 ] ] ] ]

[ -stream ]

[ -transpose ]

[ -verbosity [ level ] ]

[ -word regexp ]

[ -- ]

[ word ... ]


OPTION PROCESSING

Option processing continues up to the first `--' in the argument list (if present). Options must precede non-option command-line arguments.


Implicit Settings

Some options implicitly set variables specific to other options. For example, many of the options are only appropriate in stream mode so using these will implictly switch to stream processing. The descriptions below state which variables are being set implictly.


Abbreviations

All options to ana can use initial letter abbreviations. Initial letters can be `bundled'. For example:

 ana -sm

is equivalent to

 ana -s -m

Bundling options which take arguments is handled appropriately but, for the sake of clarity, is not advised.


Option Arguments

The majority of the options described below take arguments, some of which are optional. However, the options are all `greedy'. If an option takes between 2 and 5 arguments then if 5 are available, they are used.

It is advised that `--' is used to terminate a complicated command-line syntax.


Regular Expression Arguments

Option arguments that are regular expressions are indicated in the syntax specifications by the presence of re. Specifying these can be problematic for a variety of reasons.

Firstly, to prevent interpretation by the shell, regular expression arguments should be enclosed in single quotes. This applies to most Unix shells.

It is worthwhile reading the regular expression section of the perl manuals. Try:

 man perlre

The -verbosity option can be very useful for viewing the processing of the regular expressions specified. Verbosity is automatically switched on when most higher-level like -regexps are used.

Since ana uses back-referencing, if any regular expressions specified use parentheses then these must be:

 (?: ... )

The presence of `?:' suppresses back-referencing.

If you have any problems, please get in touch with the author (see below).


OPTIONS


-aan

Option Syntax:

-aan a an

-aan a an aan_regexp

This option specifies which two strings should be used to detect and/or replace a/an.

-aan a an

The regular expression used to search for a/an in text will be: `(a|an)'. If the target word starts with a consonant sound, a is used, otherwise an is used. See Example: -aan (1) below.

-aan a an aan_regexp

The regular expression used to search for a/an in text will be aan_regexp. a will be used for consonant sounding words, an for vowel sounding words. See Example: -aan (2) below.


-capitals

Option Syntax:

-capitals

By default, ana expects the text it is processing to be lower-case. This enables it to detect acronyms accurately since the regular expressions rely on cases changing throughout the word.

For source texts that are completely in upper case, the acronym rules would lead to many incorrect a/an decisions. This option supresses the application of the acronym regular expressions. This will not eliminate all incorrect decisions but is the better of two accuracies.

See Example: -capitals (1), (2) and (3) below.


-delimiters

Option Syntax:

-delimiters sentence_start_re sentence_end_re

In multiline mode, sentences are delimited by two regular expressions. Depending on the named mode (if any), these are set to the following defaults:

        Mode    | Sentence Start | Sentence End
        --------+----------------+-------------
        plain   | ""             | "\."
        tag     | ""             | "._[^_]*"
        sgml    | "<SENTENCE>"   | "</SENTENCE>"

This option specifies explictly both the start-of-sentence and end-of-sentence regular expressions over-riding the defaults above.

Note that this option must be used if -regexps is the most recent specification of the 5 regular expressions and multiline mode is in operation.

See Example: -delimiters (1) and (2) below as well as the following examples for -multiline: (1) and (2).


-help

Option Syntax:

-help

Displays the syntax of ana.


-line

Option Syntax:

-line

Indicates that each line should be considered a separate sentence. By default, ana operates in line mode.


-multiline

Option Syntax:

-multiline

The default behaviour of ana in stream mode is to consider each input line to be one sentence. For many applications, this is not the case. The -multiline option relaxes this one-line constraint so that sentences can be allowed to cover several lines.

In order to split the stream into sentences, ana must know how to find the start and end of a sentence. Default regular expressions are assumed depending on the named mode unless over-ridden by the -delimiters option.

See Example -multiline (1) and (2) below.


-name

Option Syntax:

-name plain

-name tag [ [ tag_sep_re ] tag_re ]

-name sgml [ [ attrib_re ] element_re ]

ana currently recognises three different `named streams': plain, tag and sgml. The arguments that these streams take (if any) are described below.

-name plain

Corrects a/an in plain text.

Regular Expressions:

 left1    \b
 right1   \b
 sep      .+?
 left2    \b
 right2   \b

See Examples: -name (1) and (2) below.

-name tag

Detects a/an in tagged text. The character `_' is assumed to separate a word from its tag and the a/an tag is AT0 (as in the CLAWS5 tagset).

Regular Expressions:

 left1    \b
 right1   _AT0
 sep      .+?
 left2    \b
 right2   _

See Example: -name (3) below.

-name tag tag_re

The supplied argument over-rides the default AT0 regular expression for matching a/an tags.

Regular Expressions:

 left1    \b
 right1   _<tag_re>
 sep      .+?
 left2    \b
 right2   _

See Example: -name (4) below.

-name tag tag_sep_re tag_re

The first argument over-rides the default `_' regular expression separating a word from its tag and the second over-rides the AT0 default a/an tag.

Regular Expressions:

 left1    \b
 right1   <tag_sep_re><tag_re>
 sep      .+?
 left2    \b
 right2   <tag_sep_re>

See Example: -name (5) below.

-name sgml

This uses the default sgml-stream specification. Words are surrounded by:

<WORD> ... </WORD>.

See Example: -name (6) below.

-name sgml element_re

The supplied element over-rides the default WORD. So words are surrounded by:

<element_re> ... </element_re>.

See Example: -name (7) below.

-name sgml attrib_re element_re

Words will be set in one of the following contexts:

<element_re attrib_re=... >

or:

<element_re attrib_re=... > </element_re>.

See Example: -name (8) below.


-regexps

For a discussion of how regular expressions are used in ana, see the section on regular expression matching below.

Option Syntax:

-regexps left1 right1

-regexps left1 right1 sep

-regexps left1 right1 sep left2

-regexps left1 right1 sep left2 right2

Omitted arguments use the following defaults:

        regexp  | default
        --------+--------
        sep     | .+?
        left2   | left1
        right2  | right1

This option provides a way of completely specifying the 5 regular expressions used for identifying the location of a/an and target words. See Examples: -regexps (1) and (2) below.


-stream

This is equivalent to -name plain.


-transpose

Using this option swaps the order of a/an and the target word. This means that the five regular expressions (discussed in the Regular Expressions section below) will now be used as follows:

 left1  the left context for the target word
 right1 the right context for the target word
 sep    the characters separating the two words and their contexts
 left2  the left context for a/an
 right2 the right context for a/an

Note this has no effect if not in stream mode.

See Examples: -transpose (1) and (2).


-verbosity

Option Syntax:

-verbosity

-verbosity level

Turns verbosity on. The default is level 1. Higher (or lower) verbosity levels can be set by supplying an optional integer argument. At present, verbosity levels range from 0 to 2.

ana has been designed with `cut-and-paste' in mind. The regular expressions used utilise only basic regular expressions constructs thus aiding integration into other applications that may not have regular expression facilities as powerful as those in Perl 5.


-word

Option Syntax:

-word word_regexp

By default, the regular expression used to match target words is `[^\s]+?'. ie. any sequence of one or more non-white-space characters. The question mark after the `+' means that the regular expression will match as little as possible of the string. Since this is normally followed by right context, this is stretched over the next word.


EXAMPLES


Example: -aan (1)

command line

 ana -stream -aan "C" "V"

typed text

 this is C example of V decision

response

 this is V example of C decision


Example: -capitals (1)

command line

 ana -capitals

typed text

 A MPHIL DEGREE

response

 A MPHIL DEGREE

Notice how stream mode is implicitly set through the use of the -capitals option.


Example: -capitals (2)

command line

 ana -stream

typed text

 A MPHIL DEGREE

response

 An MPHIL DEGREE

Since input is assumed to be in lower-case, the presence of upper-case is treated as acronyms.


Example: -capitals (3)

command line

 ana -stream

typed text

 A MPhil degree

response

 An MPhil degree

The correct output given expected input.


Example: -multiline (1)

command line

 ana -multiline

typed text

 a egg in an 
 cup.

response

 an egg in a
 cup.


Example: -multiline (2)

command line

 ana -multiline -name tag

typed text

 an_AT0 woman_NP1
 bought_VBZ an_AT0 unix_JJ 
 box_NP1 
 ._.

response

 a_AT0 woman_NP1
 bought_VBZ a_AT0 unix_JJ
 box_NP1
 ._.


Example: -name (1)

command line

 ana -name plain

typed text

 a example sentence

response

 an example sentence


Example: -name (2)

command line

 ana -name plain

typed text

 an "quoted" example

response

 a "quoted" example


Example: -name (3)

command line

 ana -name tag

typed text

 an_AT0 "_PUNCT default_JJ "_PUNCT tag_NN1

response

 a_AT0 "_PUNCT default_JJ "_PUNCT tag_NN1

By default, AT0 is the a/an tag and `_' is the word/tag separation character (can be a regular expression).


Example: -name (4)

command line

 ana -name tag AT1

typed text

 an_AT1 default_JJ tag_NN1

response

 a_AT1 default_JJ tag_NN1

Over-riding the default AT0 for the a/an tag, AT1 is now used. The default `_' is still the word/tag separation character (can be a regular expression).


Example: -name (5)

command line

 ana -name tag '\s*#\s*' ARTICLE

typed text

 a # ARTICLE exception  #WORD

response

 an # ARTICLE exception  #WORD

Over-riding the default AT0 for the a/an tag, AT1 is now used. A `#' optionally surrounded by any amount of white space is used as the word/tag separation regular expression.


Example: -name (6)

command line

 ana -name sgml

typed text

 <WORD>an</WORD> <WORD>table</WORD>

response

 <WORD>a</WORD> <WORD>table</WORD>


Example: -name (7)

command line

 ana -name html W

typed text

 <W>an</W> <W>table</W>

response

 <W>a</W> <W>table</W>


Example: -name (8)

command line

 ana -name sgml W ORTH

typed text

 <W ORTH="a"> <W ORTH=elephant></W>

response

 <W ORTH="an"> <W ORTH=elephant></W>


Example: -regexps (1)

command line

 ana -regexps '<' '>'

typed text

 <an> <sentence> <with> <chevrons>

response

 <a> <sentence> <with> <chevrons>

Notice how the arguments are quoted to prevent interpretation by the shell.


Example: -regexps (2)

command line

 ana -regexps '<' '>' '\.'

typed text

 <an> <sentence><with> <an>.<dot>

response

 <an> <sentence><with> <a>.<dot>

The third argument to the option -regexps specifies the separator regular expression. Here the expression consists only of a single dot. This must be preceded by a backslash to prevent its special interpretation within regular expressions.

In the example, the first two words an and sentence are not matched since they are separated by a space, not a dot.


Example: -transpose (1)

command line

 ana -t
     -a ' consonant' ' vowel' ""
     -r "orth:\s*\b" "\b" ".*?" "decision:" "\s*"

typed text

 orth: hour    decision:

response

 orth: hour    decision: vowel

The command-line is shown spread across several lines for clarity. This would in reality all be on the same line.


REGULAR EXPRESSION MATCHING

Internally, the way in which ana determines which word is a/an and which is the `target word' is through the use of several regular expressions.

Each of these words uses two regular expressions. These specify the left-context and the right context. For example, in plain stream mode, a/an is identified by \b(a|an)\b (where \b matches word boundaries in Perl). Here the left context is `\b' as is the right context.

A further regular expression (the fifth) is used to specify how the words (and their contexts) are separated from each other. In most cases, this will be white space.

So in total, 5 regular expressions are needed:

 left1  the left context for a/an
 right1 the right context for a/an
 sep    the characters separating the two words and their contexts
 left2  the left context for the target word
 right2 the right context for the target word

The named stream modes provide an easy way of specifying these without worrying about regular expressions. However, for some text formats which are not covered by the named stream modes, it may be necessary to specify the regular expressions completely. Option -regexps provides a way of doing this. ESCAPE CHARS, BACK REFS.


FILES

 /usr/bin/ana           executable + documentation


KNOWN BUGS


Acronyms and Capitalisation

ana uses simple regular expressions to detect acronyms based on the case of the letters in the word. For such words, the a/an decision is based on whether its first letter is pronounced like a vowel or a consonant, eg. `an FBI agent', `a UFO'. A problem occurs if all the source text is upper-case since ana will mistakenly detect acronyms.

One work around is to down-case the source text. This will improve the accuracy of correction but no acronyms will be detected. If necessary, the text can then be up-cased after a/an processing.

For example:

 cat myfile | tr A-Z a-z | ana <options> | tr a-z A-Z

A future version of ana may include a (probabilistic) decision on whether any word is `pronouncable' and will therefore solve this problem...


THANKS

The author wishes to thank John Carroll for keeping the code from contracting an even more dangerous strain of featuritis (!) and Sam Simpson for all her support as well as the name ana instead of aan.


AUTHOR

Darren Pearce <Darren.Pearce@sussex.ac.uk>