This page gives some idea of the performance and output possibilities for the RASP system described in
|Briscoe, E., J. Carroll and R. Watson (2006) The Second Release of the RASP System. In Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, Sydney, Australia.|
The main script for running the system pipes input text through processes of tokenisation, tagging, lemmatization and parsing, each producing an intermediate file that forms the input for the next phase of processing (see the paper for more details). Below we illustrate the system's performance when parsing the abstract from the paper (unmodified) and an extract from Lewis Carroll's poem Jabberwocky. The system documentation gives further detail on the output and representations. However, references here are to published resources to aid the potential user make a decision whether to download the system.
Here are the two texts input to the system:
We describe a robust accurate domain-independent approach to statistical parsing incorporated into the new release of the ANLT toolkit, and publicly available as a research tool. The system has been used to parse many well known corpora in order to produce data for lexical acquisition efforts; it has also been used as a component in an open-domain question answering project. The performance of the system is competitive with that of statistical parsers using highly lexicalised parse selection models. However, we plan to extend the system to improve parse coverage, depth and accuracy.
'Twas brillig, and the slithy toves did gyre and gimble in the wabe: all mimsy were the borogoves, and the mome raths outgrabe. Beware the Jabberwock, my son! The jaws that bite, the claws that catch! Beware the Jubjub bird, and shun the frumious Bandersnatch! He took his vorpal sword in hand: long time the manxome foe he sought -- so rested he by the Tumtum tree, and stood awhile in thought. And, as in uffish thought he stood, the Jabberwock, with eyes of flame, came whiffling through the tulgey wood, and burbled as it came! One, two! One, two! And through and through the vorpal blade went snicker-snack! He left it dead, and with its head he went galumphing back. And, has thou slain the Jabberwock? Come to my arms, my beamish boy! O frabjous day! Callooh! Callay! He chortled in his joy.
Initially the system marks text sentence boundaries and performs some basic tokenisation, such as separating punctuation from adjacent words. Here is the result of tokenisation of the first few lines of Jabberwocky:
^ 'Twas brillig , and the slithy toves did gyre and gimble in the wabe : all mimsy were the borogoves , and the mome raths outgrabe . ^ Beware the Jabberwock , my son ! ^ The jaws that bite , the claws that catch !
Note that ^ indicates a text sentence boundary and not a grammatical clause, as the last text sentence does not contain a verb; also that tokenisation is partial -- 'Twas is left as a single token and will be dealt with by the part of speech tagger's unknown word mechanism.
2. Part of Speech Tagging
The system can be run with either forced choice or threshold-based part of speech (PoS) tagging, in which either the most probable or the set of more probable tags per word are retained, respectively. In the latter case, each line of the tagger output contains an input token and a set of tag:probability pairs. The tagset is close to CLAWS C7 (see e.g. Appendix C of Jurafsky, D. and Martin, J. Speech and Language Processing, Prentice-Hall, 2000 for more details), although it is in fact a cut down version of the CLAWS C2 tagset.
Here is the threshold-based tagger output for the first sentence of the Abstract and of Jabberwocky:
We PPIS2:1 describe VV0:1 a &FO:3.91751e-06 AT1:0.99643[*+] DD:0.00170341 DD1:0.00154953 DD2:0.000125332 ZZ1:0.000187731 robust JJ:1 accurate JJ:1 domain-independent JB:4.27006e-05 NN1:0.999957[*+] approach NN1:0.95978[*+] VV0:0.0402202 to II:0.984823[*+] RG:0.00895171 RL:3.60516e-05 RR:0.000813173 TO:0.00537631 statistical JJ:1 parsing JJ:0.0141884 NN1:0.985168[*+] VVG:0.000643686 incorporated VVD:0.0280124 VVN:0.971988[*+] into II:1 the AT:0.999096[*+] DD:0.00085798 DD1:3.47521e-05 II:5.16195e-06 RR:5.76977e-06 new JJ:1 release NN1:0.999898[*+] VV0:0.000102003 of CC:7.8153e-05 II:0.0216708 IO:0.978184[*+] RG:4.65131e-07 RR:6.66912e-05 the AT:0.999942[*+] DD:4.74281e-05 DD1:3.6536e-06 II:4.92432e-06 RR:1.82499e-06 ANLT NP1:1 toolkit NN1:0.961152[*+] VV0:0.0388478 , ,:1 and CC:0.99969[*+] DD1:2.23782e-05 RA:1.21806e-05 RR:0.000275511 publicly RR:1 available JJ:1 as CC:0.0772009 CS:0.0123809 CSA:0.311123 CST:0.000850357 II:0.593427[*+] RG:0.00300239 RR:0.00152167 VBZ:0.00049398 a &FO:4.78692e-05 AT1:0.988266[*+] DD:0.00219514 DD1:0.00892991 DD2:4.08907e-05 ZZ1:0.000520336 research NN1:1 tool NN1:1 . .:1
'Twas JJ:0.578276[*+] NN1:0.291789 NN2:0.0235856 RR:0.0183698 VV0:0.0376568 VVD:0.0143868 VVG:0.0191048 VVN:0.016831 brillig JJ:0.0346606 NN1:0.779672[*+] NN2:0.153797 RR:0.0154362 VV0:0.00410459 VVD:0.00498216 VVG:0.00109019 VVN:0.00625671 , ,:1 and CC:0.999814[*+] DD1:2.85526e-06 RA:2.76221e-05 RR:0.000155313 the AT:0.998003[*+] DD:0.00168198 DD1:4.90258e-05 II:9.20987e-05 RR:0.00017411 slithy JJ:1 toves NN2:1[*+] VVZ:2.34475e-308 did VDD:1 gyre JJ:0.00192548 NN1:0.0255841 NN2:0.00465777 RR:0.146937 VV0:0.820896[*+] VVD:4.71777e-308 VVG:4.90337e-308 VVN:8.37397e-308 and CC:0.998344[*+] DD1:0.000306631 RA:1.48231e-05 RR:0.00133476 gimble JJ:0.0352231 NN1:0.11773 VV0:0.847046[*+] in BTO:2.22516e-308 CS:0.00117028 II:0.967104[*+] RP:0.0312571 RR:0.000468871 the AT:0.998995[*+] DD:0.00085473 DD1:8.79189e-05 II:2.65759e-05 RR:3.58596e-05 wabe JJ:0.0285352 NN1:0.871951[*+] NN2:0.0724562 RR:0.0113386 VV0:0.00148405 VVD:0.000650002 VVG:0.00391722 VVN:0.00966793 : ::1 all DB:0.886853[*+] RR:0.113147 mimsy JJ:1 were VBDR:1 the AT:0.996412[*+] DD:0.00319549 DD1:6.88545e-07 II:0.000249481 RR:0.000142569 borogoves NN2:0.999969[*+] VVZ:3.06735e-05 , ,:1 and CC:0.999814[*+] DD1:2.84829e-06 RA:2.76318e-05 RR:0.000155233 the AT:0.998658[*+] DD:0.00115523 DD1:8.81901e-05 II:6.13875e-05 RR:3.68615e-05 mome NN1:0.999367[*+] VV0:0.000633201 raths NN2:1 outgrabe JJ:0.00292627 NN1:0.0248626 NN2:0.00489069 RR:0.0170931 VV0:0.00557639 VVD:0.562453[*+] VVG:0.0686917 VVN:0.313506
Next the tagger output is lemmatized, based on the tags assigned to word tokens. See Briscoe and Carroll (2002) for further details and a reference to a detailed paper describing this module. For example, here are fragments of the results of lemmatizing the first sentence of the Abstract and Jabberwocky:
The The_DD:0.0693774 The_AT:0.930623 system system_NN1:1 has have+s_VHZ:0.999999 ha+s_NN1:6.32064e-07 been be+en_VBN:1 used used_JJ:0.00203297 use+ed_VVD:4.58937e-308 use+ed_VVN:0.997965 use+ed_VMK:9.89752e-307 used_NN1:2.21346e-06 ... parsing parsing_NN1:0.985168 parsing_JJ:0.0141884 parse+ing_VVG:0.000643686
'Twas 'Twa+s_VV0:0.0376568 'Twas_RR:0.0183698 'Twa+s_VVN:0.016831 'Twa+s_NN1:0.291789 'Twa+s_NN2:0.0235856 'Twa+s_VVD:0.0143868 'Twa+s_VVG:0.0191048 'Twas_JJ:0.578276 brillig brillig_VV0:0.00410459 brillig_VVN:0.00625671 brillig_NN1:0.779672 brillig_NN2:0.153797 brillig_VVD:0.00498216 brillig_JJ:0.0346606 brillig_VVG:0.00109019 brillig_RR:0.0154362 .... borogoves borogove+s_VVZ:3.06735e-05 borogove+s_NN2:0.999969
(Note that lemmatization can occasionally be misled by unknown words and incorrect tokenisation.)
The probabilistic parser analyses the PoS tag sequence or chart of initial more probable tags and generates a parse forest representation containing all possible subanalyses with associated probabilities. From this representation it is able to construct the n-best syntactic trees or (weighted) grammatical relations.
Trees can be output in a variety of formats, as labelled bracketings with rule names or category names as labels, and with (optionally sequentially numbered) morphologically analysed words with or without the relevant PoS tag as leaves. Here is the top ranked tree for the first sentence of the Abstract with rule names as node labels:
(|T/txt-sc1/-+| (|S/np_vp| |We:1_PPIS2| (|V1/v_np| |describe:2_VV0| (|NP/det_n1| |a:3_AT1| (|N1/n1_pp1| (|N1/ap_n1/-| (|AP/a1| (|A1/a| |robust:4_JJ|)) (|N1/ap_n1/-| (|AP/a1| (|A1/a| |accurate:5_JJ|)) (|N1/n_n1| |domain-independent:6_NN1| (|N1/n| |approach:7_NN1|)))) (|PP/p1| (|P1/p_n1| |to:8_II| (|N1/ap_n1/-| (|AP/a1| (|A1/a| |statistical:9_JJ|)) (|N1/n_ppart| |parsing:10_NN1| (|V1/v_ap| |incorporate+ed:11_VVN| (|AP/a1| (|A1/adv_a1| (|AP/a1| (|A1/a| (|A/pp_adv-coord/+| (|PP/p1| (|P1/p_np| |into:12_II| (|NP/det_n1| |the:13_AT| (|N1/ap_n1/-| (|AP/a1| (|A1/a| |new:14_JJ|)) (|N1/n_pp-of| |release:15_NN1| (|PP/p1| (|P1/p_np| |of:16_IO| (|NP/det_n1| |the:17_AT| (|N1/n-name_n1| |ANLT:18_NP1| (|N1/n| |toolkit:19_NN1|)))))))))) |,:20_,| (|A/cj-end_a/-| |and:21_CC| |publicly:22_RR|)))) (|A1/a_pp-as| |available:23_JJ| (|PP/p1| (|P1/p_np| |as:24_CSA| (|NP/det_n1| |a:25_AT1| (|N1/n_n1| |research:26_NN1| (|N1/n| |tool:27_NN1|))))))))))))))))) (|End-punct3/-| |.:28_.|))
Rule names are mnemonic, the capitalised part indicating the mother category and the part after the slash usually indicating immediate daughters delimited by an underscore. The analytic scheme is based on X-bar theory within a feature-based phrase structure framework. The full grammar and associated manuals supplied with the distributed system give further details.
Additionally or alternatively, a set of grammatical relations (GRs) associated with a particular analysis can be output. These consist of a named relation, a head and dependent, and possibly extra parameters depending on the relation involved. The details of this scheme are described here.
Here is the set of GRs corresponding to the tree above:
(|ncsubj| |describe:2_VV0| |We:1_PPIS2| _) (|dobj| |describe:2_VV0| |approach:7_NN1|) (|det| |approach:7_NN1| |a:3_AT1|) (|ncmod| _ |approach:7_NN1| |to:8_II|) (|dobj| |to:8_II| |parsing:10_NN1|) (|ncmod| _ |parsing:10_NN1| |statistical:9_JJ|) (|passive| |incorporate+ed:11_VVN|) (|ncsubj| |incorporate+ed:11_VVN| |parsing:10_NN1| |obj|) (|xmod| _ |parsing:10_NN1| |incorporate+ed:11_VVN|) (|xcomp| _ |incorporate+ed:11_VVN| |available:23_JJ|) (|ncmod| _ |available:23_JJ| |and:21_CC|) (|ncmod| _ |available:23_JJ| |as:24_CSA|) (|dobj| |as:24_CSA| |tool:27_NN1|) (|det| |tool:27_NN1| |a:25_AT1|) (|ncmod| _ |tool:27_NN1| |research:26_NN1|) (|conj| |and:21_CC| |into:12_II|) (|conj| |and:21_CC| |publicly:22_RR|) (|dobj| |into:12_II| |release:15_NN1|) (|det| |release:15_NN1| |the:13_AT|) (|ncmod| _ |release:15_NN1| |new:14_JJ|) (|iobj| |release:15_NN1| |of:16_IO|) (|dobj| |of:16_IO| |toolkit:19_NN1|) (|det| |toolkit:19_NN1| |the:17_AT|) (|ncmod| _ |toolkit:19_NN1| |ANLT:18_NP1|) (|ncmod| _ |approach:7_NN1| |robust:4_JJ|) (|ncmod| _ |approach:7_NN1| |accurate:5_JJ|) (|ncmod| _ |approach:7_NN1| |domain-independent:6_NN1|)
Finally, the system can output weighted GRs yielded by the n-best parses of the input. In this case the set of GRs does not define a complete and consistent directed graph of relations over the input, but may include alternative weighted GRs corresponding to competing subanalyses. The weights are calculated on the basis of the proportion of analyses supporting a specific GR and their probability. Here are the weighted GRs for the same sentence:
1.267666s-6 (|det| |available:23_JJ| |a:3_AT1|) 1.131550s-2 (|cmod| _ |approach:7_NN1| |to:8_II|) 2.596484s-6 (|ccomp| _ |approach:7_NN1| |incorporate+ed:11_VVN|) 7.694205s-2 (|ncmod| _ |incorporate+ed:11_VVN| |as:24_CSA|) 0.979621 (|ncsubj| |incorporate+ed:11_VVN| |parsing:10_NN1| |obj|) 2.438365s-6 (|ncmod| _ |incorporate+ed:11_VVN| |to:8_II|) 2.015810s-2 (|xcomp| _ |describe:2_VV0| |available:23_JJ|) 0.999657 (|det| |approach:7_NN1| |a:3_AT1|) 0.999658 (|ncmod| _ |approach:7_NN1| |accurate:5_JJ|) 1.766963s-3 (|ncmod| _ |incorporate+ed:11_VVN| |into:12_II|) 9.842899s-6 (|ncmod| _ |parsing:10_NN1| |as:24_CSA|) 4.680834s-3 (|ncmod| _ |describe:2_VV0| |to:8_II|) 1.548369s-5 (|ncmod| _ |parsing:10_NN1| |into:12_II|) 4.799288s-3 (|ncsubj| |incorporate+ed:11_VVN| |approach:7_NN1| _) 1.0 (|iobj| |release:15_NN1| |of:16_IO|) 3.684531s-9 (|xcomp| _ |describe:2_VV0| |accurate:5_JJ|) 0.979621 (|passive| |incorporate+ed:11_VVN|) 1.0 (|ncmod| _ |release:15_NN1| |new:14_JJ|) 8.630036s-7 (|ncmod| _ |approach:7_NN1| |as:24_CSA|) 3.420709s-4 (|ncmod| _ |domain-independent:6_NN1| |robust:4_JJ|) 3.093032s-6 (|conj| |and:21_CC| |incorporate+ed:11_VVN|) 1.707050s-7 (|ccomp| _ |statistical:9_JJ| |incorporate+ed:11_VVN|) 0.970213 (|xcomp| _ |incorporate+ed:11_VVN| |available:23_JJ|) 1.267666s-6 (|dobj| |describe:2_VV0| |available:23_JJ|) 2.641884s-3 (|iobj| |describe:2_VV0| |as:24_CSA|) 9.627485s-3 (|ncmod| _ |approach:7_NN1| |available:23_JJ|) 0.979621 (|xmod| _ |parsing:10_NN1| |incorporate+ed:11_VVN|) 3.418469s-6 (|ncmod| _ |incorporate+ed:11_VVN| |and:21_CC|) 3.684531s-9 (|ncsubj| |accurate:5_JJ| |robust:4_JJ| _) 1.0 (|conj| |and:21_CC| |publicly:22_RR|) 1.557665s-2 (|ncsubj| |incorporate+ed:11_VVN| |parsing:10_NN1| _) 1.0 (|dobj| |of:16_IO| |toolkit:19_NN1|) 1.885989s-8 (|dobj| |describe:2_VV0| |robust:4_JJ|) 0.999658 (|ncmod| _ |approach:7_NN1| |domain-independent:6_NN1|) 1.0 (|det| |release:15_NN1| |the:13_AT|) 3.420709s-4 (|dobj| |describe:2_VV0| |domain-independent:6_NN1|) 0.999997 (|ncmod| _ |parsing:10_NN1| |statistical:9_JJ|) 3.100951s-3 (|ccomp| _ |describe:2_VV0| |to:8_II|) 2.799488s-2 (|iobj| |incorporate+ed:11_VVN| |into:12_II|) 1.0 (|ncmod| _ |tool:27_NN1| |research:26_NN1|) 1.0 (|det| |toolkit:19_NN1| |the:17_AT|) 2.015810s-2 (|ncsubj| |available:23_JJ| |approach:7_NN1| _) 1.267666s-6 (|ncmod| _ |available:23_JJ| |approach:7_NN1|) 2.306955s-9 (|ncsubj| |available:23_JJ| |domain-independent:6_NN1| _) 0.999997 (|ncmod| _ |available:23_JJ| |and:21_CC|) 3.420709s-4 (|ncmod| _ |domain-independent:6_NN1| |accurate:5_JJ|) 4.457276s-3 (|ccomp| _ |describe:2_VV0| |incorporate+ed:11_VVN|) 3.420709s-4 (|det| |domain-independent:6_NN1| |a:3_AT1|) 1.0 (|dobj| |into:12_II| |release:15_NN1|) 2.609070s-6 (|dobj| |to:8_II| |statistical:9_JJ|) 9.806497s-4 (|ncmod| _ |describe:2_VV0| |as:24_CSA|) 1.0 (|det| |tool:27_NN1| |a:25_AT1|) 1.0 (|ncsubj| |describe:2_VV0| |We:1_PPIS2| _) 1.0 (|ncmod| _ |toolkit:19_NN1| |ANLT:18_NP1|) 0.970223 (|conj| |and:21_CC| |into:12_II|) 2.977735s-2 (|conj| |and:21_CC| |to:8_II|) 0.984423 (|dobj| |to:8_II| |parsing:10_NN1|) 1.557404s-2 (|ccomp| _ |to:8_II| |incorporate+ed:11_VVN|) 1.885989s-8 (|det| |robust:4_JJ| |a:3_AT1|) 7.131962s-5 (|cmod| _ |describe:2_VV0| |to:8_II|) 0.919425 (|ncmod| _ |available:23_JJ| |as:24_CSA|) 3.420311s-4 (|ccomp| _ |domain-independent:6_NN1| |incorporate+ed:11_VVN|) 0.999658 (|ncmod| _ |approach:7_NN1| |robust:4_JJ|) 3.684531s-9 (|ccomp| _ |accurate:5_JJ| |incorporate+ed:11_VVN|) 1.806546s-7 (|ncsubj| |incorporate+ed:11_VVN| |to:8_II| |inv|) 0.747502 (|ncmod| _ |approach:7_NN1| |to:8_II|) 1.0 (|dobj| |as:24_CSA| |tool:27_NN1|) 0.203550 (|iobj| |describe:2_VV0| |to:8_II|) 0.995199 (|dobj| |describe:2_VV0| |approach:7_NN1|)
All the GRs with weight 1.0 are supported by 100% of the n-best analyses used (in this case 100 analyses). Thus they provide highly reliable though partial syntactic information about the input.
Here are links to the the full output of the system for the Abstract and Jabberwocky:
Jabberwocky illustrates the comparatively robust behaviour of the system in the face of unknown words and text sentences not corresponding to grammatical clauses: although 3 text sentences with end of sentence punctuation are parsed as text fragments (`T/frag(x)'), partial analyses and GRs are still produced for all of these.
Abstract contains no fragmentary analyses -- a testament to the authors' grammatical prose(!). There are some attachment errors in the top ranked analyses. For example, `publically' is not attached to the adjectival phrase `available as...' in the top-ranked analysis for the first analysis shown above.
A more objective and extensive quantitive evaluation of the system's accuracy is given by Briscoe, Carroll and Watson (2006), and in papers it references.
Ted Briscoe, October 2002
John Carroll, November 2003
Last Modified: Ted Briscoe, September, 2006