[ Guide contents] | Next chapter: 3 Englex |
Previous chapter: 1 Introduction ]
Of these new features, the word grammar component is the most significant. The word grammar component uses a unification-based parser based on the PATR-II formalism described in Shieber 1986. Although parsers of this type have typically been used for syntactic analysis, they can also be used for morphological analysis with equal success. Just as a sentence parser produces a tree structure with words as its leaf nodes, a word parser produces a tree structure with morphemes as its leaf nodes. For example, figure 2.1 shows a parse tree of the word unbelievable as produced by PC-KIMMO's word grammar component:
Figure 2.1 A morphological parse tree
Word
|
Stem
______|______
PREFIX Stem
un+ ___|____
Stem SUFFIX
| +able
ROOT
be`lieve
Each node of the tree has a feature structure associated with it. The
feature structure for the top node is the most important, since these
are the features attributable to the entire word. The feature structure
for the top node of the tree in figure 2.1
is shown in figure 2.2. It gives three
features for the word unbelievable. First, the feature
cat has the value Word, which is simply the name the node.
Second the feature pos has the value AJ, meaning that the
lexical category (part-of-speech) of the word is Adjective. And third,
the feature aform has the value POS, meaning that it is the
Positive form of the adjective (as opposed to the Comparative
-er or Superlative -est forms). If PC-KIMMO were being
called from a syntactic parser, then it would return to the syntactic
parser the word unbelievable with its features.
[ cat: Word
head: [pos: AJ]
aform: POS ]]
The word grammar component uses a file containing a grammar written by the user. A
grammar consists of context-free rules and feature constraints. An example of a
rule with constraints is shown in figure 2.3.
Figure 2.3 A word grammar rule
Word = Stem INFL <Stem head pos> = <INFL from_pos> <Word head> = <INFL head>One obvious difference between parsing a sentence and parsing a word is that a sentence is typically already tokenized into words while a word is not tokenized into morphemes. In other words, we put white space between words but not between morphemes. In PC-KIMMO, the Recognizer uses the rules and lexicon to tokenize a word into a sequence of morphemes which in turn is passed to the word grammar component for parsing. In its overall architecture, version 2 of PC-KIMMO now resembles the morphological parser described in Ritchie and others 1992. That parser also first tokenizes a word into morphemes and then parses the morpheme sequence with a unification-based parser. However, our unification parser differs considerably from theirs in its implementation.
There are several reasons why we added a word grammar component to PC-KIMMO.
The word grammar component offers a more powerful model of morphotactics.
PC-KIMMO version 1 used only the continuation class model of morphotactics which was used in Koskennniemi's original model (1983). In the continuation class model, the morphotactic properties of a morpheme can be stated only in terms of the classes of morphemes that can directly follow it in a word. This meant that it was very difficult or at least practically unfeasible to enforce certain discontinuous dependencies between morphemes. The word grammar, however, has the entire power of a context-free grammar at its disposal and can model word structure as arbitrarily complex branching trees (both left- and right-branching). The practical result is that with PC-KIMMO version 2 you can eliminate most of the bad parses that were so difficult to prevent with version 1.
The word grammar component can deduce the lexical category (part-of-speech) of a word.
PC-KIMMO version 1 could break a word into its morphemes and gloss each morpheme, but it could not tell you the category of the whole word. For example, given the word computerization, version 1 would return this analysis:
com`pute+er+ize+ation com`pute+NR19+VR6+NR23The original word has been broken into four morphemes with glosses, but there is no indication that the whole word is a noun. This deficiency made PC-KIMMO less useable as a front-end to a syntactic parser, since a syntactic parser must know the category of each word. In version 2, the feature-passing mechanism can be used to determine the lexical category of a word.
The word grammar component can provide a full feature specification for a word.
Besides lexical category, a word grammar can also determine all features of a word that are relevant to syntactic parsing, such as tense, number, gender, and case.
------------------------------------------------------------------------ PC-KIMMO TWO-LEVEL PROCESSOR Version 2.0 (December 15, 1994), Copyright 1994 SIL Type ? for help PC-KIMMO>take englex PC-KIMMO>load rules english Loading rules from english.rul PC-KIMMO>load lexicon english Loading lexicon from english.lex PC-KIMMO> ------------------------------------------------------------------------The rules file and the lexicon file have now been loaded, but not the grammar file. This demonstrates that use of the word grammar component is optional. If you do not use it, then PC-KIMMO will behave just as it did in version 1. Thus you can use version 2 with your existing descriptions without having to write grammar files (however, you must convert your existing lexicon files to the new format required by version 2). To try the Recognizer without the word grammar, just type some recognize commands:
------------------------------------------------------------------------ PC-KIMMO>recognize foxes `fox+s `fox+PL `fox+s `fox+3SG PC-KIMMO> ------------------------------------------------------------------------Two results were returned for the word foxes. The two results are due to two analyses of the suffix +s, one the plural suffix for nouns, the other the third, singular suffix for verbs. Obviously our knowledge of English tells us that the first result is correct and the second is incorrect. The point to note here is that the lexicon constructed for this example does not have sufficient morphotatic constraints to disallow the incorrect analysis. There is a new display option that displays the results of the Recognizer in an interlinear format. Type set alignment on and then rec foxes again:
------------------------------------------------------------------------ PC-KIMMO>set alignment on PC-KIMMO>rec foxes `fox +s `fox +PL N INFL `fox +s `fox +3SG N INFL ------------------------------------------------------------------------This display vertically aligns each morpheme of the lexical form with its gloss on the second line and its sublexicon name on the third line. Thus we can visually see that each Recognizer result is a sequence of morpheme structures. Not shown in this display, though present internally, are the features associated with each morpheme. Now load the English word grammar and try recognizing the same word again. Type load grammar english and then rec foxes (features with empty values are not displayed):
------------------------------------------------------------------------
PC-KIMMO>load grammar english
Loading grammar from english.grm
PC-KIMMO>rec foxes
`fox +s
`fox +PL
N INFL
1:
Word
___|____
Stem INFL
| +s
ROOT +PL
`fox
`fox
Word:
[ cat: Word
clitic:-
head: [ number:PL
pos: N ]
root_pos:N
root: `fox ]
------------------------------------------------------------------------
One important difference is that now only one result is returned, namely the one
that correctly interprets the -s suffix as a plural marker. What has
happened is this.
Thus the lexicon and grammar work together to produce the desired results. The lexicon serves to break a word into its morphemes using minimal morphotactic constraints, while the grammar applies a more powerful morphotactic mechanism that filters out any incorrect analyses allowed by the lexicon. The Recognizer result display consists of three parts: the tokenized lexical form, the parse tree, and the feature structures. The first part is always displayed, while the other two parts are displayed only if a word grammar is in use and certain options are turned on. In the display shown above, the first part of the result display is the same as it was before the word grammar was loaded (assuming that the alignment option is still on). The second part of the result display is the analysis tree. The nodes of the tree bear the category symbols used in the word grammar rules. The leaf nodes (ROOT and INFL) also display the lexical form and gloss of each morpheme. The tree option determines how the tree is displayed. In the display above, the tree option is set to full by default. If the tree option is set to flat, then it would be displayed as a bracketed string like this:
(Word (Stem (ROOT `fox '`fox'))(INFL +s '+PL'))Setting the tree option to off will suppress display of the tree entirely. The third part of the result display consists of feature structures. The features option determines how feature structures are displayed. In the display shown above, only the feature structure for the top node of the tree is shown because the features option is set to top, If it is set to all, then the feature structure for each node of the tree is shown:
Word_1:
[ cat: Word
clitic:-
head: [ number:PL
pos: N ]
root_pos:N
root: `fox ]
Stem_2:
[ cat: Stem
ajr8: -
head: [ number:SG
pos: N
proper:- ]
root_pos:N
root: `fox
reg: + ]
ROOT_3:
[ cat: ROOT
ajr8: -
gloss: `fox
head: [ number:SG
pos: N
proper:- ]
root_pos:N
lex: `fox
reg: + ]
INFL_4:
[ cat: INFL
from_pos:N
gloss: +PL
head: [ number:PL
pos: N ]
lex: +s
reg: + ]
Setting the features option to off will suppress display
of feature structures entirely. In the example above using the word
foxes, the lexicon returned two results, one of which was
disallowed by the word grammar. In the next example, the lexicon
returns one result which is expanded into three by the grammar. First,
turn off the grammar component by typing set grammar off. This
causes the Recognizer to behave just as if no grammar were loaded. Then
type rec deer. One result is displayed.
------------------------------------------------------------------------ PC-KIMMO>set grammar off PC-KIMMO>rec deer `deer `deer ------------------------------------------------------------------------Now type set grammar on and rec deer again.
------------------------------------------------------------------------
PC-KIMMO>set grammar on
PC-KIMMO>rec deer
`deer `deer
1:
Word_4
|
Stem_5
|
ROOT_6
`deer
`deer
Word:
[ cat: Word
clitic:-
head: [ number:SG
pos: N
proper:- ]
root_pos:N
root: `deer ]
2:
Word_1
|
Stem_2
|
ROOT_3
`deer
`deer
Word:
[ cat: Word
clitic:-
head: [ number:PL
pos: N
proper:- ]
root_pos:N
root: `deer ]
------------------------------------------------------------------------
In this display, the single result from the lexicon has been given two
analyses by the word grammar. While the two trees are identical, the
feature structures for the top nodes of the trees differ: for the first
tree, the feature number has the value SG, while for the second
it has the value PL. In other words, the grammar has produced both a
singular and a plural form for deer. The next example
demonstrates that the prefix un+ has two analyses (or there are
two homophonous prefixes spelled un+). First, the negative
un+ as in unclear attaches to adjectives and negates
their meaning. Second, the reversive un+ as in untie
attaches to verbs and reverses their action. A word such as
unlockable is has two readings due to the ambiguity of the
un+ prefix: either "not lockable" or "can be unlocked." To see
how the word grammar distinguishes these reading, type rec
unlockable:
------------------------------------------------------------------------
PC-KIMMO>rec unlockable
un+`lock+able NEG4+`lock+AJR25a
1:
Word
|
Stem
_____|_____
PREFIX Stem
un+ ___|____
NEG4+ Stem SUFFIX
| +able
ROOT +AJR25a
`lock
`lock
Word:
[ cat: Word
clitic:-
head: [ aform: POS
pos: AJ ]
root_pos:V
root: `lock ]
un+`lock+able REV1+`lock+AJR25a
1:
Word
|
Stem
_____|______
Stem SUFFIX
___|____ +able
PREFIX Stem +AJR25a
un+ |
REV1+ ROOT
`lock
`lock
Word:
[ cat: Word
clitic:-
head: [ aform: POS
pos: AJ ]
root_pos:V
root: `lock ]
------------------------------------------------------------------------
The two trees show how the two reading are produced. In the first tree, the
negative un+ attaches to the adjective lockable to give the
reading "not lockable." In the second tree, the reversive un+ first
attaches to the verb lock to produce unlock, which in turn is
suffixed with +able to give the reading "can be unlocked." Notice,
however, that both trees have the same feature structure for their top nodes;
in other words, unlockable is an adjective in either reading.
To try the Synthesizer function, first load the Englex lexicon as a synthesis lexicon (Macintosh users may first need to increase PC-KIMMO's memory partition):
------------------------------------------------------------------------ PC-KIMMO>load synthesis-lexicon english Loading synthesis-lexicon from english.lex ------------------------------------------------------------------------Now use the synthesize command with these morphological forms:
------------------------------------------------------------------------ PC-KIMMO>synthesize REV1+ `lock +AJR25a unlockable PC-KIMMO>syn NEG4+ `tie +ING untying PC-KIMMO>syn `fox +PL +GEN foxes' ------------------------------------------------------------------------To demonstrate that synthesis uses the grammar (if one is loaded), try this ill-formed input form:
------------------------------------------------------------------------ PC-KIMMO>syn `fox +3SG *** NONE *** PC-KIMMO>set grammar off PC-KIMMO>syn `fox +3SG foxes PC-KIMMO>rec foxes `fox `fox+PL `fox `fox+3SG ------------------------------------------------------------------------When the grammar is used, the form `fox +3SG is rejected, since the grammar prohibits a verbal suffix on a noun. When the grammar is turned off, then the surface form foxes is returned, since this is permitted by the lexicon; this is demonstrated by recognizing the form foxes with the grammar off.
COMMENT %
LEXICON NOUN `boy Noun "N(boy)" `baby Noun "N(baby)" `feet Noun "N(foot).PL"Lexical entries were grouped into sublexicons declared with the keyword LEXICON; in the example above, these entries all belong to the NOUN sublexicon. Each lexical entry was composed of three fields, separated by white space and terminated by a new line. The three fields comprising an entry were the lexical item (or lexical form), the alternation name, and a gloss string. In version 2 of PC-KIMMO, these lexical entries look like this:
\lexform `boy \sublexicon NOUN \alternation Noun \gloss N(boy) \lexform `baby \sublexicon NOUN \alternation Noun \gloss N(boy) \lexform `feet \sublexicon NOUN \alternation Noun \features pl irreg \gloss N(foot).PLLexical entries are encoded in "field-oriented standard format." Standard format is an information interchange convention developed by the Summer Institute of Linguistics. It tags the kinds of information in ASCII text files by means of markers which begin with backslash. Field-oriented standard format (FOSF) is a refinement of standard format geared toward representing data which has a database-like record and field structure. Using FOSF to encode lexical entries has several advantages
FIELDCODE gloss GThis means that lexical entries can include alternative gloss fields, one of which is chosen for use when the lexicon is loaded. For example, a lexical entry might look like this:
\lexform `boy \sublexicon NOUN \alternation Noun \eng boy \sp muchachoThe field code eng and sp mark English and Spanish gloss fields. If the user wants English glosses then he includes this declaration in the main lexicon file:
FIELDCODE eng Gand if he wants Spanish glosses, this declaration:
FIELDCODE sp GThe same strategy can be used with any field used in lexical entries.
clear
Same as NEW command in version 1.
[file] compare synthesize [filespec]
Reads morphological forms (a sequence of morpheme glosses separated by spaces) from filespec, submits them to the synthesizer, and compares the resulting surface form(s) with the expected results listed in filespec.
file synthesize input-filespec [output-filespec]
Reads a list of morphological forms (a sequence of morpheme glosses separated by spaces) from input-filespec, submits them to the synthesizer, and returns each morphological form followed by the resulting surface form(s).
load grammar [filespec]
Loads a word grammar from filespec.
load synthesis-lexicon [filespec]
Loads a synthesis-lexicon from filespec.
save [filespec]
Writes the current setting to a take file named filespec. If filespec is not specified, the settings are written to a file named PCKIMMO.TAK in the current directory. On start-up, PC-KIMMO automatically tries to load default settings from PCKIMMO.TAK (or PC-KIMMO.TAK).
set alignment {on | off}
Turns alignment display mode on or off.
set ambiguities number
Limits the number of analyses produced by the word grammar to number.
set failures {on | off}
Turns grammar failure mode on or off.
set features {top | all | off}
Sets the feature display mode.
set features {full | flat}
Sets the feature display style.
set gloss {on | off}
Turns gloss display mode on or off.
set grammar {on | off}
Turns the loaded word grammar on or off.
set trim-empty-features {on | off}
Turns trimming of empty features on or off.
set tree {full | flat | indented | off}
Sets the tree display style.
set unification {on | off}
Turns feature unification in the word grammar on or off.
set warnings {on | off}
Turns warning mode on or off.
synthesize [morphological-form]
Produces surface forms from a morphological form (a sequence of morpheme glosses).