Chapter 1
Introduction to PC-KIMMO*
Last modified November 22, 1995
[ Guide contents | Next chapter: 2 Overview ]
1.1 Why do morphological parsing?
1.2 The two-level model of morphology
1.3 The KIMMO parser
1.4 The PC-KIMMO parser
1.5 Unification-based word grammar
1.6 Englex: a two-level description of English morphology
1.7 Conclusion
Figure 1.1 Main components of Karttunen's KIMMO parser
Figure 1.2 Parse tree and feature structure for enlargements
Figure 1.3 Fragment of a word grammar of English
This chapter describes PC-KIMMO, a morphological parser based on Kimmo
Koskenniemi's model of two-level morphology ( Koskenniemi
1983). While PC-KIMMO was adequate to decompose a word into morphemes,
it was not able directly to compute the part of speech of a derivationally
complex word or return a word's inflectional features--precisely the
information required for syntactic parsing. These deficiencies have now
been remedied by adding a unification-based word grammar component to
version 2 of PC-KIMMO which can provide parse trees and feature structures.
A substantial analysis of English for use with PC-KIMMO is also described.
Parsing is a standard technique used in the field of natural language
processing. When you think of parsing, you likely think of syntactic
parsing. But before a syntactic parser can parse a sentence, it must be
supplied with information about each word in the sentence. For instance, to
parse the sentence The cat chased the rat, a parser must know that
cat is a singular noun, chased is a past tense verb, and so
on. In English, such information may be supplied by a lexicon that simply
lists all wordforms with their part of speech and inflectional information
such as number and tense. Because English has a relatively simple
inflectional system, the number of forms that must be listed in such a
lexicon is manageable. Note that countable nouns such as cat have
only have two inflected forms, singular and plural, and regular verbs such
as chase have only four inflected forms: the base form, the -s
form, the -ed form, and the -ing form. But an exhaustive
lexical listing is simply not feasible for many other languages, such as
Finnish, Turkish, and Quechua, which may have hundreds of inflected forms
for each noun or verb. For such languages, one must build a word parser
that will use the morphological system of the language to compute the part
of speech and inflectional categories of any word.
Even for English a morphological parser may be necessary. Although English
has a limited inflectional system, it has very complex and productive
derivational morphology. For example, from the root compute come
derived forms such as computer, computerize, computerization,
recomputerize, noncomputerized, and so on. It is impossible to list
exhaustively in a lexicon all the derived forms (including coined terms or
inventive uses of language) that might occur in natural text.
A major breakthrough in the field of morphological parsing came in 1983
when Kimmo Koskenniemi, a Finnish computer scientist, produced his
dissertation Two-level morphology: A general computational model for
word-form recognition and generation
( Koskenniemi
1983). Koskenniemi's model of two-level morphology was based on the
traditional distinction that linguists make between morphotactics,
which enumerates the inventory of morphemes and specifies in what order
they can occur, and morphophonemics, which accounts for alternate
forms or "spellings" of morphemes according to the phonological context in
which they occur. For example, the word chased is analyzed
morphotactically as the stem chase followed by the suffix -ed.
However, the addition of the suffix -ed apparently causes the
loss of the final e of chase; thus chase and
chas are allomorphs or alternate forms of the same morpheme.
Koskenniemi's model is "two-level" in the sense that a word is represented
as a direct, letter-for-letter correspondence between its lexical or
underlying form and its surface form. For example, the word chased
is given this two-level representation (where + is a morpheme
boundary symbol and 0 is a null character):
Lexical form: c h a s e + e d
Surface form: c h a s 0 0 e d
For more on the phonological properties of the two-level model, see Antworth 1991.
Shortly after Koskenniemi's dissertation appeared, Lauri Karttunen and
others produced a LISP implementation of Koskenniemi's two-level model and
dubbed it KIMMO ( Karttunen 1983).
The main
components of the KIMMO parser are shown in Figure 1.1.
It had two analytical components: the rules component and the lexical
component, or lexicon. First, the rules component consisted of two-level
rules that accounted for regular phonological or orthographic alternations,
such as chase versus chas. Second, the lexicon listed all
morphemes (stems and affixes) in their lexical form and specified
morphotactic constraints. For example, the lexicon would have included
lexical entries for the verb stem chase and the suffix -ed,
and would have specified their relative order. Using these data
components were two processing functions, the Generator and the Recognizer.
The Generator would accept as input a lexical form such as spy+s and
return the surface form spies. The Recognizer would accept as input
a surface form such as spies and return an underlying form divided
into morphemes, namely spy+s, plus a gloss string such as
N+PLURAL.
Figure 1.1 Main components of Karttunen's KIMMO parser
In 1990, the Summer Institute of Linguistics produced PC-KIMMO version 1,
an implementation of the two-level model that closely followed Karttunen's
KIMMO (see Antworth 1990). Written in C,
it ran on
personal computers such as IBM PC compatibles and the Macintosh as well as
UNIX. PC-KIMMO was quite good at what it was designed to do--tokenize a
word into a sequence of tagged morphemes. But it had a serious deficiency:
it could not directly determine the part of speech of a word or its
inflectional categories. For example, given the word enlargements,
PC-KIMMO could tokenize it into the sequence of morphemes
en+large+ment+s and gloss each morpheme, but it could not
determine that the entire word was a plural noun. This meant that PC-KIMMO
was not adequate to act as a morphological front end to a syntactic
parser--its most desirable application.
In 1993, version 2 of PC-KIMMO was developed specifically to correct this
deficiency. It does so by adding a third analytical component, a word
grammar. The word grammar is a unification-based chart parser (based on the
PATR-II formalism described in Shieber 1986) that
provides parse trees and feature structures. The chart parser was
originally designed for syntactic parsing. Just as a sentence parser
produces a parse tree with words as its leaf nodes, a word parser produces
a parse tree with morphemes as its leaf nodes. When you parse a sentence,
it is normally already tokenized into words (since we put white space
between words); but when you parse a word, you must first tokenize it into
morphemes. This tokenizing is done by the rules and lexicon. When a surface
word is submitted to PC-KIMMO's Recognizer, the rules and lexicon analyze
the word into a sequence of morpheme structures (or possibly more than one
sequence if more than one analysis is found). A morpheme structure consists
of a lexical form, its gloss, its category, and its features. For example,
the word enlargements is tokenized into this sequence of morpheme
structures:
Form: en+ `large +ment +s
Gloss: VR1+ `large +NR25 +PL
Cat: PREFIX AJ SUFFIX INFL
Feat: [from_pos:AJ [head: [pos:AJ]] [from_pos:V [from_pos:N
head: [pos:V]] head: [pos:N]] head: [number:PL
pos: N ]]
This analysis is then passed to the word grammar, which returns the parse
tree and feature structure shown in Figure 1.2.
Figure 1.2 Parse tree and feature structure for enlargements
Word
______|_______
Stem INFL
_____|______ +s
Stem SUFFIX +PL
___|____ +ment
PREFIX Stem +NR25
en+ |
VR1+ ROOT
`large
`large
Word:
[ head: [ pos: N
number:PL ]]
While each node of the tree has a feature structure associated with it, the
feature structure for the top node is the most important, since these are
the features attributable to the entire word. The feature structure for the
word enlargements specifies two features. First, the feature
pos has the value N, meaning that the part-of-speech
(lexical category) of the word is Noun. Second, the feature number has
the value PL for plural. If PC-KIMMO were being called from a
syntactic parser, then it would return to the syntactic parser the word
enlargements with these features. In its overall architecture,
version 2 of PC-KIMMO now resembles the morphological parser described in
Ritchie and others 1992. That parser also first
tokenizes a word into morphemes and then parses the morpheme sequence with
a unification-based parser. However, our unification parser differs
considerably from theirs in its implementation.
The word grammar component uses a grammar file written by the user. A
grammar consists of context-free rules and feature constraints. The
format of the grammar closely follows Shieber's PATR-II formalism ( Shieber 1986). Figure 1.3 shows a fragment of a word grammar of English.
Figure 1.3 Fragment of a word grammar of English
;FEATURE ABBREVIATIONS:
Let pl be <head number> = PL
LET v/n be <from_pos> = V
<head pos> = N
<head number> = !SG
LET v\aj be <from_pos> = AJ
<head pos> = V
;CATEGORY TEMPLATES:
Let N be <cat> = ROOT
<head pos> = N
<head number> = !SG
Let V be <cat> = ROOT
<head pos> = V
Let AJ be <cat> = ROOT
<head pos> = AJ
;Rule 1
Word = Stem INFL
<Stem head pos> = <INFL from_pos>
<Word head> = <INFL head>
;Rule 2
Stem_1 = PREFIX Stem_2
<PREFIX from_pos> = <Stem_2 head pos>
<Stem_1 head> = <PREFIX head>
;Rule 3
Stem_1 = Stem_2 SUFFIX
<Stem_2 head pos> = <SUFFIX from_pos>
<Stem_1 head> = <SUFFIX head>
;Rule 4
Stem = ROOT
<Stem head> = <ROOT head>
The first section of the
grammar file contains feature abbreviations. Feature abbreviations can
be used either in lexical entries or in grammar rules and are expanded
by "LET" statements. For example, the feature abbreviation pl is
expanded into the feature structure [head: [number: PL]].
The second section of the grammar file contains category templates. These
are feature specifications that are attached to lexical categories such as
Noun and Adjective. This greatly reduces the amount of information that
must be stored in the lexicon. For example, the statement Let N be
<head number> = SG means that all nouns are assigned singular number.
The grammar in Figure 1.3 actually contains the
statement Let N be <head number> = !SG. The exclamation point in
!SG means that this is a default value which can be overridden.
For example, the lexical entry for fox does not need to specify that
it is singular; that information is supplied by the category definition of
Noun. However, the lexical entry for mice (an irregular plural)
explicitly sets the feature number feature to PL (plural), thus
overriding the default value.
The third section of the grammar file contains the word grammar rules.
Associated with each rule are feature constraints. A feature constraint
consists of two feature structures which must unify with each other.
Feature constraints have two functions: they constrain the operation of a
rule and they pass features from one node to another up the parse tree. For
example, in rule 1 of Figure 1.3 the feature constraint
<Stem head pos> = <INFL from_pos> requires that the
pos feature of the Stem node must have the same value as the
from_pos feature of the INFL node in order for the rule to succeed,
while the feature constraint <Word head> = <INFL head>
passes the values of the head features (including pos)
from the INFL node to the
head features of the Word node.
The word enlargements is an especially good example of the power of
the word grammar because of its complex derivational structure. Its root is
the adjective large; the prefix en- forms the verb
enlarge; the suffix -ment forms the noun enlargement;
and finally the inflectional suffix s marks it plural. To
accomplish this, each root or stem has a lexical category such as Noun or
Adjective and each affix has a from_pos feature and a pos
feature. The from_pos feature specifies the lexical category of the
stems to which it can attach. The pos feature specifies the
lexical category of the resulting stem. For instance, the prefix en-
has a from_pos of Adjective and a pos of Verb, since it
attaches to an adjective such as large and produces a verb,
enlarge. Rule 2 in Figure 1.3 says that a Stem is
composed of a PREFIX plus a Stem, for instance enlarge = en+large.
The first feature constraint, <PREFIX from_pos> = <Stem_2
head pos>, requires that the from_pos feature of the PREFIX
node must have the same value as the pos feature of the Stem_2
node. Since the from_pos feature of the prefix en- and the
pos feature of the stem large are both AJ, the
rule succeeds. The second feature constraint, <Stem_1 head> =
<PREFIX head>, passes the value of the <head pos> feature
from the PREFIX node to the <head pos> feature of the Stem node; that
is, since the <head pos> of the prefix en- is V, the
<head pos> of the resulting stem enlarge is also V.
Rule 1 in Figure 1.3 accounts for the plural suffix
-s. The rule simply says that a word is composed of a stem plus an
inflectional element. The two feature constraints use the <head pos>,
<from_pos>, and <pos> features to ensure that the
plural suffix attaches only to a noun stem
and produces a noun word (that is, it does not change the part of speech of
the stem as do the derivational affixes).
PC-KIMMO version 2 can be used with Englex, a two-level description of
English morphology. Englex consists of a set of orthographic rules, a
20,000-entry lexicon of roots and affixes, and a word grammar. With Englex
and PC-KIMMO, you can morphologically parse English words and text.
Practical applications include morphologically preprocessing text for a
syntactic parser and producing morphologically tagged text (see Antworth 1993).
Englex can also be used to explore English morphological structure for purposes of linguistic analysis.
In terms of its coverage of English, Englex has these goals:
- To account for all major spelling rules of English.
- To account for all productive morphological structure (affixes,
morphotactic constraints, word class conversion, and so on). While a
20,000-entry lexicon sounds small, Englex can actually recognize many times
that number of words because it analyzes productive derivational morphology.
For example, the lexicon contains entries for re-, compute, -er, -ize,
and -ation and can thus recognize any complex word formed from those
parts.
- To establish a critical mass of lexical entries that would handle a large
percentage of non-technical, non-specialized English text.
- To provide an interface to syntactic parsing. For each input word, Englex
should return its lexical category (part-of-speech) and all syntactically
relevant inflectional categories (such as number and tense).
Englex is described in detail in chapter 3 of this guide.
Koskenniemi's original model of a two-level morphological parser,
instantiated as PC-KIMMO version 1, made it possible to decompose complex
words into their constituent morphemes, but it lacked the ability directly
to compute the part of speech of a complex word or to return inflectional
features in a form that could easily be used by a syntactic parser. By
feeding the output of the original parser into a unification-based word
grammar, version 2 of PC-KIMMO can now provide full parse trees and feature
structures for handling part of speech and inflectional categories. A
substantial description of English including rules, lexicon, and word
grammar is also available for use with PC-KIMMO version 2.
*This chapter is a revision of a paper presented at:
North Texas Natural Language Processing Workshop
May 23, 1994
University of Texas at Arlington
[ Guide contents | Next chapter: 2 Overview ]