4.7 File formats
[ Guide contents | Chapter contents | Next section: 4.8 Trace Formats | Previous section: 4.6 Alphabetic List of Commands ]
Figure 4.1 Structure of the rules file
Figure 4.2 A sample rules file
Figure 4.3 Structure of the main lexicon file
Figure 4.4 A sample main lexicon file
Figure 4.5 Structure of a lexical entry
Figure 4.6 A sample lexical entry
Figure 4.7 Structure of the grammar file
Figure 4.8A A lexical rule example
Figure 4.8B Feature structure before application of lexical rule
Figure 4.8C Feature structure after application of lexical rule
Figure 4.9 A sample grammar file
Figure 4.10 A sample generation comparison file
Figure 4.11 A sample recognition comparison file
Figure 4.12 A sample pairs comparison file
Figure 4.12A A sample synthesis comparison file
Figure 4.13 A sample generation file
Figure 4.14A A sample synthesis file
Figure 4.15 Default file names and extensions
This section describes the formats for the files that are used as input
to PC-KIMMO. In any of the files, comments can be added to any line by
preceding the comment with the comment character. This character is
normally a semicolon (;), but can be changed with the COMMENT keyword
in the rules file. Anything following a comment character (until the
end of the line) is considered part of the comment and is ignored by
PC-KIMMO.
In the descriptions below, reference to the use of a space character
implies any whitespace character (that is, any character treated like a
space character). The following control characters when used in a file
are whitespace characters: ^I (ASCII 9, tab), ^J (ASCII 10, line feed),
^K (ASCII 11, vertical tab), ^L (ASCII 12, form feed), and ^M (ASCII
13, carriage return).
The control character ^Z (ASCII 26) cannot be used because MS-DOS
interprets it as marking the end of a file. Also the control character
^@ (ASCII 0, null) cannot be used.
Examples of each of the following file types are found on the release
diskette as part of the English description.
The general structure of the rules file is a list of keyword
declarations. Figure4.1 shows the conventional
structure of the rules file. Note that the notation {x |
y} means either x or y (but not both). The
following specifications apply to the rules file.
Figure 4.1 Structure of the rules file
COMMENT <character>
ALPHABET <symbol list>
NULL <character>
ANY <character>
BOUNDARY <character>
SUBSET <subset name> <symbol list>
. (more subsets)
.
.
RULE <rule name> <number of states> <number of columns>
<lexical symbol list>
<surface symbol list>
<state number>{: | .} <state number list>
. (more states)
.
.
. (more rules)
.
.
END
- Extra spaces, blank lines, and comment lines are ignored.
- Comments may be placed anywhere in the file. All data following a
comment character to the end of the line is ignored. (See below on the
COMMENT declaration.)
- The set of valid keywords used to form declarations includes
COMMENT, ALPHABET, NULL, ANY, BOUNDARY, SUBSET, RULE, and END.
- These declarations are obligatory and can occur only once in a
file: ALPHABET, NULL, ANY, BOUNDARY.
- These declarations are optional and can occur one or more times in
a file: COMMENT, SUBSET, and RULE.
- The COMMENT declaration sets the comment character used in the
rules file, lexicon files, and grammar file. The COMMENT declaration
can only be used in the rules file, not in the lexicon or grammar file.
The COMMENT declaration is optional. If it is not used, the comment
character is set to ; (semicolon) as a default.
- The COMMENT declaration can be used anywhere in the rules file and
can be used more than once. That is, different parts of the rules file
can use different comment characters. The COMMENT declaration can (and
in practice usually does) occur as the first keyword in the rules file,
followed by either one or more COMMENT declarations or the ALPHABET
declaration.
- Note that if you use the COMMENT declaration to declare the
character that is already in use as the comment character, an error
will result. For instance, if semicolon is the current comment
character, the declaration COMMENT ; will result in an error.
- The comment character can no longer be set using a command line
option or with a command in the user interface, as was the case in
version 1 of PC-KIMMO.
- The ALPHABET declaration must either occur first in the file or
follow one or more COMMENT declarations only. The other declarations
can appear in any order. The COMMENT, NULL, ANY, BOUNDARY, and SUBSET
declarations can even be interspersed among the rules. However, these
declarations must appear before any rule that uses them or an error
will result.
- The ALPHABET declaration defines the set of symbols used in
either lexical or surface representations. The keyword ALPHABET is
followed by a <symbol list> of all alphabetic
symbols. Each symbol must be separated from the others by at
least one space. The list can span multiple lines, but ends with the
next valid keyword. All alphanumeric characters (such as a,
B, and 2), symbols (such as $ and +), and
punctuation characters (such as . and ?) are available as
alphabet members. The characters in the IBM extended character set
(above ASCII 127) are also available. Control characters (below ASCII
32) can also be used, with the exception of whitespace characters (see
above), ^Z (end of file), and ^@ (null). The alphabet can contain a
maximum of 255 symbols. An alphabetic symbol can also be a multigraph,
that is, a sequence of two or more characters. The individual
characters composing a multigraph do not necessarily have to also be
declared as alphabetic characters. For example, an alphabet could
include the characters s and z and the multigraph sz%,
but not include % as an alphabetic character. Note that a multigraph
cannot also be interpreted as a sequence of the individual characters that
comprise it.
- The keyword NULL is followed by a single <character>
that represents a null (empty, zero) element. The NULL symbol is
considered to be an alphabetic character, but cannot also be listed in
the ALPHABET declaration. The NULL symbol declared in the rules file is
also used in the lexicon file to represent a null lexical entry.
- The keyword ANY is followed by a single "wildcard"
<character> that represents a match of any character in
the alphabet. The ANY symbol is not considered to be an alphabetic
character, though it is used in the column headers of state tables. It
cannot be listed in the ALPHABET declaration. It is not used in the
lexicon file.
- The keyword BOUNDARY is followed by a single <character>
character that represents an initial or final word boundary. The
BOUNDARY symbol is considered to be an alphabetic character, but cannot
also be listed in the ALPHABET declaration. When used in the column
header of a state table, it can only appear as the pair #:#
(where, for instance, # has been declared as the BOUNDARY
symbol). The BOUNDARY symbol is also used in the lexicon file in the
continuation class field of a lexical entry to indicate the end of a
word (that is, no continuation class).
- The SUBSET declaration defines set of characters that are referred
to in the column headers of rules. The keyword SUBSET is followed by
the <subset name> and <symbol list>.
<subset name> is a single word (one or more characters)
that names the list of characters that follows it. The subset name must
be unique (that is, if it is a single character it cannot also be in
the alphabet or be any other declared symbol). It can be composed of
any characters (except space); that is, it is not limited to the
characters declared in the ALPHABET section. It must not be identical
to any keyword used in the rules file. The subset name is used in rules
to represent all members of the subset of the alphabet that it defines.
Note that SUBSET declarations can be interspersed among the rules. This
allows subsets to be placed near the rule that uses them if such a
style is desired. However, a subset must be declared before a rule that
uses it.
- The <symbol list> following a <subset
name> is a list of single symbols, each of which is separated
by at least one space. The list can span multiple lines. Each symbol
in the list must be a member of the previously defined ALPHABET, with
the exception of the NULL symbol, which can appear in a subset list but
is not included in the ALPHABET declaration. Neither the ANY symbol nor
the BOUNDARY symbol can appear in a subset symbol list.
- The keyword RULE signals that a state table immediately follows.
- <rule name> is the name or description of the rule
which the state table encodes. It functions as an annotation to the
state table and has no effect on the computational operation of the
table. It is displayed by the list rules and show rule
commands and is also displayed in traces. The rule name must be
surrounded by a pair of identical delimiter characters. Any material
can be used between the delimiters of the rule name with the exception
of the current comment character and of course the rule name delimiter
character of the rule itself. Each rule in the file can use a different
pair of delimiters. The rule name must be all on one line, but it does
not have to be on the same line as the RULE keyword.
- <number of states> is the number of states (rows in
the table) that will be defined for this table. The states must begin
at 1 and go in sequence through the number defined here (that is, gaps
in state numbers are not allowed).
- <number of columns> is the number of state
transitions (columns in the table) that will be defined for each state.
- <lexical symbol list> is a list of elements
separated by one or more spaces. Each element represents the lexical
half of a lexical:surface correspondence which, when matched, defines a
state transition. Each element in the list must be either a member of
the alphabet, a subset name, the NULL symbol, the ANY symbol, or the
BOUNDARY symbol (in which case the corresponding surface character must
also be the BOUNDARY symbol). The list can span multiple lines, but the
number of elements in the list must be equal to the number of columns
defined for the rule.
- <surface symbol list> is a list of elements
separated by one or more spaces. Each element represents the surface
half of a lexical:surface correspondence which, when matched, defines a
state transition. Each element in the list must be either a member of
the alphabet, a subset name, the NULL symbol, the ANY symbol, or the
BOUNDARY symbol (in which case the corresponding lexical character must
also be the BOUNDARY symbol). The list can span multiple lines, but the
number of characters in the list must be equal to the number of columns
defined for the rule.
- <state number> is the number of the state or row of
the table. The first state number must be 1, and subsequent state
numbers must follow in numerical sequence without any gaps.
- {: | .} is the final or nonfinal state indicator. This should be a
colon (:) if the state is a final state and a period (.) if it is a
nonfinal state. It must follow the <state number> with no
intervening space.
- <state number list> is a list of state transition
numbers for a particular state. Each number must be between 1 and the
number of states (inclusive) declared for the table. The list can span
multiple lines, but the number of elements in the list must be equal to
the number of columns declared for this rule.
- The keyword END follows all other declarations and indicates the
end of the rules file. Any material in the file thereafter is ignored
by PC-KIMMO. The END keyword is optional; the physical end of the file
also terminates the rules file.
Figure 4.2 shows a sample rules file.
Figure 4.2 A sample rules file
ALPHABET
b c d f g h j k l m n p q r s t v w x y z + ; + is morpheme boundary
a e i o u
NULL 0
ANY @
BOUNDARY #
SUBSET C b c d f g h j k l m n p q r s t v w x y z
SUBSET V a e i o u
; more subsets
RULE "Consonant defaults" 1 23
b c d f g h j k l m n p q r s t v w x y z + @
b c d f g h j k l m n p q r s t v w x y z 0 @
1: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
RULE "Vowel defaults" 1 6
a e i o u @
a e i o u @
1: 1 1 1 1 1 1
RULE "Voicing s:z <=> V___V" 4 4
V s s @
V z @ @
1: 2 0 1 1
2: 2 4 3 1
3: 0 0 1 1
4. 2 0 0 0
; more rules
END
A lexicon consists of one main lexicon file plus one or more files of
lexical entries. The general structure of the main lexicon file is a
list of keyword declarations. The set of valid keywords is ALTERNATION,
FEATURES, FIELDCODE, INCLUDE, and END. Figure 4.3
shows the conventional structure of the lexicon file. The following
specifications apply to the main lexicon file.
Figure 4.3 Structure of the main lexicon file
ALTERNATION <alternation name> <sublexicon name list>
. (more ALTERNATIONs)
.
.
FEATURES <feature abbreviation list>
FIELDCODE <lexical item code> U
FIELDCODE <sublexicon code> L
FIELDCODE <alternation code> A
FIELDCODE <features code> F
FIELDCODE <gloss code> G
INCLUDE <filespec>
. (more INCLUDEd files)
.
.
END
- Extra spaces, blank lines, and comment lines are ignored.
- The comment character declared in the rules file is operative in
the main lexicon file. Comments may be placed anywhere in the file. All
data following a comment character to the end of the line is ignored.
- The set of valid keywords used to form declarations includes
ALTERNATION, FEATURES, FIELDCODE, INCLUDE, and END.
- The declarations can appear in any order with the proviso that any
alternation name, feature name, or fieldcode used in a lexical entry
must be declared before the lexical entry is read. In practice, this
means that the INCLUDE declarations should appear last, but the
ALTERNATION, FEATURES, and FIELDCODE declarations can appear in any
order.
- The ALTERNATION declaration defines a set of sublexicon names that
serve as the continuation class of a lexical item. The ALTERNATION
keyword is followed by an <alternation name> and a
<sublexicon name list>. ALTERNATION declarations are
optional (but nearly always used in practice) and can occur as many
times as needed.
- <alternation name> is a name associated with the
following <sublexicon name list>. It is a word composed of
one or more characters, not limited to the ALPHABET characters declared
in the rules file. An alternation name can be any word other than a
keyword used in the lexicon file. The program does not check to see if
an alternation name is actually used in the lexicon file.
- <sublexicon name list> is a list of sublexicon names.
It can span multiple lines until the next valid keyword is encountered.
Each sublexicon name in the list must be used in the sublexicon field
of a lexical entry. Although it is not enforced at the time the lexicon
file is loaded, an undeclared sublexicon named in a sublexicon name
list will cause an error when the recognizer tries to use it.
- The FEATURES keyword followed by a <feature abbreviation
list>. A <feature abbreviation list> is a list of
words, each of which is expanded into feature structures by the word
grammar.
- The FIELDCODE declaration is used to define what fieldcode will be
used to mark each type of field in a lexical entry. The FIELDCODE
keyword is followed by a <code> and one of five possible
internal codes: U, L, A, F, or G. There must be five FIELDCODE
declarations, one for each of these internal codes, where U indicates
the lexical item field, L indicates the sublexicon field, A indicates
the alternation field, F indicates the features field, and G indicates
the gloss field.
- The INCLUDE keyword is followed by a <filespec> that
names a file containing lexical entries to be loaded. An INCLUDEd file
cannot contain any declarations (such as a FIELDCODE or an INCLUDE
declaration), only lexical entries and comment lines.
- The keyword END follows all other declarations and indicates the
end of the main lexicon file. Any material in the file thereafter is
ignored by PC-KIMMO. The END keyword is optional; the physical end of
the file also terminates the main lexicon file.
Figure 4.4 shows a sample main lexicon file.
Figure 4.4 A sample main lexicon file
ALTERNATION Begin PREF
ALTERNATION Pref N AJ V AV
ALTERNATION Stem SUFFIX
FEATURES sg pl reg irreg
FIELDCODE lf U ;lexical item
FIELDCODE lx L ;sublexicon
FIELDCODE alt A ;alternation
FIELDCODE fea F ;features
FIELDCODE gl G ;gloss
INCLUDE affix.lex ;file of affixes
INCLUDE noun.lex ;file of nouns
INCLUDE verb.lex ;file of verbs
INCLUDE adjectiv.lex ;file of adjectives
INCLUDE adverb.lex ;file of adverbs
END
Figure 4.5 shows the structure of a lexical entry.
Lexical entries are encoded in "field-oriented standard format."
Standard format is an information interchange convention developed by
the Summer Institute of Linguistics. It tags the kinds of information
in ASCII text files by means of markers which begin with backslash.
Field-oriented standard format (FOSF) is a refinement of standard
format geared toward representing data which has a database-like record
and field structure. The following points provide an informal
description of the syntax of FOSF files.
Figure 4.5 Structure of a lexical entry
\<lexical item code> <lexical item>
\<sublexicon code> <sublexicon name>
\<alternation code> {<alternation name> | <BOUNDARY symbol>}
\<features code> <features list>
\<gloss code> <gloss string>
- A field-oriented standard format (FOSF) file consists of a
sequence of records.
- A record consists of a sequence of fields.
- A field consist of a field marker and a field value.
- A field marker consists of a backslash character at the
beginning of a line, followed by an alphabetic or numeric character,
followed by zero or more printable characters, and terminated by a
space, tab, or the end of a line. A field marker without its initial
backslash character is termed a field code.
- A field marker must begin in the first position of a line.
Backslash characters occurring elsewhere in the file are not
interpreted as field markers.
- The first field marker of the record is considered the record
marker, and thus the same field must occur first in every record of the
file.
- Each field marker is separated from the field value by one
or more spaces, tabs, or newlines. The field value continues up to the
next field marker.
- Any line that is empty or contains only whitespace characters is
considered a comment line and is ignored. Comment lines may occur
between or within fields.
- Fields and lines in an FOSF file can be arbitrarily long.
- There are two basic types of fields in FOSF files:
nonrepeating and repeating. Repeating fields are multiple
consecutive occurrences of fields marked by the same marker. Individual
fields within a repeating field can be called subfields.
The following specifications apply to how FOSF is implemented
in PC-KIMMO.
- Lexical entries are encoded as records in a FOSF file.
- Only those fields whose field codes are declared in the main
lexicon file are recognized (see above on the FIELDCODE declaration).
All other fields are considered to be extraneous and are ignored.
- The first field of each lexical entry must be the lexical item
field. The lexical item field code is assigned to the internal code U
by a FIELDCODE declaration in the main lexicon file.
- Only nonrepeating fields are permitted.
- The comment character declared in the rules file is operative in
included files of lexical entries. All data following a comment
character to the end of the line is ignored.
A file of lexical entries is loaded by using an INCLUDE declaration in
the main lexicon file (see above). An INCLUDEd file of lexical entries cannot
contain any declarations (such as a FIELDCODE or an INCLUDE declaration), only
lexical entries and comment lines.
The following specifications apply to lexical entries.
- A lexical entry is composed of five fields: lexical item,
sublexicon, alternation, features, and gloss. The lexical item,
sublexicon, and alternation, fields are obligatory, the features and
gloss fields are optional. The first field of the entry must always be
the lexical item. The other fields can appear in any order, even
differing from one entry to another.
- Although the gloss field is optional, if a lexical entry does not
include one, a warning message to that effect will be displayed when
the entry is loaded. To supress this warning message, do the command
set warnings off (see section 4.5.6.1) before loading the
lexicon.
- If an entry has an empty gloss field (that is, the field marker
for the gloss field is present but there is no data after it), then the
contents of the lexical form field will be also be used as the gloss
for that entry.
- A lexical item field consists of a <lexical item
code> and a <lexical item>.
- A <lexical item code> is a field code assigned to the
internal code L by a FIELDCODE declaration in the main lexicon file.
- A <lexical item> is one or more characters that
represent an element (typically a morpheme or word) of the lexicon.
Each character (or multigraph) must be in the alphabet defined for the language. The
lexical item uses only the lexical subset of the alphabet.
- A sublexicon field consists of a <sublexicon code>
and a <sublexicon name>.
- A <sublexicon code> is a field code assigned to the
internal code L by a FIELDCODE declaration in the main lexicon file.
- A <sublexicon name> is the name associated with a
sublexicon. It is a word composed of one or more characters, not
limited to the alphabetic characters declared in the rules file. Every
lexical item must belong to a sublexicon. Every lexicon must include a
special sublexicon named INITIAL (that is, there must be at least one
lexical entry that belongs to the INITIAL sublexicon).
- Lexical entries belonging to a sublexicon do not have to be listed
consecutively in a single file (as was the case for PC-KIMMO version
1); rather, lexical entries in a file can occur in any order,
regardless of what sublexicon they belong to. Lexical entries of a
sublexicon can even be placed in two or more separate files.
- An alternation field consists of a <alternation code>
followed by either an <alternation name> or the
<BOUNDARY symbol>.
- An <alternation name> is declared in an ALTERNATION
declaration in the main lexicon file. The <BOUNDARY
symbol> is declared in the rules file and indicates the end of
all possible continuations in the lexicon.
- A features field consists of a <features code> and a
<features list>.
- A <features code> is a field code assigned to the
internal code F by a FIELDCODE declaration in the main lexicon file.
- A <features list> is a list of feature abbreviations.
Each abbreviation is a single word consisting of alphanumeric characters or other characters except (){}[]<>=:$! (these are used for special purposes in the grammar file). The
character \ should not be used as the first character of an abbreviation because that is how fields are marked in the lexicon
file. Upper and
lower case letters used in template names are considered
different. For example, "PLURAL" is not the same as "Plural" or
"plural." Feature abbreviations are expanded into
full feature structures by the word grammar (see section 4.7.3).
- A gloss field consists of a <gloss code> and a
<gloss string>.
- A <gloss code> is a field code assigned to the
internal code G by a FIELDCODE declaration in the main lexicon file.
- A <gloss string> is a string of text. Any material
can be used in the gloss field with the exception of the comment
character.
Figure 4.6 shows a sample lexical entry.
Figure 4.6 A sample lexical entry
\lf `knives
\lx N
\alt Infl
\fea pl irreg
\gl N(`knife)+PL
The grammar file consists of feature templates, context-free rules, and
feature constraints. Figure 4.7 shows the
conventional structure of the grammar file.
Figure 4.7 Structure of the grammar file
LET <abbreviation | category> be <feature definition>
. (more feature templates)
.
.
DEFINE <lexical rule name> as <mappings>
. (more lexical rules)
.
.
PARAMETER <parameter name> is <parameter value>
. (more parameter settings)
.
.
RULE <rule>
<feature constraint>
. (more constraints)
.
.
(more rules)
.
.
.
END
The following specifications apply generally to the grammar file.
Rules
The following specifications apply to rules.
A grammar rule has these parts, in the order listed:
- the keyword Rule
- an optional rule identifier enclosed in braces ({})
- the nonterminal symbol to be expanded
- an arrow (->) or equal sign (=)
- zero or more terminal or nonterminal symbols, possibly marked for
alternation or optionality
- an optional colon (:)
- zero or more feature constraints, possibly marked for
alternation
- an optional period (.)
The optional rule identifier (item 2) consists of one or more words enclosed in
braces. Its current utility is only as a special form of comment
describing the intent of the rule. (Eventually it may be used as a tag
for interactively adding and removing rules.) The only limits on the
rule identifier are that it not contain the comment character and that
it all appears on the same line in the grammar file.
The terminal and nonterminal symbols in the rule have the following
characteristics:
- Blank lines, spaces, and tabs separate symbols
from one another, but otherwise are ignored.
- Upper and lower case letters used in symbols are considered different.
For example, STEM is not the same as Stem, and neither is
the same as stem.
- Index numbers are used to distinguish instances of a symbol that
is used more than once in a rule. They are added to the end of a
symbol following an underscore character (_). For example,
Stem_1 = Stem_2 SUFFIX
- The symbol X may be used to stand for any terminal or nonterminal category. For example, this rule says that a N expands into a NStem plus any category.
N = NStem X
The symbol X can be useful for capturing generalities. Care must be
taken, since it can be replaced by anything.
- The characters (){}[]<>=:/ cannot be used in terminal or
nonterminal symbols since they are used for special purposes in the
grammar file. The character _ can be used only
for
attaching an index number to a symbol.
- By default, the left hand symbol of the first rule in the grammar file
is the start symbol of the grammar.
- There can be multiple rules for the same symbol, but all rules for
a symbol must be contiguous in the file.
The symbols on the right hand side of a context-free rule may be
marked or grouped in various ways:
Feature structures
The grammar formalism uses a basic element called a feature
structure. A feature structure consists of a feature name and a
value. The notation used for feature structures looks like this:
[number: singular]
where number is the feature name and singular is the
value, separated by a colon. Feature names and values are single words
consisting of alphanumeric characters or other characters except (){}[]<>=:$! (these are used for special purposes in the grammar file). Upper and
lower case letters used in feature names and values are considered
different. For example, "NUMBER" is not the same as "Number" or
"number."
A structure containing more than one feature uses square
brackets around the entire stucture:
[number: singular
case: nominative]
Extra spaces and line breaks are optional.
Feature structures can have
either simple values, such as the example above, or complex values,
such as this:
[agreement: [number: singular]
case: nominative]]
where the value of the agreement feature is another feature
structure. Feature structures can be infinitely nested in this manner.
Feature can share values. This is not the same thing as two features
having identical values. In the first example below, the features a and
c have identical values; but in the second example, they share the same
value:
[a: [p:q]
b: [p:q]]
[a: $1[p:q]
b: $1]
Shared values are indicated by coindexing them with the prefix $1, $2,
and so on.
Portions of a feature structure can be referred to using the
"path" notation. A path is a sequence of feature names (minimally one)
enclosed in angled brackets (<>). For example, consider this
feature structure:
[agreement: [number: singular
case: nominative]]
These are feature paths based on this structure:
<number>
<case>
<agreement number>
<agreement case>
Paths are used in feature templates and feature constraints, described below.
All lexical items used by the grammar are assigned three features: cat, lex,
and gloss. These should be treated as reserved names and not used
for other purposes.
- The value of the cat feature is the name of the sublexicon
to which the lexical item belongs, taken from the sublexicon field of
the item's lexical entry.
- The value of the lex feature is the lexical form of the
item, taken from the lexical form field of the item's lexical entry.
- The value of the gloss feature is the gloss of the item,
taken from the gloss field of the item's lexical entry.
For example, here is a lexical entry for the word fox:
\lf `fox
\lx N
\alt Stem
\gl N(fox)
When this entry is used by the grammar, it is represented as this feature structure:
[cat: N
lex: `fox
gloss: N(fox)]
Feature constraints
A rule is followed by zero or more feature constraint;
which refer to symbols used in the rule.
The following specifications apply to feature
constraints.
A feature constraint has these parts, in the order listed:
- a feature path that begins with one of the symbols from the
context-free rule
- an equal sign
- either another path or a value
A feature constraint that refers only to symbols on the right hand side
of the rule constrains their co-occurrence. In the following rule and
constraint, the value of the Stem's head pos
feature must unify with the value of the SUFFIX's from_pos feature:
Word -> Stem INFL
<Stem head pos> = <INFL from_pos>
If a feature constraint refers to a symbol on the right hand side of
the rule, and has an atomic value on its right hand side, then the
designated feature must not have a different value. In the following
rule and constraint, the head case
feature for the PRONOUN node of
the parse tree must either be originally undefined or equal to NOM:
Word -> PRONOUN
<PRONOUN head case> = NOM
(If the head case feature of the PRONOUN node was originally undefined, then, after unification succeeds, it will be equal to NOM.)
A feature constraint that refers to the symbol on the left hand side of
the rule passes information up the parse tree. In the following rule
and constraint, the value of the head
feature is passed from
the INFL node up to the Word node:
Word -> Stem INFL
<Word head> = <INFL head>
PC-KIMMO allows disjunctive feature constraints with its phrase
structure rules. Consider these two rules:
Stem_1 -> PREFIX Stem_2
<PREFIX from_pos> = <Stem_2 head pos>
<PREFIX change_pos> = +
<Stem_1 head> = <PREFIX head>
Stem_1 -> PREFIX Stem_2
<PREFIX from_pos> = <Stem_2 head pos>
<PREFIX change_pos> = -
<Stem_1 head> = <Stem_2 head>
These rules have the same context-free rule part. They can therefore be collapsed into this single rule , which has a disjunction in its feature constraints:
Stem_1 -> PREFIX Stem_2
<PREFIX from_pos> = <Stem_2 head pos>
{
<PREFIX change_pos> = +
<Stem_1 head> = <PREFIX head>
/
<PREFIX change_pos> = -
<Stem_1 head> = <Stem_2 head>
}
Disjunctive feature constrains may be nested up to eight levels deep.
Feature templates
The following specifications apply to feature templates.
A feature template has these parts, in the order listed:
- the keyword Let
- the template name
- the keyword be
- a feature definition
- an optional period (.)
If the template name is a terminal category (a terminal symbol in one
of the context-free rules), the template defines the default
features for that category. Otherwise the template name serves as an
abbreviation for the associated feature structure.
Templates may occur anywhere in the
file (interspersed among the rules), but a template must occur before
any rule or other template that uses the abbreviation it defines.
Template names are single words
consisting of alphanumeric characters or other characters except (){}[]<>=:$! (these are used for special purposes in the grammar file). The
character \ should not be used as the first character of a
template name because that is how fields are marked in the lexicon
file. Upper and
lower case letters used in template names are considered
different. For example, "PLURAL" is not the same as "Plural" or
"plural."
The abbreviations defined by templates are usually used in the feature
field of entries in the lexicon file. For example, the lexical entry
for the irregular plural form feet
may have the abbreviation
pl
in its features field. The grammar file would define this
abbreviation with a template like this:
Let pl be [number: PL]
The path notation may also be used:
Let pl be <number> = PL
More complicated feature structures may be defined in templates. For
example,
Let 3sg be [tense: PRES
agr: 3SG
finite: +
vform: S]
which is equivalent to:
Let 3sg be [<tense> = PRES
<agr> = 3SG
<finite> = +
<vform> = S]
In the following example, the abbreviation irreg
is defined using
another abbreviation:
Let irreg be <reg> = -
pl
The abbreviation pl
must be defined previously in the grammar
file or an error will result. A subsequent template could also use the
abbreviation irreg
in its definition. In this way, an
inheritance hierarchy features may be constructed.
Feature templates permit disjunctive definitions. For example, the
lexical entry for the word deer
may specify the feature
abbreviation sg-pl. The grammar file would define this as a
disjunction of feature structures reflecting the fact that the word can
be either singular or plural:
Let sg/pl be {[number:SG]
[number:PL]}
This has the effect of creating two entries for deer, one with
singular number and another with plural. Note that there is no limit
to the number of disjunct structures listed between the braces. Also,
there is no slash (/) between the elements of the disjunction as
there is between the elements of a disjunction in the rules.
A shorter version of the above template using the path notation looks
like this:
Let sg/pl be <number> = {SG PL}
Abbreviations can also be used in disjunctions, provided that they
have previously been defined:
Let sg be <number> = SG
Let pl be <number> = PL
Let sg/pl be {[sg] [pl]}
Note the square brackets around the abbreviations sg
and pl
without square brackets they would be interpreted as simple values
instead.
Feature templates can assign default atomic feature values, indicated
by prefixing an exclamation point (!). A default value can be
overridden by an explicit feature assignment. This template says that
all members of category N have singular number as a default value:
Let N be <number> = !SG
The effect of this template is to make all nouns singular unless they
are explicitly marked as plural. For example, regular nouns such as
book
do not need any feature in their lexical entries to signal
that they are singular; but an irregular noun such as feet
would
have a feature abbreviation such as pl
in its lexical entry.
This would be defined in the grammar as [number: PL], and would
override the default value for the feature number specified by the
template above. If the N template above used SG instead of
!SG, then the word feet
would fail to parse, since its
number
feature would have an internal conflict between SG
and PL.
Parameter settings are used to override various default settings assumed in the grammar file. Parameter settings are optional. In the absence of a parameter setting, a default value is used. A parameter setting has these parts, in the order listed:
- the keyword Parameter
- an optional colon (:)
- one or more keywords identifying the parameter
- the keyword is
- the parameter value
- an optional period (.)
PC-KIMMO recognizes the following parameters:
- Start symbol
defines the start symbol of the grammar. For
example,
Parameter Start symbol is Word
declares that the parse goal of the grammar is the nonterminal category
Word. The default start symbol is the left hand symbol of the first
context-free rule in the grammar file.
- Attribute order
specifies the order in which feature
attributes are displayed. For example,
Parameter Attribute order is cat head root root_pos
declares that the cat
attribute should be the first one shown
in any output from PC-KIMMO and that the other attributes should
be shown in the relative order shown, with the root_pos
attribute shown last among those listed, but ahead of any attributes
that are not listed above. Attributes that are not listed are ordered
according to their character code sort order. If the attribute order
is not specified, then the category feature cat
is shown first, with all other attributes sorted according to their character codes.
- Category feature
defines the label for the category
attribute. For example,
Parameter Category feature is Categ
declares that Categ
is the name of the category attribute. The
default name for this attribute is cat
- Lexical feature
defines the label for the lexical
attribute. For example,
Parameter Lexical feature is Lex
declares that Lex
is the name of the lexical attribute. The
default name for this attribute is lex
- Gloss feature
defines the label for the gloss attribute.
For example,
Parameter Gloss feature is Gloss
declares that Gloss
is the name of the gloss attribute. The
default name for this attribute is gloss.
Lexical rules are used to modify the feature structures of lexical entries.
As noted in Shieber 1985, something more powerful than just
abbreviations for common feature elements is sometimes needed to
represent systematic relationships among the elements of a lexicon.
This need is met by lexical rules, which express transformations rather
than mere abbreviations.
Lexical rules are similar to feature templates, but are more powerful. While feature templates assign a feature structure to lexical items by means of unification, lexical rules map one feature structure to another, thus transforming it. The name of a lexical rule is included in the features field of lexical entries, similar to feature abbreviations.
A lexical rule has these parts, in the order listed:
- the keyword Define
- the name of the lexical rule
- the keyword as
- the rule definition
- an optional period (.)
The rule definition consists of one or more mappings. Each mapping has
three parts: an output feature path, an assignment operator, and the
value assigned, either an input feature path or an atomic value. Every
output path begins with the feature name out and every input
path begins with the feature name in. The assignment operator
is either an equal sign (=) or an equal sign followed by a
"greater than" sign (=>). (These two operators are
equivalent in PC-KIMMO, since the implementation treats each
lexical rule as an ordered list of assignments rather than using
unification for the mappings that have an equal sign operator.)
Consider the information shown in figure 4.8A.
Figure 4.8A A lexical rule example
;lexical item
\lf `mouse
\fea irreg POS_Gloss
\gl `mouse
;feature template
LET irreg be = -
;lexical rule
DEFINE POS_Gloss as
=
=
=
= .
The feature field (\fea ) of the lexical entry contains two labels: irreg is a feature abbreviation and is defined by a feature template (the LET statement), while POS_Gloss is the name of a lexical rule which is defined by the DEFINE statement.
Figure 4.8B Feature structure before application of lexical rule
[ cat: ROOT
head: [ agr: [ 3sg:- ]
number:PL
pos: N
proper:-
verbal:- ]
reg: -
lex: `mice
gloss: `mouse ]
Figure 4.8C Feature structure after application of lexical rule
[ cat: ROOT
head: [ agr: [ 3sg:- ]
number:PL
pos: N
proper:-
verbal:- ]
lex: `mice
gloss: N ]
When the lexicon entry is loaded, it is initially assigned the feature
structure shown in figure 4.8B, which is the
unification of the information given in the various fields of the
lexicon entry, including the feature abbreviation pl. After the
complete feature structure has been built, the lexical rule named POS_Gloss is
applied, producing the feature structure shown in figure 4.8C.
Note that the change in the value of the gloss feature from "`mouse" to "N" is done by direct mapping, not unification.
There are two important points about using lexical rules. First, the feature structure of a lexical item that has undergone a lexical rule is entirely determined by the mappings in the lexical rule. In the lexical rule in figure 4.8A, the first three mappings (for cat, head, and lex), though they seem redundant, are needed to carry over these feature values from the input feature structure to the output feature structure. Notice that the feature reg which is present in the input feature structure in figure 4.8B is absent from the output feature structure in figure 4.8C; this is due to the fact that the lexical rule which applied to the feature structure did not include a mapping for the reg feature.
Second, lexical rules apply sequentially in the order in which they are given in the grammar file.
Figure 4.9 shows a sample grammar file.
Figure 4.9 A sample grammar file
;FEATURE TEMPLATES (optional)
;Feature definitions
Let pl be <head number> = PL
LET v/n be <from_pos> = V
<head pos> = N
<head number> = !SG
LET v\aj be <from_pos> = AJ
<head pos> = V
;Category definitions
Let N be <cat> = ROOT
<head pos> = N
<head number> = !SG
Let V be <cat> = ROOT
<head pos> = V
Let AJ be <cat> = ROOT
<head pos> = AJ
;PARAMETER SETTINGS (optional)
PARAMETER Start symbol is Word
;RULES
RULE
Word = Stem INFL
<Stem head pos> = <INFL from_pos>
<Word head> = <INFL head>
RULE
Stem_1 = PREFIX Stem_2
<PREFIX from_pos> = <Stem_2 head pos>
<Stem_1 head> = <PREFIX head>
RULE
Stem_1 = Stem_2 SUFFIX
<Stem_2 head pos> = <SUFFIX from_pos>
<Stem_1 head> = <SUFFIX head>
RULE
Stem = ROOT
<Stem head> = <ROOT head>
The generation comparison file serves as input to the compare
generate command (see section
4.5.12). It consists of groupings
of a lexical form followed by one or more surface forms that are
expected to be generated from the lexical form. The following
specifications apply to the generation comparison file.
- Each form must be on a separate line.
- Leading spaces are ignored.
- A blank line (or end of file) indicates the end of a grouping.
Extra blank lines are ignored.
- The first form in each grouping is the lexical form to be input to
the generator. Its gloss does not have to be included, since the
generator does not use the lexicon; however, including a gloss with the
lexical form does no harm--it is simply ignored.
- Succeeding forms in each grouping are surface forms that are the
expected output of the generator.
Figure 4.10 shows a sample generation comparison file.
Figure 4.10 A sample generation comparison file
`trace+ed
traced
`trace+able
traceable
re-+`trace
re-trace
retrace
The recognition comparison file serves as input to the compare
recognize command (see section
4.5.12). It consists of groupings
of a surface form followed by one or more lexical forms that are
expected to be recognized from the surface form. The following
specifications apply to the recognition comparison file.
- Each form must be on a separate line.
- Leading spaces are ignored.
- A blank line (or end of file) indicates the end of a grouping.
Extra blank lines are ignored.
- The first form in each grouping is the surface form to be input to
the recognizer.
- Succeeding forms in each grouping are lexical forms that are the
expected output of the recognizer. The gloss of a form follows it on
the same line, separated by one or more spaces. The gloss must match
exactly (including spaces) the way it is output from the recognizer.
Figure 4.11
shows a sample recognition comparison file.
Figure 4.11 A sample recognition comparison file
traced
`trace+ed [ V(trace)+PAST ]
`trace+ed [ V(trace)+PAST.PRTC ]
traceable
`trace+able [ V(trace)+ADJR ]
retrace
re-+`trace [ REP+V(trace).INF ]
The pairs comparison file serves as input to the compare pairs
command (see section 4.5.12).
It consists of pairs of lexical and surface forms; that is, a lexical
form followed by exactly one surface form. It is expected that the
surface form will be recognized from the lexical form and that the
lexical form will be generated from the surface form. Glosses do not
have to be included with lexical forms, since the generator does not
use the lexicon; however, including a gloss with the lexical form does
no harm--it is simply ignored. When recognizing a surface form, the
lexicon is used to identify the constituent morphemes and verify that
they occur in the correct order, but the gloss part of a lexical entry
is not used. The following specifications apply to the pairs comparison
file.
- Each form must be on a separate line.
- Leading spaces are ignored.
- A blank line (or end of file) indicates the end of a grouping. Extra
blank lines are ignored.
- The first form of a pair is the lexical form, which is input to the
generator. It is the expected output on inputting the second (surface) form to
the recognizer. The gloss is not included with the lexical form.
- The second form of a pair is the surface form, which is input to the
recognizer. It is the expected output on inputting the first (lexical) form to
the generator.
Figure 4.12 shows a sample pairs comparison file.
Figure 4.12 A sample pairs comparison file
`trace+ed
traced
`trace+able
traceable
re-+`trace
re-trace
re-+`trace
retrace
The synthesis comparison file serves as input to the compare
synthesize command (see section
4.5.12). It consists of groupings
of a morphological form followed by one or more surface forms that are
expected to be synthesized from the morphological form. The following
specifications apply to the synthesis comparison file.
- Each form must be on a separate line.
- Leading spaces are ignored.
- A blank line (or end of file) indicates the end of a grouping.
Extra blank lines are ignored.
- The first form in each grouping is the morphological form to be input to
the synthesizer. A morphological form is a sequence of morpheme glosses separated by spaces.
- Succeeding forms in each grouping are surface forms that are the
expected output of the generator.
Figure 4.12A shows a sample synthesis comparison file.
Figure 4.12A A sample synthesis comparison file
`trace +ED
traced
`trace +EN
traced
`trace +AJR25a
traceable
ORD5+ `trace
retrace
The generation file consists of a list of lexical forms. It serves as
input to the file generate command (see section 4.5.13), which returns a file (or
screen display) whose format is identical to the generation comparison
file. The following specifications apply to the generation file.
- Each form must be on a separate line.
- Extra white space, blank lines, and comment lines are ignored.
- Each form is assumed to be a lexical form. If a gloss is included, it is
ignored.
Figure 4.13 shows a sample generation file.
Figure 4.13 A sample generation file
`cat
`cat+s
`cat+'s
`cat+s+'s
`fox
`fox+s
`fox+'s
`fox+s+'s
The recognition file consists of a list of surface forms. It serves as
input to the file recognize command (see section 4.5.14), which returns a file
(or screen display) whose format is identical to the recognition
comparison file. The following specifications apply to the recognition
file.
- Each form must be on a separate line.
- Extra spaces, blank lines, and comment lines are ignored.
- Each form is assumed to be a surface form.
Figure 4.14 shows a sample recognition file.
Figure 4.14 A sample recognition file
cat
cats
cat's
cats'
fox
foxes
fox's
foxes'
The synthesis file consists of a list of morphological forms. A morphological form is a sequence of morpheme glosses separated by spaces. A synthesis file serves as
input to the file synthesis command (see section
4.5.13), which returns a file (or
screen display) whose format is identical to the synthesis comparison
file. The following specifications apply to the synthesis file.
- Each form must be on a separate line.
- Extra white space, blank lines, and comment lines are ignored.
- Each form is assumed to be a morphological form.
Figure 4.14A shows a sample synthesis file.
Figure 4.14A A sample synthesis file
`cat
`cat +PL
`cat +GEN
`cat +PL +GEN
`fox
`fox +PL
`fox +GEN
`fox +PL +GEN
Figure 4.15 summarizes the default file names and
extensions assumed by PC-KIMMO. Two entries are given for the different
kinds of files. The first is the name PC-KIMMO will assume if no file
name at all is given to a command that expects that kind of file. The
second entry (with the *) shows what extension PC-KIMMO will add if a
file name without an extension is given.
Figure 4.15 Default file names and extensions
Rules file: RULES.RUL
*.RUL
Lexicon file: LEXICON.LEX
*.LEX
Grammar file: GRAMMAR.GRM
*.GRM
Generation comparison file: DATA.GEN
*.GEN
Recognition comparison file: DATA.REC
*.REC
Pairs comparison file: DATA.PAI
*.PAI
Synthesis comparison file: DATA.SYN
*.SYN
Take file: PCKIMMO.TAK
*.TAK
Log file: PCKIMMO.LOG
*.LOG
[ Guide contents | Chapter contents | Next section: 4.8 Trace Formats | Previous section: 4.6 Alphabetic List of Commands ]