4.7 File formats

[ Guide contents | Chapter contents | Next section: 4.8 Trace Formats | Previous section: 4.6 Alphabetic List of Commands ]

4.7.4 Generation comparison file

4.7.5 Recognition comparison file

4.7.6A Synthesis comparison file

4.7.9 Summary of default file names and extensions

Figure 4.1 Structure of the rules file

Figure 4.2 A sample rules file

Figure 4.3 Structure of the main lexicon file

Figure 4.4 A sample main lexicon file

Figure 4.5 Structure of a lexical entry

Figure 4.6 A sample lexical entry

Figure 4.7 Structure of the grammar file

Figure 4.8A A lexical rule example

Figure 4.8B Feature structure before application of lexical rule

Figure 4.8C Feature structure after application of lexical rule

Figure 4.9 A sample grammar file

Figure 4.10 A sample generation comparison file

Figure 4.11 A sample recognition comparison file

Figure 4.12 A sample pairs comparison file

Figure 4.12A A sample synthesis comparison file

Figure 4.13 A sample generation file

Figure 4.14A A sample synthesis file

Figure 4.15 Default file names and extensions

This section describes the formats for the files that are used as input to PC-KIMMO. In any of the files, comments can be added to any line by preceding the comment with the comment character. This character is normally a semicolon (;), but can be changed with the COMMENT keyword in the rules file. Anything following a comment character (until the end of the line) is considered part of the comment and is ignored by PC-KIMMO.

In the descriptions below, reference to the use of a space character implies any whitespace character (that is, any character treated like a space character). The following control characters when used in a file are whitespace characters: ^I (ASCII 9, tab), ^J (ASCII 10, line feed), ^K (ASCII 11, vertical tab), ^L (ASCII 12, form feed), and ^M (ASCII 13, carriage return).

The control character ^Z (ASCII 26) cannot be used because MS-DOS interprets it as marking the end of a file. Also the control character ^@ (ASCII 0, null) cannot be used.

Examples of each of the following file types are found on the release diskette as part of the English description.

4.7.1 Rules file

The general structure of the rules file is a list of keyword declarations. Figure4.1 shows the conventional structure of the rules file. Note that the notation {x | y} means either x or y (but not both). The following specifications apply to the rules file.

Figure 4.1 Structure of the rules file

COMMENT <character>
ALPHABET <symbol list> 
NULL <character> 
ANY <character>
BOUNDARY <character>
SUBSET <subset name> <symbol list>
. (more subsets)
.
. 
RULE <rule name> <number of states> <number of columns>
 <lexical symbol list>
 <surface symbol list> 
<state number>{: | .} <state number list>
  . (more states)
  . 
  . 
. (more rules)
.
. 
END

Extra spaces, blank lines, and comment lines are ignored.
Comments may be placed anywhere in the file. All data following a comment character to the end of the line is ignored. (See below on the COMMENT declaration.)
The set of valid keywords used to form declarations includes COMMENT, ALPHABET, NULL, ANY, BOUNDARY, SUBSET, RULE, and END.
These declarations are obligatory and can occur only once in a file: ALPHABET, NULL, ANY, BOUNDARY.
These declarations are optional and can occur one or more times in a file: COMMENT, SUBSET, and RULE.
The COMMENT declaration sets the comment character used in the rules file, lexicon files, and grammar file. The COMMENT declaration can only be used in the rules file, not in the lexicon or grammar file. The COMMENT declaration is optional. If it is not used, the comment character is set to ; (semicolon) as a default.
The COMMENT declaration can be used anywhere in the rules file and can be used more than once. That is, different parts of the rules file can use different comment characters. The COMMENT declaration can (and in practice usually does) occur as the first keyword in the rules file, followed by either one or more COMMENT declarations or the ALPHABET declaration.
Note that if you use the COMMENT declaration to declare the character that is already in use as the comment character, an error will result. For instance, if semicolon is the current comment character, the declaration COMMENT ; will result in an error.
The comment character can no longer be set using a command line option or with a command in the user interface, as was the case in version 1 of PC-KIMMO.
The ALPHABET declaration must either occur first in the file or follow one or more COMMENT declarations only. The other declarations can appear in any order. The COMMENT, NULL, ANY, BOUNDARY, and SUBSET declarations can even be interspersed among the rules. However, these declarations must appear before any rule that uses them or an error will result.
The ALPHABET declaration defines the set of symbols used in either lexical or surface representations. The keyword ALPHABET is followed by a <symbol list> of all alphabetic symbols. Each symbol must be separated from the others by at least one space. The list can span multiple lines, but ends with the next valid keyword. All alphanumeric characters (such as a, B, and 2), symbols (such as $ and +), and punctuation characters (such as . and ?) are available as alphabet members. The characters in the IBM extended character set (above ASCII 127) are also available. Control characters (below ASCII 32) can also be used, with the exception of whitespace characters (see above), ^Z (end of file), and ^@ (null). The alphabet can contain a maximum of 255 symbols. An alphabetic symbol can also be a multigraph, that is, a sequence of two or more characters. The individual characters composing a multigraph do not necessarily have to also be declared as alphabetic characters. For example, an alphabet could include the characters s and z and the multigraph sz%, but not include % as an alphabetic character. Note that a multigraph cannot also be interpreted as a sequence of the individual characters that comprise it.
The keyword NULL is followed by a single <character> that represents a null (empty, zero) element. The NULL symbol is considered to be an alphabetic character, but cannot also be listed in the ALPHABET declaration. The NULL symbol declared in the rules file is also used in the lexicon file to represent a null lexical entry.
The keyword ANY is followed by a single "wildcard" <character> that represents a match of any character in the alphabet. The ANY symbol is not considered to be an alphabetic character, though it is used in the column headers of state tables. It cannot be listed in the ALPHABET declaration. It is not used in the lexicon file.
The keyword BOUNDARY is followed by a single <character> character that represents an initial or final word boundary. The BOUNDARY symbol is considered to be an alphabetic character, but cannot also be listed in the ALPHABET declaration. When used in the column header of a state table, it can only appear as the pair #:# (where, for instance, # has been declared as the BOUNDARY symbol). The BOUNDARY symbol is also used in the lexicon file in the continuation class field of a lexical entry to indicate the end of a word (that is, no continuation class).
The SUBSET declaration defines set of characters that are referred to in the column headers of rules. The keyword SUBSET is followed by the <subset name> and <symbol list>. <subset name> is a single word (one or more characters) that names the list of characters that follows it. The subset name must be unique (that is, if it is a single character it cannot also be in the alphabet or be any other declared symbol). It can be composed of any characters (except space); that is, it is not limited to the characters declared in the ALPHABET section. It must not be identical to any keyword used in the rules file. The subset name is used in rules to represent all members of the subset of the alphabet that it defines. Note that SUBSET declarations can be interspersed among the rules. This allows subsets to be placed near the rule that uses them if such a style is desired. However, a subset must be declared before a rule that uses it.
The <symbol list> following a <subset name> is a list of single symbols, each of which is separated by at least one space. The list can span multiple lines. Each symbol in the list must be a member of the previously defined ALPHABET, with the exception of the NULL symbol, which can appear in a subset list but is not included in the ALPHABET declaration. Neither the ANY symbol nor the BOUNDARY symbol can appear in a subset symbol list.
The keyword RULE signals that a state table immediately follows.
<rule name> is the name or description of the rule which the state table encodes. It functions as an annotation to the state table and has no effect on the computational operation of the table. It is displayed by the list rules and show rule commands and is also displayed in traces. The rule name must be surrounded by a pair of identical delimiter characters. Any material can be used between the delimiters of the rule name with the exception of the current comment character and of course the rule name delimiter character of the rule itself. Each rule in the file can use a different pair of delimiters. The rule name must be all on one line, but it does not have to be on the same line as the RULE keyword.
<number of states> is the number of states (rows in the table) that will be defined for this table. The states must begin at 1 and go in sequence through the number defined here (that is, gaps in state numbers are not allowed).
<number of columns> is the number of state transitions (columns in the table) that will be defined for each state.
<lexical symbol list> is a list of elements separated by one or more spaces. Each element represents the lexical half of a lexical:surface correspondence which, when matched, defines a state transition. Each element in the list must be either a member of the alphabet, a subset name, the NULL symbol, the ANY symbol, or the BOUNDARY symbol (in which case the corresponding surface character must also be the BOUNDARY symbol). The list can span multiple lines, but the number of elements in the list must be equal to the number of columns defined for the rule.
<surface symbol list> is a list of elements separated by one or more spaces. Each element represents the surface half of a lexical:surface correspondence which, when matched, defines a state transition. Each element in the list must be either a member of the alphabet, a subset name, the NULL symbol, the ANY symbol, or the BOUNDARY symbol (in which case the corresponding lexical character must also be the BOUNDARY symbol). The list can span multiple lines, but the number of characters in the list must be equal to the number of columns defined for the rule.
<state number> is the number of the state or row of the table. The first state number must be 1, and subsequent state numbers must follow in numerical sequence without any gaps.
{: | .} is the final or nonfinal state indicator. This should be a colon (:) if the state is a final state and a period (.) if it is a nonfinal state. It must follow the <state number> with no intervening space.
<state number list> is a list of state transition numbers for a particular state. Each number must be between 1 and the number of states (inclusive) declared for the table. The list can span multiple lines, but the number of elements in the list must be equal to the number of columns declared for this rule.
The keyword END follows all other declarations and indicates the end of the rules file. Any material in the file thereafter is ignored by PC-KIMMO. The END keyword is optional; the physical end of the file also terminates the rules file.

Figure 4.2 shows a sample rules file.

Figure 4.2 A sample rules file

ALPHABET
  b c d f g h j k l m n p q r s t v w x y z +    ; + is morpheme boundary
  a e i o u
NULL 0
ANY  @
BOUNDARY #
SUBSET C b c d f g h j k l m n p q r s t v w x y z
SUBSET V a e i o u
; more subsets

RULE "Consonant defaults"  1 23
   b c d f g h j k l m n p q r s t v w x y z + @
   b c d f g h j k l m n p q r s t v w x y z 0 @
1: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

RULE "Vowel defaults"  1 6
   a e i o u @
   a e i o u @
1: 1 1 1 1 1 1

RULE "Voicing s:z <=> V___V" 4 4
   V s s @
   V z @ @
1: 2 0 1 1
2: 2 4 3 1
3: 0 0 1 1
4. 2 0 0 0

; more rules

END

4.7.2 Lexicon files

A lexicon consists of one main lexicon file plus one or more files of lexical entries. The general structure of the main lexicon file is a list of keyword declarations. The set of valid keywords is ALTERNATION, FEATURES, FIELDCODE, INCLUDE, and END. Figure 4.3 shows the conventional structure of the lexicon file. The following specifications apply to the main lexicon file.

Figure 4.3 Structure of the main lexicon file

ALTERNATION <alternation name> <sublexicon name list>
. (more ALTERNATIONs)
.
.
FEATURES <feature abbreviation list>

FIELDCODE <lexical item code> U
FIELDCODE <sublexicon code>  L
FIELDCODE <alternation code>  A
FIELDCODE <features code>  F
FIELDCODE <gloss code>  G

INCLUDE <filespec>
. (more INCLUDEd files)
.
.
END

Extra spaces, blank lines, and comment lines are ignored.
The comment character declared in the rules file is operative in the main lexicon file. Comments may be placed anywhere in the file. All data following a comment character to the end of the line is ignored.
The set of valid keywords used to form declarations includes ALTERNATION, FEATURES, FIELDCODE, INCLUDE, and END.
The declarations can appear in any order with the proviso that any alternation name, feature name, or fieldcode used in a lexical entry must be declared before the lexical entry is read. In practice, this means that the INCLUDE declarations should appear last, but the ALTERNATION, FEATURES, and FIELDCODE declarations can appear in any order.
The ALTERNATION declaration defines a set of sublexicon names that serve as the continuation class of a lexical item. The ALTERNATION keyword is followed by an <alternation name> and a <sublexicon name list>. ALTERNATION declarations are optional (but nearly always used in practice) and can occur as many times as needed.
<alternation name> is a name associated with the following <sublexicon name list>. It is a word composed of one or more characters, not limited to the ALPHABET characters declared in the rules file. An alternation name can be any word other than a keyword used in the lexicon file. The program does not check to see if an alternation name is actually used in the lexicon file.
<sublexicon name list> is a list of sublexicon names. It can span multiple lines until the next valid keyword is encountered. Each sublexicon name in the list must be used in the sublexicon field of a lexical entry. Although it is not enforced at the time the lexicon file is loaded, an undeclared sublexicon named in a sublexicon name list will cause an error when the recognizer tries to use it.
The FEATURES keyword followed by a <feature abbreviation list>. A <feature abbreviation list> is a list of words, each of which is expanded into feature structures by the word grammar.
The FIELDCODE declaration is used to define what fieldcode will be used to mark each type of field in a lexical entry. The FIELDCODE keyword is followed by a <code> and one of five possible internal codes: U, L, A, F, or G. There must be five FIELDCODE declarations, one for each of these internal codes, where U indicates the lexical item field, L indicates the sublexicon field, A indicates the alternation field, F indicates the features field, and G indicates the gloss field.
The INCLUDE keyword is followed by a <filespec> that names a file containing lexical entries to be loaded. An INCLUDEd file cannot contain any declarations (such as a FIELDCODE or an INCLUDE declaration), only lexical entries and comment lines.
The keyword END follows all other declarations and indicates the end of the main lexicon file. Any material in the file thereafter is ignored by PC-KIMMO. The END keyword is optional; the physical end of the file also terminates the main lexicon file.

Figure 4.4 shows a sample main lexicon file.

Figure 4.4 A sample main lexicon file

ALTERNATION Begin PREF
ALTERNATION Pref N AJ V AV
ALTERNATION Stem SUFFIX

FEATURES sg pl reg irreg

FIELDCODE  lf   U   ;lexical item
FIELDCODE  lx   L   ;sublexicon
FIELDCODE  alt  A   ;alternation
FIELDCODE  fea  F   ;features
FIELDCODE  gl   G   ;gloss

INCLUDE affix.lex    ;file of affixes
INCLUDE noun.lex     ;file of nouns
INCLUDE verb.lex     ;file of verbs
INCLUDE adjectiv.lex ;file of adjectives
INCLUDE adverb.lex   ;file of adverbs

END

Figure 4.5 shows the structure of a lexical entry. Lexical entries are encoded in "field-oriented standard format." Standard format is an information interchange convention developed by the Summer Institute of Linguistics. It tags the kinds of information in ASCII text files by means of markers which begin with backslash. Field-oriented standard format (FOSF) is a refinement of standard format geared toward representing data which has a database-like record and field structure. The following points provide an informal description of the syntax of FOSF files.

Figure 4.5 Structure of a lexical entry

\<lexical item code> <lexical item>
\<sublexicon code> <sublexicon name>
\<alternation code> {<alternation name> | <BOUNDARY symbol>}
\<features code> <features list>
\<gloss code> <gloss string>

A field-oriented standard format (FOSF) file consists of a sequence of records.
A record consists of a sequence of fields.
A field consist of a field marker and a field value.
A field marker consists of a backslash character at the beginning of a line, followed by an alphabetic or numeric character, followed by zero or more printable characters, and terminated by a space, tab, or the end of a line. A field marker without its initial backslash character is termed a field code.
A field marker must begin in the first position of a line. Backslash characters occurring elsewhere in the file are not interpreted as field markers.
The first field marker of the record is considered the record marker, and thus the same field must occur first in every record of the file.
Each field marker is separated from the field value by one or more spaces, tabs, or newlines. The field value continues up to the next field marker.
Any line that is empty or contains only whitespace characters is considered a comment line and is ignored. Comment lines may occur between or within fields.
Fields and lines in an FOSF file can be arbitrarily long.
There are two basic types of fields in FOSF files: nonrepeating and repeating. Repeating fields are multiple consecutive occurrences of fields marked by the same marker. Individual fields within a repeating field can be called subfields.

The following specifications apply to how FOSF is implemented in PC-KIMMO.

Lexical entries are encoded as records in a FOSF file.
Only those fields whose field codes are declared in the main lexicon file are recognized (see above on the FIELDCODE declaration). All other fields are considered to be extraneous and are ignored.
The first field of each lexical entry must be the lexical item field. The lexical item field code is assigned to the internal code U by a FIELDCODE declaration in the main lexicon file.
Only nonrepeating fields are permitted.
The comment character declared in the rules file is operative in included files of lexical entries. All data following a comment character to the end of the line is ignored.

A file of lexical entries is loaded by using an INCLUDE declaration in the main lexicon file (see above). An INCLUDEd file of lexical entries cannot contain any declarations (such as a FIELDCODE or an INCLUDE declaration), only lexical entries and comment lines.

The following specifications apply to lexical entries.

A lexical entry is composed of five fields: lexical item, sublexicon, alternation, features, and gloss. The lexical item, sublexicon, and alternation, fields are obligatory, the features and gloss fields are optional. The first field of the entry must always be the lexical item. The other fields can appear in any order, even differing from one entry to another.
Although the gloss field is optional, if a lexical entry does not include one, a warning message to that effect will be displayed when the entry is loaded. To supress this warning message, do the command set warnings off (see section 4.5.6.1) before loading the lexicon.
If an entry has an empty gloss field (that is, the field marker for the gloss field is present but there is no data after it), then the contents of the lexical form field will be also be used as the gloss for that entry.
A lexical item field consists of a <lexical item code> and a <lexical item>.
A <lexical item code> is a field code assigned to the internal code L by a FIELDCODE declaration in the main lexicon file.
A <lexical item> is one or more characters that represent an element (typically a morpheme or word) of the lexicon. Each character (or multigraph) must be in the alphabet defined for the language. The lexical item uses only the lexical subset of the alphabet.
A sublexicon field consists of a <sublexicon code> and a <sublexicon name>.
A <sublexicon code> is a field code assigned to the internal code L by a FIELDCODE declaration in the main lexicon file.
A <sublexicon name> is the name associated with a sublexicon. It is a word composed of one or more characters, not limited to the alphabetic characters declared in the rules file. Every lexical item must belong to a sublexicon. Every lexicon must include a special sublexicon named INITIAL (that is, there must be at least one lexical entry that belongs to the INITIAL sublexicon).
Lexical entries belonging to a sublexicon do not have to be listed consecutively in a single file (as was the case for PC-KIMMO version 1); rather, lexical entries in a file can occur in any order, regardless of what sublexicon they belong to. Lexical entries of a sublexicon can even be placed in two or more separate files.
An alternation field consists of a <alternation code> followed by either an <alternation name> or the <BOUNDARY symbol>.
An <alternation name> is declared in an ALTERNATION declaration in the main lexicon file. The <BOUNDARY symbol> is declared in the rules file and indicates the end of all possible continuations in the lexicon.
A features field consists of a <features code> and a <features list>.
A <features code> is a field code assigned to the internal code F by a FIELDCODE declaration in the main lexicon file.
A <features list> is a list of feature abbreviations. Each abbreviation is a single word consisting of alphanumeric characters or other characters except (){}[]<>=:$! (these are used for special purposes in the grammar file). The character \ should not be used as the first character of an abbreviation because that is how fields are marked in the lexicon file. Upper and lower case letters used in template names are considered different. For example, "PLURAL" is not the same as "Plural" or "plural." Feature abbreviations are expanded into full feature structures by the word grammar (see section 4.7.3).
A gloss field consists of a <gloss code> and a <gloss string>.
A <gloss code> is a field code assigned to the internal code G by a FIELDCODE declaration in the main lexicon file.
A <gloss string> is a string of text. Any material can be used in the gloss field with the exception of the comment character.

Figure 4.6 shows a sample lexical entry.

Figure 4.6 A sample lexical entry

\lf  `knives
\lx  N
\alt Infl
\fea pl irreg
\gl  N(`knife)+PL

4.7.3 Grammar file

The grammar file consists of feature templates, context-free rules, and feature constraints. Figure 4.7 shows the conventional structure of the grammar file.

Figure 4.7 Structure of the grammar file

LET <abbreviation | category> be <feature definition>
. (more feature templates)
.
.
DEFINE <lexical rule name> as <mappings>
. (more lexical rules)
.
.
PARAMETER <parameter name> is <parameter value>
. (more parameter settings)
.
.
RULE <rule>
 <feature constraint>
 . (more constraints)
 .
 .
(more rules)
.
.
.
END

The following specifications apply generally to the grammar file.

Blank lines, spaces, and tabs separate elements of the grammar file from one another, but are ignored otherwise.
The comment character declared in the rules file is operative in the grammar file. Comments may be placed anywhere in the grammar. All data following a comment character to the end of the line is ignored.
A grammar file minimally consists of one or more context-free rules. Each rule may optionally specify feature constraints.
A grammar file is divided into fields identified by a small set of keywords.
1. Rule starts a context-free rule with its set of feature constraints. These rules define how words join together to form phrases, clauses, or sentences. The lexicon and grammar are tied together by using the lexical categories as the terminal symbols of the context-free rules and by using the other lexical features in the feature constraints.
2. Let starts a feature template definition. Feature templates are used as macros (abbreviations) in the lexicon. They may also be used to assign default feature structures to the categories.
3. Parameter starts a program parameter definition. These parameters control various aspects of the program.
4. Define starts a lexical rule definition. Lexical rules are used to modify the feature structures of lexical entries.
5. End effectively terminates the file. Anything following this keyword is ignored.
Note that these keywords are not case sensitive: RULE is the same as rule, and both are the same as Rule.
Each of the fields in the grammar file may optionally end with a period. If there is no period, the next keyword (in an appropriate slot) marks the end of one field and the beginning of the next.

Rules

The following specifications apply to rules.

A grammar rule has these parts, in the order listed:

the keyword Rule
an optional rule identifier enclosed in braces ({})
the nonterminal symbol to be expanded
an arrow (->) or equal sign (=)
zero or more terminal or nonterminal symbols, possibly marked for alternation or optionality
an optional colon (:)
zero or more feature constraints, possibly marked for alternation
an optional period (.)

The optional rule identifier (item 2) consists of one or more words enclosed in braces. Its current utility is only as a special form of comment describing the intent of the rule. (Eventually it may be used as a tag for interactively adding and removing rules.) The only limits on the rule identifier are that it not contain the comment character and that it all appears on the same line in the grammar file.

The terminal and nonterminal symbols in the rule have the following characteristics:

Blank lines, spaces, and tabs separate symbols from one another, but otherwise are ignored.
Upper and lower case letters used in symbols are considered different. For example, STEM is not the same as Stem, and neither is the same as stem.
Index numbers are used to distinguish instances of a symbol that is used more than once in a rule. They are added to the end of a symbol following an underscore character (_). For example,
```
 Stem_1 = Stem_2 SUFFIX
```
The symbol X may be used to stand for any terminal or nonterminal category. For example, this rule says that a N expands into a NStem plus any category.
```
 N = NStem X
```
The symbol X can be useful for capturing generalities. Care must be taken, since it can be replaced by anything.
The characters (){}[]<>=:/ cannot be used in terminal or nonterminal symbols since they are used for special purposes in the grammar file. The character _ can be used only for attaching an index number to a symbol.
By default, the left hand symbol of the first rule in the grammar file is the start symbol of the grammar.
There can be multiple rules for the same symbol, but all rules for a symbol must be contiguous in the file.

The symbols on the right hand side of a context-free rule may be marked or grouped in various ways:

Parentheses around an element of the expansion (righthand) part of a rule indicate that the element is optional. Parentheses may be placed around multiple elements. This makes an optional group of elements.
A forward slash (/) is used to separate alternative elements of the expansion (righthand) part of a rule.
Curly braces can be used for grouping elements. For example the following says that a N consists of a NStem followed by either a Sing or a Plural:
```
 N = NStem {Sing / Plural}
```
Alternatives are taken to be as long as possible. Thus if the curly braces were omitted from the rule above, as in the rule below, the Sing would be treated as part of the alternative containing the NStem. It would not be allowed before the Plural.
```
 N = NStem Sing / Plural
```
Parentheses group enclosed elements the same as curly braces do. Alternatives and groups delimited by parentheses or curly braces may be nested to any depth.

Feature structures

The grammar formalism uses a basic element called a feature structure. A feature structure consists of a feature name and a value. The notation used for feature structures looks like this:

 [number: singular]

where number is the feature name and singular is the value, separated by a colon. Feature names and values are single words consisting of alphanumeric characters or other characters except (){}[]<>=:$! (these are used for special purposes in the grammar file). Upper and lower case letters used in feature names and values are considered different. For example, "NUMBER" is not the same as "Number" or "number."

A structure containing more than one feature uses square brackets around the entire stucture:

 [number: singular
   case:  nominative]

Extra spaces and line breaks are optional.

Feature structures can have either simple values, such as the example above, or complex values, such as this:

 [agreement: [number: singular]
             case:  nominative]]

where the value of the agreement feature is another feature structure. Feature structures can be infinitely nested in this manner.

Feature can share values. This is not the same thing as two features having identical values. In the first example below, the features a and c have identical values; but in the second example, they share the same value:

  [a: [p:q]
  b: [p:q]]
  [a: $1[p:q]
  b: $1]

Shared values are indicated by coindexing them with the prefix $1, $2, and so on.

Portions of a feature structure can be referred to using the "path" notation. A path is a sequence of feature names (minimally one) enclosed in angled brackets (<>). For example, consider this feature structure:

 [agreement: [number: singular
             case: nominative]]

These are feature paths based on this structure:

 <number>
 <case>
 <agreement number>
 <agreement case>

Paths are used in feature templates and feature constraints, described below. All lexical items used by the grammar are assigned three features: cat, lex, and gloss. These should be treated as reserved names and not used for other purposes.

The value of the cat feature is the name of the sublexicon to which the lexical item belongs, taken from the sublexicon field of the item's lexical entry.
The value of the lex feature is the lexical form of the item, taken from the lexical form field of the item's lexical entry.
The value of the gloss feature is the gloss of the item, taken from the gloss field of the item's lexical entry.

For example, here is a lexical entry for the word fox:

 \lf `fox
 \lx N
 \alt Stem
 \gl N(fox)

When this entry is used by the grammar, it is represented as this feature structure:

 [cat: N
  lex: `fox
  gloss: N(fox)]

Feature constraints

A rule is followed by zero or more feature constraint; which refer to symbols used in the rule. The following specifications apply to feature constraints.

A feature constraint has these parts, in the order listed:

a feature path that begins with one of the symbols from the context-free rule
an equal sign
either another path or a value

A feature constraint that refers only to symbols on the right hand side of the rule constrains their co-occurrence. In the following rule and constraint, the value of the Stem's head pos feature must unify with the value of the SUFFIX's from_pos feature:

        Word -> Stem INFL
                <Stem head pos> = <INFL from_pos>

If a feature constraint refers to a symbol on the right hand side of the rule, and has an atomic value on its right hand side, then the designated feature must not have a different value. In the following rule and constraint, the head case feature for the PRONOUN node of the parse tree must either be originally undefined or equal to NOM:

        Word -> PRONOUN
                <PRONOUN head case> = NOM

(If the head case feature of the PRONOUN node was originally undefined, then, after unification succeeds, it will be equal to NOM.)

A feature constraint that refers to the symbol on the left hand side of the rule passes information up the parse tree. In the following rule and constraint, the value of the head feature is passed from the INFL node up to the Word node:

        Word -> Stem INFL
                <Word head> = <INFL head>

PC-KIMMO allows disjunctive feature constraints with its phrase structure rules. Consider these two rules:

Stem_1 -> PREFIX Stem_2
        <PREFIX from_pos> = <Stem_2 head pos>
        <PREFIX change_pos> = +
        <Stem_1 head> = <PREFIX head>

Stem_1 -> PREFIX Stem_2
        <PREFIX from_pos> = <Stem_2 head pos>
        <PREFIX change_pos> = -
        <Stem_1 head> = <Stem_2 head>

These rules have the same context-free rule part. They can therefore be collapsed into this single rule , which has a disjunction in its feature constraints:

Stem_1 -> PREFIX Stem_2
        <PREFIX from_pos> = <Stem_2 head pos>
       {
        <PREFIX change_pos> = +
        <Stem_1 head> = <PREFIX head>
        /
        <PREFIX change_pos> = -
        <Stem_1 head> = <Stem_2 head>
        }

Disjunctive feature constrains may be nested up to eight levels deep.

Feature templates

The following specifications apply to feature templates.

A feature template has these parts, in the order listed:

the keyword Let
the template name
the keyword be
a feature definition
an optional period (.)

If the template name is a terminal category (a terminal symbol in one of the context-free rules), the template defines the default features for that category. Otherwise the template name serves as an abbreviation for the associated feature structure. Templates may occur anywhere in the file (interspersed among the rules), but a template must occur before any rule or other template that uses the abbreviation it defines.

Template names are single words consisting of alphanumeric characters or other characters except (){}[]<>=:$! (these are used for special purposes in the grammar file). The character \ should not be used as the first character of a template name because that is how fields are marked in the lexicon file. Upper and lower case letters used in template names are considered different. For example, "PLURAL" is not the same as "Plural" or "plural."

The abbreviations defined by templates are usually used in the feature field of entries in the lexicon file. For example, the lexical entry for the irregular plural form feet may have the abbreviation pl in its features field. The grammar file would define this abbreviation with a template like this:

     Let pl be [number: PL]

The path notation may also be used:

     Let pl be <number> = PL

More complicated feature structures may be defined in templates. For example,

     Let 3sg be [tense:  PRES
                 agr:    3SG
                 finite: +
                 vform:  S]

which is equivalent to:

    Let 3sg be [<tense>  = PRES
                <agr>    = 3SG
                <finite> = +
                <vform>  = S]

In the following example, the abbreviation irreg is defined using another abbreviation:

    Let irreg be <reg> = -
                 pl

The abbreviation pl must be defined previously in the grammar file or an error will result. A subsequent template could also use the abbreviation irreg in its definition. In this way, an inheritance hierarchy features may be constructed.

Feature templates permit disjunctive definitions. For example, the lexical entry for the word deer may specify the feature abbreviation sg-pl. The grammar file would define this as a disjunction of feature structures reflecting the fact that the word can be either singular or plural:

    Let sg/pl be {[number:SG]
                  [number:PL]}

This has the effect of creating two entries for deer, one with singular number and another with plural. Note that there is no limit to the number of disjunct structures listed between the braces. Also, there is no slash (/) between the elements of the disjunction as there is between the elements of a disjunction in the rules. A shorter version of the above template using the path notation looks like this:

    Let sg/pl be <number> = {SG PL}

Abbreviations can also be used in disjunctions, provided that they have previously been defined:

    Let sg be <number> = SG
    Let pl be <number> = PL
    Let sg/pl be {[sg] [pl]}

Note the square brackets around the abbreviations sg and pl without square brackets they would be interpreted as simple values instead.

Feature templates can assign default atomic feature values, indicated by prefixing an exclamation point (!). A default value can be overridden by an explicit feature assignment. This template says that all members of category N have singular number as a default value:

    Let N be <number> = !SG

The effect of this template is to make all nouns singular unless they are explicitly marked as plural. For example, regular nouns such as book do not need any feature in their lexical entries to signal that they are singular; but an irregular noun such as feet would have a feature abbreviation such as pl in its lexical entry. This would be defined in the grammar as [number: PL], and would override the default value for the feature number specified by the template above. If the N template above used SG instead of !SG, then the word feet would fail to parse, since its number feature would have an internal conflict between SG and PL.

2.3 Parameter settings

Parameter settings are used to override various default settings assumed in the grammar file. Parameter settings are optional. In the absence of a parameter setting, a default value is used. A parameter setting has these parts, in the order listed:

the keyword Parameter
an optional colon (:)
one or more keywords identifying the parameter
the keyword is
the parameter value
an optional period (.)

PC-KIMMO recognizes the following parameters:

Start symbol defines the start symbol of the grammar. For example,
```
        Parameter Start symbol is Word
```
declares that the parse goal of the grammar is the nonterminal category Word. The default start symbol is the left hand symbol of the first context-free rule in the grammar file.
Attribute order specifies the order in which feature attributes are displayed. For example,
```
        Parameter Attribute order is cat head root root_pos
```
declares that the cat attribute should be the first one shown in any output from PC-KIMMO and that the other attributes should be shown in the relative order shown, with the root_pos attribute shown last among those listed, but ahead of any attributes that are not listed above. Attributes that are not listed are ordered according to their character code sort order. If the attribute order is not specified, then the category feature cat is shown first, with all other attributes sorted according to their character codes.
Category feature defines the label for the category attribute. For example,
```
        Parameter Category feature is Categ
```
declares that Categ is the name of the category attribute. The default name for this attribute is cat
Lexical feature defines the label for the lexical attribute. For example,
```
        Parameter Lexical feature is Lex
```
declares that Lex is the name of the lexical attribute. The default name for this attribute is lex
Gloss feature defines the label for the gloss attribute. For example,
```
        Parameter Gloss feature is Gloss
```
declares that Gloss is the name of the gloss attribute. The default name for this attribute is gloss.

2.4 Lexical rules

Lexical rules are used to modify the feature structures of lexical entries. As noted in Shieber 1985, something more powerful than just abbreviations for common feature elements is sometimes needed to represent systematic relationships among the elements of a lexicon. This need is met by lexical rules, which express transformations rather than mere abbreviations.

Lexical rules are similar to feature templates, but are more powerful. While feature templates assign a feature structure to lexical items by means of unification, lexical rules map one feature structure to another, thus transforming it. The name of a lexical rule is included in the features field of lexical entries, similar to feature abbreviations.

A lexical rule has these parts, in the order listed:

the keyword Define
the name of the lexical rule
the keyword as
the rule definition
an optional period (.)

The rule definition consists of one or more mappings. Each mapping has three parts: an output feature path, an assignment operator, and the value assigned, either an input feature path or an atomic value. Every output path begins with the feature name out and every input path begins with the feature name in. The assignment operator is either an equal sign (=) or an equal sign followed by a "greater than" sign (=>). (These two operators are equivalent in PC-KIMMO, since the implementation treats each lexical rule as an ordered list of assignments rather than using unification for the mappings that have an equal sign operator.) Consider the information shown in figure 4.8A.

Figure 4.8A A lexical rule example

;lexical item
\lf `mouse
\fea irreg POS_Gloss
\gl `mouse

;feature template
LET irreg be  = -

;lexical rule
DEFINE POS_Gloss as 
            = 
            = 
            = 
            = .

The feature field (\fea ) of the lexical entry contains two labels: irreg is a feature abbreviation and is defined by a feature template (the LET statement), while POS_Gloss is the name of a lexical rule which is defined by the DEFINE statement.

Figure 4.8B Feature structure before application of lexical rule

[ cat:   ROOT
  head:  [ agr: [ 3sg:- ]
           number:PL
           pos:   N
           proper:-
           verbal:- ]
  reg:   -
  lex:   `mice
  gloss: `mouse ]

Figure 4.8C Feature structure after application of lexical rule

[ cat:   ROOT
  head:  [ agr: [ 3sg:- ]
           number:PL
           pos:   N
           proper:-
           verbal:- ]
  lex:   `mice
  gloss: N ]

When the lexicon entry is loaded, it is initially assigned the feature structure shown in figure 4.8B, which is the unification of the information given in the various fields of the lexicon entry, including the feature abbreviation pl. After the complete feature structure has been built, the lexical rule named POS_Gloss is applied, producing the feature structure shown in figure 4.8C. Note that the change in the value of the gloss feature from "`mouse" to "N" is done by direct mapping, not unification.

There are two important points about using lexical rules. First, the feature structure of a lexical item that has undergone a lexical rule is entirely determined by the mappings in the lexical rule. In the lexical rule in figure 4.8A, the first three mappings (for cat, head, and lex), though they seem redundant, are needed to carry over these feature values from the input feature structure to the output feature structure. Notice that the feature reg which is present in the input feature structure in figure 4.8B is absent from the output feature structure in figure 4.8C; this is due to the fact that the lexical rule which applied to the feature structure did not include a mapping for the reg feature.

Second, lexical rules apply sequentially in the order in which they are given in the grammar file.

Figure 4.9 shows a sample grammar file.

Figure 4.9 A sample grammar file

    ;FEATURE TEMPLATES (optional)

    ;Feature definitions
    Let pl be   <head number> = PL
    LET v/n be  <from_pos> = V
                <head pos> = N
                <head number> = !SG
    LET v\aj be <from_pos> = AJ
                <head pos> = V
    
    ;Category definitions
    Let N be  <cat> = ROOT
              <head pos> = N
              <head number> = !SG
    Let V be  <cat> = ROOT
              <head pos> = V
    Let AJ be <cat> = ROOT
              <head pos> = AJ

    ;PARAMETER SETTINGS (optional)

    PARAMETER Start symbol is Word

    ;RULES

    RULE
    Word = Stem INFL
        <Stem head pos> = <INFL from_pos>
        <Word head> = <INFL head>
    
    RULE
    Stem_1 = PREFIX Stem_2
        <PREFIX from_pos> = <Stem_2 head pos>
        <Stem_1 head> = <PREFIX head>
    
    RULE
    Stem_1 = Stem_2 SUFFIX
        <Stem_2 head pos> = <SUFFIX from_pos>
        <Stem_1 head> = <SUFFIX head>
    
    RULE
    Stem = ROOT
        <Stem head> = <ROOT head>

4.7.4 Generation comparison file

The generation comparison file serves as input to the compare generate command (see section 4.5.12). It consists of groupings of a lexical form followed by one or more surface forms that are expected to be generated from the lexical form. The following specifications apply to the generation comparison file.

Each form must be on a separate line.
Leading spaces are ignored.
A blank line (or end of file) indicates the end of a grouping. Extra blank lines are ignored.
The first form in each grouping is the lexical form to be input to the generator. Its gloss does not have to be included, since the generator does not use the lexicon; however, including a gloss with the lexical form does no harm--it is simply ignored.
Succeeding forms in each grouping are surface forms that are the expected output of the generator.

Figure 4.10 shows a sample generation comparison file.

Figure 4.10 A sample generation comparison file

`trace+ed
 traced

`trace+able
 traceable

re-+`trace
 re-trace
 retrace

4.7.5 Recognition comparison file

The recognition comparison file serves as input to the compare recognize command (see section 4.5.12). It consists of groupings of a surface form followed by one or more lexical forms that are expected to be recognized from the surface form. The following specifications apply to the recognition comparison file.

Each form must be on a separate line.
Leading spaces are ignored.
A blank line (or end of file) indicates the end of a grouping. Extra blank lines are ignored.
The first form in each grouping is the surface form to be input to the recognizer.
Succeeding forms in each grouping are lexical forms that are the expected output of the recognizer. The gloss of a form follows it on the same line, separated by one or more spaces. The gloss must match exactly (including spaces) the way it is output from the recognizer.

Figure 4.11 shows a sample recognition comparison file.

Figure 4.11 A sample recognition comparison file

traced
 `trace+ed     [ V(trace)+PAST ]
 `trace+ed     [ V(trace)+PAST.PRTC ]

traceable
 `trace+able     [ V(trace)+ADJR ]

retrace
 re-+`trace     [ REP+V(trace).INF ]

4.7.6 Pairs comparison file

The pairs comparison file serves as input to the compare pairs command (see section 4.5.12). It consists of pairs of lexical and surface forms; that is, a lexical form followed by exactly one surface form. It is expected that the surface form will be recognized from the lexical form and that the lexical form will be generated from the surface form. Glosses do not have to be included with lexical forms, since the generator does not use the lexicon; however, including a gloss with the lexical form does no harm--it is simply ignored. When recognizing a surface form, the lexicon is used to identify the constituent morphemes and verify that they occur in the correct order, but the gloss part of a lexical entry is not used. The following specifications apply to the pairs comparison file.

Each form must be on a separate line.
Leading spaces are ignored.
A blank line (or end of file) indicates the end of a grouping. Extra blank lines are ignored.
The first form of a pair is the lexical form, which is input to the generator. It is the expected output on inputting the second (surface) form to the recognizer. The gloss is not included with the lexical form.
The second form of a pair is the surface form, which is input to the recognizer. It is the expected output on inputting the first (lexical) form to the generator.

Figure 4.12 shows a sample pairs comparison file.

Figure 4.12 A sample pairs comparison file

`trace+ed
traced

`trace+able
traceable

re-+`trace
re-trace

re-+`trace
retrace

4.7.6A Synthesis comparison file

The synthesis comparison file serves as input to the compare synthesize command (see section 4.5.12). It consists of groupings of a morphological form followed by one or more surface forms that are expected to be synthesized from the morphological form. The following specifications apply to the synthesis comparison file.

Each form must be on a separate line.
Leading spaces are ignored.
A blank line (or end of file) indicates the end of a grouping. Extra blank lines are ignored.
The first form in each grouping is the morphological form to be input to the synthesizer. A morphological form is a sequence of morpheme glosses separated by spaces.
Succeeding forms in each grouping are surface forms that are the expected output of the generator.

Figure 4.12A shows a sample synthesis comparison file.

Figure 4.12A A sample synthesis comparison file

`trace +ED
traced

`trace +EN
traced

`trace +AJR25a
traceable

ORD5+ `trace
retrace

4.7.7 Generation file

The generation file consists of a list of lexical forms. It serves as input to the file generate command (see section 4.5.13), which returns a file (or screen display) whose format is identical to the generation comparison file. The following specifications apply to the generation file.

Each form must be on a separate line.
Extra white space, blank lines, and comment lines are ignored.
Each form is assumed to be a lexical form. If a gloss is included, it is ignored.

Figure 4.13 shows a sample generation file.

Figure 4.13 A sample generation file

`cat
`cat+s
`cat+'s
`cat+s+'s
`fox
`fox+s
`fox+'s
`fox+s+'s

4.7.8 Recognition file

The recognition file consists of a list of surface forms. It serves as input to the file recognize command (see section 4.5.14), which returns a file (or screen display) whose format is identical to the recognition comparison file. The following specifications apply to the recognition file.

Each form must be on a separate line.
Extra spaces, blank lines, and comment lines are ignored.
Each form is assumed to be a surface form.

Figure 4.14 shows a sample recognition file.

Figure 4.14 A sample recognition file

cat
cats
cat's
cats'
fox
foxes
fox's
foxes'

4.7.8A Synthesis file

The synthesis file consists of a list of morphological forms. A morphological form is a sequence of morpheme glosses separated by spaces. A synthesis file serves as input to the file synthesis command (see section 4.5.13), which returns a file (or screen display) whose format is identical to the synthesis comparison file. The following specifications apply to the synthesis file.

Each form must be on a separate line.
Extra white space, blank lines, and comment lines are ignored.
Each form is assumed to be a morphological form.

Figure 4.14A shows a sample synthesis file.

Figure 4.14A A sample synthesis file

`cat
`cat +PL
`cat +GEN
`cat +PL +GEN
`fox
`fox +PL
`fox +GEN
`fox +PL +GEN

4.7.9 Summary of default file names and extensions

Figure 4.15 summarizes the default file names and extensions assumed by PC-KIMMO. Two entries are given for the different kinds of files. The first is the name PC-KIMMO will assume if no file name at all is given to a command that expects that kind of file. The second entry (with the *) shows what extension PC-KIMMO will add if a file name without an extension is given.

Figure 4.15 Default file names and extensions

Rules file:                    RULES.RUL
                                   *.RUL
Lexicon file:                LEXICON.LEX
                                   *.LEX
Grammar file:                GRAMMAR.GRM
                                   *.GRM
Generation comparison file:     DATA.GEN
                                   *.GEN
Recognition comparison file:    DATA.REC
                                   *.REC
Pairs comparison file:          DATA.PAI
                                   *.PAI
Synthesis comparison file:      DATA.SYN
                                   *.SYN
Take file:                   PCKIMMO.TAK
                                   *.TAK
Log file:                    PCKIMMO.LOG
                                   *.LOG

[ Guide contents | Chapter contents | Next section: 4.8 Trace Formats | Previous section: 4.6 Alphabetic List of Commands ]