A4 Writing the rules file

[ Guide contents | Appendix A contents | Previous section: A3 Compiling two-level rules into state tables ]

A4.2 NULL, ANY, and BOUNDARY symbols

Figure A8 The skeleton of a PC-KIMMO rules file

This section contains instructions on how to write the rules file for the PC-KIMMO program (a more detailed specification of the rules file is found in section 4.7.1). We will develop a sample rules file for a set of hypothetical data.

The general structure of the rules file is a list of declarations composed of a keyword followed by data. The set of valid keywords in a rules file includes COMMENT, ALPHABET, NULL, ANY, BOUNDARY, SUBSET, RULE, and END. The COMMENT, SUBSET and RULE declarations are optional and also can be used more than once in a rules file. The END declaration is also optional, but can only be used once.

The COMMENT declaration (new in PC-KIMMO version 2) sets the comment character used in the rules file, lexicon files, and grammar file. The COMMENT declaration can only be used in the rules file, not in the lexicon or grammar file. The COMMENT declaration is optional. If it is not used, the comment character is set to ; (semicolon) as a default.

The ALPHABET declaration must either occur first in the file or follow one or more COMMENT declarations only. The other declarations can appear in any order. The COMMENT, NULL, ANY, BOUNDARY, and SUBSET declarations can even be interspersed among the rules. However, these declarations must appear before any rule that uses them or an error will result.

To begin creating a rules file, use your text editor or word processing program to create a file with the extension .RUL (for example, SAMPLE.RUL). When you save the file to disk, be sure to save it as plain text (ASCII). We also recommend that you use an editor that handles column blocks; this makes manipulating state tables much easier. Type the basic skeleton of a PC-KIMMO rules file as shown in figure A8 (a template of a rules file is also available in the file RULES.RUL on the PC-KIMMO release diskette):

Figure A8 The skeleton of a PC-KIMMO rules file

    COMMENT
    ALPHABET
    NULL
    ANY
    BOUNDARY
    SUBSET
    RULE
    END

Comments can be added to the rules file that are ignored by PC-KIMMO. The default comment delimiter character is semicolon (;), but can be changed by using the COMMENT declaration. Anything on a line following a semicolon is considered a comment and is ignored. Extra spaces and blank lines are also ignored.

A4.1 The ALPHABET

The rules file must first declare the alphabet. This is the entire set of symbols (characters), both lexical and surface, used by the rules and lexicon. The ALPHABET declaration must either occur first in the file or follow one or more COMMENT declarations only. It is followed by any number of lines of symbols, each separated by at least one space. For example,

    ALPHABET
      p t k b d g m n ng ç j s S z Z h l r w y
     i e a o u ï ë ä ö ü
     + '

The alphabet can consist of any alphanumeric characters, including those available in the extended character set on IBM PC-compatible computers. Uppercase and lowercase are considered distinct characters. Nonalphabetic characters such as $, &, !, ', #, and + may also be used. In the above alphabet, + indicates a morpheme boundary and ' indicates stress. In this section, the examples printed in typewriter style use only those characters available on IBM PC compatible computers.

An alphabetic symbol can also be a multigraph, that is, a sequence of two or more characters. The individual characters composing a multigraph do not necessarily have to also be declared as alphabetic characters. For example, an alphabet could include the characters s and z and the multigraph sz%, but not include % as an alphabetic character. Note that a multigraph cannot also be interpreted as a sequence of the individual characters that comprise it. For example, if you declare t, h, and th as alphabetic symbols, then the th in a word such as rathole will match only the digraph th, not the sequence t plus h.

A4.2 NULL, ANY, and BOUNDARY symbols

Next, the NULL (empty or zero) symbol is declared. Any character not already in the alphabet can be chosen, but for obvious reasons 0 (zero) is typically used. The NULL symbol is used for deletions, for instance h:0, and insertions, for instance 0:h. The NULL symbol is declared by including this line:

    NULL 0

Next, the ANY ("wildcard") symbol is declared. Again, any character not already in the alphabet can be chosen; in this book we use @ ("at" sign). The ANY symbol is declared by including this line:

    ANY @

Next, the BOUNDARY (word boundary) symbol is declared. Again, any character not already in the alphabet can be chosen; in this book we use # (crosshatch or pound sign). The BOUNDARY symbol is declared by including this line:

    BOUNDARY #

A4.3 Subsets

Next in the rules file the subsets, if any, are declared. A subset declaration is composed of the keyword SUBSET followed by a subset name followed by a list of subset characters. A subset name can be any alphanumeric string (one or more characters, no spaces) so long as it is unique; that is, it cannot be a single character already declared in the alphabet. Uppercase characters are useful for subset names because they are usually distinct from their lowercase equivalents. All characters defined as belonging to a subset must also be in the complete alphabet. Subsets are declared by including lines such as these:

   SUBSET C       p t k b d g m n ng ç j s S z Z h l r w y
   SUBSET V       i e a o u
   SUBSET Vlng    ï ë ä ö ü
   SUBSET Cvd     b d g m n ng z l r w y
   SUBSET Oalv    t d s z
   SUBSET Opal    ç j S Z
   SUBSET Ovd     b d g z
   SUBSET Ovl     p t k s

A4.4 Rules

The rest of the rules file consists of the rules. A rule declaration is composed of the keyword RULE followed by the rule name, number of states, number of columns, and the state table itself. The rule name is enclosed in a pair of identical delimiter characters such as double quotes. The rule name has no effect on the operation of the table. It actually can contain any information, but by convention we use it for the name and the two-level notation of the rule. It is also useful to include a sequence number for each rule, as rules are referred to by number in some of the diagnostic displays for rule debugging. Notice that the horizontal and vertical lines printed in the tables shown in this chapter are not present in an actual rules file.

By common convention, the first rules listed are the tables of default correspondences, though these correspondences can be listed anywhere in the file. For the sake of consistency, it is best to place all the default correspondences in these tables even if they also occur in other tables. The possible redundancy has no effect on the operation of the tables. Tables of default correspondences for the alphabet given in section A4.1 look like this:

RULE "1 Consonant defaults" 1 17
           p t k b d g m n ng s z h l r w y @
           p t k b d g m n ng s z h l r w y @
        1: 1 1 1 1 1 1 1 1 1  1 1 1 1 1 1 1 1

RULE "2 Vowels and other defaults" 1 8
           i e a o u ' + @
           i e a o u ' 0 @
        1: 1 1 1 1 1 1 1 1

The two tables could be combined into one; consonants and vowels have been separated here for increased readability. Notice that the morpheme boundary symbol (+) is deleted by default; that is, it has no surface realization other than 0.

After the tables of default correspondences come the rules for special correspondences. The rules developed below account for examples such as these:

LR:   s'ati   s'adi   bab'at   bab'ad
SR:   s'açi   s'äji   bab'at   bab'ät

These rules account for Palatalization:

RULE "3 Palatalization correspondences" 1 5
           t d s z @
           ç j S Z @
        1: 1 1 1 1 1

RULE "4 Palatalization, Oalv:Opal  lt;= gt; ___i" 3 4
           Oalv Oalv i @
           Opal  @   i @
        1:  3    2   1 1
        2:  3    2   0 1
        3.  0    0   1 0

Rule 4 is a palatalization rule that states that the alveolar consonants are realized as palatalized consonants before i. Because rule 4 uses subsets, the feasible pairs represented by the correspondence Oalv:Opal must be explicitly declared. Rule 3 contains these correspondences which are relevant only to rule 4. The special correspondences from all the rules in the description could be combined into one table (or they could even be combined with the tables of default correspondences). However, for readability and to make it easier to modify and debug the rules file, we recommend that a separate table of special correspondences be kept with each rule that uses subsets.

Rules 5 and 6 state that vowels are lengthened when they are stressed (that is, follow ') and precede a lexical voiced consonant (that is, a member of the subset Cvd).

RULE "5 Lengthening correspondences" 1 6
           a e i o u @
           ä ë ï ö ü @
        1: 1 1 1 1 1 1

RULE "6 Vowel Lengthening, V:Vlng  lt;= gt; '___Cvd:" 4 5
           ' V    V Cvd @
           ' Vlng @  @  @
        1: 2 0    1  1  1
        2: 2 4    3  1  1
        3: 2 1    1  0  1
        4. 0 0    0  1  0

The environment of rule 6 contains the correspondence Cvd:@ rather than Cvd:Cvd because of rules 7 and 8, which devoice obstruents word finally.

RULE "7 Devoicing correspondences" 1 5
           b d g z @
           p t k s @
        1: 1 1 1 1 1

RULE "8 Final Devoicing, Ovd:Ovl lt;= gt; ___#" 3 4
           Ovd Ovd # @
           Ovl  @  # @
        1:  3   2  1 1
        2:  3   2  0 1
        3.  0   0  1 0

The rules file optionally ends with a line containing only the word END. Any material in the file after this line is ignored by PC-KIMMO.

END

A4.5 Example of a rules file

As a ready model for the format of a rules file, the example developed above is repeated in its entirety in figure A9. This file is found on the PC-KIMMO release diskette in the SAMPLE subdirectory.

Figure A9 Sample rules file

ALPHABET
  p t k b d g m n ng ç j s S z Z h l r w y
  i e a o u ï ë ä ö ü
  + '
NULL 0
ANY @
BOUNDARY #
SUBSET C       p t k b d g m n ng ç j s S z Z h l r w y
SUBSET V       i e a o u
SUBSET Vlng    ï ë ä ö ü
SUBSET Cvd     b d g m n ng z l r w y
SUBSET Oalv    t d s z
SUBSET Opal    ç j S Z
SUBSET Ovd     b d g z
SUBSET Ovl     p t k s
END

RULE "1 Consonant defaults" 1 17
           p t k b d g m n ng s z h l r w y @
           p t k b d g m n ng s z h l r w y @
        1: 1 1 1 1 1 1 1 1 1  1 1 1 1 1 1 1 1

RULE "2 Vowels and other defaults" 1 8
           i e a o u ' + @
           i e a o u ' 0 @
        1: 1 1 1 1 1 1 1 1

RULE "3 Palatalization correspondences" 1 5
           t d s z @
           ç j S Z @
        1: 1 1 1 1 1

RULE "4 Palatalization, Oalv:Opal  lt;= gt; ___i" 3 4
           Oalv Oalv i @
           Opal  @   i @
        1:  3    2   1 1
        2:  3    2   0 1
        3.  0    0   1 0

RULE "5 Lengthening correspondences" 1 6
           a e i o u @
           ä ë ï ö ü @
        1: 1 1 1 1 1 1

RULE "6 Vowel Lengthening, V:Vlng  lt;= gt; '___Cvd:" 4 5
           ' V    V Cvd @
           ' Vlng @  @  @
        1: 2 0    1  1  1
        2: 2 4    3  1  1
        3: 2 1    1  0  1
        4. 0 0    0  1  0

RULE "7 Devoicing correspondences" 1 5
           b d g z @
           p t k s @
        1: 1 1 1 1 1

RULE "8 Final Devoicing, Ovd:Ovl lt;= gt; ___#" 3 4
           Ovd Ovd # @
           Ovl  @  # @
        1:  3   2  1 1
        2:  3   2  0 1
        3.  0   0  1 0

END

[ Guide contents | Appendix A contents | Previous section: A3 Compiling two-level rules into state tables ]