This section contains instructions on how to write the rules file for the PC-KIMMO program (a more detailed specification of the rules file is found in section 4.7.1). We will develop a sample rules file for a set of hypothetical data.
The general structure of the rules file is a list of declarations composed of a keyword followed by data. The set of valid keywords in a rules file includes COMMENT, ALPHABET, NULL, ANY, BOUNDARY, SUBSET, RULE, and END. The COMMENT, SUBSET and RULE declarations are optional and also can be used more than once in a rules file. The END declaration is also optional, but can only be used once.
The COMMENT declaration (new in PC-KIMMO version 2) sets the comment character used in the rules file, lexicon files, and grammar file. The COMMENT declaration can only be used in the rules file, not in the lexicon or grammar file. The COMMENT declaration is optional. If it is not used, the comment character is set to ; (semicolon) as a default.
The ALPHABET declaration must either occur first in the file or follow one or more COMMENT declarations only. The other declarations can appear in any order. The COMMENT, NULL, ANY, BOUNDARY, and SUBSET declarations can even be interspersed among the rules. However, these declarations must appear before any rule that uses them or an error will result.
To begin creating a rules file, use your text editor or word processing program to create a file with the extension .RUL (for example, SAMPLE.RUL). When you save the file to disk, be sure to save it as plain text (ASCII). We also recommend that you use an editor that handles column blocks; this makes manipulating state tables much easier. Type the basic skeleton of a PC-KIMMO rules file as shown in figure A8 (a template of a rules file is also available in the file RULES.RUL on the PC-KIMMO release diskette):
Figure A8 The skeleton of a PC-KIMMO rules file
COMMENT ALPHABET NULL ANY BOUNDARY SUBSET RULE ENDComments can be added to the rules file that are ignored by PC-KIMMO. The default comment delimiter character is semicolon (;), but can be changed by using the COMMENT declaration. Anything on a line following a semicolon is considered a comment and is ignored. Extra spaces and blank lines are also ignored.
ALPHABET p t k b d g m n ng ç j s S z Z h l r w y i e a o u ï ë ä ö ü + 'The alphabet can consist of any alphanumeric characters, including those available in the extended character set on IBM PC-compatible computers. Uppercase and lowercase are considered distinct characters. Nonalphabetic characters such as $, &, !, ', #, and + may also be used. In the above alphabet, + indicates a morpheme boundary and ' indicates stress. In this section, the examples printed in typewriter style use only those characters available on IBM PC compatible computers.
An alphabetic symbol can also be a multigraph, that is, a sequence of two or more characters. The individual characters composing a multigraph do not necessarily have to also be declared as alphabetic characters. For example, an alphabet could include the characters s and z and the multigraph sz%, but not include % as an alphabetic character. Note that a multigraph cannot also be interpreted as a sequence of the individual characters that comprise it. For example, if you declare t, h, and th as alphabetic symbols, then the th in a word such as rathole will match only the digraph th, not the sequence t plus h.
NULL 0Next, the ANY ("wildcard") symbol is declared. Again, any character not already in the alphabet can be chosen; in this book we use @ ("at" sign). The ANY symbol is declared by including this line:
ANY @Next, the BOUNDARY (word boundary) symbol is declared. Again, any character not already in the alphabet can be chosen; in this book we use # (crosshatch or pound sign). The BOUNDARY symbol is declared by including this line:
BOUNDARY #
SUBSET C p t k b d g m n ng ç j s S z Z h l r w y SUBSET V i e a o u SUBSET Vlng ï ë ä ö ü SUBSET Cvd b d g m n ng z l r w y SUBSET Oalv t d s z SUBSET Opal ç j S Z SUBSET Ovd b d g z SUBSET Ovl p t k s
By common convention, the first rules listed are the tables of default correspondences, though these correspondences can be listed anywhere in the file. For the sake of consistency, it is best to place all the default correspondences in these tables even if they also occur in other tables. The possible redundancy has no effect on the operation of the tables. Tables of default correspondences for the alphabet given in section A4.1 look like this:
RULE "1 Consonant defaults" 1 17 p t k b d g m n ng s z h l r w y @ p t k b d g m n ng s z h l r w y @ 1: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 RULE "2 Vowels and other defaults" 1 8 i e a o u ' + @ i e a o u ' 0 @ 1: 1 1 1 1 1 1 1 1The two tables could be combined into one; consonants and vowels have been separated here for increased readability. Notice that the morpheme boundary symbol (+) is deleted by default; that is, it has no surface realization other than 0.
After the tables of default correspondences come the rules for special correspondences. The rules developed below account for examples such as these:
LR: s'ati s'adi bab'at bab'ad SR: s'açi s'äji bab'at bab'ätThese rules account for Palatalization:
RULE "3 Palatalization correspondences" 1 5 t d s z @ ç j S Z @ 1: 1 1 1 1 1 RULE "4 Palatalization, Oalv:Opal lt;= gt; ___i" 3 4 Oalv Oalv i @ Opal @ i @ 1: 3 2 1 1 2: 3 2 0 1 3. 0 0 1 0Rule 4 is a palatalization rule that states that the alveolar consonants are realized as palatalized consonants before i. Because rule 4 uses subsets, the feasible pairs represented by the correspondence Oalv:Opal must be explicitly declared. Rule 3 contains these correspondences which are relevant only to rule 4. The special correspondences from all the rules in the description could be combined into one table (or they could even be combined with the tables of default correspondences). However, for readability and to make it easier to modify and debug the rules file, we recommend that a separate table of special correspondences be kept with each rule that uses subsets.
Rules 5 and 6 state that vowels are lengthened when they are stressed (that is, follow ') and precede a lexical voiced consonant (that is, a member of the subset Cvd).
RULE "5 Lengthening correspondences" 1 6 a e i o u @ ä ë ï ö ü @ 1: 1 1 1 1 1 1 RULE "6 Vowel Lengthening, V:Vlng lt;= gt; '___Cvd:" 4 5 ' V V Cvd @ ' Vlng @ @ @ 1: 2 0 1 1 1 2: 2 4 3 1 1 3: 2 1 1 0 1 4. 0 0 0 1 0The environment of rule 6 contains the correspondence Cvd:@ rather than Cvd:Cvd because of rules 7 and 8, which devoice obstruents word finally.
RULE "7 Devoicing correspondences" 1 5 b d g z @ p t k s @ 1: 1 1 1 1 1 RULE "8 Final Devoicing, Ovd:Ovl lt;= gt; ___#" 3 4 Ovd Ovd # @ Ovl @ # @ 1: 3 2 1 1 2: 3 2 0 1 3. 0 0 1 0The rules file optionally ends with a line containing only the word END. Any material in the file after this line is ignored by PC-KIMMO.
END
ALPHABET p t k b d g m n ng ç j s S z Z h l r w y i e a o u ï ë ä ö ü + ' NULL 0 ANY @ BOUNDARY # SUBSET C p t k b d g m n ng ç j s S z Z h l r w y SUBSET V i e a o u SUBSET Vlng ï ë ä ö ü SUBSET Cvd b d g m n ng z l r w y SUBSET Oalv t d s z SUBSET Opal ç j S Z SUBSET Ovd b d g z SUBSET Ovl p t k s END RULE "1 Consonant defaults" 1 17 p t k b d g m n ng s z h l r w y @ p t k b d g m n ng s z h l r w y @ 1: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 RULE "2 Vowels and other defaults" 1 8 i e a o u ' + @ i e a o u ' 0 @ 1: 1 1 1 1 1 1 1 1 RULE "3 Palatalization correspondences" 1 5 t d s z @ ç j S Z @ 1: 1 1 1 1 1 RULE "4 Palatalization, Oalv:Opal lt;= gt; ___i" 3 4 Oalv Oalv i @ Opal @ i @ 1: 3 2 1 1 2: 3 2 0 1 3. 0 0 1 0 RULE "5 Lengthening correspondences" 1 6 a e i o u @ ä ë ï ö ü @ 1: 1 1 1 1 1 1 RULE "6 Vowel Lengthening, V:Vlng lt;= gt; '___Cvd:" 4 5 ' V V Cvd @ ' Vlng @ @ @ 1: 2 0 1 1 1 2: 2 4 3 1 1 3: 2 1 1 0 1 4. 0 0 0 1 0 RULE "7 Devoicing correspondences" 1 5 b d g z @ p t k s @ 1: 1 1 1 1 1 RULE "8 Final Devoicing, Ovd:Ovl lt;= gt; ___#" 3 4 Ovd Ovd # @ Ovl @ # @ 1: 3 2 1 1 2: 3 2 0 1 3. 0 0 1 0 END