LingDoc Transform

 

A tool for document and language engineering

 

 

 

Reference Manual

 

Version 3.2

 

 

 

 

 

 

 

Palstar

Uffelte

The Netherlands

www.lingdoc.eu

www.palstar.nl

 

 

 


1........... Introduction. 6

1.1      Hands‑on. 6

1.1.1      Example on Dutch license plates. 6

1.2      From grammar to program.. 7

1.2.1      Preparation phase. 7

1.2.2      Compilation phase. 8

1.2.3      Execution phase. 8

1.3      Grammars 8

1.3.1      Meta language. 9

1.3.2      Types of rules and types of grammars 10

1.4      The compiler and the run time system.. 11

1.4.1      Finite state automata. 11

1.4.2      Technical remarks 12

1.5      Miscellaneous concepts 12

1.5.1      Recursion. 12

1.5.2      Ambiguity. 13

1.5.3      Unit rules and empty rules 13

1.6      Contents of this manual 13

2........... Basic Syntax. 15

2.1      Meta symbols 15

2.2      Terminal symbols 16

2.2.1      Characters and strings 17

2.2.2      Ranges 18

2.2.3      Wildcards 19

2.3      Non-terminal symbols 19

2.4      Regular expressions 20

3........... Extended Features. 22

3.1      Actions 22

3.1.1      Assignment and tests 22

3.1.1.1    Variables, literals and expressions 22

3.1.1.2    Assignment 23

3.1.1.3    Types of tests 24

3.1.1.4    Combinations of tests and/or assignments 25

3.1.2      Messages 26

3.1.2.1    Simple messages 26

3.1.2.2    Messages with expressions 27

3.1.3      Places of actions in rules 28

3.1.3.1    Actions following symbols 28

3.1.3.2    Actions following regular expressions 29

3.1.3.3    Actions as part of the repetition of regular expressions 29

3.1.3.4    Actions in empty units 30

3.2      Symbol type extensions 31

3.2.1      Non-terminals with parameters 31

3.2.2      Lexicon symbols 33

3.2.3      Cover non-terminals 34

3.2.4      Intermediate symbols 35

3.3      Negation. 36

3.3.1      Negation of terminal characters, ranges and intermediates 37

3.3.2      Negation of non-terminal symbols 37

3.3.3      Negation of regular expressions 37

3.3.4      Negation of a string of terminal symbols 38

3.4      Error recovery. 38

4........... Context Sensitive Grammars. 40

4.1      Description of context sensitive rules 40

4.2      Context sensitive grammars for recognition or parsing. 40

4.2.1      Example of a context sensitive grammar used for recognition. 40

4.2.2      (Cover) non-terminals in context sensitive grammars 41

4.2.3      Intermediates in context sensitive grammars 42

4.3      Transduction grammars 43

4.3.1      Description of transduction rules 43

4.3.2      Non-terminals in transduction rules 45

4.3.3      Intermediates in transduction rules 45

4.3.4      Ambiguity and "looping" problems in transduction grammars 46

4.3.5      Compiling and running a transduction grammar 47

4.3.6      Cascaded grammars 47

5........... Run Time Input and Output files. 48

5.1      Input files 48

5.2      Appearance of symbols in output files 49

5.3      Parse tree output 50

5.4      Report output 51

5.4.1      Report of messages 51

5.4.2      Report of output parameters 52

5.5      Error output 52

6........... Switches (Directives) 52

6.1      Compiler Directives 53

6.1.1      GUI: Advanced Mode, Standard fields: Compiler 53

6.1.1.1    Switch: FINITE. 54

6.1.1.2    Switch: ONE_CS_REDUCE. 54

6.1.1.3    Switch: ONE_ALTERNATIVE. 54

6.1.1.4    Switch: RECURSION_MESSAGE. 55

6.1.1.5    Switch: SHARED_UNIVERSE. 55

6.1.1.6    Switch: UNIVERSE. 55

6.1.1.7    Switch: TRANSDUCE. 55

6.1.1.8    Switch: PREPARE_ PARSETREE. 56

6.1.1.9    Switch: INTERM_INPUT. 56

6.1.1.10  Switch: CHECK_VARIABLES. 56

6.1.1.11  Switch: COMP_BASE. 56

6.2      Runtime Directives 56

6.2.1      GUI: Advanced Mode, Standard fields: Parser 56

6.2.1.1    Switch: ECHO.. 57

6.2.1.2    Switch: IGNORE_CR. 57

6.2.1.3    Switch: SPLIT_OUTPUT. 57

6.2.1.4    Switch: BUILD_PARSETREE. 57

6.2.1.5    Switch: SKIP_CHARS. 58

6.2.1.6    Switch: SKIP_ON_ERROR. 58

6.2.1.7    Switch: OUTPUT_PARAMETERS. 58

6.2.1.8    Switch: LEXICON.. 58

6.2.1.9    Switch: LEX_CHARS. 58

6.2.1.10  Switch: SINGLE_LINES. 58

6.2.1.11  Switch: IGNORE_BLANKS. 59

6.2.1.12  Switch: FLAT_OUTPUT. 59

6.2.1.13  Switch: CASE_CONVERT. 59

6.2.1.14  Switch: FIND_ON_ERROR. 59

6.2.1.15  Switch: PARSE_BASE. 59

6.2.1.16  Switch: LEX_INCREMENT. 60

6.3      Switches for error recovery. 60

6.3.1.1    Switch: CPU_LOG.. 60

7........... Operation. 60

7.1      Compiler 60

7.1.1      Commands for compilation in batch. 61

7.1.2      Screen output in the compilation phase. 61

7.1.3      Files related to compilation. 62

7.1.4      Error handling during compilation. 62

7.2      Run time system.. 63

7.2.1      Commands for the run time system in batch. 63

7.2.2      Screen output and keyboard input during the execution phase. 64

7.2.3      Files related to the run time system.. 64

7.2.4      Error handling during run time. 65

8........... Problems and current limitations. 66

8.1      Current limitations 66

8.1.1      Upper limits for the size of a grammar 66

8.1.2      Finite versus non-finite. 66

8.1.3      Limitations concerning context sensitive rules 67

8.1.4      Limitations concerning actions 67

8.1.5      Miscellaneous limitations 67

8.1.6      Known effects of "bugs" 68

8.2      Debugging Grammars 68

9........... Tools. 69

9.1      Lexicon. 70

9.1.1      GUI: Advanced Mode: Lexicon-Compiler 71

9.2      Drawtree. 71

9.3      Drawgraph. 72

9.4      Plink. 72

9.4.1      GUI: Advanced Mode: Cascade Linker 73

9.5      Xgram.. 73

9.6      Disassembler 74

9.6.1      GUI: Advanced Mode: Disassembler 74

10.......... Overview. 75

10.1     Meta language of Transform grammars 75

10.2     Rules for the construction of user defined symbol names 76

10.3     Checklist for switches 77

10.4     Batch-Commands 78

11.......... APPENDIX A: SWITCHES FOR PROGRAMMERS. 79

11.1     Compiler programmer switches 79

11.1.1     GUI: Advanced Mode: Compiler, Advanced Fields 79

11.1.1.1  Switch: TEST_COMP_SETS. 79

11.1.1.2  Switch: TEST_LABELTREES (obsolete) 79

11.1.1.3  Switch: WRITE_PGFILE. 79

11.1.1.4  Switch: OPTIMIZE. 79

11.1.1.5  Switch: TRACE_COMP. 80

11.1.1.6  Switch: REDUCE_SHIFT. 80

11.1.1.7  Switch: PRINT_SETS. 80

11.1.1.8  Switch: ADD_CS_RULES. 80

11.1.1.9  Switch: SKIP_TREES (obsolete) 80

11.1.1.10 Switch: UPDATE_LABFILE. 80

11.1.1.11 Switch: PRINT_STRUCTURE. 81

11.1.1.12 Switch: DEBUG_CONTROL. 81

11.1.1.13 Switch: TEST_TREES (obsolete) 81

11.1.1.14 Switch: SKIP_LABELTREES (obsolete) 81

11.1.1.15 Switch: WRITE_SYMBOLICS. 81

11.1.1.16 Switch: PRINT_RESULT. 81

11.1.1.17 Switch: DEALLOCATION.. 81

11.1.1.18 Switch: WATCH.. 81

11.1.1.19 Switch: COMP_STATISTICS. 82

11.1.1.20 Switch: INTERACTIVE. 82

11.1.1.21 Switch: PRINT_STATES. 82

11.1.1.22 Switch: BOOKKEEPING_ITEMS. 82

11.2     Run time programmer switches 83

11.2.1     GUI: Advanced Mode: Parser, Advanced Fields 83

11.2.1.1  Switch: TREE_INPUT (obsolete) 83

11.2.1.2  Switch: TRACE_USECOUNT. 83

11.2.1.3  Switch: PRINT_PARSETREE. 83

11.2.1.4  Switch: COUNT_CONNECTORS. 83

11.2.1.5  Switch: TRACE_LEVEL. 84

11.2.1.6  Switch: TRACE_ERROR. 84

11.2.1.7  Switch: ONLINE_MESSAGES. 84

11.2.1.8  Switch: PARSE_STATISTICS. 84

11.2.1.9  Switch: COUNT_PARSETREES. 84

11.2.1.10 Switch: BATCH.. 85

11.2.1.11 Switch: TRACE_PARSE. 85

11.2.2     Switches for interactive use. 85

12.......... APPENDIX B: ASCII TABLE. 86

13.......... APPENDIX C: COMPILER ERROR MESSAGES. 87

13.1     Errors in values of switches 87

13.2     Insufficient constants 87

13.3     Not yet implemented features 87

13.4     Syntax errors in the grammar, not detected by Bootstrap. 88

13.5     Errors in the .IM file. 88

13.6     Debug facilities 89

13.7     Implementation errors 89

 

 


1     Introduction

This reference manual presents a general overview of LingDoc Transform, together with a detailed description of the Transform formalism, its operation and its auxiliary tools.

 

LingDoc Transform (short Transform) is introduced in a Management Overview and in Five Position Papers. For a formal description of the underlying ideas and principles we refer to the following book:

 

              G.J. van der Steen. A program generator for recognition, parsing and transduction with syntactic patterns. CWI Tract 55, Center for Mathematics and Computer Science. P.O. Box 4079, 1009 AB Amsterdam, The Netherlands, 1988, 284 pp., ISBN 90 6196 361 3.

 

An ordinary user who wants to make use of the possibilities of the system has to understand some of the underlying concepts and must be familiar with some of the terminology. In this introductory chapter only those terms will be explained which are indispensable for the understanding of the remaining chapters of this manual. Where necessary the text is illustrated with examples.

 

1.1     Hands‑on

 

A hands-on experience is described in the User Manual of Transform, together with an introduction to the formalism. In this Reference Manual we go into more detail and describe the Transform system from a general point of view.

 

1.1.1        Example on Dutch license plates.

 

In addition to the examples within the User Manual we will use, throughout this Reference Manual, an example on Dutch license plates.

 

Dutch license plates consist of a combination of pairs of letters and pairs of digits. The pairs are separated by hyphens. Back in ancient history, the plates started with letters, followed by two pairs of digits. Nowadays, plates start with a pair of letters, followed by two digits and another pair of letters. Here is the history:

 

      AB‑12‑34

      12‑34‑AB

      12‑AB‑34

      AB‑12‑CD

 

Now, enter the following grammar using an editor that produces plain text files. Name the file LICENCE.GRM.. (This file is already contained within the directory Examples_Reference_Manual). The .GRM extension indicates that the file contains an Transform grammar.

 

      /* LICENCE.GRM */

 

      LicensePlates::=  ( LicensePlate , CR )+ .

      LicensePlates::=  Letters , Hyphen , Digits , Hyphen , Digits {S:2} |

                        Digits  , Hyphen , Digits , Hyphen , Letters {S:3} |

                        Digits  , Hyphen , Letters , Hyphen , Digits {S:4} |

                        Letters , Hyphen , Digits , Hyphen , Letters {S:5} .

      Letters     ::=   [a-z] {R:1} , [A-Z] {R:1} .

      Digits      ::=   [0-9] {R:1} , [0-9] {R:1} .

      Hyphen      ::=   '-' {R:1} .

 

 

The texts between the curly brackets are not part of the actual grammar; they are only there to ensure that the grammar produces some interesting output, which does not mean that they may be omitted. The heading, the grammar name between ‘/*’ and ‘*/’, is just a comment.

 

Now, create a file called LICENCE.SW. Its contents should be a single line saying FINITE (lowercase is allowed as well). Alternatively, use the GUI in the “Advanced mode” to switch FINITE on. This will create the file LICENCE.SW automatically.

 

This LICENCE.SW file is consulted by the Transform compiler. It contains so‑called switches: instructions to the compiler to modify its behavior to some extent. In this case the compiler is instructed to generate a so‑called finite state machine. The file modifies the default settings of the Transform compiler. If no .SW file exists, the compiler uses its default settings.

 

After setting the switches, select the grammar in the GUI by pressing the button “Select grammar”. Then, invoke the Transform compiler by pressing the button “Compile grammar”.

 

Before the compiler starts, the syntax of the grammar is checked. This is done by means of the grammar of Transform Grammars, called BOOTSTRAP.GRM.

 

When the compilation is complete, you can look in the directory which contains the grammar. You will find that some other files have been created.

 

At this stage, you can invoke the program that actually runs the grammar. This run time system requires an input file. Therefore, create an input file containing correct license plates. Name the file PLATES.TXT. (This file is already present). Its contents should be:

 

      12‑34‑AB

      12‑AB‑34

      AB‑12‑34

      AB‑12‑CD

 

In the GUI, select the input file by pressing the button “Run grammar”.

 

If all goes well (and let's hope it does), the Transform Process Monitor says "Input correct". This means that the Transform system did not find an incorrect license plate in your input. Try for yourself what happens if you make errors in the syntax of the license plates.

 

Apart from saying "Input correct", the parsing process has a side effect. There now is a file called PLATES.REP. You can view this file. Its contents list the original input (preceded by "1:"), followed by lines like "2:", "3:" and so on.

This .REP file (usually called report file) is produced because the grammar contained so-called messages (the text between the curly brackets), which are instructions to write part of the input to the output. The "{R:1}" strings in your grammar copy the input to the report file, preceded by "1:". The "{S:2}" string only writes "2:" to the report file. Note that the digits 2 through 5 in this grammar indicate from which period in the history the license plate stems.

 

1.2     From grammar to program

 

The Transform system consists of two main programs, a compiler and a run time system. The process in which these programs are used can be described in a cycle of three steps:

 

1.2.1        Preparation phase

 

              The first step is to write a grammar in the formalism prescribed by the system. This can be done by creating an ordinary text file with an editor. At this point the grammar writer can influence the outcome of the next step by describing the goals of his grammar (for example "description versus change" or "recognizing versus parsing") or by describing some peculiarities of the grammar. This can be done by creating a so-called switches file, which is also a text file.

 

1.2.2        Compilation phase

 

              In the next step the grammar must be compiled by the Transform compiler. In fact the compilation process consists of two stages. In the first stage the grammar is checked for grammatical errors; the second stage is the generation phase in which the machine code is generated that constitutes the program resulting from that particular grammar. If an error is detected in either stage, the grammar writer must return to the first phase and correct his grammar; otherwise he goes to the next phase.

 

1.2.3        Execution phase

 

              The compiled grammar is now transformed into a kind of program (also called automaton or machine) and can run over one or more input files, text files for which the grammar has been designed. This can be done with the help of the second main Transform program: the run time system. Again some options can be specified in the so-called switches file. For example, one can specify the name of a lexicon that must be used or one can instruct the run time system to neglect the difference between lowercase and uppercase letters. When errors are detected during this phase one can decide whether to correct the input file (for example when the errors were due to typing errors in the input file) or to start the cycle again by correcting the grammar. The different kinds of output (transduction output, parse trees, etc.) will be written to separate files which the user can inspect or use in other programs.

 

In the remaining sections of this chapter a number of concepts belonging to the automaton theory will be used. This manual is, however, not the place to give an introduction in this theory. For the reader who wants to know more about this subject we refer to the literature.

 

1.3     Grammars

 

Formal grammars consist of a set of interdependent rewriting rules. With these rules one describes syntactic regularities in a specific domain. For instance, a grammar for Simplified English gives a description of all the possible sentences that are grammatical in this particular domain. The combined set of all these sentences, however, is only a subset of all English sentences. And as another example, a grammar for the transformation of graphemes into phonemes in Dutch describes the rules by which pronunciation takes place.

 

A grammar defines a language, consisting of all sentences (ranges of characters or words) that are correct according to this grammar. The program resulting from a compiled grammar will mark every input sentence as correct or incorrect according to the grammar. When the terms "language" and "sentence" are used in this manual, they refer to this particular definition, unless stated otherwise.

 

The automatic processing of texts by means of a grammar can serve different purposes. Three goals can be distinguished:

 

1.           Recognition. In this case a grammar is compiled to a program which will take a string as input and either accept or reject it depending on whether the string is a sentence which may be produced by the syntactic description. Such a program is called a recognizer.

 

2.           Interpretation. The grammar is not only used to analyze an input string in syntactic terms, but also to give a structural description of the sentence in the form of a so-called parse tree. This kind of activity is known as parsing and the program resulting from such a grammar is called a parser.

 

3.           Change. The input material (or parts of it) has to be replaced by other characters and/or words, has to be omitted and/or has to be placed in a restructured order. The grammar contains rules by which this transformation must take place. This process is called transduction and the compiled grammar a transducer.

 

The difference between recognizing and interpreting programs is only a gradual one; programs of the latter type will be used as a combined interpreter/recognizer. They will decide whether or not an input string is correct according to the syntactic description of the grammar and at the same time a parse tree (or alternative parse trees) will be built. In the Transform system the three goals mentioned were unified in one single concept. At the same time a number of grammatical formalisms were unified [1]. Facilities for pattern matching are also included. See for a short overview the Position Paper “LingDoc and Grammar Formalisms”.

 

 

1.3.1        Meta language

 

Grammars must be written in a kind of formal language, the meta language. This language defines its objects and prescribes the appearance of grammars.

 

The syntax for Transform grammars is fully described in Bootstrap, a "grammar for Transform  grammars". This grammar describes all correct grammars that can be written in the Transform formalism. It is used in the first phase of the compilation process when the input, a grammar written by a user, is checked for syntax errors. The outcome can be:

 

             the user grammar does not contain any syntactic errors and will be recognized as a Transform grammar, or

             the user grammar is syntactically incorrect according to the Bootstrap grammar and the errors will be marked.

 

In the Transform system, a grammar consists of an unordered set of rules. Each rule is basically built out of three groups of components:

 

1.           Terminal symbols. These are characters and/or words literally found in the input that must be recognized, parsed or transduced by the compiled grammar.

 

2.           Non-terminal symbols. The user can define auxiliary terms as names or abbreviations for sequences of terminal symbols and/or non-terminal symbols. A non-terminal can be used for two different purposes:

                       For convenience, for example as an abbreviation for a long series of symbols, repeated several times in the grammar.

                       In a structural description as a name for an entity that corresponds with a part of the object. For example one can use the name "NounPhrase" for a specific part of a natural language sentence.

 

3.           Meta symbols or reserved symbols (mostly punctuation marks). They make it possible to describe a complex structure in a compact notation.

 

A grammar rule consists of a left-hand side and a right-hand side, eventually separated by a rewrite sign (in the Transform formalism: a double colon "::"). The end of the rule is marked by a period ("."). The left-hand side of a rule may consist of one or more symbols separated by the concatenation sign (comma ","). The right-hand side of a rule (also called the input side) may consist of a, possibly empty, sequence of symbols also eventually separated by the concatenation sign.

 

The meaning of a rule can be stated as:

              "the left-hand side may be rewritten in the right-hand side" or

              "the right-hand side is the production of the symbol at the left-hand side" or

              "the right-hand side is the definition of the left-hand side".

 

As soon as an input fits the description of the right-hand side of a rule, we can state:

              "the symbol on the left-hand has been recognized".

Another terminology designating this process is:

              "the right-hand side reduces the symbol on the left-hand side".

 

 

1.3.2        Types of rules and types of grammars

 

The number of symbols on either side of the rewrite sign of a grammar rule determines the type of the rule:

 

             If the left-hand side consists of only one symbol, the rule is said to be context free. The symbol on the left-hand side of a context free rule is always a non-terminal. The right-hand side may be empty.

 

             If the left-hand side consists of more than one symbol, the rule is said to be context sensitive. Within the Chomsky hierarchy of rules and grammars, a distinction is made between truly context sensitive rules and type‑0 rules. In truly context sensitive rules, the right-hand side is at least as long as the left-hand side. Type‑0 rules are shortening rules: the right-hand side is shorter than the left-hand side. They are also known as unrestricted rewriting rules. [2]

 

In connection with the different types of rules, different types of grammars are distinguished.

 

Context free grammars are grammars which consist of only context free rules. Each rule of a context free grammar rewrites one symbol in a sequence of zero or more symbols.

 

Context sensitive grammars are grammars which contain one or more context sensitive rule. Type‑0 grammars (or unrestricted rewriting grammars) are grammars which contain at least one shortening context sensitive rule. Type‑0 grammars are undecidable, which means that the parsing process may not terminate.

 

Within the Transform system, grammars that are used for recognition and for parsing can be context free or context sensitive, but type‑0 rules with an empty right-hand side are not allowed. Each grammar has a starting symbol, defined as the first symbol on the left-hand side of the first rule in the grammar.

 

Apart from context free rules, a grammar used for recognition or parsing can also contain context sensitive rules (making the grammar a context sensitive grammar). The class of context free languages is a real subset of the class of context sensitive languages. As a consequence, a grammar using context sensitive rules can recognize input that would be unrecognizable if the context sensitive rules were left out.

 

As an example, one may think of a grammar describing the plural of English nouns as "singular noun followed by an 's'". By means of a context sensitive rule stating that any "ies" at the end of a word may also be treated as if it were transformed into "ys" one can guarantee that forms like "spies" (derived from "spy" + "s") can also be recognized. This example will be explained in detail in section 4.2.1 (Example of a context sensitive grammar used for recognition).

 

Transduction grammars contain at least one context sensitive rule, possibly supplemented by one or more context free rule. The context sensitive rules in this type of grammar are called transduction rules. Transduction grammars do not have a starting symbol. Transduction rules can be shortening or lengthening, however, their right-hand side may never be empty. The goal of this type of grammar is to replace input conforming to the pattern described on the right-hand side of transduction rules with the characters underlying the pattern on the left-hand side.

 

 

1.4     The compiler and the run time system

 

The compiler transforms a grammar into a kind of program (or automaton): it generates code that can be interpreted by the run time system as instructions for the treatment of input for which the grammar was designed. For example, it can contain instructions for the transformation of recognized patterns, for building parse trees, for the consultation of an external lexicon, for emitting messages and for performing tests.

 

In generating the code the compiler describes every possible so-called state of the run time system during the treatment of the input. This state is only dependent on the section of the input already treated.

 

Dealing with natural language means dealing with ambiguities (see also section 1.5.2 Ambiguity). In the design of the Transform system, special care has been paid to handle ambiguity in an efficient way. The developed run time system is an interpreter for so-called Parallel Transduction Automata (PTAs). The run time system tries to follow all possible paths in parallel (breadth first). If there is more then one path, common parts of the resulting parse trees will be shared as much as possible [3]. As a consequence, recognition will take place in polynomial time, when this is theoretically possible. In case of an interpreting program the resulting output contains all remaining parse trees.

 

 

1.4.1        Finite state automata

 

Some types of grammar (namely non-nested context free grammars, see section 1.5.1 Recursion) can be compiled in so-called finite state automata. A finite state automaton can be modeled by a finite number of states, an input (tape) and a set of transitions which specify, given a state and an input, the resulting state. A finite state automaton is thus modeled as a computer without any memory.

 

Other types of grammar cannot be compiled into finite state automata, because the languages they describe are too complex to use an automaton without any memory. The number of states that is needed will become infinite unless memory is added to the automaton. Therefore a stack is used as a storage device; the run time system must, for some time, postpone some decisions and must therefore keep track of past treatments.

 

For example, grammars containing nested recursion (see section 1.5.1 Recursion) can never be compiled into a finite state machine. Some extended features in the Transform formalism also make use of a stack in run time. Grammars making use of these features must always be compiled to a non-finite automaton, although the type of grammar may not demand it. Other features (for example negation and transduction) can only be used in grammars to be compiled into a finite state machine.

 

The choice between generation of a finite state machine and a non-finite state machine is (still) left to the user. One can add the switch FINITE or NOFINITE to the switches file. At appropriate places in the presentation of the Transform formalism one will find remarks on the possible combinations.

 

In spite of these limitations there remains a group of grammars for which one has a choice between the two possibilities. In these cases one can consider that a grammar compiled to a finite state machine will run faster (because the run time system does not need to maintain a stack) but the compilation needs more time and memory and will produce more program code (because more states will be described).

 

 

1.4.2        Technical remarks

 

The main programs of the Transform system as well as the auxiliary tools are written in standard Pascal. The system has been ported to a variety of hardware platforms. Current implementations include VAX/VMS, Sun (UNIX) and LINUX. The most recent version runs under Windows XP.

 

The compiler and the run time system can run on different machines because the generated program of a grammar is written in pseudo-code.

 

 

1.5     Miscellaneous concepts

 

In this section some terms that will be used in the rest of this manual are introduced: recursion, ambiguity and two special types of rules, the unit rules and the empty rules.

 

 

1.5.1        Recursion

 

Grammars may contain recursion caused by the (in)direct recurrence of the left-hand side non-terminal on the right-hand side of the same rule. A distinction is made between nested and non-nested recursion. Non-nested recursion, i.e. left or right recursion, is caused by the (in)direct recurrence of the non-terminal at the extreme sides of the right-hand side of a grammar rule. In nested recursion the non-terminal reoccurs somewhere in the middle of the rule. Examples:

 

                   Premodifier        ::=           Premodifier , Modifier .                                        /* non-nested, left recursion */

                   Postmodifier      ::=           Modifier , Postmodifier .                                    /* non-nested, right recursion */

                   Brackets              ::=           LeftBracket , Brackets , RightBracket .                             /* nested recursion */

 

Grammars with nested recursion theoretically cannot be compiled into a finite state automaton, because a stack is needed in run time[4]. Non-nested recursion is automatically handled by the compiler.

 

When a grammar is compiled into a finite state automaton, a check for nested recursion is always performed by the program. If one wants the same check performed in the case of FINITE=FALSE, one can add the switch RECURSION_MESSAGE to the switches file.

 

 

1.5.2        Ambiguity

 

During recognition, parsing and transduction, ambiguities may arise. The causes may be:

 

             inherent ambiguity: according to the syntactic rules there is more than one structural description of the input

 

             temporal ambiguity: according to the syntactic rules there is more than one way to proceed in order to reach one single final structural description.

 

In designing grammars related to natural languages, ambiguities are no exception. Often sentences can in fact only be described in several ways and as a consequence the user will expect more than one parse tree.

 

Ambiguities may also arise, however, as a consequence of badly designed grammars. This is partly a question of experience. Some advice however can be given; it will appear throughout the text of this manual in the form of style recommendations. The formalism of Transform offers a lot of possibilities to give a grammar text a transparent lay-out; also comments added to the text can be helpful to avoid unwanted ambiguity.

 

 

1.5.3        Unit rules and empty rules

 

By unit rules we will refer to a special type of rule occurring in grammars. If the right-hand side of a rule contains only one symbol (or a series of alternatives, each containing only one symbol), the symbol on the left-hand side will be reduced immediately as soon as one symbol in the input matches the right-hand side. Without special instructions the compiler will "eliminate" the symbols defined in this manner and as a consequence the run time system will run faster. However, because the run time system will have no knowledge about these symbols, they will not appear in the resulting parse trees. One can instruct the compiler to generate instructions for all auxiliary symbols in the grammar by adding PREPARE_PARSETREE to the switches file.

 

In the Transform formalism, it is permitted to add context free rules with an empty right-hand side to the grammar, so-called empty rules. The compiler will also eliminate these empty rules from a grammar. At first sight such rules do not make any sense. However, they can play an important rule in combination with the tests that can be performed (see section 3.1.3.4 Actions in empty units). The non-terminal on the left-hand side of an empty rule will never appear in parse trees.

 

1.6     Contents of this manual

 

The Transform formalism will be presented in chapter 2 Basic Syntax and chapter 3 Extended Features. Because context sensitive grammars (used for recognition or transduction) have special characteristics, this grammar type will be discussed in chapter 4 Context Sensitive Grammars. Some remarks concerning the output of the run time system can be found in chapter 5 Run Time Input and Output files. Chapter 6 Switches (Directives) contains an overview of all possible options that can be added to the switches file. The commands for activating the compiler and the run time system are subject of chapter 7 Operation. The current limitations of the system will be listed in chapter 8 Problems and current limitations, together with some advice regarding the debugging of grammars. A description of the auxiliary tools can be found in chapter 9 Tools. The concluding chapter 10 Overview in which an overview is given of all the different aspects of the system can be used as a quick reference.


2     Basic Syntax

 

In this chapter, the three basic components of Transform grammars: meta symbols, terminal symbols and non-terminal symbols will be explained. The main power of the Transform formalism, however, lies in the extensions of the simple concepts. The advanced features will be the subject of the next chapter.

 

This reference manual, together with the user manual, uses the syntax of the W3C. The dissertation uses a syntax which we call “classic”. Both syntaxes may be used by calling a different metagrammar.

 

2.1     Meta symbols

 

Meta symbols in the Transform formalism are mainly punctuation marks. In section 1.3.1 (Meta language) we already introduced some of the meta symbols. They are listed below, together with some new ones which will be introduced in this section:

 

────────────────────────────────────────────────────

      meta symbol            function

────────────────────────────────────────────────────

               ::=                  standard rewrite sign between the left-hand side and right-hand

                                      side in context free or context sensitive rules

               <::                  rewrite sign in context sensitive rules

               ,                    concatenation

               |                    alternative

               .                    end of rule

               /*                  start of comment

               */                  end of comment

────────────────────────────────────────────────────

 

Context free rules with the standard rewrite sign are rewriting, or defining, rules. Context sensitive rules may be written with the standard rewrite sign ("::=") or with the context sensitive rewrite sign ("<::"). When a rule with a context free appearance (only one symbol on the left-hand side) is meant to be a context sensitive rule, the rule must be marked as such by the context sensitive rewrite sign ("<::").

 

We repeat here that the right-hand side of context sensitive rules may not be empty.

 

Restriction: Recognizing grammars with context sensitive rules must be compiled with the switch FINITE=FALSE.

 

The right-hand side of a context free rule contains the definition or production or realization of the left-hand side. There may be more than one realization. In this case, the grammar will contain several rules with the same left-hand side. In the Transform formalism, it is possible to collapse rules with identical left-hand sides; alternative right-hand sides may be linked by the alternative sign (vertical bar "|").

 

For example, the two rewrite signs ("::=" and "<::"), can be defined by two rules:

 

            RewriteSign      ::=         '::=' .

            RewriteSign      ::=         '<::' .

 

These rules can be collapsed to only one rule:

 

            RewriteSign      ::=         '::=' | '<::' .

 

            Remark: The concatenation sign (",") has precedence over the alternative sign ("|"). So the rule:

 

                        Symbols            ::=         Letter , Digit | Letter,  Letter .

 

            is equivalent with the two following rules:

 

                        Symbols            ::=         Letter,  Digit.

                        Symbols            ::=         Letter,  Letter .

 

            Remark: The use of the concatenation sign (",") is optional. It may be used for readability, e.g. when a grammar rule is written on one line. It may be left out, e.g. when each notion within a rule is written on a separate line.

 

A grammar may be supplied with comments.

A comment starts with a slash and a star ("/*") and ends with a star and a slash ("*/"). One can write any character between these begin and end markings, except the end combination star‑slash itself.

 

The main purpose of using comment in a grammar is documentation. In the development phase, for example while debugging a grammar with erroneous results, one can switch off one or more rules by enclosing them within comment markers.

 

Comments may be included at the beginning and at the end of the grammar, between rules and within rules (before and after each symbol or meta symbol).

 

Spaces, returns and form feeds between symbols and rules are also ignored, just like comments.

 

The only basic meta symbols not present in the list presented at the start of this section are the ones related to regular expressions. Regular expressions provide a compact notation of repetition in rules.

 

Style recommendations: Large grammars can become complex and difficult to expand. Although a user has to develop a clear style for writing grammars by himself, we can give some recommendations:

1.         Use comments to document the grammar.

2.         The structure of a grammar becomes more clear if blank lines between rules are added and tabs are used for indentation.

3.         Write every alternative on a separate line (unless the alternatives consist of a simple enumeration).

4.         Write all context sensitive rules with the "<::" sign.

 

As an example, the basic syntax of grammars is presented here as (an outline of) a context free grammar:

 

            Grammar                      ::=         Rule,  RestGrammar .

            RestGrammar               ::=         /* nothing */ |

                                                            Rule  RestGrammar .

            Rule                             ::=         LefthandSide,  RewriteSign,  RighthandSide,  Period .

            LefthandSide                 ::=         Symbol,  RestLefthandSide .

            RestLefthandSide          ::=         /* nothing */ |

                                                            ( Comma)?,  Symbol,  RestLefthandSide .

            RighthandSide               ::=         /* nothing */ |

                                                            Alternative,  RestRighthandSide .

            Alternative                    ::=         Symbol,  RestAlternative .

            RestAlternative             ::=         /* nothing */ |

                                                            (Comma)?,  Symbol,  RestAlternative .

            RestRighthandSide        ::=         /* nothing */ |

                                                            VerticalBar,  Alternative,  RestRighthandSide .

 

Remark: Note that the comments consisting of "/* nothing */" mark an empty alternative. For example, "RestGrammar" is defined as empty or as a rule again followed by "RestGrammar". In section 2.4 (Regular expressions) the combination of such alternatives to only one alternative by means of a regular expression will be explained.

 

2.2     Terminal symbols

 

The grammar given at the end of section 2.1 (Meta symbols) is not complete, because it does not contain terminal symbols. Terminal symbols are symbols that are not rewritten; they represent the characters and/or strings that must be found in the input. Therefore they never appear on the left-hand side of a context free rule.

 

In this section three different appearances of terminals will be discussed: single characters and strings, ranges and wildcards.

 

 

2.2.1        Characters and strings

 

All characters (including the characters that have a meta function), will be recognized by the system as terminals if they are surrounded by single quotes, for example:

 

            Vowel  ::=         'a' | 'e' | 'i' | 'o' | 'u' .

            Quote   ::=         '''' | '"'.

 

Note that the single quote itself must be doubled.

 

As an obsolete feature, single "name characters" may be represented literally (without single quotes), where "name characters" are defined as the alphanumeric characters and the underscore. So as an alternative of the first rule in the example given above, one can state:

 

            Vowel  ::=         a | e | i | o | u .

 

All other characters may not be represented literally; they must be surrounded by single quotes. Alternatively, as an obsolete feature, they may be preceded by the escape character (backslash "\"). The only exception to these representations is the character that denotes the end of line (also called the carriage return character or the newline character). There is however a reserved symbol which matches an end of line character (the actual value of which is operating system dependent): "cr" or "CR" (not "Cr"). This reserved terminal symbol (like all terminal symbols) may never occur on the left-hand side of a context free rule; this implies that one cannot redefine this reserved symbol.

 

All terminal symbols without exception may also be represented by the number representing their underlying code. A value in the (decimal) range 0’ to ‘255’ represents a character from the extended ASCII set, ISO 8859/1, as used under Windows and UNIX. (In the future, Transform will support wide-range character sets.)

The values have to be written in hexadecimal notation, like ‘#x61’. In Appendix B the characters and their ASCII code in three bases (octal, decimal and hexadecimal) are listed.

 

In summary, there are alternative ways to represent a single character in a grammar:

 

                                   

                                                           

"name characters" like "a"

"single quote" character

"end of line" character

other characters like "."

'a'

''''

CR

'.'

#x61

#x27

#x0D

#x2E

 

Sequences of terminal symbols may be given as a string between single quotes. For example:

 

            abcString          ::=         'abc' .

 

is another notation for:

 

            abcString          ::=         'a',  'b' , 'c' .

 

The matching of the characters in the grammar with characters in the input is case sensitive, so an "a" in the grammar matches only an "a" in the input and not an "A". If there is no need to distinguish between upper and lowercase in the input, one can add the switch CASE_CONVERT to the switches file.

 

There are text files with end of line characters that are not important for the structure of the text. The starting symbol of a grammar designed for a description of the structure of such a text covers input that can contain end of line symbols, but does not necessarily do so. One does not know in advance at which position an end of line symbol or a form feed can be expected. In such a situation the run time system can be instructed to skip the end of line symbols and form feeds by adding the switch IGNORE_CR to the switches file; a corresponding switch for blanks (spaces and tabs) is IGNORE_BLANKS.

 

It is possible to instruct the run time system to ignore other characters as well. For example, if you don't want to mention certain punctuation marks in your grammar, you can add the switch:

 

            SKIP_CHARS = '.' | ',' | '!' | '?' .

 

2.2.2        Ranges

 

With a range one can define a set of alternative terminals. A range consists of two terminal symbols separated by a hyphen. The range includes the two mentioned terminals plus those in between, defined by the order of the underlying character set (the ASCII set in the current implementation).

 

Examples:

 

            NameCharacter                        ::=         [a-z] | [A-Z] | [0-9] | '_' .

 

            OctalDigit                     ::=         [0-7] .

 

            Consonant                     ::=         [b-d] | [f-h] | [j-n] | [p-t] | [v-x] | 'z' .

 

Remark: Note the absence of single quotes around the name characters (letters, digits and the underscore) that mark the ranges.

 

A range may consist of one terminal symbol, for example:

 

            Underscore                   ::=         [_-_].

 

Ranges, just as ordinary terminals, may be defined in terms of their ASCII code, for example:

 

            Ascii                             ::=         [#x00-#xFF] .

            Brackets                       ::=         [#x28-#x29] .    /* "(" and ")" */

 

            Remark: The underlying numerical value of the first terminal must be less than or equal to the value of the second one. If this is not the case an empty range is defined (which is most probably not intended).

 

            Restriction: Ranges can not be used on the left-hand side of context sensitive rules.

 

Note: In other syntaxes, like the W3C ENBF syntax for XML, it is possible to define regular expressions of characters. We support the following forms:

- [a] (one constant) and [a-z] (range of constants);

- [#xN] (one constant) and [#xN-#xN] (range of constants);

- also, the negated variants are supported: [^a], [^a-z], [^#xN] and [^#xN-#xN].

Other variants can easily be dealt with, e.g. [0-9a-z] can be written as [0-9] | [a-z], because of our notation for regular expressions. In our opinion, this notation is more readable.

If required, a future version of Transform will support the syntax of [0-9a-z].

Furthermore, we do not deal with the special treatment of '-' and ']' in the W3C ENBF syntax when used as the first character following the '['.

 

2.2.3        Wildcards

 

Wildcards are meta symbols which match arbitrary characters (including carriage return, form feed etc.). There are three different wildcards, with different functions: a dontcare, a line and an arb symbol. They are listed below, together with their functions.

 

symbol

name

function

*

dontcare

matches exactly one arbitrary character

=

arb

matches zero or more arbitrary characters, not including the character following the arb symbol in the grammar

line

matches any sequence of zero or more arbitrary characters, including occurences of the character following the line symbol in the grammar

 

For example:

 

            Password          ::=         *,  *,  *, *,  *,  * .

 

defines a password as a sequence of six arbitrary characters. And in:

 

            Comment          ::=         '/*',  =,  '*/' .

 

a sequence of characters which starts with ‘/*’ and ends with ‘*/’ is recognized as a comment. The characters which are matched by the arb symbol will not contain a ‘*’. An alternative definition of comment:

 

            Comment          ::=         '/*',  ‑,  '*/' .

 

defines the same sequence of characters, but the characters which are matched by the line symbol may contain one or more ‘*/’. For example:

 

            /* a comment is marked by a "*/" at the end */

 

will be matched twice by the rule with a line symbol. It will only be matched once by the rule with an arb  symbol.

 

            Remark: The arb symbol matches any sequence of characters, not including the character following it in the grammar rule. If the arb symbol is followed by a string of terminal characters, it is only the first character of this string which defines the operation of the arb symbol. For example:

 

            Restriction 1: When the compiler option FINITE=FALSE is used an arb symbol may only be followed by a terminal symbol or by an intermediate symbol (see section 3.2.4 Intermediate symbols). It may not be followed by a non-terminal symbol or a lexicon symbol (see section 3.2.2 Lexicon symbols) and it may not occur at the end of a rule. If FINITE=TRUE, there are no restrictions at all and an arb symbol may even be the last symbol in a grammar rule.

 

            Restriction 2: Wildcards can not be used on the left-hand side of context sensitive rules.

 

2.3     Non-terminal symbols

 

Non-terminal symbols are symbols that occur at least once on the left-hand side of a context free rule. Non-terminal symbols function as names, or abbreviations, for sequences of symbols. Their names consist of a sequence of one or more alphabetic characters (uppercase or lowercase), the digits 0 to 9 and the underscore character ("_"). The maximum length of a non-terminal name is 32 characters.

 

            Remark 1: The names of non-terminal symbols are case sensitive. For example, a symbol defined as "comma" does not match any other spelling of the name, such as "Comma" or "COMMA".

 

            Remark 2: A symbol name consisting of only one character is handled as a terminal unless this name occurs at least once on the left-hand side of a context free rule (then it is considered a non-terminal). Therefore, beware of non-terminals named with only one character.

 

            Style recommendation: To avoid possible confusion with terminals, it is recommended that the names of non-terminals contain at least two characters.

 

Non-terminals defined by one or more unit rules will not automatically appear in parse trees when the run time switch BUILD_PARSETREE has been set to true. Only after setting the compiler switch PREPARE_PARSETREE to the value TRUE will the resulting parse trees contain every appropriate non-terminal. Examples of unit defined non-terminals can be found in the next rules:

 

            AlfaNum          ::=         Alfa | Number .

            Alfa                  ::=         [a-z] | [A-Z] .

            Number                        ::=         ( [0-9] )+ .

 

If the switch PREPARE_ PARSETREE is set to FALSE, "Alfa" and even "AlfaNum" will not appear in any parse tree at all; "Number" will only be present if it matches input consisting of more then one digit.

 

A built-in check on typing errors in non-terminal names is present to some extent. The compiler considers a name consisting of more than one character as an "intermediate symbol" (this symbol type will be introduced in section 3.2.4  Intermediate symbols) unless it appears at least once on the left-hand side of a context free rule. If intermediate symbols are present, their names appear on the screen and in the error file; misspelled names of non-terminals can appear amongst them.

 

Grammars for recognition and/or parsing contain at least one non-terminal: the symbol on the left-hand side of the first rule in the grammar. This special non-terminal is also called the starting symbol.

 

Transduction grammars can be written without any non-terminals, because they do not contain a starting symbol. For example, the rule:

 

            'terminal symbol' <:: 'terminal' .

 

can be compiled as a grammar but contains only terminals. When applied, it converts all the words "terminal" in a document (for instance this manual) to the words "terminal symbol".

 

 

2.4     Regular expressions

 

Regular expressions provide a compact notation of repetition and embedded alternation:

 

─────────────────────────────────────────────────────

expression      function

─────────────────────────────────────────────────────

x                    matches exactly one occurrence of "x"

( x )?             matches zero or one occurrence of "x"

( x )*             matches zero or more occurrences of "x"

( x )+             matches one or more occurrences of "x"

( x | y )?         matches zero or one occurrence of either "x" or "y"

( x | y )           matches exactly one occurrence of either "x" or "y"

( x | y )*         matches zero or more occurrences of "x" and/or "y"

( x | y )+         matches one or more occurrences of "x" and/or "y"

─────────────────────────────────────────────────────

 

In these examples of regular expressions, "x" and "y" may be replaced by a terminal or non-terminal symbol, by a sequence of terminal and/or non-terminal symbols or by another regular expression.

 

The grouping notation (...)  is especially useful if one wishes to overrule the precedence of the concatenation sign over the alternative sign. For example, if the definition of the non-terminal "Symbols" in the example on page 15, was meant to be:

 

            "a string of three characters, the first and the last a letter and between them a letter or a digit",

 

this can be written as:

 

            Symbols            ::=         Letter,  ( Digit | Letter ) , Letter .

 

As an example, the grammar for Transform grammars (earlier presented in section 2.1 Meta symbols), is given here in a more compact notation:

 

            Grammar          ::=         ( Rule )+ .

            Rule                 ::=         LefthandSide,  RewriteSign,  RighthandSide, Period .

            LefthandSide     ::=         Symbol,  ( Comma,  Symbol )* .

            RighthandSide   ::=         Alternative,  ( VerticalBar,  Alternative )* .

            Alternative        ::=         Symbol,  ( Comma,  Symbol )* .

 

            Restriction: Regular expressions can not be used on the left-hand side of context sensitive rules.


3     Extended Features

3.1     Actions

 

Actions relate to extensions of the formalism. There are two kinds of actions:

 

(1)        assignments and tests (section 3.1.1 Assignments and tests)

(2)        messages (section 3.1.2 Messages )

 

The general rules for actions are:

 

      Actions are written between braces "{ ...}".

      Apart from assignments and tests, an action starts with a keyword determining the type of the action. These keywords are "r:" and "s:" and may be written in upper or lowercase without difference in meaning.

      Actions may be conjoined by the conjunction operator (semicolon ";").

      Actions may only be used on the right-hand side of a context free rule. They may occur:

    after each symbol or regular expression

    between the ")" and the " * " or "+" of a regular expression

    as the first (or only) element of an alternative.

 

For some types of actions the effects depend on their position in the rule. In section 3.1.3 (Places of actions in rules) the different possibilities are discussed together with a description of the effects.

 

 

3.1.1        Assignment and tests

 

A powerful mechanism is the application of tests: the recognition of input can be made dependent of a testable criterion. This feature can be used successfully in grammars that describe ambiguous phenomena and can limit the number of alternative parse trees. They are very useful in combination with the symbol type "non-terminals with parameters", to be introduced in section 3.2.1 (Non-terminals with parameters).

 

Tests are performed on variables and literals, which can be combined together into expressions (section 3.1.1.1 Variables, literals and expressions). Before a test can be performed, variables must be given a value by means of an assignment (section 3.1.1.2 Assignment). In section 3.1.1.3 (Types of tests) different types of tests are discussed and section 3.1.1.4 (Combinations of tests and/or assignments) concerns possible combinations of tests and/or assignments.

 

 

3.1.1.1    Variables, literals and expressions

 

In Transform grammars, two kinds of variables can be used: local variables and parameters. The main difference between the two types lies in their scope; the scope of a local variable is the rule in which it occurs, a parameter can be used to transmit information between different rules. In this section, local variables are introduced and although parameters will be explained later on (see section 3.2.1 Non-terminals with parameters), all operations that can be performed on local variables are applicable to variables introduced as parameters as well.

 

Every rule, except context sensitive rules, may be extended with local variables. Local variables are devices which can be used for administrative purposes; information (in the form of a string) available at some point in the rule can be stored in a variable so that it can be used at another point in the same rule. A variable has a name (consisting of name characters, see the first example in section 2.2.2 Ranges) and a value (in the Transform system always a string).

 

A local variable need not be declared; the variable will be allocated at the moment of its first reference in the rule. At the moment of its allocation, it contains the empty string and this remains its content up to the moment that another value is assigned to it.

 

There is a reserved variable name (percent "%"). Its value is always the character that has most recently been read.

 

Besides variables, literal string values (also called literals) can be used in actions. They also store string information, but their value can never be changed. Between the braces ("{-}") of actions, literals are strings of characters between single quotes. The empty string can be represented by merely two quotes. To represent a single quote as part of a literal, it should be written twice.

 

Expressions can be built from variables and literals. This is done with the concatenation operator "||". The value of an expression is the concatenated value of all its components.

 

Remark 1: In the next sections the term "expression" refers also to its most simple form: a simple variable or a simple literal.

 

            Remark 2: In a grammar with many variables, a typing error in the name of a variable is easily made. Because the compiler treats every new name as an occurrence of a new variable, an error in a name can cause incorrect behavior in the compiled grammar. If the switch CHECK_VARIABLES is added to the switches file, all separate variables found in a rule will be listed in the compiler error file.

 

As an example, we give the value of some expressions that can be composed from the following components:

 

            var1 with the value "aa",

            var2 with the value "bb",

            % with the value 'c' and

            the literal 'dd':

 

expression

resulting value

var1 || var2

"aabb"

var1 || %

"aac"

'dd' || var2

"ddbb"

var1 || % || var1

"aacaa"

 

In Transform notation the definitions of the concepts, introduced in this section, are:

 

            /* Definitions for variables, literals and expressions */

 

            Variable                        ::=         VariableName | '%' .

            Literal                           ::=         ''''  ( Token )*  '''' .

            VariableExpression        ::=         VariableOrLiteral,  ( '||',  VariableOrLiteral )* .

 

            VariableOrLiteral          ::=         Variable | Literal .

            VariableName               ::=         ( NameCharacter )+ .

            Token                           ::=         [#x00-#x26] | [#x28-#xFF] | [#x27-#x27].

            /* #x27 is the value of a single quote, which must be doubled */

 

3.1.1.2    Assignment

 

A variable may be assigned a value by means of the assignment operator ("="); in general:

 

            { <variable> = <expression> }

 

which means that after evaluation the variable has the value specified by the  expression.

 

In the Transform notation, the definition of an assignment is:

 

            Assignment       ::=         Variable,  '=',  VariableExpression .

 

            Remark: Explicit assignment to the reserved "%" is not possible; this variable is assigned its value, the last read character, in run time.

 

Following the meaning of the concept expression (see section 3.1.1.1 Variables, literals and expressions), the value of a variable can be set in different ways, for example:

 

            {var1=var2} or {var1=%}         (assignment of a variable)

            {var1='abc'}                             (assignment of a literal)

            {var1=var2||'abc'}                     (assignment of a variable expression)

 

The value of a variable may be empty at any time. Remember that a variable has an empty value at the moment of its first appearance in a rule; its value can become the empty string again by assigning the literal representing the empty string to it. Therefore:

 

            {var=''}

 

removes the value of the variable "var".

 

 

3.1.1.3    Types of tests

 

Tests concern the truth value of a comparison between two expressions. If this value is false, the current recognition path stops. There are two test operators, one for equality and one for inequality.

 

The equality test:

 

            { <expression1> == <expression2> }

 

returns the value true if "expression1" has the same value as "expression2", while the inequality test:

 

            { <expression1> != <expression2> }

 

returns the value true if "expression1" refers to a value other than that specified by "expression2".

 

The definition of tests in Transform notation becomes:

 

            Test                  ::=         EqualityTest   |

                                                InequalityTest .

            EqualityTest      ::=         VariableExpression,  '=',   VariableExpression .

            InequalityTest   ::=         VariableExpression,  '!=',  VariableExpression .

 

Disjunct tests may be stated by means of the "or"-operator:

 

            { <test_1> | <test_2> | ... | <test_n> }

 

is true when the evaluation of at least one of the tests leads to the value true. For example:

 

            { var1== '' | var2==% }

 

is true if the value of the variable "var1" is empty or if "var2" has the value of the last read character in the input.

 

If the combined test concerns the same variable on the left-hand side, a shorthand notation is possible. For example:

 

            {var1==expr1 | var1==expr2}

 

can also be written as:

 

            {var1==expr1 | ==expr2} .

 

            Remark 1: A series of disjunct tests must be considered as a combined test with one value; the value is true if and only if one of its components is true.

 

            Remark 2: The "or"-operator can only be used in combination with tests; a compiler error message will be generated if this operator is used between other actions like assignments and/or messages.

 

 

3.1.1.4    Combinations of tests and/or assignments

 

Multiple assignments may be conjoined by the conjunction operator (";" or "&"):

 

            {var1=expr1;  var2=expr2}        or         {var1=expr1 & var2=expr2}

 

Multiple tests may also be conjoined by the conjunction operator:

 

            {var1==expr1 & var2!=expr2}   or         {var1==expr1;  var2!=expr2}

 

is true when all of the tests are true.

 

            Remark: Because a series of disjunct tests has to be considered as one test with one resulting value, in a combination of the "or"-operator with the conjunction operator the former has precedence over the latter. Therefore, the compound test:

 

                        {var1=='1' | var2=='2' & var3=='3'}

 

            only succeeds when var3 has the value '3' and either var1 has the value '1' or var2 has the value '2' (or both).

 

A mixture of tests and assignments, just as other actions, may also be conjoined by means of the conjunction operator:

 

            {var1==expr1;  var2=expr2}      or        {var1==expr1 & var2=expr2}

 

            Style recommendation: It is recommended that the "&" variant is used for the conjunction of two tests and the ";" variant for the conjunction of two actions where at least one of the actions is not a test. For example:

 

                        {var1==expr1 & var2==expr2;  var3=var1||var2}

 

In this manner the Boolean outcome of the conjunction of two tests is emphasized by means of the sign "&" since the ampersand is often used in mathematics to stand for the Boolean "and"-operator.

 

            Remark: The combination of a test and an assignment has the appearance of an "if‑then" statement in a programming language, but note that if the test fails, the parse fails as well.

 

            You can solve this situation by adding an identical alternative to the rule in which you change the test to the opposite meaning (i.e. "==" becomes "!=" and "!=" becomes "=="). For example, consider the following sketch of a rule:

 

                        aa         ::= ...  ( bb | cc {p='1'} ) {p=='1'; q='2'}  ... .

 

            If you intend this rule to mean "if variable 'p' has obtained value '1', the variable 'q' becomes '2', else 'q' remains unchanged", you should change it as follows:

 

                        aa         ::= ...  ( bb | cc {p='1'} ) {p=='1'; q='2'}  ... |

                                          ...  ( bb | cc {p='1'} ) {p!='1'}           ... .

 

            This problem will become more obvious when non-terminals with parameters are introduced.

 

As a final example, we present a grammar for palindromes (lines that are the same when read from right to left). Such a grammar could be written using local variables:

 

            Palindrome        ::=         ( Char {head = head || %} )+  ( Char )? ,

                                                            ( Char {tail = % || tail} )+ {head == tail} .

 

 

3.1.2        Messages

 

Messages can be used to mark the relevant points in the recognition process by which other (external) processes can be triggered. Messages provide for event-based processing in general purpose programming languages.

When messages are used in a grammar, an output file with the extension .REP (also called "report file") is created in which the messages are written. There are two variants: simple messages (section 3.1.2.1 Simple messages) and messages with variables (section 3.1.2.2 Messages with variables).

 

 

3.1.2.1    Simple messages

 

There are two message functions: signal and report. A message is characterized by a function indication ("s" for signals and "r" for reports) followed by a colon and a positive, non-zero number:

 

            { s: <number> }            or         { r: <number> }.

 

Also, in Transform notation:

 

            Message           ::=         Signal | Report .

            Signal               ::=         'S_or_s', ':' , Number .

            Report              ::=         'R_or_r',  ':',  Number .

            Number            ::=         [1-9]  ( [0-9] )* .

            R_ or_ r                       ::=         'R' | 'r' .

            S_ or_ s                        ::=         'S' | 's' .

 

            Remark: The number "0" cannot be used as a signal or report.

 

Signals or s‑messages only report the number and a colon (":"); reports or r‑messages report the number followed by a colon plus the most recently matched terminal symbol. In the case of subsequent r‑messages with the same report number, the report number is given once and the reported symbols are concatenated (even if they were not contiguous in the input).

 

R‑messages always output the character that has most recently been read. Thus, if a report follows a symbol or regular expression that covers no input (as may be the case in the "( )" and "( )* " expressions), the character that was read before the regular expression will be output.

 

In the grammar:

 

            String                ::=         ( Word {S:2},  Blank )+ .

            Word                ::=         ( Char )+ .

            Char                 ::=         [a-z] {R:1} .

            Blank                ::=         #x20 .

 

any sequence of lowercase alphabetic characters will be recognized as a string and will return as output a report file with two lines. The first line will consist of the report number "1" followed by a colon and all recognized characters, since subsequently reported terminal symbols with the same number are concatenated. The second line will consist of the message number "2" followed by a colon (and nothing else). For example, for the string "abc def " the output will be:

 

            1:abc

            2:

            1:def

            2:

 

If, in the same grammar, the signal "S:2" in the first rule is omitted, the output of the same input will be:

 

            1:abcdef

 

The characters written to the report file by means of r‑messages will not be converted to a readable format. This only has consequences if the input file contains control characters (characters representing the decimal ASCII codes smaller than 32 or greater than 126) and these characters are emitted by r‑messages. A character in the input matching a terminal symbol specified in the grammar as:

 

 ... #x04 {R:4} ....

 

will result in a non-printable character in the report file after the number "4:". To avoid non-printable report files, a signal can be emitted instead of a report for those control characters (for example with a number equal to its ASCII code):

 

 ... #x04 {S:4} ....

 

So the general rule is: characters emitted by r‑messages are equal to the corresponding characters in the input. The only exception occurs when the switch CASE_ CONVERT has either the value LOWER_ TO_ UPPER or the value UPPER_ TO_ LOWER. In those cases letters will not be emitted in their original format but in their converted one.

 

            Remark 1: Although the carriage return character can be represented in the grammar as "cr", the emission of this character to the report file causes a carriage return.

 

            Remark 2: If a parse is ambiguous, messages can be produced in a unpredictable order (see also section 5.4.1 Report of messages).

 

More examples of messages in a grammar rule will be given in section 3.1.3 (Places of actions in rules).

 

 

3.1.2.2    Messages with expressions

 

A variant of the simple r‑messages or s‑messages can be used to communicate expressions. In this use, the report or signal number is followed again by a colon followed by a variable expression:

 

            {r:<number>:<expression>}       or         {s:<number>:<expression>}.

 

For more details about the use of variables and expressions, see section  3.1.1.1 (Variables, literals and expressions).

 

The effect of this action is that as soon as the input is recognized up to the point of the action, the current value of the expression will be written to the report file (preceded by the number and a colon). The same difference in effect that applies for simple reports and signals is found for signals and reports with a variable expression; the resulting expressions of an r‑message are concatenated as long as the same report number is emitted (the report number itself appears only once in the output). For this reason the signals with expressions are of more practical use; each separate emission appears on a separate line in the report file.

 

Examples:

 

            {r:1:var1}

            {r:2:'abc'}

            {r:3:var1||'abc'}

 

The report files for the examples above will contain respectively:

 

            1:...                   (the value of the variable with the name "var1")

            2:abc                (the literal string "abc")

            3:...abc             (where “…" stands for the value of the variable with the name "var1")

 

            Remark: Due to an error in the run time system, messages with expressions sometimes yield incorrect output when used inside a repeated regular expression.

 

 

3.1.3        Places of actions in rules

 

Actions can be written after symbols (terminals, non-terminals and the symbol types that will be introduced later on), after regular expressions and between the closing bracket of a regular expression and the repetition indicators star ("*") or plus ("+"). There is also a possibility for an alternative consisting of only an action; this alternative must be the first one. These possibilities will be discussed in the sections below.

 

 

3.1.3.1    Actions following symbols

 

Actions written after a symbol (terminal, non-terminal or one of the symbols to be introduced in section 3.2 Symbol type extensions), will be performed as soon as the symbol has matched the input. For example, in the rules:

 

            Input     ::=         ( Word {var1=%};  Blank {R:1:var1} )+ .

            Word    ::=         ( [a-z] )+ .

            Blank    ::=         #x20 | #x0B .  /* space or tab */

 

upon recognition of the input "horse", the variable "var1" will be assigned its final value ("e"); this value will be emitted to the report file after recognition of the space.

 

Actions will also be executed when the symbol itself has covered no input. For example in the rules:

 

            Number            ::=         Digit,  Mantisse {R:1} .

            Digit                 ::=         [0-9] .

            Mantisse           ::=         ( '.'  Digit )? .

 

the output in the report file of the input "2" will be "1:2", because the last read input character was a "2".

 

However, if the action is placed after a symbol inside a regular expression that can cover no input the action is not performed. Therefore in:

 

            Input                 ::=         ( ( Determiner {S:1},  Blank )?,  Noun,  Blank )+.

            Determiner       ::=         'a' | 'the' .

            Noun                ::=         'text' | 'book' .

 

upon the input "a text book " the report file will consist of only one "1:".

 

As a special case, we consider actions after wildcards. Actions placed after a wildcard that can cover a whole sequence of characters will be repeated as many times as the wildcard matches the input. For example in:

 

            Comment          ::=         '/*', = {R:1}, '*/'.

 

the report file will contain the number "1", followed by a colon and all terminal symbols matched by the "arb" are reported. The same effect is achieved when a message follows a "line" ("‑"). The dontcare symbol matches only one character, so an r‑message placed behind the dontcare symbol will also report only one character.

 

            Remark: If a wildcard ("arb" or "line") matches no terminal symbols, the "last read character" will contain the character that was matched before the wildcard. An illustration in the case of r‑messages can be found in the preceding example (a ‘*’ will be emitted). The same can be said about the reserved variable "%", occurring in actions after a wildcard.

 

 

3.1.3.2    Actions following regular expressions

 

Actions following a regular expression are executed only when the regular expression has ceased matching input. Thus, in the case of a r‑message, the most recently matched character is reported. It is not the string which has been matched by the regular expression.

 

For example, in the grammar:

 

            String                ::=         ( Word,  Comma )+ {R:1} .

            Word                ::=         ( [a-z] )+ .

            Comma             ::=         ',' .

 

the input "abc,pqr," will result in the following report file:

 

            1:,

 

A special case is formed by messages, assignments and tests containing the reserved variable "%" following a regular expression that does not cover any input. The same which has been said before about actions after symbols that cover no input, also holds for the regular expressions "(…)?" and "(…)* ". In the next grammar with messages, three different input strings are considered to illustrate the effects:

 

            Input     ::=         Bs,  ( 'a' )* {R:1} .

            Bs        ::=         ( 'b' )* {S:2} .

 

            input                 output (in the report file)

            ─────────────────────────────

            aa                                 2:

                                                1:a

 

            bb                                 2:

                                                1:b

 

            bbaa                             2:

                                                1:a

 

 

3.1.3.3    Actions as part of the repetition of regular expressions

 

An action between the ")" and the " * " or "+" differs from the situation in which the action is placed after the " * " or the "+" that concludes the regular expression.

 

For example, in the grammar:

 

            String    ::=         ( Char ){R:1}+ .

            Char     ::=         [#x00-#xFF] .

 

the report is given when the expression is repeated or left. Thus, the report file will consist of the number "1" followed by a colon and the recognized characters, which are concatenated.

As an illustration of the difference between a message action placed before and a message placed after the " * " of a regular expression, we consider the effects of two slightly different grammars with a r‑message:

 

                                    Example 1                                             Example 2

 

                                    Ss         ::= Digit, ( Char ){R:1} * .          Ss         ::= Digit, ( Char )* {R:1} .

                                    char      ::= 'a' | 'b' .                                Char     ::= 'a' | 'b' .

                                    Digit     ::= [0-9] .                                  Digit     ::= [0-9] .

 

            input                 output example 1                                   output example 2

            ─────────────────────────────────────────────────

            1                      (no output created)                                1:1

            1a                     1:a                                                        1:a

            1ab                   1:ab                                                      1:b

 

 

3.1.3.4    Actions in empty units

 

As explained above, variables placed in regular expressions that can be skipped remain unchanged. With the help of an action in an otherwise empty unit, one can force a value to be given to a variable, for example:

 

            QuotedString     ::=         {q='"'}   ( [a-z] {q=q||%} )+ {R:1:q||'"'} .

 

            Remark 1: Only one empty unit may be added to a rule or a regular expression and that must be the first one. The compiler halts with an error when a rule contains more than one empty unit or when the empty unit is written in an illegal position.

 

            Remark 2: In order to simulate an "if‑then‑else" construction for tests (see also section 3.1.1.4 Combinations of tests and/or assignments), one can make use of so-called empty rules. For example, as an alternative for the example in that section, one could write:

 

                        Aa        ::= ...  ( Bb        |

                                                Cc {p = '1'}

                                                ) ,

                                                            ( Empty {p ='1'; q = '2'} |

                                                              Empty {p!='1'}

                                                            ) .

                        Empty   ::=         .

 

PMSUB2.WP bevat: - chapter 2: BASIC SYNTAX


3.2     Symbol type extensions

 

In the basic syntax, only two symbol types were distinguished: terminals and non-terminals. In this section, other types of symbols are introduced; some are entirely different from the two basic ones, other are derivates. In section 3.2.1 non-terminals with parameters  will be introduced. The use of parameters or features transforms a grammar into an attribute grammar. Related to the use of parameters is the introduction of a new symbol type: the lexicon symbols. A description can be found in section 3.2.2 Lexicon symbols. Another extension of the non-terminals, the cover non-terminals, is especially useful in context sensitive rules. It will be discussed in section 3.2.3 Cover non-terminals. The new symbol type introduced in section 3.2.4 Cover non-terminals, the intermediate symbols, is also important for context sensitive grammars.

 

3.2.1        Non-terminals with parameters

 

A non-terminal symbol may be supplied with parameters, which can be viewed as attributes of the symbol. They are declared together in a so-called formal parameter list of a non-terminal symbol on the left-hand side of a context free rule. There are three types of parameters:

 

i (input)                         : taking (inheriting)

o (output)                      : giving (synthesizing)

io (input/output) : taking and giving

 

Once declared, the type of the parameters is fixed. The type of the parameter must be specified in the formal parameter list, followed by a colon, followed by one or more names of parameters, separated by a comma. A parameter list can contain parameters of one, two or three types. There is no prescribed order if more than one type is used. Each type indicator is separated from the preceding list of names by a comma.

 

The syntax of a formal parameter list can be defined in Transform notation as:

 

            FormalParameterList     ::=         '(',  ParameterDeclarations ,

                                                                        ( ',',  ParameterDeclarations )* ,

                                                            ')' .

            ParameterDeclarations  ::=         TypeIndicator,  ':',  ParameterName ,

                                                                        ( ',',  ParameterName )* .

            TypeIndicator                ::=         I_or_I,  ( O_or_o )? |

                                                            O_or_o .

 

            Remark: Again, the keywords "i", "o" and "io" are not case sensitive.

 

Examples of formal parameter lists:

 

            Verb(I:MainVerb)         ::= ...

            Symbol "Verb" has one parameter of which the value is determined in the calling rule; the value of "MainVerb" is not affected after application of the present rule.

 

            VsMod(O:type) ::= ...

            Symbol "VsMod" has one parameter of which the value is initially an empty string; the value of "type" is determined during the application of the present rule.

 

            Noun(IO:cat)    ::= ...

            Symbol "Noun" has one parameter of which the value is initially determined by the calling rule; the value of "cat" may be changed after application of the present rule.

 

            RelClause(I:Antecedent,O:type,IO:cat)   ::= ...

            Symbol "RelClause" has three parameters, one of each type.

 

            NP(O:cat,number,pers)  ::= ...

            Symbol "NP" has three parameters, all of the same type.

 

When the non-terminal is used on the right-hand side of one or more rules, the parameters must be repeated in a so-called actual parameter list. The names of the actual parameters can of course be different from the names used in the formal parameter list. In the actual parameter list, it is allowed to repeat the type but it is not necessary. If the type is indicated on the right-hand side, it must be the same as in the declaration. So, one can choose between, for instance:

 

 ...  NP(cat, number, pers>  ...                and ...  NP(O:cat, number, pers>  ...

 

            Remark: A non-terminal and its parameter list must be considered as a whole; therefore an action can not be placed between the name of the non-terminal and its parameter list, but only after the parameter list.

 

            Restriction: If a non-terminal has a left recursive definition, it can only have output parameters. An example of illegal use of the input parameter is:

 

                        Adjectives(I:gender)      ::=         ( Adjectives(gender) )?,  Adjective(gender) .

 

The use of parameters enables the exchange of information between rules. For example, number agreement of subject and verb can be handled with the use of parameters in combination with a test action (see section 3.1.1.3 Types of tests):

 

            Sentence                       ::=         NP(number1),  VP(number2) {number1==number2} .

            NP(O:number)              ::=         ... .

            VP(O:number)              ::=         ... .

 

where both parameters are declared as output parameters. Another grammar, making use of an input parameter, would look like:

 

            Sentence                       ::=         NP(number), VP(number) .

            NP(O:number)              ::=         ... .

            VP(I:number)                ::=         ... {number==…} ... .

 

            Restriction: Parameters can only be used if FINITE=FALSE.

 

The starting symbol can also have output parameters (but no input or input/output ones). The content of these top level parameters can be written to the report file and/or the screen with the run time switch: OUTPUT_ PARAMETERS.

 

            Remark 1: It is not allowed to compare parameters and/or variables that stem from a different path within the same rule. Thus, the following grammar is illegal:

 

                        Sentence           ::= ( -,  Subj(Sub),  - | -,  Obj(Ob),  - ) {Sub==Ob & Sub!=''} .

                        Subj(O:Sub)      ::= ... {Sub = 'yes'} .

                        Obj(O:Ob)        ::= ... {Ob  = 'yes'} .

 

            Nor is it allowed to rewrite this grammar as follows:

 

                        Sentence                       ::= SubjObj(Sub, Ob) {Sub==Ob & Sub!=''} .

                        SubjObj(O:Sub,Ob)        ::= -,  Subj(Sub),  - | -,  Obj(Ob) , - .

                        Subj(O:Sub)                  ::= ... {Sub = 'yes'} .

                        Obj(O:Ob)                    ::= ... {Ob  = 'yes'} .

 

            Remark 2: As an alternative for the rules in the example given in the second remark on page 30 one would like to write:

 

                        Aa                                ::= ...  ( Bb        |

                                                                        Cc {p = '1'}

                                                                        ) ,

                                                                        Empty(p,q) .

                        Empty(I:p, O:q)             ::=         EE {p =='1'; q = '2'} |

                                                                        EE {p!='1'}           .

                        EE                                ::=         .

 

However, because the compiler is not yet equipped to handle alternative empty paths in combination with in- and output parameters, an error will be generated in cases like this.

 

The same operations that are allowed on variables like assignments and tests (see section 3.1.1 Assignment and tests) are allowed on parameters. In a rule with a formal parameter list, local variables can be introduced as well. Input parameters (i‑ and io‑parameters) are usually used for testing; output parameters (o‑ and io‑parameters) usually get a value (by assignment).

 

            Recommendation: The compiler distinguishes parameters from local variables by examining the formal parameter list. Thus, when a parameter name is misspelled (due to a typing error), it is silently considered a local variable. Therefore, it is advisable to set the compiler switch CHECK_VARIABLES to TRUE especially in early stages of the development of a grammar. Then, for every rule in the grammar, the local variables will be written to the compiler error file.

 

 

3.2.2        Lexicon symbols

 

Grammars that deal with the syntax of natural languages often contain rules, referring to lexical categories. In order to parse sentences one can add rules to the grammar like:

 

            Noun    ::=         'man' | 'child' | 'cat' .

            Verb     ::=         'see' | 'walk' .

 

However, if the applicability of the grammar has to be extended to sentences containing words not yet mentioned, new words must be added to the grammar and the grammar must be recompiled.

 

To avoid this, a new type of symbol was introduced that can be used in Transform grammars, the lexicon symbols. Lexicon symbols are a kind of non-terminal symbols which refer to entries in an external lexicon. A lexicon symbol name consists of a "$" followed by one or more "name characters" (see the first example in section 2.2.2 Ranges), followed again by a (possibly empty) parameter list. Lexicon symbols refer to categories in the lexicon. The following symbols illustrate the appearance of a lexicon symbol:

 

            $Noun(Word, Type)      and  $Verb()".

 

            Remark: A lexicon symbol may be used without parameters, but the parentheses for the parameter list are compulsory.

 

A lexicon consists of a list of entries. Each entry specifies the grammatical category of a word and possibly one or more features of the word, such as syntactic, morphological and/or semantic ones. For the construction of a lexicon, see 9.1 Lexicon.

 

Lexicon entries are not restricted to words; other uses include word stems, word endings and punctuation marks.

 

The match between entries in the lexicon and the words found in the input is case sensitive; using the switch CASE_CONVERT one can force the run time system to treat all capitals in the input in their lowercase form (value: UPPER_TO_LOWER) or to treat all lowercase letters in their uppercase forms (value: LOWER_TO_UPPER).

 

Lexicon symbols are defined as non-terminal symbols with implicit O‑parameters. A lexicon symbol matches any word that occurs in the lexicon with this category. When a lexicon symbol is used with one parameter, this parameter contains the word that matched the symbol. When it is used with more than one parameter, the first parameter contains the word that was matched and further parameters contain the feature values specified in the lexicon, in left‑to‑right order. The number of parameters specified in the grammar may be less than the number of parameters specified in the lexicon, but it may not be more; the number of features specified in the lexicon must be sufficient

 

            Restriction: Lexicon symbols can only be used if FINITE=FALSE, and as a consequence they can not occur in transduction grammars. They can, however, occur at the left hand side of context sensitive rules in a grammar used for recognition, but only without parameters.

 

There are two ways in which lexicon symbols can be used, regulated by the run time switch LEX_ INCREMENT:

 

1.         The run time system scans the input character by character. Lexicon symbols can match any string, provided it is included in the lexicon. The lexicon symbols can be mixed freely with other terminal symbols. Ambiguity with terminal symbols in the grammar is allowed. This option will be chosen if LEX_INCREMENT=TRUE, the default value for this switch.

 

            For instance, suppose a grammar has rules like:

 

                        Noun(O: N)      ::=         Singular(I:N) | Plural(I:N).

                        Singular(O: N)   ::=         $Noun(N).

                        Plural(O: N)      ::=         $Noun(N), 's'.

 

            If the lexicon contains the entries:

 

                        $Noun("sale")

                        $Noun("sales")

 

            then the word "sales" in the input can be recognized as both a Singular and a Plural noun.

 

2.         The run time system scans the input word by word. This option is somewhat more efficient than the first one, but provides less flexibility. It can be selected by setting the run time switch LEX_INCREMENT to FALSE. Each word is looked up in the lexicon. If it is found, its corresponding lexicon symbol(s) are returned and the parse continues.

 

            A word to be looked up in the lexicon is either a sequence of one or more lexicon word characters, delimited by one non-lexicon word character or it consists of one non-blank, non-lexicon word character. The user can specify the lexicon word characters by adding the switch LEX_CHARS to the switches file. If no such switch is provided by the user, the run time system assumes the default value:

 

                        LEX_CHARS ::=        [a-z] | [A-Z] | [0-9] | '''' | '_' .

            Any blank characters (spaces, tabs, end of lines, form feeds) are skipped. Examples of words (the surrounding quotes are not part of the words):

                        "man", "123".

            The following strings are not examples of words:

                        "central heating", "‑123", "…".

            In these examples the default value of the switch LEX_CHARS is assumed.

 

            Note that if this feature is used, any terminal symbols in the grammar (apart from lexicon symbols) will never be recognized.

 

 

3.2.3        Cover non-terminals

 

Cover non-terminals consist of a non-terminal name followed by a cover symbol, a caret ("^"). The non-terminal with this name (without the caret) must be defined in the grammar, i.e. must occur at least once on the left-hand side of a context free rule.

 

Cover non-terminals can only be used on the left-hand side of context sensitive rules. Therefore, their use will be explained extensively in chapter 4 where all kinds of context sensitive grammars are discussed.

 

Upon recognition of a rule in which one or more cover non-terminals occur on the left-hand side, the run time system will replace these cover non-terminals with the terminal symbols they have covered on the right-hand side. For example, if a grammar contains the rules:

 

            aa^,  bb^                       ::=         aa,  cc,  bb .

            aa                                 ::=         ( 'a' )+ .

            bb                                 ::=         ( 'b' )+ .

            cc                                 ::=         ( 'c' )+ .

 

the input "aaacccbbb" will be replaced by "aaabbb" and the input "acb" by "ab".

 

            Remark 1: If the cover non-terminal does not reoccur at the right-hand side of the same rule, the run time system will replace the non-terminal by its defining terminals. For example, if the grammar contains the rules:

 

                        aa^,  bb^           ::=         aa,  cc .

                        aa                     ::=         ( 'a' )+ .

                        bb                     ::=         'bb',  dd .

                        dd                     ::=         'dd' .

                        cc                     ::=         ( 'c' )+ .

 

            the recognized input "aaaccc" will be replaced by "aaabbdd". Note that such a cover non-terminal may not contain ambiguity in its defining rule. If the definition contains ambiguity the compiler will generate an error.

 

            Remark 2: If the non-terminal with a cover sign reoccurs more than once on the left-hand side, the compiler will generate an error. Example of illegal use:

 

                        Aa,  bb^            ::=         aa,  bb,  bb .      /* bb occurs twice. */

 

 

On the other hand, it is allowed to use the same cover non-terminal more than once on the left-hand side.

 

            Remark 3: In transduction grammars (see section 4.3 Transduction grammars), all non-terminals on the left-hand side of rules are cover symbols by default. Thus, it is unnecessary to use the cover sign in transduction grammars.

 

 

3.2.4        Intermediate symbols

 

Intermediate symbols are used in grammars with context sensitive rules or in a context free grammar that is the last one of a cascaded grammar (see section 4.3.6 Cascaded grammars). These symbols are called "intermediate" because they are not terminal and not non-terminal, but something in between. They differ from terminal symbols in that they do not occur in the original input: they are introduced in the left-hand side of a rule. They differ from non-terminal symbols in that they do not occur on the left-hand side of a defining (context free) rule in the grammar.

 

Once introduced on the left-hand side of a context sensitive rule, intermediate symbols may serve as input on the right-hand side of other rules. If the grammar is part of a cascaded grammar (see section 4.3.6 Cascaded grammars), its intermediates may also serve as input for subsequent grammars.

 

The names of intermediates are strings consisting of at least two "name characters".

 

Example:

 

            /*1*/     N1,  hundred,  N2,  ty,  N3,  Blank          <::         N1,  N2,  N3,  Blank .

            /*2*/     Blank                ::=         #x20 .

            /*3*/     N1                    ::=         [1-9] .

            /*4*/     N2                    ::=         [2-9] .

            /*5*/     N3                    ::=         N1 .

 

In this transduction grammar, "[1-9]", "[2-9]" and "#x20" are terminal symbols, "Blank", "N1", "N2" and "N3" are non-terminal symbols and "hundred" and "ty" are intermediate symbols. These intermediates are introduced on the left-hand side of rule 1 and they can serve as input for other rules, such as:

 

            /*6*/     thir,  ty              <::         3,  ty .

            /*7*/     three,  No_ty     <::         3,  No_ty .

            /*8*/     No_ty               ::=         0 | N1 | hundred | Blank .

 

According to this grammar, the transduction of the input "333 " will be "threehundredthirtythree ", as an illustration of the fact that intermediate symbols, like terminal symbols, may be found literally in the printed output.

 

Although intermediates look like terminal strings (however, the single quotes are missing), they can not be split into parts. So, the next rule can not serve as an alternative for rule 6 in the example above:

 

            thir,  't'              <:: 3,  't' .

 

For the use of intermediates in context sensitive grammars for recognition and/or parsing, we refer the reader to section 4.2.3 Intermediates in context sensitive grammars.

 

 

3.3     Negation

 

A useful extension, especially in transduction grammars, is negation. Sometimes it is a rather laborious task to describe a pattern in positive terms. For example, the pattern "every word, except the ones starting with a 'ch'" can only be described in positive terms as "every word starting with a letter not equal to a 'c', or a word that starts with a 'c' but its second letter is not equal to an 'h'". As a Transform rule:

 

            WordNo_ch_Start         ::=         No_c, ( Letter )* |

                                                            'c',  No_h,  ( Letter )* .

            No_c                            ::=         [a-b] | [d-z] .

            No_h                            ::=         [a-g] | [i-z] .

            Letter                           ::=         [a-z] .

 

With the help of the negation sign, the back quote ("`"), this can be written in a simpler way:

 

            WordNo_ch_Start         ::=         Letter |

                                                            `Ch,  ( Letter )* .

            Ch                                ::=         'ch' .

            Letter                           ::=         [a-z] .

 

The negation sign can be used in combination with terminals like characters, strings and ranges, with non-terminals and intermediates, but also in combination with regular expressions. In the next sections all combinations will be explained.

 

The use of the negations sign is subject to the following rules:

 

            Restriction 1: Negations can only be used if FINITE=TRUE, unless the negation concerns one input character.

 

            Restriction 2: Negations can not be used on the left-hand side of context sensitive rules.

 

            Restriction 3: Negations can not be used in combination with lexicon symbols.

 

In section 3.3.3 Negation of regular expressions some restrictions concerning the use of negated regular expressions will be added to this list.

 

 

3.3.1        Negation of terminal characters, ranges and intermediates

 

The negation of terminal characters, ranges or an intermediate symbol is expressed by the negation sign, a back quote (`), before the symbol. It is defined as its complement in the set of which it is a member.

 

For the terminals, this set is by default the ASCII character set #x00-#xFF), but it can be redefined once for a particular grammar, using the compiler switch UNIVERSE. For example, if the switches file contains the line:

 

UNIVERSE = [a-z | A-Z].

"`a" matches an arbitrary letter, but not "a", while in a grammar without the UNIVERSE switch "`a" matches any arbitrary character but not "a".

 

For intermediates the complement is defined in the total set of all intermediates used in this particular grammar. So the negation of an intermediate symbol will be interpreted as "any intermediate symbol, but not the symbol itself".

 

It is possible, however, to treat the terminals and the intermediates as one set of symbols; ths is done by adding the switch SHARED_UNIVERSE to the switches file. If this is done, the negation of a terminal and an intermediate will be interpreted as "any input symbol, but not the symbol itself".

 

 

3.3.2        Negation of non-terminal symbols

 

The negation of a non-terminal symbol is expressed by the negation sign before the symbol. It is defined as the set of strings with the same length as that of the string matched by the symbol, but not equal to it. When there are alternatives, the complement is defined as the set of strings with a length between the length of the shortest and the largest string that is matched by the symbol, but not equal to any string that can be matched by the symbol.

 

For example, if the non-terminal "Input" is defined as:

 

            Input                 ::=         'a',  ( 'b' )?,  'c' .

 

then

           only the strings "abc" and "ac" are recognized as the non-terminal "Input" and

           all strings consisting of two or three characters not equal to "abc" and "ac" are recognized as the negation of the non-terminal "Input" (`Input).

 

 

3.3.3        Negation of regular expressions

 

The negation of regular expressions is expressed as the negation sign before the open bracket. It is defined as the negation of a non-terminal symbol which is rewritten as the regular expression.

 

For example:

 

            `( 'p' ,'q' )*

 

must be interpreted as: every string of an even number of characters (and at least two) that does not contain the combination "pq" starting at an odd position.

 

            Restriction 1: Only regular expressions that enclose the total right-hand side of a rule can be negated.

 

            Restriction 2: Therefore, nesting of negated regular expressions is forbidden.

 

By defining extra non-terminals, it is always possible to overcome these restrictions. For example, the ungrammatical rule:

 

            WordNo_sch_Ending    ::=         ( Letter )+,  `( 's',  'c',  'h') .

 

can be rewritten to the grammatical ones:

 

            WordNo_sch_Ending    ::=         ( Letter )+,  `Sch .

            Sch                               ::=         'sch' .

 

            Remark 1: The interpretation of the negation of the "zero or one" and the "zero or more" regular expressions needs some explanation. The negation of "zero or one times 'something'" is "at least one time anything else but 'something'". So, in fact, the negation of a (…)? regular expression is identical to the negation of a (…) regular expression and the negation of a (…)* regular expression is identical to the negation of a (…)+ regular expression. For example, there is no difference in interpretation between:

 

                        `( (Letter)? )     and       `( Letter )

 

            and between:

 

                        `( Letter )*        and       `( Letter )+ .

 

            Remark 2: In the interpretation of the negation of a regular expression with alternatives (the Boolean "or"), these operators switch their meaning according to the mathematical "laws of De Morgan". For example:

 

                        `( Letter | Number )+

 

            must be interpreted as: one or more characters, not containing any letters and not containing any numbers.

 

 

3.3.4        Negation of a string of terminal symbols

 

Strings of terminal symbols can only be negated indirectly: to negate a string, make it a negated regular expression. For example, the negation of the string "something" can be written as

 

            `( 'something' )

 

It is defined as the set of strings with the same length as the mentioned string, but not equal to it.

 

Examples:

            `( 'abc' )                        matches a sequence of three characters, which is not "abc"

            `( 'ab' | 'bc' )      matches a sequence of two characters, which is not "ab" or "bc"

            `( ‑  [a-c] )        matches any sequence of characters, which does not end in "a", "b" or "c"

 

 

3.4     Error recovery

 

Usually, while running a compiled grammar, the run time system will stop if an error is detected in the input (i.e. when the input does not conform to the grammar). It is, however, also possible to have the run time system resume its operation at a certain position in the output, thus producing an overview of all errors.

 

In order to achieve this goal, the run time switch FIND_ON_ERROR can be added to the switches file. The value of this switch should consist of a set of terminal symbols. Upon an error, the run time system will skip forward to the first terminal in the input that is in the set and still complies with the state of the parse at the time of the error. The parse is resumed starting at the first character found.

 

The terminal characters in the definition of FIND_ON_ERROR should be relatively unique. Often, "cr" is a good choice or, when the grammar deals with natural language sentences, a period (".").

 

There are times when one needs to define the context of the character used for the recovery. For instance, in the Bootstrap grammar (the grammar defining the Transform formalism), skipping to a period only make sense if this period is not part of a comment (otherwise, the entire rest of the input might be recognized as a comment!). For situations like this, the FIND_ON_ERROR switch has its counterpart in a switch with the name SKIP_ON_ERROR; this lists all terminals that can introduce some unwanted domain. In Bootstrap this is the ‘/’.

 

Example:

 

            FIND_ON_ERROR = '.' .

            SKIP_ON_ERROR = '/' .

 

The two switches given in the example above are actually added to the switches file of the Bootstrap grammar. So, when a newly written grammar is checked for syntax errors (see section 7.1 Compiler), a list of all errors that occurred will be made.

 

            Remark: Error recovery is not yet possible if the grammar contains context sensitive rules.

PMSUB3a.WP bevat - chapter 3: EXTENDED FEATURES - 3.1 actions

PMSUB3b.WP bevat - chapter 3: EXTENDED FEATURES - 3.2 symbol type extens.

                                                - 3.3 negation


4     Context Sensitive Grammars

4.1     Description of context sensitive rules

 

A context sensitive rule rewrites the input that fits the right-hand side to the pattern described by the left-hand side. By preference, the rewrite sign between the two sides is "<::". The left-hand side of such a rule may never be empty. In context sensitive rules ordinary terminals, like 'a' or <032> or '.', can be used freely. They can also contain non-terminals and intermediates; their use will be explained in the section for context sensitive recognizing grammars as well as in the section for transduction grammars.

 

There are, however, some restrictions on the use of other symbol types and meta symbols on the left-hand side of the rule. On the left-hand side of a context sensitive rule, the following are not allowed:

 

            1. Regular expressions

            2. The meta symbol "|" (alternative sign)

            3. Ranges and wildcards

            4. Non-terminals with parameters

            5. Negations of any kind

            6. Lexicon symbols with parameters

            7. Actions

 

Some of these restrictions are easy to overcome by the introduction of extra non-terminals. As an illustration, a small (transduction) grammar [5] for the transformation of the letter "c" in one of the possible phonemes "GPhon", "KPhon" and "SPhon" (intermediates) is given. In this example a regular expression and a negation are both replaced by non-terminals:

 

            GPhon                          <::         'ch' .

            SPhon,  e_or_i              <::         'c',  e_or_i .

            KPhon,  Not_e_h_i       <::         'c',  Not_e_h_i .

            e_or_i                           ::=         'e' | 'i' .

            Not_e_h_i                    ::=         `( 'e' | 'h' | 'i' ) .

 

The switches file contains:

 

            UNIVERSE = [a-z] .

 

Intermediates like "GPhon", "SPhon" and "KPhon" will be emitted to the output, resulting in a phonetic transcription of the words.

 

In run time, if a context sensitive rule comes to an end (reduces), the input will be replaced by the pattern described by the left-hand side of the rule; the run time system resumes the parse at the first symbol of that emitted output.

 

 

4.2     Context sensitive grammars for recognition or parsing

 

When context sensitive grammars are used for recognition and/or parsing, the context sensitive rules transform the input in such a way that other rules (either context free or context sensitive) can operate on the result. We will discuss an example (section 4.2.1 Example of a context sensitive grammar used for recognition), and make some remarks about the special role of non-terminals (section 4.2.2 (Cover) non-terminals in context sensitive grammars) and intermediates (section 4.2.3 Intermediates in context sensitive grammars) in this type of grammar.

 

 

4.2.1        Example of a context sensitive grammar used for recognition

 

As an example, consider the following grammar that describes the plural of some English nouns:

 

            /* Grammar for the plural of English nouns */

 

            Input                             ::=         ( PluralNoun,  CR )+ .

            PluralNoun                    ::=         SingularNoun,  's' .

            SingularNoun                ::=         'book' | 'house' | 'spy' .

            'spys',  CR                    <::         'spies',  CR .

 

If the input is a line containing "spies", the third rule will match up to the "sp" in "spy". Since an "i" follows instead of a "y", the only rule that still matches is the context sensitive one. Upon recognition of this rule, the parse is (as it were) resumed at the input position where the context sensitive rule started, but now the input has been transformed according to the matching rules. Thus, starting again after "sp", the third rule will now match the "y", leading eventually to accepting the input. Note the over recognition in this grammar; the incorrect English word "spys" is also accepted as input.

 

The next grammar with lexicon symbols recognizes the same input. Lexical symbols can be viewed as special non-terminals defined as one or more times a dontcare symbol ("(*)+"). So, the word "spies" will be transformed by the context sensitive rule into "spys" and, by the second rule, be recognized as a $Noun, followed by a "s".

 

            Input                             ::=         ( PluralNoun,  CR )+ .

            PluralNoun                    ::=         $Noun(),  's' .

            'ys',  CR                        <::         'ies',  CR .

 

                        Lexicon:

                        $Noun('spy')

 

            Remark: the use of context sensitive rules in grammars used for recognition slows down the run time system.

 

 

4.2.2        (Cover) non-terminals in context sensitive grammars

 

Non-terminal symbols are a special case. When used on the left-hand side of a rule, the result of the transformation will include the symbol as such, not the input that was matched by it. There are cases in which this is exactly what you want; if the rules that will eventually recognize the result of the transformation mention the same non-terminal, they will match at once, without the need of matching the same input once again (thus providing efficiency).

 

Example:

 

            Sentence                                   ::=         Prep,  Subj,  Verb,  Obj .

            Prep                                         ::=         'in the garden ' .

            Subj                                          ::=         'the man ' .

            Verb                                         ::=         'sees ' .

            Obj                                           ::=         'a bird ' .

            Prep,  Subj, Verb, Obj                <::         Subj,  Verb,  Obj,  Prep .

 

In this example, the input "in the garden the man sees a bird" as well as "the man sees a bird in the garden" will be recognized as the starting symbol "Sentence".

 

There are, however, also situations in which you want the non-terminal on the left-hand side to stand for the input it has matched. Such situations include:

 

           the rules  operating on the result of the transformation do not mention the non-terminal, but they do describe the same production.

 

           the grammar is compiled with the compiler switch FINITE=TRUE. In this case, the compiler replaces all non-terminals with their productions. The generated code will therefore only know terminal symbols, and no non-terminals.

 

            Remark: A fatal error occurs in the case of FINITE=TRUE (and TRANSDUCE=FALSE) if a context sensitive rule contains non-terminals without a caret at the left-hand side.

 

Using the cover sign (the caret, "^"), you can mark a non-terminal on the left-hand side of a context sensitive rule as being a cover non-terminal. Upon recognition of the rule, the run time system will replace cover non-terminals with the terminal symbols they have covered on the right-hand side. Note that it is an error to use a cover sign after a non-terminal that does not occur exactly once on the right-hand side of the rule. On the other hand, it is allowed to use the same cover non-terminal more than once on the left-hand side.

 

In run time, cover non-terminals behave as if they were substituted by the sequence of terminals covered by the same symbol on the right-hand side. The following grammar provides an example. It describes perfect forms of Latin verbs of the type LAVDO. Assume a lexicon containing an entry:

 

            $LAVDO(‘navig').

 

The aim of the grammar is to accept the Latin form variant "navigasti" as the same form as "navigavisti":

 

            /* Grammar for perfect forms of Latin verbs of type "LAVDO" */

 

            /*1*/     Perfectum                     ::=         $LAVDO(),  'av',  PerfectumEnding .

            /*2*/     PerfectumEnding           ::=         'i' | 'isti' | 'it’ | 'imvs' | 'istis' | 'ervnt' .

 

            /* Context sensitive rules */

 

            /*3*/     'avi', S_Ending^             <::         'a',  S_Ending.

            /*4*/     'ave',  R_Ending^           <::         'a',  R_Ending .

 

            /* Definitions */

 

            /*5*/     S_Ending                      ::=         'sti' | 'stis' .

            /*6*/     R_Ending                      ::=         'rvnt' .

 

In the case of the input "navigasti", symbol "S_Ending" will contain the string "sti". Application of context sensitive rule 3 will transform the input to "avi" plus the contents of "S_Ending", thus to "avisti". Now the parsing process will restart in the situation that existed when rule 3 started. Whereas rule 1 originally failed (because the input did not contain "av"), it now succeeds.

 

If the cover sign were left out in rule 3, the parse would fail, since rule 1 does not match a symbol "S_Ending".

 

In order to make the grammar work without the cover signs in rule 3 and 4, we would have to change rule 2 into:

 

            /*2'*/    PerfectumEnding           ::=         'i' | 'isti' | 'it’ | 'imvs' | 'istis' | 'ervnt' |

                                                                        S_Ending | R_Ending .

 

 

4.2.3        Intermediates in context sensitive grammars

 

The following grammar[6] shows that it is also possible to use intermediates in recognition grammars. It is a modified version of the grammar given above (see section 4.2.2 (Cover) non-terminals in context sensitive grammars). The purpose of the intermediates VTRACE, ITRACE and ETRACE is to ensure that output parameter "Word" of rule 1 always reflects the original input (note that in case of an input "navigasti" the context sensitive rules transform the input into "navigavisti").

 

            /* Modified grammar for perfect forms of Latin verbs of type "LAVDO" */

 

            /*1*/     Perfectum(O:Word) ::=

                                    $LAVDO(Stem),  'a',  V_or_VTRACE(Ending1) ,

                                                PerfectumEnding(Ending2) {Word = Stem||'a'||Ending1||Ending2} .

 

            /*2*/     PerfectumEnding(O:Ending) ::=

                                    'i' {Ending = 'i'}                         |

                                    'it' {Ending = 'it'}                       |

                                    'imvs' {Ending = 'imvs'}            |

                                    I_ or_ ITRACE(Ending),

                                                'sti' {Ending = Ending||'sti'},

                                                ( 's' {Ending = Ending||'s'} )? |

                                    E_ or_ ETRACE(ending),

                                                'rvnt' {Ending = Ending||'rvnt'}  .

 

            /* Context sensitive rules */

 

            /*3*/     'a',  VTRACE,  ITRACE, S_ Ending^     <::         'a',  S_ending.

            /*4*/     'a',  VTRACE,  ETRACE,  R_ Ending^  <::         'a',  R_ending .

 

            /* Definitions */

 

            /*5*/     S_ending                                   ::=         'sti' | 'stis' .

            /*6*/     R_ending                                  ::=         'rvnt' .

 

            /*7*/     V_or_VTRACE(O:Ending)       ::=         'v'                     {Ending = 'v'} |

                                                                                    VTRACE         {Ending = ''}.

            /*8*/     I_or_ITRACE(O:Ending)          ::=         'i'                      {Ending = 'i'}  |

                                                                                    ITRACE           {Ending = ''}.

            /*9*/     E_ or_ ETRACE(O:Ending)      ::=         'e'                     {Ending = 'e'} |

                                                                                    ETRACE          {Ending = ''}.

 

 

4.3     Transduction grammars

 

Transduction grammars differ in many aspects of grammars used for recognition and/or parsing. Much of what has been said in the preceding chapters about transduction will be repeated here, together with some new information. A description of transduction rules is found in section 4.3.1 Description of transduction rules. In the sections 4.3.2 Non-terminals in transduction rules and 4.3.3 Intermediates in transduction rules, one can find examples of the use of non-terminal symbols and intermediates in this grammar type. In section 4.3.4 Ambiguity and "looping" problems in transduction grammars some problems in connection with transduction will be discussed. Recommendations for the compilation and execution can be found in section section 4.3.5 Compiling and running a transduction grammar. The subject of section section 4.3.6 Cascaded grammars is an extended feature, the cascaded grammars, by which several transduction grammars can be combined.

 

 

4.3.1        Description of transduction rules

 

A transduction grammar is a grammar by means of which an input stream can be transformed into a (different) output stream. The transduction is described by rules, which have the appearance of context sensitive rules.

 

As was mentioned before, transduction grammars do not have a first or starting symbol. They consist of at least one transduction rule and zero or more context free rules; in these rules non-terminals can be defined which are used in the context sensitive rules and/or other context free rules. Each transduction rule consists of a left-hand side and a right-hand side and can be interpreted as "the right-hand side (the input side), when recognized, will be replaced by the left-hand side (the output side)".

 

For the rewrite sign between the left-hand side and the right-hand side in a transduction rule, one can choose between "<::" and "::=", where the former is recommended. If the left-hand side consists of only one symbol, the rule must be marked as a context sensitive rule with "<::".

 

The left-hand side of a context sensitive rule can be empty (a pattern in the input matched by the right-hand side will be deleted); the right-hand side on the other hand must contain at least one symbol.

 

Besides the absence of a starting symbol, in a transduction grammar there is no need to describe the whole input as is the case in a grammar used for recognition and parsing. Only parts of the input may be covered by the rules. Input not described by any transduction rule will be simply emitted to the output.

 

            Remark: If the input contains control characters not described by any rule, those control characters will appear again in the transduction output file (see also section 5.2 Appearance of symbols in output files). Control characters can also be emitted explicitly by the left-hand side transduction rule. These characters will not be transformed by any built-in mechanism. For example, the rule:

 

                        #x16 <:: '16'.

 

            will transform any occurrence of the number "16" in the input file into the control character with ASCII code 16.

 

It is important to be aware of the fact that during run time, as soon as a transduction rule reduces, the input covered by the right-hand side will be replaced by the symbols mentioned in the left-hand side. The run time system resumes scanning the input with the first symbol of the newly inserted ones, thus treating the replaced part as new input.

 

Examples:

 

            /*1*/ 'a'             <::         'a',  'a' . : each "aa" in the input will be replaced by "a".

            /*2*/ 'b'  'a'       <::         'a',  'b' . : each "ab" will be replaced by "ba".

            /*3*/                 <::         'd',  'd' . : each "dd" will be omitted.

 

A grammar consisting of only rule 1 will transform a large number of "a"'s into only one "a". The first "aa" will be replaced by one "a", which together with the next "a" forms a new pair, which will again be replaced by one "a", until finally the last "a" remains untouched. By the same mechanism, a grammar consisting of only rule 2 rewrites an input "aaabbb" to the output "bbbaaa" and an input "cabc" to the output "cbac". Finally, a grammar consisting of only rule 3 transforms an input "cddcd" into "ccd".

 

Transduction grammars can only be compiled to a finite state machine  (with instructions for storage) (if the switch TRANSDUCE is set to TRUE then the default value for the switch FINITE is also TRUE). This implies that no lexicon symbols (see section 3.2.2 Lexicon symbols) can be used in transduction grammars and also that no use can be made of non-terminals with parameters (see section 3.2.1 Non-terminals with parameters). On the other hand, negations (see section 3.3 Negation) can be used freely.

 

            Warning: Some transduction grammars can result in very large programs and need a long compilation time. In developing a large transduction grammar one can reach a point where adding one transduction rule to the grammar causes the compilation to break off with the message that more memory is needed than the computer can offer. There is no simple formula to describe the relationship between the number of transduction rules in the grammar and the number of states to be generated by the compiler. When compilation is no longer possible or even when this process needs too much time, one could consider the possibility of splitting the grammar into parts and building a cascaded grammar (see section 4.3.6 Cascaded grammars).

 

 

4.3.2        Non-terminals in transduction rules

 

Non-terminals used in context sensitive rules in transduction grammars are by default cover non-terminals. There is therefore no need to mark the non-terminals with a cover sign.

 

All non-terminals that are defined in the grammar can be used on the right-hand side of a transduction rule, but on the left-hand side only non-terminals that also occur on the right-hand side of the same rule can be used. Moreover each appearance of a non-terminal on the left-hand side must be unambiguously interpretable in terms of the non-terminals on the right-hand side. If a non-terminal on the left-hand side does not reoccur on the right-hand side or reoccurs more than once on that side, an error message will be generated by the compiler.

 

Thus the rule:

 

            Digit                 <::         Digit,  Digit .

            Digit                 ::=         [0-9] .

 

is not a good example, because it is not clear whether the digit on the left-hand side is the same as the first one or the second one on the right-hand side.

 

A solution for problems like this is often to double the definition for the non-terminals; we can rewrite this grammar as follows:

 

            Digit                 <::         Digit,  DigitScnd .

            Digit                 ::=         [0-9] .

            DigitScnd          ::=         Digit .

 

Note that this problem does not exist in the case of terminals and intermediates (see below).

 

 

4.3.3        Intermediates in transduction rules

 

An intermediate (see section 3.2.4 Intermediate symbols) can only be introduced on the left-hand side of a transduction rule. As soon as it is introduced, the same intermediate can play a role on the right-hand side of other rules (either context sensitive or context free).

 

Example:

 

            /* Part of a grammar by which numbers are translated in Dutch words. */

 

            Digit1, en,  Digit2,  tig    <::         Digit2,  Digit1 .

            Digit1                           ::=         [1-9] .

            Digit2                           ::=         [2-9] .

            twin, tig             <::         '2',  tig .

            twee, en                        <::         '2',  en .

            der,  tig                         <::         '3',  tig .

            drie,  en                                    <::         '3',  en .

 

In the first rule, the intermediates "en" and "tig" are introduced. They are also used in the four last rules. Because there is no physical ordering of the rules in a transduction grammar, it is not necessary to write the last four rules after the first one; the grammar has the same effect if these rules precede the first one. There is, however, an implicit ordering of the rules, determined by the order of applicability of the rules. At the starting moment of the treatment of "32", only the first rule is applicable, and the input is replaced by "2en3tig". At this point the right-hand side of two other rules cover the newly formed input string one after the other and the final output will be "tweeendertig". This string consists of four intermediates: "twee", "en", "der" and "tig".

 

            Remark 1: Do not confuse intermediates with string terminals. In a string terminal like 'this is a string', every character is treated as an entity. An intermediate is always treated as a whole. So, in the example above, the rule:

 

                        twin,  tig            <::         '2',  tig .

 

            can not be replaced by the rule:

 

                        twin,  'tig'          <::         '2',  'tig' .

 

            because the string 'tig' (in the last rule) can not refer to the intermediate with the name "tig" (in the first rule).

 

            Remark 2: If a transduction grammar contains intermediate symbols and at the same time use is made of negations, one has to decide whether or not to use the compiler switch SHARED_ UNIVERSE.

 

 

4.3.4        Ambiguity and "looping" problems in transduction grammars

 

Special attention must be paid to ambiguous transduction. This occurs if two (or more) context sensitive rules describe (part of) the same input and will reduce at the same moment in run time. For example:

 

            /*1*/ 'a',  'c',  'b' <::         'a',  'b' .

            /*2*/ 'd',  'b'       <::         'b' .

 

If (part of) the input is "ab" then as soon as the run time system reads the "b", both rules reduce. Rule 1 rewrites this input to "acb" and rule 2 to "adb" and the grammar creates ambiguous output. By default the switch ONE_ CS_ REDUCE will be set to TRUE for transduction grammars. The effect of this value of the switch is:

 

           if two or more context sensitive rules are applicable at the same moment of reduction, the run time system chooses the one with the longest right-hand side (the number of terminals that have been covered by the rule) and

           if two rules have the same length on the right-hand side, the one that is mentioned first in the grammar text will be chosen.

 

If ambiguous output is really wanted, the compiler switch ONE_ CS_ REDUCE=FALSE can be added to the switches file. In each case, the compiler generates a warning if ambiguous rewritings are to be expected.

 

In designing a transduction grammar, care must be taken not to write transduction rules that will bring the run time system in an endless loop. For example, the following innocent looking rule will cause such a looping problem:

 

            'a',  'a'   <::         'a' .

 

The input of only one "a" will be replaced by two "a"'s, each of which in turn will be replaced by two "a"'s, etc., until the program crashes due to a shortage of memory.

 

Sometimes two different rules can have the same effect:

 

            'a'                     <::         'b' .

            'b'                     <::         'a' .

 

Again, the run time system will never stop to alter each "a" into a "b" and replace the "b" again by an "a".

 

Problems like this can sometimes be solved by introducing a more or less superfluous intermediate, and/or by anchoring the input to an extra "start" and/or "end" symbol. For example, the "a‑problem" can be solved by means of the next rules:

 

            'a',  treated,  'a',  treated,  End    <::         'a',  End .

            End                                          ::=         'a' | ' ' .

 

The output "atreatedatreated " of the input "a " can not be rewritten again, because no part of the right-hand side of the transduction rule covers the newly formed output. By doing this, we can prevent a looping problem, but at the same time we create a new problem: this output is not what we would like. Adding the rule:

 

            'a'                     <::         'a',  treated .

 

will bring us in a loop again. However, there is a solution in the form of cascaded grammars (see section 4.3.6 Cascaded grammars).

 

 

4.3.5        Compiling and running a transduction grammar

 

If the grammar starts with a transduction rule, the compiler automatically sets the switch TRANSDUCE to the value TRUE. If, however, the grammar is meant to be a transduction grammar, but starts for some reason with a defining (context free) rule, one needs a switches file; this file must contain at least the switch TRANSDUCE=TRUE. By default the switch FINITE will be set to TRUE. The default value of the switch ONE_ CS_ REDUCE is also TRUE for transduction grammars, so in the default case no ambiguous rewriting is to be expected.

 

Adding run time switches to the switches file is usually not necessary. While running the grammar, the run time system will write the output to the .TDO file (transduce output file). If the grammar contains no ambiguous rewritings, this file consists of the rewritten output. If, however, during compilation time a warning has been generated that ambiguous rewritings could be expected, the .TDO file is not directly readable and must be transformed by means of the program Drawgraph (see section 9.3 Drawgraph).

 

 

4.3.6        Cascaded grammars

 

It is possible to split a grammar into two (or more) grammars, and to link these grammars together into a so-called cascaded grammar. The run time system will treat the input initially with the first grammar and will use the resulting output as input for the second grammar, and so on. The linking can be performed by the program Plink (see section 9.4 Plink). All grammars have to be transduction grammars, except the last one. A cascaded grammar can consist of a maximum of 25 individually compiled grammars.

 

The a‑problem in the example on page 46 can be solved in the following way:

 

Grammar1:

 

            'a', treated, 'a', treated,  End       <::         'a', End .

            End                                          ::=         'a' | cr .

 

Grammar2:

 

            <::         treated .            /* This transduction rule has an empty left-hand side */

 

After linking to a cascaded grammar, the run time system will transform the input without the superfluous intermediate "treated".

 

            Remark: By linking to the cascaded grammar, each one of the composing grammars knows the intermediates of other grammars. However, they do not know each other's non-terminal symbols.

 

Cascaded grammars can also be helpful in other cases:

 

1.         In developing a very large transduction grammar (with a long compilation time), adding new rules is sometimes difficult because one has to wait a long time to test the effect. In cases like this one can write a small grammar, consisting of one or more new transduction rules, and link the two grammars together. Later on, the new grammar can be added to the existing one, and recompiled to one grammar.

 

2.         Some transduction grammars are so large that they need more memory during the compilation than the computer can offer. In this case you also can split the grammar into two or more smaller parts, compile them independently, and link them together.

 

If a grammar does not contain any intermediates but will be part of a cascaded grammar in which the preceding grammar outputs this kind of symbol, it must be compiled with the switch INTERM_ INPUT  set to TRUE.

 

            Remark: If a transduction grammar can produce ambiguous rewriting it can only be part of a cascaded grammar if it is the last one. Otherwise the program Plink will generate an error message.

 

The last grammar of a cascaded grammar can be a transducer or a recognizer/parser. In the latter case, one can transform the input by means of the transduction grammars followed by a check on the rewriting. However, such a configuration does not result in a file with the rewritten input (the file with extension .TDO) but an output as requested by the last grammar; rejecting/accepting the transformed input in case of a recognizer, a parse tree in case of a parser and eventual output written by output instructions.

 

 

5     Run Time Input and Output files

 

When the run time system performs the instructions of a compiled grammar on an input file, it always creates some kind of output. This output varies from a simple message on the screen like "input correct" to large new files. The form of the output is partly dependent on the kind of grammar and the features used in it and partly on the values of the run time switches. In this chapter some remarks will be made about output related to parse trees (section 5.3 Parse tree output), reports (section 5.4 Report output) and errors (section 5.5 Error output). These sections will be preceded by section 5.1 Input files in which the input files are discussed and by section 5.2 Appearance of symbols in output files with some general remarks concerning the appearance of different types of symbols in output files. Details of the output of transduction grammars can be found in section section 4.3.5 Compiling and running a transduction grammar.

 

For a description of the remaining messages on the screen during run time, the reader should refer to section 7.2.2 Screen output and keyboard input during the execution phase.

 

 

5.1     Input files

 

Every file can be analyzed or transformed by means of a suitable compiled Transform grammar as long as it is a file of characters. There are no constraints on its content; it can consist of all characters represented by an ASCII code from 0 to 255 (decimal).

 

It is also possible to use the run time system interactively (see section 7.2.1 Commands for the run time system in batch). In that case the input will be read from the standard input file, the keyboard.

 

Recognizers and parsers can define different scopes of the input by means of the definition of their starting symbol:

 

           Some grammars give a description of a whole file. This is the case, for example, if one wishes to analyze documents consisting of subdocuments, paragraphs and articles with titles, headings etc. Typically, the first rule in such a grammar is a rule like:

 

                        File                   ::=         ( FilePart )+ .

 

           The main purpose of other grammars is to analyze smaller units of a file. For example, with a grammar for the syntactical analysis of English sentences, one is primarily interested in the parse tree of individual sentences. A typical first rule could be:

 

                        Sentence           ::= ... .   /* syntactic description of a sentence */

 

            Of course, one can widen the scope of such a grammar by first adding another rule (making the original first rule the second one):

 

                        File                   ::=         ( Sentence,  cr )+ .

 

            When parse trees have to be built, adding such a rule slows down the speed of the parsing process dramatically [7]. A better solution is to use the switch SINGLE_ LINES, by which one can force the run time system to treat every single line in the input as a separate unit for the parsing process. In such a situation there are two limitations:

 

            1.         The grammar must not contain the special terminal "cr" or the terminal "<013>".

 

2.         Every sentence in the input file must be written on a separate line, terminated by an end of line symbol; the sentence itself may not contain any other end of line symbol.

 

For the scope of transducers we refer to what has been said on page 44.

 

One can also adjust the run time system to skip and/or convert some of the characters in the input with the switches IGNORE_ BLANK, IGNORE_ CR and CASE_ CONVERT.

 

 

5.2     Appearance of symbols in output files

 

Depending on the grammar type and the setting of some relevant switches, the run time system emits characters originating from the input file to one or more output files; terminal characters appear in parse trees, in reports, in messages with variables, in reported output variables of the starting symbol, in transduction output and in error messages. In three cases, an emitted character is not equal to the corresponding input character:

 

           If the conversion switch CASE_ CONVERT is set to one of the values UPPER_ TO_ LOWER and LOWER_ TO_ UPPER, the run time system emits letter characters only in their converted format. This affects not only the appearance of terminal characters and strings in parse trees for example, but also corresponding reports and variables, written to the report file.

 

           If one of the switches IGNORE_ CR or IGNORE_ BLANKS has the value TRUE, end of line symbols or spaces and tabs will not appear in any output.

 

           If the input file contains control characters (characters with a decimal ASCII code smaller than 32 or greater than 126), those characters will eventually be emitted as control characters only to the report file or the transduction output file. Control characters will be written to all other relevant files (general output file, run time error file and the screen or batch output file[8] as a number between angle brackets, the number being the ASCII code of the character[9]. The user can influence the overall base of the number format of the control characters for all those files at the same time. This base is equal to the value of the run time switch RUN_ BASE  or, if this switch has not been added to the switches file, equal to the value of the switch PARSE_ BASE during the compilation of the grammar [10]. The default value for both switches is HEX.

 

The possible differences between input and output format applies only to terminal characters and strings. This also includes words that match lexical entries.

 

Because non-terminals and intermediates find their origin in the grammar, if they appear in parse trees their names will be written exactly as they were defined by the grammar writer.

 

5.3     Parse tree output

 

Parse trees can only result from grammars with a starting symbol. While treating the input, the run time system is able to build a tree-like structure in which the leaves carry the input characters and the nodes are labeled with the names of the encountered non-terminal symbols. The root of the tree always consists of the starting symbol.

 

The resulting parse trees will be written to the general output file. If the run time switch CPU_ LOG has one of the values TO_ FILE  or TO_FILE_SCREEN, each parse tree in this file will be preceded by a line containing the CPU time used for the parse.

 

By default, the run time system only builds (parts of) parse trees if it needs to store information concerning treated input necessary for the recognition of the remaining part of the input. This need exists, for example, in cases of (temporary) ambiguity of the parse. The building of the internal structures that correspond to those parse trees costs memory. Therefore, as soon as the structure is no longer needed, it will be destroyed.

 

However, by adding the switch BUILD_ PARSETREE=TRUE to the switches file, the run time system can be instructed to build the total parse tree, to store it and to emit the result for inspection. A file with the extension .OUT (also called the general output file) will be created for this purpose.

 

If no precautions are taken during compilation time, parse trees can be incomplete; unless the compiler switch PREPARE_ PARSETREE has been set to TRUE non-terminals that are defined by means of unit rules will not occur in the parse tree.

 

If the grammar contains messages, the report and signal numbers will also occur in the emitted trees. At the same time, a separate report file is created (see section 5.4 Report output).

 

The number of parse trees that will be written to the file depends on the outcome of the parsing process:

 

           There will be no resulting parse tree if an error is found in the input during the parse. In this case, the run time system will have created an error file in which the exact position of the error can be found and the surrounding input.

           If the input can be described in more than one way in terms of the grammar, the output will consist of two (or more) alternative parse trees. In the output file these parse trees are marked with a "#" followed by a number.

           In the case of one possible description, there will be one parse tree.

 

If an input file consists of lines that must be parsed separately by the run time system and consequently the switch SINGLE_ LINES=TRUE has been added to the switches file, the parse tree output of each line will be appended to the output of the preceding parse. For some applications it is useful to separate different kinds of parse tree output. If the run time switch SPLIT_ OUTPUT is set to TRUE, the run time system creates three output files instead of one: non-ambiguous parse trees will still appear in the file with extension .OUT, but the ambiguous parse trees will be written to a file with extension .AMB and a file with extension .NOP (no parse tree) will contain the original lines in which an error occurred during the parse.

 

The resulting parse trees have a very simple lay-out. For example, an input file PLATES.TXT containing:

 

            12‑34‑AB

            AB‑12‑34

 

parsed by the grammar LICENCE.GRM (mentioned in section 1.1 Hands-on and for this purpose compiled with the switch PREPARE_ PARSETREE=TRUE) resulted in the following parse tree (in the file PLATES.OUT):

 

            < LicencePlates:(LicencePlate:(Digits:(1 2)

                                           Hyphen:(‑)

                                           Digits:(3 4)

                                           Hyphen:(‑)

                                           Letters:(A B)

                                          )

                             cr:<13>)

                             LicencePlate:(Letters:(A B)

                                           Hyphen:(‑)

                                           Digits:(1 2)

                                           Hyphen:(‑)

                                           Digits:(3 4)

                                          )

                             cr:<13>)

                            )

            >

 

Using the program Drawtree (see section 9.2 Drawtree), one can transform a file with parse trees into another format.

 

 

5.4     Report output

 

The run time system creates a report file (a file with the extension .REP) in two cases: if the grammar contains messages or if the starting symbol of the grammar has output parameters. Therefore, output in a report file can consist of:

 

            ‑ signals

            ‑ reports with (a concatenation of) input characters

            ‑ signals and reports with expressions

            ‑ output parameters of the starting symbol.

 

The variables and parameters in turn are built from literals (mentioned in actions) and/or characters found in the input file. In general, the characters resulting from the input will be emitted unchanged, so a report file can contain non-printable characters. If the switch CASE_ CONVERT has a value other than NONE, letters will appear in the report file in their converted form (which may be different from their appearance in the input file).

 

 

5.4.1        Report of messages

 

If a grammar contains messages (reports and/or signals, with or without variables, see sections section 3.1.2 Messages

 and section 3.1.2.2 Messages with variables), every line of the resulting report file starts with a number, followed by a colon (":"). In the case of simple reports, the colon is followed by the concatenation of all the input characters subsequently emitted with the same report number. In the case of messages with variables, the colon will be followed by the string value of the expression occurring in the grammar after the report number. In the sections mentioned above and in section 3.1.3 Places of actions in rules, the reader will find various examples.

 

The resulting report file can be used as input file for other programs. For example, if parts of the input file are transferred to the report file by means of r‑messages, a simple program can be used to reconstruct an extract of the original file.

 

By default, the run time system will emit the messages online as long as the parse is not ambiguous. In case of temporary ambiguity, the run time system will stop the emission and stores the messages in parse trees. As soon as the parse becomes unambiguous again (and only one parse tree is left), the gathered messages will be written to the report file, and the superfluous parse tree will be destroyed (unless the switch BUILD_ PARSETREE has the value TRUE). There are two situations in which problems will arise; in both ambiguity of the grammar plays the crucial role:

 

           The input file is very large and for a substantial part the parse is temporarily ambiguous. In this situation the intermediate parse trees can not be broken down and must be constantly extended. This results in a slowing down of the parse and eventually to a crash of the run time system because of shortage of memory. In such a case the user must make his grammar as disambiguous as possible.

 

           The grammar contains inherent ambiguities, but this fact will be detected by the run time system after the emission of the first message output. As a consequence of the ambiguity, two or more parse trees will be built, but when the parse comes to an end and the stored message information has to be emitted, the run time system has no remaining knowledge of the already emitted messages. The messages belonging to all but the first parse tree will be written without the leading ones. In this case, the run time system will generate a warning. The resulting report file will be useless for most subsequent applications.

 

            At this moment the user can only try to overcome this problem by altering his grammar. In the near future however, the run time system will be adapted in order to cope with situations like this in a correct manner.

 

 

5.4.2        Report of output parameters

 

If the starting symbol of the grammar has one or more output parameters (see section 3.2.1 Non-terminals with parameters), the content of these resulting parameters will only be written to the report file if the switches file contains the switch OUTPUT_ PARAMETERS set to the value TO_FILE or TO_FILE_SCREEN.

 

Upon recognizing the starting symbol, the run time system writes the contents of these variables to the report file. The value of the first parameter is prefixed by "1:", the second by "2:" and so on. If the parse was ambiguous, the values are repeated for every possible analysis. The last parameter of the last analysis is followed by a line containing "0:". This "0:" line can be used as a delimiter in the case of SINGLE_LINES=TRUE.

 

This feature is applied in a grammar for the morphological analysis of Latin words (a small part of this grammar was presented as an example in section 4.2.3 Intermediates in context sensitive grammars).

 

 

5.5     Error output

 

The run time system will only create an error file (a file with the extension .ERR) if errors are found in the input file. Therefore, such a file can never be the result of a transducer, but only of a recognizer/parser.

 

If no FIND_ON_ERROR switch (see section 3.4 Error recovery) has been added to the switches file while running the grammar, the run time system will halt as soon as an error has been detected in the input. In the run time error file the user can read the exact position of the error in the input file (line number and position in the line), together with the context and a list of the expected terminals at the moment the error was found.

 

If the switches file contains FIND_ON_ERROR, the run time system will list all detected errors in the manner described above, supplemented by the string of characters that has been skipped and the character where the parse was resumed.

6     Switches (Directives)

 

Both the compiler and the run time system can be adapted to the wishes of the user; some options can be specified to modify the behavior of the programs to some extent. These options (or switches) are specified in a so-called switches file for the compiler and the run time system. The switches file has the extension .SW. Each switch has a default value; if a switch value is not specified by the user, this default value is assumed. The default values of the switches will be specified below; they can also be found in the USER.SW file.

 

The names of the switches (and of their values) are not case sensitive: uppercase and lowercase characters may be used. Most of the switches are Boolean: they are either TRUE or FALSE. Such Boolean switches may be specified as TRUE by merely stating the name of the switch and as FALSE by NO plus the name of the switch, for example:

 

            FINITE=TRUE             or         FINITE

            FINITE=FALSE           or         NOFINITE

 

Other switches require an integer value (default value 0, other possible values depending on the switch) or a string value. For example:

 

            TRACE_ ERROR = 1

            LEXICON = MY_LEXICON

            CASE_CONVERT = UPPER_TO_LOWER

 

There are also switches that resemble grammar rules. The value of those switches consists of an enumeration of terminal symbols. For example:

 

            FIND_ON_ERROR = '.' | CR .

            UNIVERSE = [a-z] | [A-Z] | ….

 

It is possible to add blank lines and/or comments to a switches file.

 

In section 6.1 Compiler Directives the switches for the compiler are described. Section 6.2 Runtime Directives covers the switches for the run time system. In both sections the switches are listed with their domain of values (if no value domain is given the switch is a Boolean one with the values TRUE and FALSE), the default value and an explanation of their meaning and operation.

 

Not all switches are relevant for ordinary users; those which are, are listed here. Other switches are only relevant for the programmers of the Transform system; these switches can be found in Appendix A.

 

For a quick overview of the related switches while preparing a grammar for compilation or execution, one can refer to the checklist presented in section  10.3 Checklist for switches.

 

6.1     Compiler Directives

 

The switches are grouped together according to the different functions they serve.

 

6.1.1        GUI: Advanced Mode, Standard fields: Compiler

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6.1.1.1    Switch: FINITE

related to:          grammar types

default value:     FALSE if TRANSDUCE=FALSE

                        TRUE  if TRANSDUCE=TRUE

callout:              "Generate a finite state automaton"

explanation:

            If FINITE=TRUE the grammar will be compiled into a finite state automaton (see section 1.4.1 Finite state automata). If the run time system will need a stack for running the compiled grammar the switch must be set to FALSE.

 

            Transduction grammars can only be compiled to finite state automata (with instructions for operations on a data structure); the default for FINITE is therefore TRUE if the switch TRANSDUCE has been added to the switches file. For other types of grammars the default is FALSE.

 

            The value FALSE is the correct value in case of recognizing grammars containing:

            ‑ nested recursion (see also the switch RECURSION_ MESSAGE in section Switches for grammar debugging)

            ‑ non-terminals with parameters

            ‑ lexicon symbols

            ‑ actions with variables (assignments, tests and reports with variables).

 

            However, one must add FINITE=TRUE for recognizing grammars containing negations.

            FINITE can be FALSE when negations only appear in unit rules.

 

            The interaction of the above restrictions may give rise to conflicting values of FINITE. Therefore, in grammars with variables and in recursive grammars, negations may only occur in unit rules without actions.

 

6.1.1.2    Switch: ONE_CS_REDUCE

related to:          grammar types

default value:     FALSE if TRANSDUCE=FALSE

                        TRUE  if TRANSDUCE=TRUE

callout:              "Solve conflicts between CS rules"

explanation:

            This switch only has effect if there are two or more context sensitive rules in the grammar. Moreover, the right-hand side of one of those rules must cover input that is also totally covered by the right-hand side of another rule or is also covered by an ending part of the right-hand side of another rule. If ONE_ CS_REDUCE is FALSE, the existence of such (partly) overlapping rules will result in ambiguous parse trees or ambiguous rewritings. If, however, ONE_CS_REDUCE is set to TRUE, no ambiguity will occur, because some precautions will be taken:

           When there are two or more active transduction rules which can reduce at the same time, only the rule with the longest right-hand side (in terms of the number of covered terminals) will be reduced.

           When there are two or more reducable rules with the same length, the rule which is first mentioned in the grammar is chosen.

 

6.1.1.3    Switch: ONE_ALTERNATIVE

related to:          parse trees

default value:     FALSE if TRANSDUCE=FALSE

                        TRUE  if TRANSDUCE=TRUE

callout:              "Solve ambiguity for negation/coordination"

explanation:

When the grammar contains negated regular expressions and/or negated non-terminal symbols, different parse trees will be built during run time, even if the grammar is not inherently ambiguous. In the case of transduction grammars ambiguous transduction output will be generated. In order to avoid this, ONE_ALTERNATIVE can be set to TRUE. The user has no influence as to which of the alternative rules will be selected for the parse tree.

 

6.1.1.4    Switch: RECURSION_MESSAGE

related to:          grammar debugging

default value:     FALSE if FINITE=FALSE

                        TRUE  if FINITE=TRUE

callout:              "Suppress recursion warning"

explanation:

            When FINITE=TRUE, an error message is generated when the grammar contains nested recursion. When FINITE=FALSE and RECURSION_MESSAGE is TRUE, this message is also generated, but as a warning message.

 

6.1.1.5    Switch: SHARED_UNIVERSE

related to:          symbol types

default value:     FALSE .

callout:              "Chars and intermediates share one universe"

explanation:

            The use of the switch SHARED_UNIVERSE is reserved for grammars with negations and a mixture of terminal and intermediate symbols.

            When SHARED_UNIVERSE is FALSE, terminal symbols and intermediate symbols have separate universes for negation. The complement of a terminal symbol is a set of terminal symbols; the complement of an intermediate symbol is a set of intermediate symbols. When SHARED_UNIVERSE is TRUE, terminal and intermediate symbols have a shared universe, the union of the two universes.

            The universe of terminal symbols is defined by default as the character set [<000>-<255>]; it can be defined otherwise by the switch UNIVERSE. The universe of intermediate symbols is always the set of all intermediates that are found in the grammar.

 

6.1.1.6    Switch: UNIVERSE

related to:          symbol types

value domain:    enumeration of terminals

default value:     #x00-#xFF      .

callout:              "Numeric base for character constants"

explanation:

            If the grammar contains negations, the negation of a terminal will be the complement of that terminal with respect to the universe. For example, if the universe is defined as "[a-z]", "`a" (not "a") covers all terminals inside the range [b-z]; if, however, the universe is defined as "[a-z] | [A-Z]", "`a" covers all terminals inside the range "[b-z]" and inside the range "[A-Z]" .

 

6.1.1.7    Switch: TRANSDUCE

related to:          grammar types

default value:     FALSE if the first rule in the grammar contains the rewrite sign "::="

                        TRUE  if the first rule in the grammar contains the rewrite sign "<".

callout:              "Transduction grammar"

explanation:

            When TRANSDUCE=TRUE, the grammar is compiled as a transduction grammar, otherwise as a recognizing and/or interpreting grammar with a starting symbol. The user has to add this switch to the switches file only in the case of a transduction grammar, that, for some reason, starts with a defining (context free) rule.

 

6.1.1.8    Switch: PREPARE_ PARSETREE

related to:          parse trees

default value:     FALSE .

callout:              "Never optimize away nonterminals"

explanation:

            The switch PREPARE_PARSETREE concerns the generation of program code for non-terminals defined with unit rules (reduction in one step).

            When PREPARE_PARSETREE is FALSE, all these non-terminals were removed from the grammar. As a result, the non-terminals will not appear in the parse trees. One only need use this switch, if one also sets the run time switch BUILD_PARSETREE to TRUE.

 

6.1.1.9    Switch: INTERM_INPUT

default value:     FALSE .

callout:              "Grammar in cascade with intermediate symbols"

explanation:

            If a grammar which does not contain intermediates itself will be part of a cascaded grammar in which the preceding grammar will output this type of symbols, the switch INTERM_INPUT should be set to true.

 

6.1.1.10 Switch: CHECK_VARIABLES

related to:          grammar debugging

default value:     FALSE .

callout:              "Write local vars to error file"

explanation:

            When there are local variables in the grammar, they are listed in the compiler error file. The user can use this switch to check typing errors in variable names (if the user intends to use a parameter, but mistypes its name, Transform assumes a local variable).

 

6.1.1.11 Switch: COMP_BASE

related to:          symbol types

value domain:    DEC, OCT, HEX          .

default value:     HEX     .

callout:              "Character universe for negation"

explanation:

            For older versions of Transform: if the grammar contains terminals in ASCII code representation, the numbers will be interpreted according to the decimal, octal or hexadecimal base, depending on the value of this switch. In the current implementation of Transform ASCII-values always have a hexadecimal notation. Currently, it is not necessary to adjust this switch.

 

6.2     Runtime Directives

 

The switches that are of interest to programmers can be found in Appendix A. The switches of interest to ordinary users are listed below.

6.2.1        GUI: Advanced Mode, Standard fields: Parser

 

 

 

 

 

6.2.1.1    Switch: ECHO

related to:          input and output files

default value:     FALSE if the program is called using the PARSEER command

                        TRUE  if the program is called using the IPARSEER command

                        (see section 7.2.1 Commands for the run time system in batch for these commands)

Callout:            

explanation:

            The input is echoed in the output file (the screen or the log file of a batch job). Note that echoing slows down the run time.

 

6.2.1.2    Switch: IGNORE_CR

related to:          input characters

default value:     FALSE .

callout:              "Ignore line breaks"

explanation:

            When IGNORE_CR is FALSE, an end of line character is considered as input (to be matched with the symbol "cr" in the grammar) an so does a form feed. When IGNORE_CR is TRUE, end of line characters and form feed characters are ignored. When the switch SINGLE_LINES is set to TRUE, IGNORE_CR is automatically set to FALSE. The same effect can be reached by adding the switch:

            SKIP_CHARS = <012> | <013> .

 

6.2.1.3    Switch: SPLIT_OUTPUT

related to:          input and output files

default value:     FALSE .

callout:              "Output according to ambiguity and success"

explanation:

            When SPLIT_OUTPUT is TRUE (and BUILD_PARSETREE is TRUE), the unambiguous parse trees are written to the .OUT file; the ambiguous parse trees to the .AMB file and the input in which an error occurs during the parse is echoed to the .NOP file (no proceeding).

 

6.2.1.4    Switch: BUILD_PARSETREE

related to:          output of the results of the parse

default value:     FALSE .

callout:              "Output a parsetree"

explanation:

            When BUILD_PARSETREE is TRUE, a parse tree will be constructed for each successful parse and written to the general output file (a file with the extension .OUT). The parse tree is only complete if the grammar was compiled with default value TRUE of the switch PREPARE_PARSETREE. If the grammar was compiled with this switch set to FALSE, non-terminals defined by unit rules (see section 1.5.3 Unit rules and empty rules) will not occur in the parse tree.

 

6.2.1.5    Switch: SKIP_CHARS

related to:          input characters

value domain:    enumeration of terminals

default value:     the empty set

callout:              "Characters that should be skipped"

explanation:

            The enumerated terminals will be ignored by the run time system.

 

6.2.1.6    Switch: SKIP_ON_ERROR

related to:          error recovery

value domain:    enumeration of terminals

default value:     the empty set

callout:              "Characters to be skipped during error recovery"

explanation:

            This switch can only be used in combination with the switch FIND_ON_ERROR. It lists the terminals that can introduce some unwanted domain for the terminals mentioned in FIND_ON_ERROR.

 

6.2.1.7    Switch: OUTPUT_PARAMETERS

related to:          output of the results of the parse

value domain:    NO_OUTPUT, TO_FILE, TO_SCREEN, TO_FILE_SCREEN .

default value:     NO_OUTPUT  .

callout:              "Output parameters of start symbol"

explanation:

            With the switch OUTPUT_PARAMETERS, one can regulate whether or not the values of the output parameters of the starting symbol of the grammar will be written to the report file and/or the screen. When the value is TO_FILE, the values are listed in the report file; when the value is TO_SCREEN, they will appear on the screen; when the value is TO_FILE_SCREEN, the output will be directed to the report file as well as to the screen and with the default value no output will be emitted at all.

            The value of the first parameter is preceded by "1:", the second by "2:" and so on. If the parse was ambiguous, the values are repeated for every possible analysis. The last parameter of the last analysis is followed by a line containing "0:". This "0:" line can be used as a delimiter in case of SINGLE_LINES=TRUE.

 

6.2.1.8    Switch: LEXICON

value domain:    a string, the (path) name of a lexicon, either between simple or double quotes or no quotes at all.

default value:     the empty string

callout:              "Pathname of the compiled lexicon"

explanation:

            If the grammar contains lexical symbols, the name of the lexicon must be supplied to the run time system. Specify the pathname of the lexicon (without any extension), for example: LEXICON=D:MY_LINGWARE\MY_LEXICON.

 

6.2.1.9    Switch: LEX_CHARS

value domain:    enumeration of terminals

default value:     [a-z] | [A-Z] | [0-9] | '''' | '_'

callout:              "Characters for lexicon tokenization"

explanation:                  

            A specification of the characters, that define a word in the lexicon. This switch has only effect if the switch LEX_INCREMENT has the value FALSE.

 

6.2.1.10 Switch: SINGLE_LINES

related to:          input and output files

default value:     FALSE if the program is called using the PARSEER command

                        TRUE  if the program is called using the IPARSEER command

callout:              "Treat end-of-line as end-of-file"

explanation:

            When SINGLE_LINES is TRUE, each line in the input is treated as if it were a separate file. Parsing is restarted at every line. All output (such as messages, output variables of the starting symbol, parse trees) is appended to the output of the preceding parse. The combination SINGLE_LINES=TRUE and IGNORE_CR=TRUE is not possible. In that case, the run time system will change the latter value into FALSE.

 

6.2.1.11 Switch: IGNORE_BLANKS

related to:          input characters

default value:     FALSE .

callout:              "Ignore spaces and tabs"

 

            When IGNORE_BLANKS is FALSE, spaces and tabs are treated as input symbols, which must be handled by the grammar. When IGNORE_BLANKS is TRUE, spaces and tabs are ignored. The same effect can be reached by adding the switch:

            SKIP_CHARS = <008> | <032> .

 

6.2.1.12 Switch: FLAT_OUTPUT

related to:          debug tools

default value:     FALSE if the grammar is not a transduction grammar

                        TRUE  if a transduction grammar contains no ambiguous rewriting and FALSE if it does.

callout:              "flatten unambiguous transduction result"

explanation:

            This switch is only significant for transduction grammars. Normally, this switch is transferred from the compiler to the run time system. If FLAT_OUTPUT=FALSE has been added to the switches file of a non-ambiguous transduction grammar, the .TDO file is an intermediate file and has to be further processed by Drawgraph (see section 9.3 Drawgraph). The value TRUE in case of ambiguous transduction output makes no sense. If the program Plink detects a grammar with possible ambiguous rewriting in another position than the last one, a (fatal) error will be generated.

 

6.2.1.13 Switch: CASE_CONVERT

related to:          input characters

value domain:    NONE, UPPER_TO_LOWER, LOWER_TO_UPPER  .

default value:     NONE  .

callout:              "Conversion from/to lowercase"

explanation:

            When CASE_CONVERT=UPPER_TO_LOWER is added to the switches file, all capitals in the input are treated in their lowercase form. This is especially recommended if use is made of a lexicon in which the entries are usually written in lowercase letters. The value LOWER_TO_UPPER has the opposite result.

 

6.2.1.14 Switch: FIND_ON_ERROR

related to:          error recovery

value domain:    enumeration of terminals

default value:     the empty set

callout:              "Find characters on error"

explanation:

            If the run time system detects an error in the input, it will skip forward to one of the mentioned terminals and resumes parsing starting from that point.

 

6.2.1.15 Switch: PARSE_BASE

related to:          output of the results of the parse

value domain:    DEC, OCT, HEX

default value:     HEX     .

callout:              "Numeric base for character constants"

explanation:

            Printable terminals appear in the parse trees (and the error and trace file) in their character representation. The switch RUN_BASE regulates the appearance of the control characters in these files. By default, these characters will be written as a decimal number, enclosed by angle brackets. The number represents the underlying ASCII code. If the switch has the value OCT, the ASCII code is represented by an octal number, and in the case of the value HEX, by a hexadecimal number. For example:

 

                        character                      decimal             octal                 hexadecimal

                        carriage return              <013>               <015>               <00F>

                        blank                            <032>               <040>               <020>

 

                        Remark: This switch does not affect the output in the report file or the file with transduction output.

 

6.2.1.16 Switch: LEX_INCREMENT

default value:     TRUE  .

callout:              "No lexical tokenization"

explanation:

            When the switch LEXICON has been supplied and LEX_INCREMENT is FALSE, the lexicon entries must match words in the input, i.e. strings of alphanumeric characters enclosed by non-alphanumeric characters (see section section 3.2.2 Lexicon symbols). There may also be entries for punctuation marks; these entries match one non-alphanumeric character in the input. When LEXICON has a non-empty value and LEX_INCREMENT is TRUE, the lexicon entries may match words or prefixes of words in the input.

6.3     Switches for error recovery

6.3.1.1    Switch: CPU_LOG

related to:          output of the results of the parse

value domain:    NO_OUTPUT, TO_FILE, TO_SCREEN, TO_FILE_SCREEN

default value:    NO_OUTPUT if SINGLE_LINES=TRUE

                        else TO_SCREEN

explanation:

            The switch CPU_LOG deals with the CPU time used for each successful parse. The meaning of the values is the same as in the case of the switch OUTPUT_PARAMETERS, except that in the case of the values TO_FILE and TO_FILE_SCREEN the target file is not the report file but the general output file (a file with the extension .OUT). If parse trees are also written to the general output file, the CPU time is reported after the parse tree(s).

PMSUB6.WP bevat - chapter 6: SWITCHES

7     Operation

 

The setup and the GUI of the Transform system is already dealt with by the Transform User Manual. In this chapter we explain the operation of the Transform system. The system can operate in batch as well as in the GUI. The description is taken from the point of view of batch operations.

 

7.1     Compiler

 

A newly written grammar must be compiled before it can be executed and it must be recompiled every time changes are made in the original grammar. Actually, the parser generation is achieved in two steps. In the first step, the grammar is parsed with Bootstrap, the meta grammar for grammars, and an intermediate (report) file is created with the extension .IM.

The metagrammer describes itself and is preprocessed by Transform as a normal grammar (hence the name Bootstrap). Therefore, parsing of a grammar is essentially the same process as parsing an input file.

The intermediate file is a coded representation of the grammar which serves as input for the program generation process. In the second step, the program code is generated on the basis of the intermediate file. The first process is only executed when necessary; if there is no intermediate file for the grammar or if there is an intermediate file which is older than the present grammar file [11].

 

7.1.1        Commands for compilation in batch

 

The command name for the compiler is GENPARS (for generate parser):

 

            GENPARS  <GRAMMAR>

 

When the compiler switches are defined in a switches file named <GRAMMAR>.SW, this file will automatically be taken into account.

 

When the compiler switches are defined in a switches file with another name, the command is:

 

            GENPARS  <GRAMMAR>  <SWITCHES>

 

Examples:

 

To compile a grammar named MY_GRAMMAR.GRM with switches values contained in the file MY_GRAMMAR.SW, the command is:

 

            GENPARS  MY_GRAMMAR

 

If the switches values are in a file other than MY_GRAMMAR.SW (for example in a file named MY_SWITCHES.SW), the command is:

 

            GENPARS  MY_GRAMMAR  MY_SWITCHES.SW

 

 

7.1.2        Screen output in the compilation phase

 

When the first parse begins, the screen shows the message:

 

            Syntax checking of the grammar ...

 

If the parsing of the grammar fails at some point, a message appears together with the part of the grammar which does not conform to Bootstrap. These errors will also be listed in the compiler error file. If no errors are found the message:

 

            Input correct.

 

appears on the screen.

 

When the actual compilation starts, the revision number of the compiler appears on the screen, together with a listing of the actual compiler switches (values may have been set or may be the default values) and the message:

 

            Compilation has started, please wait ....

 

If an error occurs, an error message will be written to the compiler error file.

 

Sometimes warnings appear on the screen. They deal with peculiarities of the grammar of which the compiler can not decide if they are typing errors. For example, context free rules with a non-terminal on the left-hand side that never occurs in one of the other rules are shown. Also, all intermediates that have been found in the grammar will be written to the screen; sometimes mistyped non-terminals appear among them. If the user is sure he made no mistakes, he can ignore these warnings.

 

When the compiler is ready, some statistics about the compilation process will be written to the output (compilation time and the amount of program code).

7.1.3        Files related to compilation

 

The input files for the compiler [12] are:

 

grammar file

<GRAMMAR>.GRM

switches file

<GRAMMAR>.SW or <FILENAME>

 

The output files of the compiler inherit their name from the name of the grammar with a specific extension:

 

descriptive name

extension

function

intermediate file

.IM

result of the first parse of the grammar

program code file

.PG

contains the program instructions

label index file

.LB

index for the former file

symbol table file

.ST

contains the names of the symbols and literals used in the grammar which are necessary for the output of the run time system

symbolic file

.SM

contains symbolic output, to be used in the future [13]

parsout file

.POU

contains information about the value of the switches, some statistics concerning the compilation and the output resulting from PRINT_... switches (only useful for programmers)

compiler error file

.ERR

contains error messages and warnings

 

 

 

trace file

.TRA

contains tracing information, only useful for programmers

statistics file

.STA

contains statistical information, only useful for programmers

 

All listed files are printable with exception of the program code and the label index files which are binary files.

 

7.1.4        Error handling during compilation

 

During the first parse of the grammar, the rules are checked for errors against the syntax. The exact position of all the detected errors will be written to the output and to the compiler error file. The actual compilation will not start after discovery of errors.

 

            Remark 1: As explained in section 3.4 Error recovery, upon finding an error some input will be skipped and the parsing will be resumed at the first period. Therefore, no error checking takes place in the skipped part. It is possible that new errors will emerge after correction of the grammar.

 

            Remark 2: In some rare cases, the error is so severe that there is no possibility to recover the error during the first parse. A long list of rather puzzling error messages can result. A good strategy is then to concentrate upon the first mentioned error, correct the grammar at this position and try again.

 

During the actual compilation, errors may be discovered again. Some of them will stop the compilation process, others will only result in a warning. Errors and warnings will be written to the compiler error file, some warnings also to the output (screen or batch log file). If an compiler error file has been created, this will be reported to the screen or log file.

 

All errors and warnings have a three digit code. Apart from this code the meaning of the error and, in some cases, advice on how to modify the grammar will be given.

 

There are six types of possible errors and warnings. The error numbers of each type start with a specific digit:

 

1.         The error numbers starting with a "1" concern errors due to conflicting switches, or switches that do not match the contents of the grammar. In this case, the user must alter the switches file.

 

2.         If a limit in one of the specified numbers in the compiler is reached an error will also be given, the number of which starts with a "2", for example, if the number of rules exceeds the prespecified maximum. The user can not handle this problem alone, but must contact the programmer of Transform. The programmer has to alter the maximum, and recompile the modules of Transform.

 

3.         Some features are not yet fully implemented in Transform; in rare circumstances the continuation of the compilation would produce erroneous code. In these cases an error number (or a warning) starting with a "3" will be given. In some cases advice for the modification of the grammar will be given; if the user has difficulties he can ask the programmer(s) for help.

 

4.         Some errors against the syntax are not checked during the first parse, but will be discovered during the compilation process, for example, a regular expression used on the left-hand side of a transduction rule. If an error of this type (starting with a "4"), is discovered, the user has to modify his grammar.

 

5.         Errors detected while the compiler is reading the intermediate file with extension .IM start with a "5".

 

6.         Some compiler switches can be used for debugging grammars (see section 8.2 Debugging Grammars). The result of a check, induced by these switches, will appear as a warning; the number starts with a "6".

 

7.         In order to prevent a program trap during the compilation, some unexpected circumstances (out of scope of the user) cause an error (the error number starts with a "9") and the compilation stops. In this case one must contact the programmer; please save a copy of the grammar (and the switches file) that cause the error.

 

A complete list of all possible error messages during compilation time is listed in Appendix C, with the exception of the errors in the last mentioned category.

 

 

7.2     Run time system

 

The run time system executes a grammar which is compiled by Transform over an input file specified by the user.

 

 

7.2.1        Commands for the run time system in batch

 

The command name for the run time system is PARSEER:

 

            PARSEER  <GRAMMAR>  <INPUTFILE>

 

For interactive use, the command is:

 

            IPARSEER  <GRAMMAR>

 

In this case, the input will be read from the keyboard.

 

Again, when the switches are defined in a switches file named <GRAMMAR>.SW, this file is automatically taken into consideration.

 

When the switches are defined in a switches file with another name, the commands are:

 

            PARSEER  <GRAMMAR>  <INPUT>  <SWITCHES>

 

            IPARSEER  <GRAMMAR>  <SWITCHES>

 

Examples:

 

To parse a file named MY_FILE.TXT with a grammar named MY_GRAMMAR.GRM and run time switches values contained in the file MY_GRAMMAR.SW, the command is:

 

            PARSEER  MY_GRAMMAR  MY_FILE.TXT

 

If the values of the run time switches are in a file other than MY_GRAMMAR.SW (say in a file with the name MY_RUNSWITCHES.SW), the command is:

 

            PARSEER  MY_GRAMMAR  MY_FILE.TXT  MY_RUNSWITCHES.SW

 

 

7.2.2        Screen output and keyboard input during the execution phase

 

When the run time system starts with parsing an input, the revision number of the run time system and the actual run time switches (a combination of the switches set by the user and the default switches) are listed. If the parse fails at some point, a message appears together with an indication of the part of the input which does not match the grammar.

 

If the run time switch ECHO is set to TRUE, the run time system echoes every character (or word) that has been treated to the output.

 

The PARSEER command assumes that the input comes from a file while the IPARSEER command assumes input from the keyboard. [14]

 

When the run time system is called by the command IPARSEER, the user can input to the run time system by typing his text followed by a ctrl‑z. The parsing starts at the moment of the ctrl‑z, so one can edit the text beforehand. If the switch SINGLE_LINES is set to TRUE as well, after parsing of a sentence the user will be asked whether to continue or to quit.

 

            Remark: In case of the IPARSEER command: if the parse produces output, it will be sent to the file <GRAMMAR>_IN.<EXTENSION>. For instance, reports from MYGRAM.GRM go to MYGRAM_IN.REP.

 

If the grammar is a cascaded grammar, the output of all grammars will be echoed to the standard output file.

 

 

7.2.3        Files related to the run time system

 

The user must provide for two input files for the run time system:

 

input file

<INPUTFILE>

switches file

<GRAMMAR>.SW or <FILENAME>

 

However, if the run time system is used interactively (IPARSEER), the input file will automatically be replaced by the keyboard.

 

Furthermore, the run time system needs the three main output files of the compiler:

 

program code file

.PG

contains the program instructions

label index file

.LB

index for the former file

symbol table file

.ST

contains the names of the symbols and literals used in the grammar which are necessary for the output of the run time system

 

In the case of calling the run time system by the PARSEER command, all output files inherit the name of the input file with a specific extension. With the IPARSEER command, where the input comes from the keyboard, the names of the output files consist of the name of the grammar followed by the string "IN" and the specific extension. For instance, a report file created while running a grammar called MY_GRAMMAR with an input file called MY_TEXT is named:

 

            MY_TEXT.REP                                   with PARSEER

            MY_GRAMMAR_IN.REP                   with IPARSEER

 

The creation of the different output files depends in some cases on the content of the grammar and/or the settings of the switches:

 

descriptive name

extension

function

general output file

.OUT

contains information about the parsing dependent on the switch values of BUILD_PARSETREE and CPU_LOG.

ambiguous output file

.AMB

contains the ambiguous parse trees (only created if SPLIT_OUTPUT=TRUE)

no proceeding file

.NOP

contains the echoed input in which an error was found (only created if SPLIT_OUTPUT=TRUE)

report file

.REP

contains messages or the values of the output variables of the starting symbol, so this file will be only created if

-          the grammar contains messages or

-          the starting symbol has output parameters and the switch OUTPUT_PARAMETERS has the value TO_FILE or TO_FILE_SCREEN

transduce output file

.TDO

contains the output of transduction (only created in the case of a transduction grammar)

run time error file

.ERR

contains run time errors (only created if an error has been found)

trace file

.TRA

contains trace information (used by programmers, only created if TRACE_LEVEL<>0 and/or TRACE_USECOUNT=TRUE)

statistics file

.STA

contains statistics in connection with the parsing process (used by programmers, only created if RUN_STATISTICS=TRUE and/or one of the COUNT_... switches is TRUE)

 

All output files created by the run time system are printable; depending on the grammar and the input, the .TDO file and the .REP file can contain control characters.

 

7.2.4        Error handling during run time

 

In grammars used for recognition and/or parsing a FIND_ON_ERROR switch can be specified (see section 3.4 Error recovery).

 

If no such switch is present and an error is discovered in the input, the run time system will stop the parsing process and give a message to the run time error file and the output. The message includes the context of the location of the discovered error; the exact spot is marked by a repeated arrow ("<<<<<").

 

The user can modify his grammar and/or his input file. When the input file consists of different items and the switch SINGLE_LINES has been set to TRUE, the run time system will continue with the next item after the discovery of an error.

 

The situation is different if FIND_ON_ERROR has been specified. If an error is found the run time system tries to recover from the error state and skips to the first occurrence of the terminal or of one of the terminals that are mentioned in the FIND_ON_ERROR switch. In most cases the run time system will continue parsing. The position where the error occurred and the position where the parsing is resumed are included in a message to the output and to the run time error file. The terminals that were expected at the moment that the error occurred will also be written to the run time error file.

 

In transduction grammars, no errors will ever be detected; when the input file is not correct, the output will not be as expected. The user must detect this by himself.

PMSUB7.WP bevat - chapter 7: OPERATION

8     Problems and current limitations

 

Sooner or later every grammar writer reaches a stage not mentioned in section 1.2 From grammar to program: a problem phase. This occurs when after a run of a compiled grammar the output seems different from what has been expected. Some groups of problems can be distinguished:

 

           A recognizer finds errors in an input while the input has to be considered as  correct or, alternatively, incorrect input is judged as "correct".

 

           A parser produces "wrong" parse trees.

 

           A recognizer or parser finds unexpected ambiguities in the input.

 

           The output of a transducer is different from what was intended, or the transducer produces ambiguous rewriting.

 

Although it is difficult, or even impossible, to give solutions for all the problems that could arise, some general advice will be given in this chapter. The chapter starts with section 8.1 Current limitations on the known limitations (and some effects of bugs) of the Transform system; if a problem resembles the ones presented there, one can try to get round the imperfection due to the system by rewriting or restructuring (parts of) the grammar, for example, by making use of alternative ways to reach the intended effect.

 

The second part (section 8.2 Debugging Grammars) contains some practical advice for "debugging" grammars.

 

 

8.1     Current limitations

 

Some of the current limitations will be discussed here. In addition, some known effects of imperfect implementation ("bugs" in the programs) will be mentioned.

 

 

8.1.1        Upper limits for the size of a grammar

 

It is possible to create grammars that need more memory in the compilation phase than the computer can offer. There is however no clear connection between the physical size of the grammar text file (or even the number of rules the grammar contains) and the number of states that will be generated; therefore no general rule can be given for defining beforehand whether a grammar will be too big for compilation. In the programs of Transform, there are still possibilities to optimize the process of state generation.

 

 

8.1.2        Finite versus non-finite

 

The main limitation lies in the incompatibility of some of the extended features with the different grammar types in which the switch FINITE plays a crucial role. Only grammars with a nested recursion can never be compiled into a finite state machine; for grammars not containing this type of recursion a choice between the two types of resulting machines could be possible, in principle. Practically however, the use of some features has the effect of a fixation of the resulting automaton type. The following table contains an overview of the restrictions concerning the use of the switch FINITE:

 

─────────────────────────────────────────────────────

Item                                             Can be used in combination with

                                                 FINITE=FALSE  FINITE=TRUE

─────────────────────────────────────────────────────

Transduction rules                                                        +

"Recognizing" context sensitive rules   +                        

Negation                                                                      +

Negation in unit rules                          +                         +

Parameters                                        +                        

Actions with variables                        +                        

Lexicon symbols                                +                        

─────────────────────────────────────────────────────

 

 

8.1.3        Limitations concerning context sensitive rules

 

The restrictions for context sensitive rules apply to rules in recognizing grammars as well as to the transduction rules.

 

The left-hand side of a context sensitive rule can never contain:

            ‑ regular expressions

            ‑ the alternative sign

            ‑ ranges or wildcards

            ‑ non-terminals with parameters

            ‑ lexicon symbols with parameters

            ‑ negations

            ‑ actions

In section section 4.3.1 Description of transduction rules, solutions were been given to overcome these restrictions.

 

The right-hand side of context sensitive rules may never be empty.

A transduction grammar that outputs ambiguous rewritings can only be the last one in a cascaded grammar.

 

 

8.1.4        Limitations concerning actions

 

1.         The "or"-operator can only be used between tests.

2.         In combination with tests, the "or"-operator has precedence over the conjunction operator; there are (still) no means to reverse this order of performance. As a consequence there is no equivalence for an "if‑then‑else" construction. A method to overcome this limitation is described in section section 3.1.3.4 Actions in empty units.

3.         Actions can not be placed at the start of the right-hand side of a rule.

 

 

8.1.5        Miscellaneous limitations

 

1.         Non-terminals with input parameters can not be used in a left recursive position in a grammar rule.

2.         In a grammar to be compiled with FINITE=FALSE, the successor of an "arb" can only consist of a terminal. In addition, in such grammars an "arb" can never be the last symbol in a rule.

3.         A non-terminal with output parameters can not be defined by two or more empty paths with actions.

 

 

8.1.6        Known effects of "bugs"

 

1.         Signals with variables used inside regular expressions sometimes yield incorrect output.

2.         Chains of empty rewritings with actions can produce the wrong effects.

 

 

8.2     Debugging Grammars

 

Debugging grammars can be a hard and time consuming job. It demands a lot of concentration, hard thinking, inventivity and possibly also some experience. From our own experience, we can only present some suggestions here ... and wish you inspiration.

 

Advice 1: Make use of the warnings the compiler can produce and the built-in debug facilities of the run time system:

 

           Take notice of the information in the run time error file if an error occurred. Sometimes the reported expected terminals can give a clue.

 

                        Warning: The run time system reports the last read character before the error occurs. However, if the grammar contains tests, it is possible that the problem is caused in a much earlier stage.

 

            One can also set the switch TRACE_ERROR to TRUE (see Appendix A) to get even more information about the rules that are applicable at the moment of the error. However, in the case of substantial grammars, this information is too extensive to be of any help.

 

           If the run time system signals a possible error in the consulted lexicon, check and, if necessary, correct the entries before trying again.

 

                                    Remark: Repetition of identical entries (with possibly different features) in the lexicon can cause ambiguity.

 

           Look at the screen (or batch output) or in the compiler error file for the report of:

            ‑ unused rules.

                                    If the grammar contains a context free rule of which the left-hand non-terminal does not reoccur in at least one of the right-hand sides of other rules, the compiler reports such a rule as unused. The cause can be a typing error in the name of the non-terminal.

            ‑ a list of intermediates.

                                    The compiler treats every name of a symbol that does not reoccur on the left-hand side of a context free rule as an intermediate (see section section 3.2.4 Intermediate symbols). A mistyping in the name of a non-terminal causes that name to appear in the reported list of intermediates. Also errors in the names of intermediates themselves can be traced in this list.

 

           In the case where the grammar contains variables and/or parameters: recompile the grammar with the switch CHECK_VARIABLES set to TRUE. The compiler makes a list of all the local variables used in a rule in the compiler error file; maybe there is a typing error in the name of a parameter or local variable.

 

           Check the contents of the switches file (the compiler makes a list in the .POU file) and compare these switches with the checklist given in section  10.3 Checklist for switches of this manual.

 

           If the grammar is compiled with NOFINITE, recompile it with the switch RECURSION_MESSAGE; maybe there is an unintended (possibly implicit) nested recursion.

 

Advice 2: Take a closer look at your grammar.

 

           Check the parameters (is their input/output type correct?) and the variables. Check the assignments and tests.

 

           Check, if used, the ASCII code representation of characters and control the base of the numbers.

 

           Turn yourself into a kind of "machine"; try to follow the problem input through the course of the grammar. Make note of the reductions and write down the contents of the variables.

 

Advice 3: Try to locate the problem in a more specific way. Two methods can be used:

            1. systematically varying the input

            2. systematically adapting the grammar

One can choose to concentrate on one of the methods or apply them both at the same time.

 

            Warning: Before altering the grammar, save a copy of the original and also of the original switches file and the "problem" input file. It occurs too often that a problem can no longer be reproduced.

 

Advice 3a: Vary the input. Figure out what (kind of) input is treated correctly and what not. Try to do this in a systematic manner.

 

           Run the program several times, starting with a simple input and make it gradually more complex. Or, alternatively, start with a complex input and simplify it stepwise. A closer look at the transition point from "correct treatment' to "incorrect treatment" can contribute to the solution.

 

           Choose different inputs that will be processed by different parts of the grammar. By this one could locate the problem section in the grammar in more detail.

 

Advice 3b: Adapt the grammar. Try to locate the problem to a restricted area of the grammar text.

 

           Often a problem arises "suddenly". While developing and expanding a grammar an input string treated correctly in the past causes a problem after a certain change in the grammar. Inspection of the last change can give some clues.

 

                        Note: This illustrates the need for documentation of a grammar; it is recommended to keep note of the development of a grammar. This can be done by saving copies of earlier versions, by writing a "log file" and/or by documenting the grammar itself using the comment feature.

 

           Switch off one or more parts of the grammar by enclosing rules using comments and/or replace them with dummy rules. Of course, after every alteration, the grammar has to be recompiled. Run the grammar every time with the input that caused the problem and try to locate the transition point of "problem‑no problem".

 

           Make use of the message-feature; at certain points of the rules one can write reports or signals. Inspection of the report file can tell whether or not certain points of the grammar were reached by the input.

 

           If the grammar is of a suitable type, make a combination of variables (to "store" parts of the input) and output parameters for the starting symbol. If the switches file contains OUTPUT_PARAMETERS, one also has a check on the rules that were reached.

 

           If the problem grammar is a transducing one, remember that small parts of such a grammar can be compiled to separate transducers, that can be linked together to one cascaded grammar (see section section 4.3.6 Cascaded grammars). To some extent, the parts are independently testable.

 

9     Tools

 

Besides the compiler and the run time system, the Transform system contains some auxiliary programs, used in combination with the two main ones.

 

When lexicon symbols are used in a grammar, the run time system must use a lexicon; the steps to build one will be described in section 9.1 Lexicon. A tool for transforming the resulting parse tree into a better lay-out is the program Drawtree, described in section 9.2 Drawtree. Drawgraph, found in section section 9.3 Drawgraph, is useful in cases where the transduction output has been produced in an intermediate, not directly interpretable form. In section 9.4 Plink the program for building a cascaded grammar will be described. Section 9.6 Disassembler with the description of a disassembler program, mainly to be used by programmers, concludes this chapter.

 

All tools can be run in a command shell. Many tools run also in the gui.

 

 

9.1     Lexicon

 

When a lexicon is used, this lexicon must first be built, using a separate program called Parserlex. Parserlex creates an ltree lexicon for use in cooperation with a Transform grammar. The lexicon is built on the basis of a text file which consists of entries in a specified format.

 

The format of the entries is illustrated by the following examples:

 

            $prep('in')

            $noun('program', 'sg')

 

where "in" is classified as a "prep" (preposition) and "program" is classified as a "noun" with the characteristic "sg" (singular).

 

The single quotes may be replaced by double quotes. To include a quote in a string, it must be written twice when it is the same quote as the quotes which bound the string. Thus:

 

            "This is a double quote: ""."

            'This is a single quote: ''.'

            'This is a double quote: ".'

            "This is a single quote: '."

 

The lexical category may also be enclosed between quotes. The name may contain non-alphanumeric characters, but it must still start with a dollar sign. For example:

 

            '$NO(x)'('abc', 'xyz')

 

Blanks (tabs, spaces, end of line characters) and comments may be added to the text file.

 

The syntax of the Parserlex command is in batch:

 

            PARSERLEX  <LEXICON_NAME>  <TEXTFILE>

 

The name of the text file should include the extension, for instance .TXT. The program Parserlex creates three files, all with the name of the original text file with the extensions .LEX, .ADM and .INF.

 

A lexicon may be composed of more than one input file by repeated calls of Parserlex with the same lexicon name. It is recommended to make a backup copy of the lexicon before a new file is added to it.

 

            Remark: Identical entries will be ignored but a warning will be displayed.

 

There is a demonstration program for the Ltree lexicon software, which can also be used for the examination of an existing lexicon. The program is called without arguments:

 

            LEXDEMO

 

Another program shows the specifications of an existing lexicon. This program also enables the user to lower the "use count" of a lexicon. The "use count" registers the number of processes that simultaneously consult the lexicon. When the run time system crashes because of a console interrupt or a system failure, the use count is usually not decremented. Since a use count of zero is needed for modification of a lexicon, the use count should be corrected by the program Lexspecs, which is also called without arguments.

 

            LEXSPECS

 

Using the program Unparserlex, it is possible to check the contents of a specific lexicon. With the same program a lexicon can be translated to a text file. The command for running this program is:

 

            UNPARSERLEX

 

9.1.1        GUI: Advanced Mode: Lexicon-Compiler

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

9.2     Drawtree

 

As mentioned before, the lay-out of the resulting parse trees is a very simple one. With the program Drawtree one can transform the trees to another lay-out. For example, the parse tree given in section 5.3 has been treated by this program and resulted in:

 

              LicencePlates

                      (‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑*‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑)

                      |                       |                       |                      |

                      |                       |                       |                      |

                LicencePlate                 cr                 LicencePlate                cr

      (‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑*‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑)       |      (‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑*‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑)      |

      |       |       |       |       |       |      |        |       |       |       |      |

      |       |       |       |       |       |      |        |       |       |       |      |

   Digits  Hyphen  Digits  Hyphen  Letters    |   Letters  Hyphen  Digits  Hyphen  Digits    |

      |       |       |       |       |       |      |        |       |       |       |      |

      |       |       |       |       |       |      |        |       |       |       |      |

     12             34             AB     <13>    AB              12             34    <13>

 

The user can partly influence the lay-out produced by Drawtree by means of a switches file.

 

The command for the program is:

 

            DRAWTREE  <INPUTFLE>  <OUTPUTFILE>  <DRAWTREE_SWITCHES_FILE>

 

For example, after running a grammar over the input file MY_INPUT the resulting general output file MY_INPUT.OUT can be modified to a file called MY_INPUT.TREE by means of the call:

 

            DRAWTREE  MY_INPUT.OUT  MY_INPUT.TREE

 

or, if a Drawtree switches file with the name MY_SWITCHES.DSW is used:

 

            DRAWTREE  MY_INPUT.OUT  MY_INPUT.TREE  MY_SWITCHES.DSW

 

The switches that can be added to a Drawtree switches file are listed below:

 

Switch:             DRAW_SKIPBLANKS

default value:     TRUE  .

explanation:

            In the resulting parse trees, the leaves that consist of a blank or a tab are omitted.

 

Switch:             DRAW_MESSAGES

default value:     FALSE .

explanation:

            If DRAW_MESSAGES is FALSE, no messages are represented in the resulting tree(s).

 

Switch:             DRAW_SKIPCHARS

possible values:  enumeration of terminals

default value:     the empty set

explanation:

In writing a grammar, some non-terminals are only introduced in order to write the rules in an efficient way. If these non-terminals are superfluous in the parse tree(s), one can choose to name them with strings that all start with the same unique character, for instance the underscore ("_"), and to add to the Drawtree switches file: DRAW_SKIPCHARS = '_'. Another application is to omit names of lexicon symbols by adding 

DRAW_SKIPCHARS = '$' to this file. You can do both at the same time by adding the switch

DRAW_SKIPCHARS = '_' | '$'.

 

Switch:             DRAW_UPSIDEDOWN

default value:     FALSE .

explanation:

            DRAW_UPSIDEDOWN=FALSE results in a tree with the root at the top and the leaves at the bottom, the value TRUE results in the opposite.

 

 

9.3     Drawgraph

 

The program Drawgraph is only of interest for output of transduction grammars. If such a grammar produced ambiguous output the transduction output file (.TDO) is not immediately readable. This output has to be converted by means of the program Drawgraph. The calling format for this program is:

 

            DRAWGRAPH  <INPUTFILE>  <OUTPUTFILE>

 

For example:

 

            DRAWGRAPH  MY_INPUT.TDO  MY_INPUT.TRANS

 

 

9.4     Plink

 

The linking of two or more compiled grammars into one cascaded grammar (see section section 4.3.6 Cascaded grammars) can be performed by the next command:

 

            PLINK  <RESULTGRAMMAR>  <GRAMMAR1>,<GRAMMAR2>,…,<GRAMMARN>

 

The name of the cascaded grammar which can be run will be <RESULTGRAMMAR>.

 

The linking of grammars can be performed in steps: after linking two (or more) grammars to a resulting grammar "result1" and linking two other grammars to "result2", the two resulting grammars can again be linked together.

 

The maximum number of grammars that can be linked is 25.

 

Example:

 

            PLINK  LARGEGRAM  SMALL1,SMALL2,SMALL3

 

As can be expected, the program Plink constructs files with extensions .PG, .LB, .ST and  .SM. It also creates a file <RESULTGRAMMAR>.SAM, which lists the grammars that were used to build the resulting grammar. Its only use is as documentation; it can safely be deleted.

 

            Remark: If at least one of the grammars that will be part of a cascaded grammar contains intermediates, the other related grammars that do not make use of intermediates, must be compiled with the switch INTERM_INPUT set to TRUE.

 

The program Plink does not create a switches file; the user has to create one if necessary. The run time switches will be applied to all the composing grammars.

 

9.4.1        GUI: Advanced Mode: Cascade Linker

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

9.5     Xgram

 

The utility Xgram links the program code file together with the run time system into one executable program file. This leads to a faster start-up time of the run time system. It also makes the compiled grammar independent of changes in the run time system, which may lead to code incompatibility. The files <grammar>.EXE and <grammar>.COM are created by the following command:

 

            XGRAM  <GRAMMAR>

 

or, if there are run time switches in a file with a name not equal to <GRAMMAR>.SW:

 

            XGRAM  <GRAMMAR> <MY_SWITCHES>

 

It is recommended to supply a full path name for <grammar>, since in this case the generated command file becomes independent of the current default directory. Better still, precede <grammar> with a logical name referring to the directory.

 

The grammar transformed into an executable program can be called by the following command:

 

            <GRAMMAR>  <INPUTFILE>

 

or, for interactive use (input from the keyboard), simply:

 

            <GRAMMAR>

 

Remark: Currently, the utility XGRAM is not functioning. Some maintenance is needed in order to be operational again.

 

9.6     Disassembler

 

The utility Disassembler is mainly of interest for programmers. It operates on the program code file and the symbol table file. It produces a copy of the program code in a symbolic and annotated form, which can be useful in debugging the grammar (or modules of the Transform system itself). The program is called as follows:

 

            DASS  <GRAMMAR>

 

Its output is a file with name <GRAMMAR>.DA.

 

            Remark: The program code file can become very large in the case of substantial grammars.

PMSUB9.WP bevat: -  9. TOOLS

9.6.1        GUI: Advanced Mode: Disassembler

10              Overview

10.1 Meta language of Transform grammars

All meta symbols used in Transform grammars together with their names and a short description of their functions are listed below:

 

Delimiters and operators used in rules:

 

name

meta symbol

function

rewrite sign

::=

rewrites a left-hand side into a right-hand side

transduction sign

<::

transduction: written between left- and right-hand side

concatenation sign

,

concatenation

alternative sign

|

between alternatives

end of rule sign

.

end of a rule

comment signs

/ * and * /

start and end comment

action brackets

{ and }

between these brackets actions are notated

                       

Reserved symbols for regular expressions:

 

name

meta symbol

function

regular expression open

(

open bracket of a regular expression

negated reg. expr. open

`(

open bracket for negated regular expression

r.e. close zero or one

)?

zero or one times the enclosed part

r.e. close one

)

one times the enclosed part

r.e. close zero or more

)*

zero or more times the enclosed part

r.e. close one or more

)+

one or more times the enclosed part

                       

Reserved non-terminals and terminals:

 

name

meta symbol

function

dontcare symbol

*

one arbitrary character

line symbol

zero or more arbitrary characters

arb

=

zero or more arbitrary characters, not including the terminal that follows

carriage return

cr or CR

end of line character

 

                        Operands and indicators for symbol names:

 

name

meta symbol

function

string sign

'

single character terminals or strings are denoted between single quotes [15]

escape sign

\

terminal indication for a non-name character

octal brackets

< and >

between the angle brackets a number representing  the ASCII code of a character

base indicator

#X or #x

the next number has a hexadecimal notation

ranges sign

[..-..]

defines range of characters

negation sign

`

before the symbol: the negation of that symbol

lexicon sign

$

start of the name of a lexicon symbol

cover sign

^

end of the name of a cover non-terminal

parameter brackets

( and )

between these brackets formal or actual parameters are denoted

 

                        Indicators [16] in formal (and actual) parameter lists:

 

name

meta symbol

function

input indicator

i:

type indicator for input parameter

output indicator

o:

type indicator for output parameter

input/output indicator

io:

type indicator for input/output parameter

 

                        Indicators 16 and operators in actions:

 

name

meta symbol

function

report indicator

r:

report message

signal indicator

s:

signal message

action conjunction

; or &