LingDoc Transform
A tool for document and language
engineering
Reference Manual
Version 3.2
Palstar
Uffelte
The
www.lingdoc.eu
www.palstar.nl
1.1.1 Example on Dutch license plates.
1.3.2 Types of rules and types of grammars
1.4 The compiler and the run time system
1.5.3 Unit rules and empty rules
3........... Extended Features
3.1.1.1 Variables, literals and expressions
3.1.1.4 Combinations of tests and/or assignments
3.1.2.2 Messages with expressions
3.1.3 Places of actions in rules
3.1.3.1 Actions following symbols
3.1.3.2 Actions following regular expressions
3.1.3.3 Actions as part of the repetition of
regular expressions
3.1.3.4 Actions in empty units
3.2.1 Non-terminals with parameters
3.3.1 Negation of terminal characters, ranges and
intermediates
3.3.2 Negation of non-terminal symbols
3.3.3 Negation of regular expressions
3.3.4 Negation of a string of terminal symbols
4........... Context Sensitive Grammars
4.1 Description of context sensitive rules
4.2 Context sensitive grammars for recognition
or parsing
4.2.1 Example of a context sensitive grammar used
for recognition
4.2.2 (Cover) non-terminals in context sensitive
grammars
4.2.3 Intermediates in context sensitive grammars
4.3.1 Description of transduction rules
4.3.2 Non-terminals in transduction rules
4.3.3 Intermediates in transduction rules
4.3.4 Ambiguity and "looping" problems
in transduction grammars
4.3.5 Compiling and running a transduction
grammar
5........... Run Time Input and Output files
5.2 Appearance of symbols in output files
5.4.2 Report of output parameters
6........... Switches (Directives)
6.1.1 GUI: Advanced Mode, Standard fields:
Compiler
6.1.1.3 Switch: ONE_ALTERNATIVE
6.1.1.4 Switch: RECURSION_MESSAGE
6.1.1.5 Switch: SHARED_UNIVERSE
6.1.1.8 Switch: PREPARE_ PARSETREE
6.1.1.10 Switch: CHECK_VARIABLES
6.2.1 GUI: Advanced Mode, Standard fields: Parser
6.2.1.4 Switch: BUILD_PARSETREE
6.2.1.7 Switch: OUTPUT_PARAMETERS
6.2.1.11 Switch: IGNORE_BLANKS
6.2.1.14 Switch: FIND_ON_ERROR
6.2.1.16 Switch: LEX_INCREMENT
6.3 Switches for error recovery
7.1.1 Commands for compilation in batch
7.1.2 Screen output in the compilation phase
7.1.3 Files related to compilation
7.1.4 Error handling during compilation
7.2.1 Commands for the run time system in batch
7.2.2 Screen output and keyboard input during the
execution phase
7.2.3 Files related to the run time system
7.2.4 Error handling during run time
8........... Problems and current limitations
8.1.1 Upper limits for the size of a grammar
8.1.2 Finite versus non-finite
8.1.3 Limitations concerning context sensitive rules
8.1.4 Limitations concerning actions
8.1.5 Miscellaneous limitations
9.1.1 GUI: Advanced Mode: Lexicon-Compiler
9.4.1 GUI: Advanced Mode: Cascade Linker
9.6.1 GUI: Advanced Mode: Disassembler
10.1 Meta language of Transform grammars
10.2 Rules for the construction of user defined
symbol names
11.......... APPENDIX A: SWITCHES FOR PROGRAMMERS
11.1 Compiler programmer switches
11.1.1 GUI: Advanced Mode: Compiler, Advanced
Fields
11.1.1.1 Switch: TEST_COMP_SETS
11.1.1.2 Switch: TEST_LABELTREES (obsolete)
11.1.1.9 Switch: SKIP_TREES (obsolete)
11.1.1.10 Switch: UPDATE_LABFILE
11.1.1.11 Switch: PRINT_STRUCTURE
11.1.1.12 Switch: DEBUG_CONTROL
11.1.1.13 Switch: TEST_TREES (obsolete)
11.1.1.14 Switch: SKIP_LABELTREES (obsolete)
11.1.1.15 Switch: WRITE_SYMBOLICS
11.1.1.16 Switch: PRINT_RESULT
11.1.1.17 Switch: DEALLOCATION
11.1.1.19 Switch: COMP_STATISTICS
11.1.1.21 Switch: PRINT_STATES
11.1.1.22 Switch: BOOKKEEPING_ITEMS
11.2 Run time programmer switches
11.2.1 GUI: Advanced Mode: Parser, Advanced Fields
11.2.1.1 Switch: TREE_INPUT (obsolete)
11.2.1.2 Switch: TRACE_USECOUNT
11.2.1.3 Switch: PRINT_PARSETREE
11.2.1.4 Switch: COUNT_CONNECTORS
11.2.1.7 Switch: ONLINE_MESSAGES
11.2.1.8 Switch: PARSE_STATISTICS
11.2.1.9 Switch: COUNT_PARSETREES
11.2.2 Switches for interactive use
12.......... APPENDIX B: ASCII TABLE
13.......... APPENDIX C: COMPILER ERROR MESSAGES
13.1 Errors in values of switches
13.3 Not yet implemented features
13.4 Syntax errors in the grammar, not detected
by Bootstrap
This reference manual presents a
general overview of LingDoc Transform, together with a detailed description of
the Transform formalism, its operation and its auxiliary tools.
LingDoc Transform (short Transform) is
introduced in a Management
Overview and in Five Position
Papers. For a formal description of the underlying ideas and principles we
refer to the following book:
G.J.
van der
An ordinary user who wants to make use
of the possibilities of the system has to understand some of the underlying
concepts and must be familiar with some of the terminology. In this
introductory chapter only those terms will be explained which are indispensable
for the understanding of the remaining chapters of this manual. Where necessary
the text is illustrated with examples.
A hands-on
experience is described in the User Manual of Transform, together with an
introduction to the formalism. In this Reference Manual we go into more detail
and describe the Transform system from a general point of view.
In addition to the examples within the
User Manual we will use, throughout this Reference Manual, an example on Dutch
license plates.
Dutch license plates consist of a
combination of pairs of letters and pairs of digits. The pairs are separated by
hyphens. Back in ancient history, the plates started with letters, followed by
two pairs of digits. Nowadays, plates start with a pair of letters, followed by
two digits and another pair of letters. Here is the history:
12‑34‑AB
12‑AB‑34
AB‑12‑CD
Now, enter the following grammar using
an editor that produces plain text files. Name the file
LICENCE.GRM.. (This file is already
contained within the directory Examples_Reference_Manual). The
.GRM extension indicates that the file contains an Transform grammar.
/*
LICENCE.GRM */
LicensePlates::= ( LicensePlate , CR )+ .
LicensePlates::= Letters , Hyphen , Digits , Hyphen , Digits
{S:2} |
Digits , Hyphen , Digits , Hyphen , Letters {S:3} |
Digits , Hyphen , Letters , Hyphen , Digits {S:4} |
Letters
, Hyphen , Digits , Hyphen , Letters {S:5} .
Letters ::= [a-z]
{R:1} , [A-Z] {R:1} .
Digits ::= [0-9]
{R:1} , [0-9] {R:1} .
Hyphen ::= '-'
{R:1} .
The texts between the curly brackets
are not part of the actual grammar; they are only there to ensure that the
grammar produces some interesting output, which does not mean that they may be
omitted. The heading, the grammar name between ‘/*’ and ‘*/’, is just a
comment.
Now, create a file called
LICENCE.SW. Its contents should be a single line saying FINITE (lowercase is allowed
as well). Alternatively, use the GUI in the “Advanced mode” to
switch FINITE on. This will create the file LICENCE.SW automatically.
This LICENCE.SW file is consulted
by the Transform compiler. It contains so‑called switches: instructions
to the compiler to modify its behavior to some extent. In this case the
compiler is instructed to generate a so‑called finite state machine. The
file modifies the default settings of the Transform compiler. If no
.SW file exists, the compiler uses its default settings.
Before the compiler starts, the syntax
of the grammar is checked. This is done by means of the grammar of Transform
Grammars, called BOOTSTRAP.GRM.
When the compilation is complete, you
can look in the directory which contains the grammar.
You will find that some other files have been created.
At this stage, you can invoke the
program that actually runs the grammar. This run time system requires an input
file. Therefore, create an input file containing correct license plates. Name
the file PLATES.TXT. (This file is already
present). Its contents should be:
12‑34‑AB
12‑AB‑34
AB‑12‑34
AB‑12‑CD
In the GUI,
select the input file by pressing the button “Run grammar”.
If all goes well (and let's hope it
does), the Transform Process Monitor says "Input correct". This means
that the Transform system did not find an incorrect license plate in your
input. Try for yourself what happens if you make errors in the syntax of the
license plates.
Apart from saying "Input
correct", the parsing process has a side effect. There now is a file
called PLATES.REP. You can view this file. Its contents
list the original input (preceded by "1:"), followed by lines like
"2:", "3:" and so on.
This .REP
file (usually called report file) is produced because the grammar contained
so-called messages (the text between the curly brackets), which are instructions
to write part of the input to the output. The "
The Transform
system consists of two main programs, a compiler and a run time system. The process in which these programs are used can be
described in a cycle of three steps:
The
first step is to write a grammar in the formalism
prescribed by the system. This can be done by creating an ordinary text file
with an editor. At this point the grammar writer can influence the outcome of
the next step by describing the goals of his grammar (for example
"description versus change" or "recognizing versus
parsing") or by describing some peculiarities of the grammar. This can be
done by creating a so-called switches file, which is also a text file.
In
the next step the grammar must be compiled by the Transform compiler. In fact
the compilation process consists of two stages. In the first stage the grammar
is checked for grammatical errors; the second stage is the generation phase in
which the machine code is generated that constitutes the program resulting from
that particular grammar. If an error is detected in either stage, the grammar
writer must return to the first phase and correct his grammar; otherwise he
goes to the next phase.
The
compiled grammar is now transformed into a kind of program (also called
automaton or machine) and can run over one or more input files, text files for which the grammar has been designed.
This can be done with the help of the second main Transform program: the run
time system. Again some options can be specified in the so-called switches file. For example, one can
specify the name of a lexicon that must be used or one can instruct the run
time system to neglect the difference between lowercase and uppercase letters.
When errors are detected during this phase one can decide whether to correct
the input file (for example when the errors were due to typing errors in the
input file) or to start the cycle again by correcting the grammar. The
different kinds of output (transduction output, parse trees, etc.) will be
written to separate files which the user can inspect or use in other programs.
In the remaining sections of this
chapter a number of concepts belonging to the automaton theory will be used. This manual is, however, not the place to
give an introduction in this theory. For the reader who wants to know more
about this subject we refer to the literature.
Formal grammars consist of a set of
interdependent rewriting rules. With these rules one describes syntactic
regularities in a specific domain. For instance, a grammar for Simplified
English gives a description of all the possible sentences that are grammatical
in this particular domain. The combined set of all these sentences, however, is
only a subset of all English sentences. And as another example, a grammar for
the transformation of graphemes into phonemes in Dutch describes the rules by
which pronunciation takes place.
A grammar defines a language, consisting of all sentences (ranges of characters or
words) that are correct according to this grammar. The program resulting from a
compiled grammar will mark every input sentence as correct or incorrect according to the grammar.
When the terms "language" and "sentence" are used in this
manual, they refer to this particular definition, unless stated otherwise.
The automatic processing of texts by
means of a grammar can serve different purposes. Three goals can be
distinguished:
1. Recognition. In this case a
grammar is compiled to a program which will take a string as input and either
accept or reject it depending on whether the string is a sentence which may be
produced by the syntactic description. Such a program is called a recognizer.
2. Interpretation. The grammar is
not only used to analyze an input string in syntactic terms, but also to give a
structural description of the sentence in the form of a so-called parse tree. This kind of activity is
known as parsing and the program
resulting from such a grammar is called a parser.
3. Change. The input
material (or parts of it) has to be replaced by other characters and/or words,
has to be omitted and/or has to be placed in a restructured order. The grammar
contains rules by which this transformation must take place. This process is
called transduction and the compiled
grammar a transducer.
The difference between recognizing and
interpreting programs is only a gradual one; programs of the latter type will
be used as a combined interpreter/recognizer. They will decide whether or not
an input string is correct according to the syntactic description of the grammar
and at the same time a parse tree (or alternative parse trees) will be built.
In the Transform system the three goals mentioned were unified in one single
concept. At the same time a number of grammatical formalisms were unified [1]. Facilities for pattern
matching are also included. See for a short overview the Position Paper
“LingDoc and Grammar Formalisms”.
Grammars
must be written in a kind of formal language, the meta language. This language defines its objects and prescribes the
appearance of grammars.
The syntax for Transform grammars is
fully described in Bootstrap, a "grammar for Transform grammars". This grammar describes all
correct grammars that can be written in the Transform formalism. It is used in
the first phase of the compilation process when the input, a grammar written by
a user, is checked for syntax errors. The outcome can be:
‑ the user grammar does not contain any syntactic errors
and will be recognized as a Transform grammar, or
‑ the user grammar is syntactically incorrect according to
the Bootstrap grammar and the errors will be marked.
In the Transform system, a grammar
consists of an unordered set of rules. Each
rule is basically built out of three groups of components:
1. Terminal symbols. These are
characters and/or words literally found in the input that must be recognized,
parsed or transduced by the compiled grammar.
2. Non-terminal symbols. The user can
define auxiliary terms as names or abbreviations for sequences of terminal
symbols and/or non-terminal symbols. A non-terminal can be used for two
different purposes:
‑ For convenience, for example as an abbreviation for a long series of symbols, repeated several times in the grammar.
‑ In a structural description as a name
for an entity that corresponds with a part of the object. For example one can
use the name "NounPhrase" for a specific part of a natural language
sentence.
3.
A grammar rule consists of a left-hand side and a right-hand side, eventually separated by
a rewrite sign (in the Transform
formalism: a double colon "::"). The end of the rule is marked by a period ("."). The left-hand
side of a rule may consist of one or more symbols separated by the concatenation sign (comma ",").
The right-hand side of a rule (also called the input side) may consist of a, possibly empty, sequence of symbols
also eventually separated by the concatenation sign.
The meaning of a rule can be stated
as:
"the
left-hand side may be rewritten in the right-hand
side" or
"the
right-hand side is the production of the symbol at
the left-hand side" or
"the
right-hand side is the definition of the left-hand
side".
As soon as an input fits the
description of the right-hand side of a rule, we can state:
"the
symbol on the left-hand has been recognized".
Another terminology designating this process is:
"the
right-hand side reduces the symbol on the
left-hand side".
The number of symbols on either side
of the rewrite sign of a grammar rule determines the type of the rule:
‑ If the left-hand side consists of only one symbol, the
rule is said to be context free. The symbol on the
left-hand side of a context free rule is always a non-terminal. The right-hand side may be empty.
‑ If the left-hand side consists of more than one symbol,
the rule is said to be context
sensitive. Within the Chomsky hierarchy of rules and grammars, a distinction is
made between truly context sensitive rules and type‑0 rules. In truly
context sensitive rules, the right-hand side is at least as long as the
left-hand side. Type‑0 rules are
shortening rules: the right-hand side is shorter than the left-hand side. They are
also known as unrestricted rewriting rules. [2]
In connection with the different types of rules, different types of grammars are distinguished.
Context free grammars are grammars which
consist of only context free rules. Each rule of a context free grammar
rewrites one symbol in a sequence of zero or more symbols.
Context sensitive grammars are grammars which
contain one or more context sensitive rule. Type‑0 grammars (or unrestricted
rewriting grammars) are grammars which contain at least one shortening context
sensitive rule. Type‑0 grammars are undecidable, which means that the
parsing process may not terminate.
Within the Transform system, grammars that
are used for recognition and for parsing can be context free or context
sensitive, but type‑0 rules with an empty right-hand side are not
allowed. Each grammar has a starting symbol, defined as the
first symbol on the left-hand side of the first rule in the grammar.
Apart from context free rules, a grammar used for recognition or parsing can also contain context sensitive rules (making the grammar a context sensitive grammar). The class of context free languages is a real subset of the class of context sensitive languages. As a consequence, a grammar using context sensitive rules can recognize input that would be unrecognizable if the context sensitive rules were left out.
As an example, one may think of a grammar describing the plural of English nouns as "singular noun followed by an 's'". By means of a context sensitive rule stating that any "ies" at the end of a word may also be treated as if it were transformed into "ys" one can guarantee that forms like "spies" (derived from "spy" + "s") can also be recognized. This example will be explained in detail in section 4.2.1 (Example of a context sensitive grammar used for recognition).
Transduction grammars contain at least one
context sensitive rule, possibly supplemented by one or more context free rule.
The context sensitive rules in this type of grammar are called transduction rules. Transduction grammars do not have a starting symbol.
Transduction rules can be shortening or lengthening, however, their right-hand
side may never be empty. The goal
of this type of grammar is to replace input conforming to the pattern described
on the right-hand side of transduction rules with the characters underlying the
pattern on the left-hand side.
The compiler transforms a grammar into
a kind of program (or automaton): it generates code that can be interpreted by
the run time system as instructions for the treatment of input for which the
grammar was designed. For example, it can contain instructions for the
transformation of recognized patterns, for building parse trees, for the
consultation of an external lexicon, for emitting messages and for performing
tests.
In generating the code the compiler
describes every possible so-called state of the run time
system during the treatment of the input. This state is only dependent on the
section of the input already treated.
Dealing with natural language means
dealing with ambiguities (see also section 1.5.2 Ambiguity). In the design
of the Transform system, special care has been paid to handle ambiguity in an
efficient way. The developed run time system is an interpreter for so-called Parallel Transduction Automata (PTAs). The run
time system tries to follow all possible paths in parallel (breadth first). If there is more then
one path, common parts of the resulting parse trees will be shared as much as
possible [3]. As a consequence,
recognition will take place in polynomial time, when this is theoretically
possible. In case of an interpreting program the resulting output contains all
remaining parse trees.
Some
types of grammar (namely non-nested context free grammars, see section 1.5.1 Recursion) can be compiled
in so-called finite state automata. A finite state
automaton can be modeled by a finite number of states, an input (tape) and a
set of transitions which specify, given a state and an input, the resulting
state. A finite state automaton is thus modeled as a computer without any
memory.
Other types of grammar cannot be
compiled into finite state automata, because the languages they describe are
too complex to use an automaton without any memory. The number of states that
is needed will become infinite unless memory is added to the automaton.
Therefore a stack is used as a storage device; the run time system must, for
some time, postpone some decisions and must therefore keep track of past
treatments.
For example, grammars containing nested recursion (see section 1.5.1 Recursion)
can never be compiled into a finite state machine. Some extended features in
the Transform formalism also make use of a stack in run time. Grammars making
use of these features must always be compiled to a non-finite automaton,
although the type of grammar may not demand it. Other features (for example
negation and transduction) can only be used in grammars to be compiled into a
finite state machine.
The choice between generation of a
finite state machine and a non-finite state machine is (still) left to the
user. One can add the switch FINITE or NOFINITE
to the switches file. At appropriate places in the presentation of the
Transform formalism one will find remarks on the possible combinations.
In spite of these limitations there
remains a group of grammars for which one has a choice between the two
possibilities. In these cases one can consider that a grammar compiled to a
finite state machine will run faster (because the run time system does not need
to maintain a stack) but the compilation needs more time and memory and will
produce more program code (because more states will be described).
The main programs of the
Transform system as well as the auxiliary tools are written in standard Pascal.
The system has been ported to a variety of hardware platforms. Current
implementations include VAX/VMS, Sun (UNIX) and LINUX. The most recent version
runs under Windows XP.
The compiler and the run time
system can run on different machines because the generated program of a grammar
is written in pseudo-code.
In this section some terms that will
be used in the rest of this manual are introduced: recursion, ambiguity and two special
types of rules, the unit rules and the empty rules.
Grammars may
contain recursion caused by the
(in)direct recurrence of the left-hand side non-terminal on the right-hand side
of the same rule. A distinction is made between nested and non-nested recursion.
Non-nested recursion, i.e. left or right recursion, is caused by the (in)direct
recurrence of the non-terminal at the extreme sides of the right-hand side of a
grammar rule. In nested recursion the non-terminal reoccurs somewhere in the
middle of the rule. Examples:
Premodifier ::= Premodifier
, Modifier . /*
non-nested, left recursion */
Postmodifier ::= Modifier
, Postmodifier . /*
non-nested, right recursion */
Brackets ::= LeftBracket , Brackets , RightBracket . /* nested recursion
*/
Grammars with nested recursion
theoretically cannot be compiled into a finite state automaton, because a stack
is needed in run time[4]. Non-nested recursion is
automatically handled by the compiler.
When a grammar is compiled into a
finite state automaton, a check for nested recursion is always performed by the
program. If one wants the same check performed in the case of
FINITE=FALSE, one can add the switch RECURSION_MESSAGE
to the switches file.
During
recognition, parsing and transduction, ambiguities may arise. The causes may
be:
‑ inherent ambiguity:
according to the syntactic rules there is more than one structural description
of the input
‑ temporal ambiguity:
according to the syntactic rules there is more than one way to proceed in order
to reach one single final structural description.
In designing grammars related to
natural languages, ambiguities are no exception. Often sentences can in fact
only be described in several ways and as a consequence the user will expect
more than one parse tree.
Ambiguities may also arise, however,
as a consequence of badly designed grammars. This is partly a question of
experience. Some advice however can be given; it will appear throughout the
text of this manual in the form of style recommendations. The formalism of
Transform offers a lot of possibilities to give a grammar text a transparent
lay-out; also comments added to the text can be helpful to avoid unwanted
ambiguity.
By unit rules we will refer to a
special type of rule occurring in grammars. If the right-hand side of a rule
contains only one symbol (or a series of alternatives, each containing only one
symbol), the symbol on the left-hand side will be reduced immediately as soon
as one symbol in the input matches the right-hand side. Without special
instructions the compiler will "eliminate" the symbols defined in
this manner and as a consequence the run time system will run faster. However,
because the run time system will have no knowledge about these symbols, they
will not appear in the resulting parse trees. One can instruct the compiler to
generate instructions for all auxiliary symbols in the grammar by adding PREPARE_PARSETREE
to the switches file.
In the Transform formalism, it is
permitted to add context free rules with an empty right-hand side to the
grammar, so-called empty rules. The compiler will
also eliminate these empty rules from a grammar. At first sight such rules do
not make any sense. However, they can play an important rule in combination
with the tests that can be performed (see section 3.1.3.4 Actions in empty units).
The non-terminal on the left-hand side of an empty rule will never appear in
parse trees.
The Transform formalism will be presented in chapter 2 Basic Syntax and chapter 3 Extended Features. Because context sensitive grammars (used for recognition or transduction) have special characteristics, this grammar type will be discussed in chapter 4 Context Sensitive Grammars. Some remarks concerning the output of the run time system can be found in chapter 5 Run Time Input and Output files. Chapter 6 Switches (Directives) contains an overview of all possible options that can be added to the switches file. The commands for activating the compiler and the run time system are subject of chapter 7 Operation. The current limitations of the system will be listed in chapter 8 Problems and current limitations, together with some advice regarding the debugging of grammars. A description of the auxiliary tools can be found in chapter 9 Tools. The concluding chapter 10 Overview in which an overview is given of all the different aspects of the system can be used as a quick reference.
In
this chapter, the three basic components of Transform grammars: meta symbols,
terminal symbols and non-terminal symbols will be explained. The main power of
the Transform formalism, however, lies in the extensions of the simple concepts.
The advanced features will be the subject of the next chapter.
This reference manual, together with the user
manual, uses the syntax of the W3C. The dissertation uses a syntax which we
call “classic”. Both syntaxes may be used by calling a different metagrammar.
────────────────────────────────────────────────────
meta symbol function
────────────────────────────────────────────────────
::= standard rewrite sign between
the left-hand side and right-hand
side in context free or context sensitive rules
<:: rewrite sign in context
sensitive rules
, concatenation
| alternative
. end of rule
/* start of comment
*/ end of comment
────────────────────────────────────────────────────
Context free rules with the
standard rewrite sign are rewriting, or defining, rules. Context sensitive
rules may be written with the standard rewrite sign ("::=") or with
the context sensitive rewrite sign ("<::"). When a rule with a
context free appearance (only one symbol on the left-hand side) is meant to be
a context sensitive rule, the rule must be
marked as such by the context sensitive rewrite sign ("<::").
We repeat here that the
right-hand side of context sensitive rules may not be empty.
Restriction:
Recognizing grammars with context sensitive rules must be compiled with the
switch
FINITE=FALSE.
The right-hand side of a
context free rule contains the definition or production or realization of the left-hand side. There may be more than one
realization. In this case, the grammar will contain several rules with the same
left-hand side. In the Transform formalism, it is possible to collapse rules
with identical left-hand sides; alternative right-hand sides may be linked by
the alternative sign (vertical bar "|").
For example, the two rewrite
signs ("::=" and "<::"), can be defined by two rules:
RewriteSign ::= '::=' .
RewriteSign ::= '<::' .
These rules can be collapsed
to only one rule:
RewriteSign ::= '::=' | '<::' .
Remark: The concatenation sign (",") has
precedence over the alternative sign ("|"). So the rule:
Symbols ::= Letter
, Digit | Letter, Letter .
is equivalent with the two following rules:
Symbols ::= Letter,
Digit.
Symbols ::= Letter,
Letter .
Remark:
The use of the concatenation sign (",") is optional. It may be used
for readability, e.g. when a grammar rule is written on one line. It may be
left out, e.g. when each notion within a rule is written on a separate line.
A grammar may be supplied with comments.
A comment starts with a slash
and a star ("/*") and ends with a star and a slash ("*/"). One can write any character between these begin and end
markings, except the end combination star‑slash itself.
The main purpose of using
comment in a grammar is documentation. In the development phase, for example while debugging a grammar with
erroneous results, one can switch off one or more rules by enclosing them
within comment markers.
Comments may be included at
the beginning and at the end of the grammar, between rules and within rules
(before and after each symbol or meta symbol).
Spaces, returns and form feeds
between symbols and rules are also ignored, just like comments.
The only basic meta symbols
not present in the list presented at the start of this section are the ones
related to regular expressions. Regular expressions provide a compact notation
of repetition in rules.
Style recommendations: Large grammars can become complex and difficult to expand. Although a
user has to develop a clear style for writing grammars by himself, we can give
some recommendations:
1. Use
comments to document the grammar.
2. The
structure of a grammar becomes more clear if blank lines between rules are
added and tabs are used for indentation.
3. Write
every alternative on a separate line (unless the alternatives consist of a
simple enumeration).
4. Write
all context sensitive rules with the "<::" sign.
As an example, the basic
syntax of grammars is presented here as (an outline of) a context free grammar:
Grammar ::= Rule, RestGrammar .
RestGrammar ::= /* nothing */ |
Rule RestGrammar .
Rule ::= LefthandSide, RewriteSign,
RighthandSide, Period .
LefthandSide ::= Symbol,
RestLefthandSide .
RestLefthandSide ::= /* nothing */ |
(
Comma)?, Symbol, RestLefthandSide .
RighthandSide ::= /* nothing */ |
Alternative, RestRighthandSide .
Alternative ::= Symbol,
RestAlternative .
RestAlternative ::= /* nothing */ |
(Comma)?, Symbol,
RestAlternative .
RestRighthandSide ::= /* nothing */ |
VerticalBar, Alternative,
RestRighthandSide .
Remark: Note
that the comments consisting of "/* nothing */" mark an empty
alternative. For example, "RestGrammar" is defined as empty or as a
rule again followed by "RestGrammar". In section 2.4 (Regular expressions)
the combination of such alternatives to only one alternative by means of a
regular expression will be explained.
The grammar given at the end of section 2.1 (Meta symbols) is not
complete, because it does not contain terminal symbols. Terminal symbols are
symbols that are not rewritten; they represent the characters and/or strings
that must be found in the input. Therefore they never appear on the left-hand
side of a context free rule.
In this section three
different appearances of terminals will be discussed: single characters and
strings, ranges and wildcards.
All characters (including the
characters that have a meta function), will be recognized by the system as
terminals if they are surrounded by single quotes, for example:
Vowel ::= 'a' | 'e' | 'i' | 'o' | 'u' .
Quote ::= '''' | '"'.
Note that the single quote
itself must be doubled.
As an obsolete feature, single
"name characters" may be represented literally (without single
quotes), where "name characters" are defined as the alphanumeric
characters and the underscore. So as an alternative of the first rule in the
example given above, one can state:
Vowel ::= a | e | i | o | u .
All other characters may not
be represented literally; they must be surrounded by single quotes.
Alternatively, as an obsolete feature, they may be preceded by the escape
character (backslash "\"). The only exception to these
representations is the character that denotes the end of line (also called the carriage return character or the newline character). There is however a
reserved symbol which matches an end of line character (the actual value of
which is operating system dependent): "cr" or "CR" (not
"Cr"). This reserved terminal symbol (like all terminal symbols) may
never occur on the left-hand side of a context free rule; this implies that one
cannot redefine this reserved symbol.
All terminal symbols without
exception may also be represented by the number representing their underlying
code. A value in the (decimal) range
The values have to be written
in hexadecimal notation, like ‘#x61’. In Appendix B the characters and their
ASCII code in three bases (octal, decimal and hexadecimal) are listed.
In summary, there are
alternative ways to represent a single character in a grammar:
|
"name characters" like "a" |
"single quote" character |
"end of line" character |
other characters like "." |
|
'a' |
'''' |
CR |
'.' |
|
#x61 |
#x27 |
#x0D |
#x2E |
Sequences of terminal symbols
may be given as a string between single quotes. For example:
abcString ::= 'abc' .
is another notation for:
abcString ::= 'a',
'b' , 'c' .
The matching of the characters
in the grammar with characters in the input is case sensitive, so an
"a" in the grammar matches only an "a" in the input and not an "A". If there is no need to distinguish
between upper and lowercase in the input, one can add the switch CASE_CONVERT to the switches file.
There are text files with end
of line characters that are not important for the structure of the text. The
starting symbol of a grammar designed for a description of the structure of
such a text covers input that can
contain end of line symbols, but does not necessarily do so. One does not know
in advance at which position an end of line symbol or a form feed can be
expected. In such a situation the run time system can be instructed to skip the
end of line symbols and form feeds by adding the switch IGNORE_CR to the switches file; a corresponding switch for blanks (spaces and
tabs) is
IGNORE_BLANKS.
It is possible to instruct the
run time system to ignore other characters as well. For example, if you don't
want to mention certain punctuation marks in your grammar, you can add the
switch:
SKIP_CHARS = '.' | ',' | '!' | '?' .
With a range one can define a
set of alternative terminals. A range consists of two terminal symbols
separated by a hyphen. The range includes the two mentioned terminals plus
those in between, defined by the order of the underlying character set (the
ASCII set in the current implementation).
Examples:
NameCharacter ::= [a-z] | [A-Z] | [0-9] | '_' .
OctalDigit ::= [0-7] .
Consonant ::= [b-d] | [f-h] | [j-n] | [p-t] | [v-x] |
'z' .
Remark: Note
the absence of single quotes around the name characters (letters, digits and
the underscore) that mark the ranges.
A range may consist of one
terminal symbol, for example:
Underscore ::= [_-_].
Ranges, just as ordinary
terminals, may be defined in terms of their ASCII code, for example:
Ascii ::= [#x00-#xFF] .
Brackets ::= [#x28-#x29] . /* "(" and ")" */
Remark:
The underlying numerical value of the first terminal must be less than or equal
to the value of the second one. If this is not the case an empty range is
defined (which is most probably not intended).
Restriction:
Ranges can not be used on the left-hand side of context sensitive rules.
Note: In
other syntaxes, like the W3C ENBF syntax for XML, it is possible to define
regular expressions of characters. We support the following forms:
- [a] (one constant) and [a-z] (range of constants);
- [#xN] (one constant) and [#xN-#xN] (range of constants);
- also, the negated variants are supported: [^a], [^a-z], [^#xN] and
[^#xN-#xN].
Other variants can easily be dealt with, e.g. [0-9a-z] can be written
as [0-9] | [a-z], because of our notation for regular expressions. In our
opinion, this notation is more readable.
If required, a future version of Transform will support the syntax of
[0-9a-z].
Furthermore, we do not deal with the special treatment of '-' and ']'
in the W3C ENBF syntax when used as the first character following the '['.
Wildcards are meta symbols which match arbitrary characters (including
carriage return, form feed etc.). There are three different wildcards, with
different functions: a dontcare, a line and
an
arb symbol. They are listed
below, together with their functions.
|
symbol |
name |
function |
|
* |
dontcare |
matches exactly one
arbitrary character |
|
= |
arb |
matches zero or more
arbitrary characters, not including the character following the arb symbol in
the grammar |
|
‑ |
line |
matches any sequence of zero
or more arbitrary characters, including occurences of the character following
the line symbol in the grammar |
For example:
Password ::= *, *, *, *, *, * .
defines a password as a
sequence of six arbitrary characters. And in:
Comment ::= '/*',
=, '*/' .
a sequence of characters which
starts with ‘/*’ and ends with ‘*/’ is recognized as a comment. The characters
which are matched by the arb symbol will not contain a ‘*’. An alternative definition
of comment:
Comment ::= '/*',
‑, '*/' .
defines the same sequence of
characters, but the characters which are matched by the line symbol may contain
one or more ‘*/’. For example:
/* a comment is marked by a "*/" at the end */
will be matched twice by the
rule with a line symbol. It will only be matched once by the rule with an
arb symbol.
Remark:
The arb symbol matches any sequence of characters, not including the character following it in the grammar rule. If the arb symbol is
followed by a string of terminal characters, it is only the first character of
this string which defines the operation of the arb symbol. For example:
Restriction
1: When the compiler option FINITE=FALSE is used an arb symbol may only be followed by a terminal symbol or by
an intermediate symbol (see section 3.2.4 Intermediate symbols).
It may not be followed by a non-terminal symbol or a lexicon symbol (see
section 3.2.2 Lexicon symbols) and
it may not occur at the end of a rule. If FINITE=TRUE, there are no restrictions at all and an arb symbol may even be the
last symbol in a grammar rule.
Restriction 2: Wildcards can not be used on the
left-hand side of context sensitive rules.
Non-terminal symbols are
symbols that occur at least once on the left-hand side of a context free rule.
Non-terminal symbols function as names, or abbreviations, for sequences of
symbols. Their names consist of a sequence of one or more alphabetic characters
(uppercase or lowercase), the digits 0 to 9 and the underscore character
("_"). The maximum length of a non-terminal name is 32
characters.
Remark
1: The names of non-terminal symbols are case sensitive. For example, a symbol
defined as "comma" does not match any other spelling of the name,
such as "Comma" or "COMMA".
Remark
2: A symbol name consisting of only one character is handled as a terminal
unless this name occurs at least once on the left-hand side of a context free
rule (then it is considered a non-terminal). Therefore, beware of non-terminals
named with only one character.
Style recommendation:
To avoid possible confusion with terminals, it is recommended that the names of
non-terminals contain at least two characters.
Non-terminals defined by one
or more unit rules will not automatically appear in parse trees when the run
time switch
BUILD_PARSETREE has been set to true. Only after setting the compiler
switch
PREPARE_PARSETREE to
the value
TRUE will the resulting parse
trees contain every appropriate non-terminal. Examples of unit defined
non-terminals can be found in the next rules:
AlfaNum ::= Alfa | Number .
Alfa ::= [a-z] | [A-Z] .
Number ::= ( [0-9] )+ .
If the switch PREPARE_ PARSETREE is set to FALSE,
"Alfa" and even "AlfaNum" will not appear in any parse tree
at all; "Number" will only be present if it matches input consisting
of more then one digit.
A built-in check on typing
errors in non-terminal names is present to some extent. The compiler considers
a name consisting of more than one character as an "intermediate
symbol" (this symbol type will be introduced in section 3.2.4 Intermediate symbols) unless
it appears at least once on the left-hand side of a context free rule. If
intermediate symbols are present, their names appear on the screen and in the
error file; misspelled names of non-terminals can appear amongst them.
Grammars for recognition
and/or parsing contain at least one non-terminal: the symbol on the left-hand
side of the first rule in the grammar. This special non-terminal is also called
the
starting symbol.
Transduction grammars can be
written without any non-terminals, because they do not contain a starting
symbol. For example, the rule:
'terminal symbol' <:: 'terminal' .
can be compiled as a grammar
but contains only terminals. When applied, it converts all the words
"terminal" in a document (for instance this manual) to the words
"terminal symbol".
Regular expressions provide a compact notation of repetition and
embedded alternation:
─────────────────────────────────────────────────────
expression function
─────────────────────────────────────────────────────
x matches
exactly one occurrence of "x"
( x )? matches
zero or one occurrence of "x"
( x )* matches
zero or more occurrences of "x"
( x )+ matches
one or more occurrences of "x"
( x | y )? matches
zero or one occurrence of either "x" or "y"
( x | y ) matches
exactly one occurrence of either "x" or "y"
( x | y )* matches
zero or more occurrences of "x" and/or "y"
( x | y )+ matches
one or more occurrences of "x" and/or "y"
─────────────────────────────────────────────────────
In these examples of regular
expressions, "x" and "y" may be replaced by a terminal or
non-terminal symbol, by a sequence of terminal and/or non-terminal symbols or
by another regular expression.
The grouping notation
(...) is especially useful if one wishes
to overrule the precedence of the concatenation sign over the alternative sign.
For example, if the definition of the non-terminal "Symbols" in the
example on page 15, was meant to be:
"a string of three characters, the first and the
last a letter and between them a letter or a digit",
this can be written as:
Symbols ::= Letter,
( Digit | Letter ) , Letter .
As an example, the grammar for
Transform grammars (earlier presented in section 2.1 Meta symbols), is given here in a more compact
notation:
Grammar ::= ( Rule )+ .
Rule ::= LefthandSide, RewriteSign,
RighthandSide, Period .
LefthandSide ::= Symbol,
( Comma, Symbol )* .
RighthandSide ::= Alternative, ( VerticalBar, Alternative )* .
Alternative ::= Symbol,
( Comma, Symbol )* .
Restriction: Regular expressions can not be used on the left-hand side of context sensitive rules.
Actions relate to extensions
of the formalism. There are two kinds of actions:
(1) assignments
and tests (section 3.1.1 Assignments and
tests)
(2) messages
(section 3.1.2 Messages )
The general rules for actions are:
‑ Actions
are written between braces "
‑ Apart
from assignments and tests, an action starts with a keyword determining the
type of the action. These keywords are "r:" and "s:" and
may be written in upper or lowercase without difference in meaning.
‑ Actions
may be conjoined by the conjunction operator (semicolon ";").
‑ Actions
may only be used on the right-hand side of a context free rule. They may occur:
‑ after each symbol or
regular expression
‑ between the
")" and the " * " or "+" of
a regular expression
‑ as the first (or only) element of an alternative.
For some types of actions the effects depend on
their position in the rule. In section 3.1.3 (Places of actions in
rules) the different possibilities are discussed together with a
description of the effects.
A
powerful mechanism is the application of tests: the recognition of input can be
made dependent of a testable criterion. This feature can be used successfully
in grammars that describe ambiguous phenomena and can limit the number of
alternative parse trees. They are very useful in combination with the symbol
type "non-terminals with parameters", to be introduced in section 3.2.1 (Non-terminals
with parameters).
Tests are performed on variables and literals,
which can be combined together into expressions (section 3.1.1.1 Variables,
literals and expressions). Before a test can be performed, variables must
be given a value by means of an assignment (section 3.1.1.2 Assignment). In section 3.1.1.3 (Types of tests)
different types of tests are discussed and section 3.1.1.4 (Combinations of
tests and/or assignments) concerns possible combinations of tests and/or
assignments.
In Transform grammars,
two kinds of variables can be used: local variables and
parameters. The main difference between the two types lies in their scope; the
scope of a local variable is the rule in which it occurs, a parameter can be
used to transmit information between different rules. In this section, local
variables are introduced and although parameters will be explained later on
(see section 3.2.1 Non-terminals with parameters),
all operations that can be performed on local variables are applicable to
variables introduced as parameters as well.
Every rule, except context sensitive rules, may
be extended with local variables. Local variables are devices which can be used
for administrative purposes; information (in the form of a string) available at
some point in the rule can be stored in a variable so that it can be used at
another point in the same rule. A variable has a name (consisting of name characters,
see the first example in section 2.2.2 Ranges) and a value (in the Transform system always
a string).
A local variable need not be declared; the
variable will be allocated at the moment of its first reference in the rule. At
the moment of its allocation, it contains the empty string and this remains its
content up to the moment that another value is assigned to it.
There is a reserved variable name (percent
"%"). Its value is always the character that has most recently been
read.
Besides variables, literal string values (also
called literals) can be used in actions. They also store string
information, but their value can never be changed. Between the braces ("
Expressions can be
built from variables and literals. This is done with the concatenation operator
"||". The value of an expression is the concatenated value of all its
components.
Remark
1: In the next sections the term "expression"
refers also to its most simple form: a simple variable or a simple literal.
Remark 2: In a grammar with
many variables, a typing error in the name of a variable is easily made.
Because the compiler treats every new name as an occurrence of a new variable,
an error in a name can cause incorrect behavior in the compiled grammar. If the
switch CHECK_VARIABLES is
added to the switches file, all separate variables found in a rule will be
listed in the compiler error file.
As an example, we give the value of some
expressions that can be composed from the following components:
var1
with the value "aa",
var2
with the value "bb",
%
with the value 'c' and
the
literal 'dd':
|
expression |
resulting value |
|
var1 || var2 |
"aabb" |
|
var1 || % |
"aac" |
|
'dd' || var2 |
"ddbb" |
|
var1 || % || var1 |
"aacaa" |
In Transform notation the definitions of the
concepts, introduced in this section, are:
/*
Definitions for variables, literals and expressions */
Variable ::= VariableName | '%' .
Literal ::= ''''
( Token )* '''' .
VariableExpression ::= VariableOrLiteral, ( '||',
VariableOrLiteral )* .
VariableOrLiteral ::= Variable
| Literal .
VariableName ::= ( NameCharacter )+ .
Token ::= [#x00-#x26] | [#x28-#xFF] |
[#x27-#x27].
/*
#x27 is the value of a single quote, which must be doubled */
A variable may be
assigned a value by means of the assignment operator ("="); in
general:
{
<variable> = <expression> }
which means that after evaluation the variable
has the value specified by the
expression.
In the Transform notation, the definition of an
assignment is:
Assignment ::= Variable, '=',
VariableExpression .
Remark: Explicit assignment
to the reserved "%" is not possible; this variable is assigned its
value, the last read character, in run time.
Following the meaning of the concept expression
(see section 3.1.1.1 Variables,
literals and expressions), the value of a variable can be set in different
ways, for example:
{var1=var2}
or {var1=%} (assignment of a
variable)
{var1='abc'} (assignment of a
literal)
{var1=var2||'abc'} (assignment of a variable
expression)
The value of a variable may be empty at any time.
Remember that a variable has an empty value at the moment of its first
appearance in a rule; its value can become the empty string again by assigning
the literal representing the empty string to it. Therefore:
{var=''}
removes the value of the variable
"var".
Tests concern the truth value
of a comparison between two expressions. If this value is false, the current
recognition path stops. There are two test operators, one for equality and one
for inequality.
The equality test:
{
<expression1> == <expression2> }
returns the value true if "expression1"
has the same value as "expression2", while the inequality
test:
{
<expression1> != <expression2> }
returns the value true if "expression1"
refers to a value other than that specified by "expression2".
The definition of tests in Transform notation
becomes:
Test ::= EqualityTest |
InequalityTest
.
EqualityTest ::= VariableExpression, '=',
VariableExpression .
InequalityTest ::= VariableExpression, '!=',
VariableExpression .
Disjunct
tests may be stated by means of the
"or"-operator:
is true when the evaluation of at least one of
the tests leads to the value true. For example:
{
var1== '' | var2==% }
is true if the value of the variable
"var1" is empty or if "var2" has the value of the last read
character in the input.
If the combined test concerns the same variable
on the left-hand side, a shorthand notation is possible. For example:
{var1==expr1
| var1==expr2}
can also be written as:
{var1==expr1
| ==expr2} .
Remark 1: A series of
disjunct tests must be considered as a combined test with one value; the value
is true if and only if one of its components is true.
Remark 2: The
"or"-operator can only be used in combination with
tests; a
compiler error message will be generated if this operator is used between other
actions like assignments and/or messages.
Multiple assignments may be
conjoined by the conjunction operator (";" or "&"):
{var1=expr1; var2=expr2} or {var1=expr1 & var2=expr2}
Multiple
tests may also be conjoined by the
conjunction operator:
{var1==expr1
& var2!=expr2} or {var1==expr1; var2!=expr2}
is true when all of the tests are true.
Remark: Because a series of
disjunct tests has to be considered as one test with
one
resulting value, in a combination of the "or"-operator with the
conjunction operator the former has precedence over the latter. Therefore, the
compound test:
{var1=='1'
| var2=='2' & var3=='3'}
only
succeeds when var3 has the value '3' and either var1 has the value '1' or var2 has the
value '2' (or both).
A mixture of tests and assignments, just
as other actions, may also be conjoined by means of the conjunction operator:
{var1==expr1; var2=expr2}
or {var1==expr1 & var2=expr2}
Style recommendation: It is
recommended that the "&" variant is used for the conjunction of
two tests and the ";" variant for the conjunction of two actions
where at least one of the actions is not a test. For example:
{var1==expr1
& var2==expr2; var3=var1||var2}
In this manner the Boolean outcome of the
conjunction of two tests is emphasized by means of the sign "&"
since the ampersand is often used in mathematics to stand for the Boolean
"and"-operator.
Remark: The combination of a
test and an assignment has the appearance of an "if‑then" statement in a programming language, but note that if the
test fails, the parse fails as well.
You can solve this
situation by adding an identical alternative to the rule in which you change
the test to the opposite meaning (i.e. "==" becomes "!="
and "!=" becomes "=="). For example, consider the following
sketch of a rule:
aa ::= ...
( bb | cc {p='1'} ) {p=='1'; q='2'}
... .
If you intend this
rule to mean "if variable 'p' has obtained value '1', the variable 'q'
becomes '2', else 'q' remains unchanged", you should change it as follows:
aa ::= ...
( bb | cc {p='1'} ) {p=='1'; q='2'}
... |
...
( bb | cc {p='1'} ) {p!='1'}
... .
This
problem will become more obvious when non-terminals with parameters are
introduced.
As a final example, we present a grammar for
palindromes (lines that are the same when read from right to left). Such a
grammar could be written using local variables:
Palindrome ::= (
Char {head = head || %} )+ ( Char )? ,
(
Char {tail = % || tail} )+ {head == tail} .
Messages can be used to mark
the relevant points in the recognition process by which other (external)
processes can be triggered. Messages provide for event-based processing in
general purpose programming languages.
When messages are used in a grammar, an output
file with the extension .REP (also called "report file") is created in which the messages
are written. There are two variants: simple messages (section 3.1.2.1 Simple messages) and messages with variables (section 3.1.2.2 Messages with
variables).
There are two message functions:
signal
and report. A message is characterized by a function
indication ("s" for signals and "r" for reports) followed
by a colon and a positive, non-zero number:
Also, in Transform notation:
Message ::= Signal
| Report .
Signal ::= 'S_or_s', ':' , Number .
Report ::= 'R_or_r', ':', Number .
Number ::= [1-9] ( [0-9] )* .
R_ or_ r ::= 'R' | 'r' .
S_ or_ s ::= 'S' | 's' .
Remark:
The number "0" cannot be used as a signal or report.
Signals or s‑messages only report the number and a
colon (":"); reports or r‑messages report the number followed by a
colon plus the most recently matched terminal symbol. In the case of subsequent
r‑messages with the same report number, the report number is given once
and the reported symbols are concatenated (even if they were not contiguous in
the input).
R‑messages always output the character that
has most recently been read. Thus, if a report follows a symbol or regular
expression that covers no input (as may be the case in the "( )" and
"( )* " expressions), the character that was
read before the regular expression will be output.
In the grammar:
String ::= ( Word
Word ::= ( Char )+ .
Char ::= [a-z]
Blank ::= #x20 .
any sequence of lowercase alphabetic characters
will be recognized as a string and will return as output a report file with two
lines. The first line will consist of the report number "1" followed
by a colon and all recognized characters, since subsequently reported terminal
symbols with the same number are concatenated. The second line will consist of
the message number "2" followed by a colon (and nothing else). For
example, for the string "abc def " the output will be:
1:abc
2:
1:def
2:
If, in the same grammar, the signal
"S:2" in the first rule is omitted, the output of the same input will
be:
1:abcdef
The characters written to the report file by
means of r‑messages will not be converted to a readable format. This only
has consequences if the input file contains control characters (characters
representing the decimal ASCII codes smaller than 32 or greater than 126)
and
these characters are emitted by r‑messages. A character in the input
matching a terminal symbol specified in the grammar as:
... #x04
will result in a non-printable character in the
report file after the number "4:". To avoid non-printable report
files, a signal can be emitted instead of a report for those control characters
(for example with a number equal to its ASCII code):
... #x04
So the general rule is: characters emitted by r‑messages
are equal to the corresponding characters in the input. The only exception
occurs when the switch CASE_
CONVERT has either the value LOWER_ TO_ UPPER
or the value UPPER_ TO_ LOWER. In those cases letters will not be emitted in
their original format but in their converted one.
Remark 1: Although the
carriage return character can be represented in the grammar as "cr",
the emission of this character to the report file causes a carriage return.
Remark 2: If a parse is
ambiguous, messages can be produced in a unpredictable order (see also section 5.4.1 Report of messages).
More examples of messages in a grammar rule will
be given in section 3.1.3 (Places of actions in
rules).
A
variant of the simple r‑messages or s‑messages can be used to
communicate expressions. In this use, the report or signal number is followed
again by a colon followed by a variable expression:
For more details about the use of variables and
expressions, see section 3.1.1.1 (Variables,
literals and expressions).
The effect of this action is that as soon as the
input is recognized up to the point of the action, the current value of the
expression will be written to the report file (preceded by the number and a
colon). The same difference in effect that applies for simple reports and
signals is found for signals and reports with a variable expression; the
resulting expressions of an r‑message are concatenated as long as the
same report number is emitted (the report number itself appears only once in
the output). For this reason the signals with expressions are of more practical
use; each separate emission appears on a separate line in the report file.
Examples:
The report files for the examples above will
contain respectively:
1:... (the value of the variable
with the name "var1")
2:abc (the literal string
"abc")
3:...abc (where “…" stands for the
value of the variable with the name "var1")
Remark: Due to an error in
the run time system, messages with expressions sometimes yield incorrect output
when used inside a repeated regular expression.
Actions
can be written after symbols (terminals, non-terminals and the symbol types
that will be introduced later on), after regular expressions and between the
closing bracket of a regular expression and the repetition indicators star
("*") or plus ("+"). There is also a
possibility for an alternative consisting of only an action; this alternative
must be the first one. These possibilities will be discussed in the sections
below.
Actions written after a symbol (terminal,
non-terminal or one of the symbols to be introduced in section 3.2 Symbol type
extensions), will be performed as soon as the symbol has matched the input.
For example, in the rules:
Input ::= (
Word {var1=%}; Blank
Word ::= (
[a-z] )+ .
Blank ::= #x20
| #x0B . /* space or tab */
upon recognition of the input "horse",
the variable "var1" will be assigned its final value ("e");
this value will be emitted to the report file after recognition of the space.
Actions will also be executed when the symbol
itself has covered no input. For example in the rules:
Number ::= Digit, Mantisse
Digit ::= [0-9] .
Mantisse ::= (
'.' Digit )? .
the output in the report file of the input
"2" will be "1:2", because the last read input character
was a "2".
However, if the action is placed after a symbol
inside a
regular expression that can cover no input the action is
not
performed. Therefore in:
Input ::= ( ( Determiner
Determiner ::= 'a'
| 'the' .
Noun ::= 'text' | 'book' .
upon the input "a text book " the
report file will consist of only one "1:".
As a special case, we consider actions after
wildcards. Actions placed after a wildcard that can cover a whole sequence of
characters will be repeated as many times as the wildcard matches the input.
For example in:
Comment ::= '/*',
=
the report file will contain the number
"1", followed by a colon and all terminal symbols matched by the
"arb" are reported. The same effect is achieved when a message
follows a "line" ("‑"). The dontcare symbol matches
only one character, so an r‑message placed behind the dontcare symbol
will also report only one character.
Remark: If a wildcard
("arb" or "line") matches no terminal symbols, the
"last read character" will contain the character that was matched
before the wildcard. An illustration in the case of r‑messages can be
found in the preceding example (a ‘*’ will be emitted). The same can be said
about the reserved variable "%", occurring in actions after a
wildcard.
Actions following a regular expression are
executed only when the regular expression has ceased matching input. Thus, in
the case of a r‑message, the most recently matched character is reported.
It is not the string which has been matched by the regular
expression.
For example, in the grammar:
String ::= ( Word, Comma )+
Word ::= ( [a-z] )+ .
Comma ::= ','
.
the input "abc,pqr," will result in the
following report file:
1:,
A special case is formed by messages, assignments
and tests containing the reserved variable "%" following a regular
expression that does not cover any input. The same which has been said before
about actions after symbols that cover no input, also holds for the regular
expressions "(…)?" and "(…)* ". In
the next grammar with messages, three different input strings are considered to
illustrate the effects:
Input ::= Bs, ( 'a' )*
Bs ::= (
'b' )*
input output (in the report file)
─────────────────────────────
aa 2:
1:a
bb 2:
1:b
bbaa 2:
1:a
An action between the ")" and the
" * " or "+" differs from the
situation in which the action is placed after the " * " or the
"+" that concludes the regular expression.
For example, in the grammar:
String ::= (
Char )
Char ::= [#x00-#xFF]
.
the report is given when the expression is
repeated or left. Thus, the report file will consist of the number "1"
followed by a colon and the recognized characters, which are concatenated.
As an illustration of the difference between a
message action placed before and a message placed after the " * " of a regular expression, we consider the effects of
two slightly different grammars with a r‑message:
Example
1 Example
2
Ss ::= Digit, ( Char )
char ::= 'a' | 'b' . Char ::=
'a' | 'b' .
Digit ::= [0-9] . Digit ::= [0-9] .
input output example 1 output example
2
─────────────────────────────────────────────────
1 (no output created) 1:1
1a 1:a 1:a
1ab 1:ab 1:b
As
explained above, variables placed in regular expressions that can be skipped
remain unchanged. With the help of an action in an otherwise empty unit, one
can force a value to be given to a variable, for example:
QuotedString ::= {q='"'} ( [a-z] {q=q||%} )+
Remark 1: Only
one
empty unit may be added to a rule or a regular expression and that must be the
first
one. The compiler halts with an error when a rule contains more than one empty
unit or when the empty unit is written in an illegal position.
Remark 2: In order to
simulate an "if‑then‑else" construction for tests (see
also section 3.1.1.4 Combinations of
tests and/or assignments), one can make use of so-called empty rules. For
example, as an alternative for the example in that section, one could write:
Cc
{p = '1'}
)
,
(
Empty {p ='1'; q = '2'} |
Empty {p!='1'}
)
.
Empty ::= .
In the
basic syntax, only two symbol types were distinguished: terminals and
non-terminals. In this section, other types of symbols are introduced; some are
entirely different from the two basic ones, other are derivates. In section 3.2.1 non-terminals with parameters will be introduced. The use of parameters or
features transforms a grammar into an attribute grammar. Related to the use of
parameters is the introduction of a new symbol type: the
lexicon symbols. A description can be found in section 3.2.2 Lexicon symbols. Another
extension of the non-terminals, the cover non-terminals, is
especially useful in context sensitive rules. It will be discussed in section 3.2.3 Cover non-terminals. The new symbol type introduced
in section 3.2.4 Cover non-terminals, the
intermediate symbols, is also important for context sensitive grammars.
A non-terminal symbol
may be supplied with parameters, which can be viewed as attributes of the
symbol. They are declared together in a so-called formal
parameter list of a non-terminal symbol on the left-hand side of a context free rule.
There are three types of parameters:
i (input) :
taking (inheriting)
o (output) :
giving (synthesizing)
io (input/output) :
taking and giving
Once declared, the type of the parameters is
fixed. The type of the parameter must be specified in the formal parameter
list, followed by a colon, followed by one or more names of parameters,
separated by a comma. A parameter list can contain parameters of one, two or
three types. There is no prescribed order if more than one type is used. Each
type indicator is separated from the preceding list of names by a comma.
The syntax of a formal parameter list can be
defined in Transform notation as:
FormalParameterList ::= '(', ParameterDeclarations ,
(
',', ParameterDeclarations )* ,
')'
.
ParameterDeclarations ::= TypeIndicator, ':',
ParameterName ,
(
',', ParameterName )* .
TypeIndicator ::= I_or_I, ( O_or_o )? |
O_or_o .
Remark:
Again, the keywords "i", "o" and "io" are not
case sensitive.
Examples of formal parameter lists:
Verb(I:MainVerb) ::= ...
Symbol "Verb" has one
parameter of which the value is determined in the calling rule; the value of
"MainVerb" is not affected after application of the present rule.
VsMod(O:type) ::= ...
Symbol "VsMod" has one
parameter of which the value is initially an empty string; the value of
"type" is determined during the application of the present rule.
Noun(IO:cat) ::= ...
Symbol "Noun" has one
parameter of which the value is initially determined by the calling rule; the
value of "cat" may be changed after application of the present rule.
RelClause(I:Antecedent,O:type,IO:cat) ::= ...
Symbol "RelClause" has
three parameters, one of each type.
NP(O:cat,number,pers) ::= ...
Symbol "NP" has three
parameters, all of the same type.
When the non-terminal is used on the right-hand
side of one or more rules, the parameters must be repeated in a so-called
actual parameter list. The names of the actual parameters can of course be different from
the names used in the formal parameter list. In the actual parameter list, it
is allowed to repeat the type but it is not necessary. If the type is indicated
on the right-hand side, it must be the same as in the declaration. So, one can
choose between, for instance:
... NP(cat, number, pers> ... and
... NP(O:cat, number, pers> ...
Remark: A non-terminal and
its parameter list must be considered as a whole; therefore an action can not
be placed between the name of the non-terminal and its parameter list, but only
after
the parameter list.
Restriction: If a
non-terminal has a left recursive definition, it can only have output
parameters. An example of illegal use of the input parameter is:
Adjectives(I:gender) ::= (
Adjectives(gender) )?, Adjective(gender)
.
The use of parameters enables the exchange of
information between rules. For example, number agreement of subject and verb
can be handled with the use of parameters in combination with a test action
(see section 3.1.1.3 Types of tests):
Sentence ::= NP(number1), VP(number2) {number1==number2} .
NP(O:number) ::= ... .
VP(O:number) ::= ... .
where both parameters are declared as output
parameters. Another grammar, making use of an input parameter, would look like:
Sentence ::= NP(number), VP(number) .
NP(O:number) ::= ... .
VP(I:number) ::= ... {number==…} ... .
Restriction:
Parameters can only be used if FINITE=FALSE.
The starting symbol can also have output
parameters (but no input or input/output ones). The content of these top level
parameters can be written to the report file and/or the screen with the run
time switch: OUTPUT_ PARAMETERS.
Remark 1: It is not allowed to compare
parameters and/or variables that stem from a different path within the same
rule. Thus, the following grammar is illegal:
Sentence ::= ( -, Subj(Sub),
- | -, Obj(Ob), - ) {Sub==Ob & Sub!=''} .
Subj(O:Sub) ::= ... {Sub = 'yes'} .
Obj(O:Ob) ::= ... {
Nor
is it allowed to rewrite this grammar as follows:
Sentence ::= SubjObj(Sub,
SubjObj(O:Sub,Ob) ::= -,
Subj(Sub), - | -, Obj(Ob) , - .
Subj(O:Sub) ::= ... {Sub = 'yes'} .
Obj(O:Ob) ::= ... {
Remark 2: As an alternative
for the rules in the example given in the second remark on page 30 one would like to write:
Aa ::= ... ( Bb |
Cc
{p = '1'}
)
,
Empty(p,q)
.
Empty(I:p,
O:q) ::= EE {p =='1'; q = '2'} |
EE
{p!='1'} .
EE ::= .
However,
because the compiler is not yet equipped to handle alternative empty paths in
combination with in- and output parameters, an error will be generated in cases
like this.
The same operations that are allowed on variables
like assignments and tests (see section 3.1.1 Assignment and tests)
are allowed on parameters. In a rule with a formal parameter list, local
variables can be introduced as well. Input parameters (i‑ and io‑parameters)
are usually used for testing; output parameters (o‑ and io‑parameters)
usually get a value (by assignment).
Recommendation: The compiler
distinguishes parameters from local variables by examining the formal parameter
list. Thus, when a parameter name is misspelled (due to a typing error), it is
silently considered a local variable. Therefore, it is advisable to set the
compiler switch
CHECK_VARIABLES to TRUE especially in early stages of the development of a grammar.
Then, for every rule in the grammar, the local variables will be written to the
compiler error file.
Grammars
that deal with the syntax of natural languages often contain rules, referring
to lexical categories. In order to parse sentences one can add rules to the
grammar like:
Noun ::= 'man'
| 'child' | 'cat' .
Verb ::= 'see'
| 'walk' .
However, if the applicability of the grammar has
to be extended to sentences containing words not yet mentioned, new words must be
added to the grammar and the grammar must be recompiled.
To avoid this, a new type of symbol was
introduced that can be used in Transform grammars, the lexicon
symbols.
Lexicon symbols are a kind of non-terminal symbols which refer to entries in an
external lexicon. A lexicon symbol name consists of a "$" followed by
one or more "name characters" (see the first example in section
2.2.2 Ranges), followed again by a (possibly empty) parameter
list. Lexicon symbols refer to categories in the lexicon. The following symbols
illustrate the appearance of a lexicon symbol:
$Noun(Word,
Type) and $Verb()".
Remark: A lexicon symbol may
be used without parameters, but the parentheses for the parameter list are
compulsory.
A lexicon consists of a list of entries. Each
entry specifies the grammatical category of a word and possibly one or more
features of the word, such as syntactic, morphological and/or semantic ones.
For the construction of a lexicon, see 9.1 Lexicon.
Lexicon entries are not restricted to words;
other uses include word stems, word endings and punctuation marks.
The match between entries in the lexicon and the
words found in the input is case sensitive; using the switch CASE_CONVERT one
can force the run time system to treat all capitals in the input in their
lowercase form (value: UPPER_TO_LOWER) or to treat all lowercase letters in
their uppercase forms (value: LOWER_TO_UPPER).
Lexicon symbols are defined as non-terminal
symbols with implicit O‑parameters. A lexicon symbol matches any word
that occurs in the lexicon with this category. When a lexicon symbol is used
with one parameter, this parameter contains the word that matched the symbol.
When it is used with more than one parameter, the first parameter contains the
word that was matched and further parameters contain the feature values
specified in the lexicon, in left‑to‑right order. The number of
parameters specified in the grammar may be less than the number of parameters
specified in the lexicon, but it may not be more; the number of features
specified in the lexicon must be sufficient
Restriction: Lexicon symbols
can only be used if
FINITE=FALSE, and as a consequence they can
not occur in transduction grammars. They can, however, occur at the left hand
side of context sensitive rules in a grammar used for recognition, but only
without parameters.
There are two ways in which lexicon symbols can
be used, regulated by the run time switch LEX_
INCREMENT:
1. The
run time system scans the input character by character. Lexicon symbols can match any
string, provided it is included in the lexicon. The lexicon symbols can be
mixed freely with other terminal symbols. Ambiguity with terminal symbols in
the grammar is allowed. This option will be chosen if LEX_INCREMENT=TRUE, the
default value for this switch.
For
instance, suppose a grammar has rules like:
Noun(O:
N) ::= Singular(I:N) | Plural(I:N).
Singular(O:
N) ::= $Noun(N).
Plural(O:
N) ::= $Noun(N), 's'.
If
the lexicon contains the entries:
$Noun("sale")
$Noun("sales")
then
the word "sales" in the input can be recognized as both a Singular
and a Plural noun.
2. The run time system scans the input word by word. This option is somewhat more
efficient than the first one, but provides less flexibility. It can be selected
by setting the run time switch LEX_INCREMENT to FALSE. Each word is looked up in the lexicon. If it is found, its
corresponding lexicon symbol(s) are returned and the parse continues.
A
word to be looked up in the lexicon is either a sequence of one or more
lexicon word characters, delimited by one non-lexicon word character or it consists of one
non-blank, non-lexicon word character. The user can specify the lexicon word
characters by adding the switch LEX_CHARS to the switches file. If no such switch is provided by the user, the
run time system assumes the default value:
LEX_CHARS
::=
[a-z] | [A-Z] | [0-9] | '''' | '_'
.
Any blank characters (spaces, tabs,
end of lines, form feeds) are skipped. Examples of words (the surrounding
quotes are not part of
the words):
"man",
"123".
The following strings are not examples of words:
"central
heating", "‑123", "…".
In these examples the default value
of the switch LEX_CHARS is
assumed.
Note that if this feature is used,
any terminal symbols in the grammar (apart from lexicon symbols) will never be
recognized.
Cover non-terminals consist of
a non-terminal name followed by a cover symbol, a caret ("^"). The
non-terminal with this name (without the caret) must be defined in the grammar,
i.e. must occur at least once on the left-hand side of a context free rule.
Cover non-terminals can only be used on the
left-hand side of context sensitive rules. Therefore, their use will be
explained extensively in chapter 4 where all kinds of context sensitive grammars are
discussed.
Upon recognition of a rule in which one or more
cover non-terminals occur on the left-hand side, the run time system will
replace these cover non-terminals with the terminal symbols they have covered
on the right-hand side. For example, if a grammar contains the rules:
aa^, bb^ ::= aa,
cc, bb .
aa ::= ( 'a' )+ .
bb ::= ( 'b' )+ .
cc ::= ( 'c' )+ .
the input "aaacccbbb" will be replaced
by "aaabbb" and the input "acb" by "ab".
Remark 1: If the cover
non-terminal does not reoccur at the right-hand side of the same rule, the run
time system will replace the non-terminal by its defining terminals. For
example, if the grammar contains the rules:
aa^, bb^ ::= aa,
cc .
aa ::= ( 'a' )+ .
bb ::= 'bb', dd .
dd ::= 'dd' .
cc ::= ( 'c' )+ .
the recognized input
"aaaccc" will be replaced by "aaabbdd". Note that such a
cover non-terminal may not contain ambiguity in its defining rule. If the
definition contains ambiguity the compiler will generate an error.
Remark 2: If the non-terminal
with a cover sign reoccurs more than once on the left-hand side, the compiler
will generate an error. Example of illegal use:
Aa, bb^ ::= aa,
bb, bb . /* bb occurs twice. */
On the
other hand, it is allowed to use the same cover non-terminal more than once on
the left-hand side.
Remark 3: In transduction
grammars (see section 4.3 Transduction grammars), all non-terminals on the left-hand side of rules are cover symbols
by default. Thus, it is unnecessary to use the cover sign in transduction grammars.
Intermediate symbols
are used in grammars with context sensitive rules or in a context free grammar
that is the last one of a cascaded grammar (see section 4.3.6 Cascaded grammars).
These symbols are called "intermediate" because they are not terminal
and not non-terminal, but something in between. They differ from
terminal
symbols in that they do not occur in the original input: they are introduced in
the left-hand side of a rule. They differ from
non-terminal symbols in that they do not occur on the left-hand side of a defining
(context free) rule in the grammar.
Once introduced on the left-hand side of a
context sensitive rule, intermediate symbols may serve as input on the
right-hand side of other rules. If the grammar is part of a cascaded grammar
(see section 4.3.6 Cascaded grammars),
its intermediates may also serve as input for subsequent grammars.
The names of intermediates are strings consisting
of at least two "name characters".
Example:
/*1*/ N1,
hundred, N2, ty,
N3, Blank <:: N1, N2,
N3, Blank .
/*2*/ Blank ::= #x20 .
/*3*/ N1 ::= [1-9] .
/*4*/ N2 ::= [2-9] .
/*5*/ N3 ::= N1 .
In this transduction grammar, "[1-9]",
"[2-9]" and "#x20" are terminal symbols, "Blank",
"N1", "N2" and "N3" are non-terminal symbols and
"hundred" and "ty" are intermediate symbols. These
intermediates are introduced on the left-hand side of rule 1 and they can serve
as input for other rules, such as:
/*6*/ thir,
ty <:: 3,
ty .
/*7*/ three,
No_ty <:: 3, No_ty .
/*8*/ No_ty ::= 0 | N1 | hundred | Blank .
According to this grammar, the transduction of
the input "333 " will be "threehundredthirtythree ", as an
illustration of the fact that intermediate symbols, like terminal symbols, may
be found literally in the printed output.
Although intermediates look like terminal strings
(however, the single quotes are missing), they can not be split into parts. So,
the next rule can not serve as an alternative for rule
thir, 't' <::
3, 't' .
For the use of intermediates in context sensitive
grammars for recognition and/or parsing, we refer the reader to section 4.2.3 Intermediates in
context sensitive grammars.
A useful extension, especially
in transduction grammars, is negation. Sometimes it is a rather laborious task
to describe a pattern in positive terms. For example, the pattern "every
word, except the ones starting with a 'ch'" can only be described in
positive terms as "every word starting with a letter not equal to a 'c',
or a word that starts with a 'c' but its second letter is not equal to an
'h'". As a Transform rule:
WordNo_ch_Start ::= No_c, ( Letter )* |
'c', No_h, ( Letter )* .
No_c ::= [a-b] | [d-z] .
No_h ::= [a-g] | [i-z] .
Letter ::= [a-z] .
With the help of the negation sign, the back quote
("`"), this can be written in a simpler way:
WordNo_ch_Start ::= Letter |
`Ch, ( Letter )* .
Ch ::= 'ch' .
Letter ::= [a-z] .
The negation sign can be used in combination with
terminals like characters, strings and ranges, with non-terminals and
intermediates, but also in combination with regular expressions. In the next
sections all combinations will be explained.
The use of the negations sign is subject to the
following rules:
Restriction
1: Negations can only be used if FINITE=TRUE, unless the negation concerns one input character.
Restriction
2: Negations can not be used on the left-hand side of context sensitive
rules.
Restriction
3: Negations can not be used in combination with lexicon symbols.
In section 3.3.3 Negation
of regular expressions some restrictions concerning
the use of negated regular expressions will be added to this list.
The negation of terminal characters, ranges or an
intermediate symbol is expressed by the negation sign, a back quote (`), before
the symbol. It is defined as its complement in the set of which it is a
member.
For the terminals, this set is by default the
ASCII character set #x00-#xFF), but it can be redefined once for a particular
grammar, using the compiler switch UNIVERSE. For example, if the switches file contains the line:
UNIVERSE = [a-z | A-Z].
"`a" matches an arbitrary
letter,
but not "a", while in a grammar without the UNIVERSE switch "`a" matches any arbitrary character but not "a".
For intermediates the complement is defined in
the total set of all intermediates used in this particular grammar. So the
negation of an intermediate symbol will be interpreted as "any
intermediate symbol, but not the symbol itself".
It is possible, however, to treat the terminals
and the intermediates as one set of symbols; ths is done by adding the switch SHARED_UNIVERSE to
the switches file. If this is done, the negation of a terminal and an
intermediate will be interpreted as "any input symbol, but not the symbol
itself".
The negation of a non-terminal symbol is
expressed by the negation sign before the symbol. It is defined as the set of
strings with the same length as that of the string matched by the symbol, but
not equal to it. When there are alternatives, the complement is defined as the
set of strings with a length between the length of the shortest and the largest
string that is matched by the symbol, but not equal to any string that can be
matched by the symbol.
For example, if the non-terminal
"Input" is defined as:
Input ::= 'a', ( 'b' )?, 'c' .
then
‑ only
the strings "abc" and "ac" are recognized as the
non-terminal "Input" and
‑ all strings
consisting of two or three characters not equal to "abc" and
"ac" are recognized as the negation of the non-terminal
"Input" (`Input).
The negation of regular expressions is expressed
as the negation sign before the open bracket. It is defined as the negation of
a non-terminal symbol which is rewritten as the regular expression.
For example:
`(
'p' ,'q' )*
must be interpreted as: every string of an even
number of characters (and at least two) that does not contain the combination
"pq" starting at an odd position.
Restriction
1: Only regular expressions that enclose the total right-hand side of a
rule can be negated.
Restriction
2: Therefore, nesting of negated regular expressions is forbidden.
By defining extra non-terminals, it is always
possible to overcome these restrictions. For example, the ungrammatical rule:
WordNo_sch_Ending ::= ( Letter )+, `( 's',
'c', 'h') .
can be rewritten to the grammatical ones:
WordNo_sch_Ending ::= ( Letter )+, `Sch .
Sch ::= 'sch' .
Remark 1: The interpretation
of the negation of the "zero or one" and the "zero or more"
regular expressions needs some explanation. The negation of "zero or one
times 'something'" is "at least one time anything else but
'something'". So, in fact, the negation of a (…)? regular expression is
identical to the negation of a (…) regular expression and the negation of a (…)* regular expression is identical to the negation of a (…)+ regular expression. For example, there is no difference in
interpretation between:
`(
(Letter)? ) and `( Letter )
and
between:
`(
Letter )* and `(
Letter )+ .
Remark 2: In
the interpretation of the negation of a regular expression with alternatives
(the Boolean "or"), these operators switch their meaning according to
the mathematical "laws of De Morgan". For example:
`(
Letter | Number )+
must
be interpreted as: one or more characters, not containing any letters
and
not containing any numbers.
Strings of terminal symbols can only be negated
indirectly: to negate a string, make it a negated regular expression. For
example, the negation of the string "something" can be written as
`(
'something' )
It is defined as the set of strings with the same
length as the mentioned string, but not equal to it.
Examples:
`(
'abc' ) matches a
sequence of three characters, which is not "abc"
`(
'ab' | 'bc' ) matches a sequence of
two characters, which is not "ab" or "bc"
`(
‑ [a-c] ) matches any sequence of characters, which does not end in
"a", "b" or "c"
Usually,
while running a compiled grammar, the run time system will stop if an error is
detected in the input (i.e. when the input does not conform to the grammar). It
is, however, also possible to have the run time system resume its operation at
a certain position in the output, thus producing an overview of all errors.
In order to achieve this goal, the run time
switch
FIND_ON_ERROR can be added to the switches file. The value of
this switch should consist of a set of terminal
symbols.
Upon an error, the run time system will skip forward to the first terminal in
the input that is in the set and still complies with the state of the parse at
the time of the error. The parse is resumed starting at the first character
found.
The terminal characters in the definition of FIND_ON_ERROR should be relatively unique. Often, "cr" is a good choice
or, when the grammar deals with natural language sentences, a period
(".").
There are times when one needs to define the
context of the character used for the recovery. For instance, in the Bootstrap
grammar (the grammar defining the Transform formalism), skipping to a period
only make sense if this period is not part of a comment (otherwise, the entire
rest of the input might be recognized as a comment!). For situations like this,
the FIND_ON_ERROR switch has its counterpart in a switch with the name SKIP_ON_ERROR;
this lists all terminals that can introduce some unwanted domain. In Bootstrap
this is the ‘/’.
Example:
FIND_ON_ERROR = '.' .
SKIP_ON_ERROR = '/' .
The two switches given in the example above are
actually added to the switches file of the Bootstrap grammar. So, when a newly
written grammar is checked for syntax errors (see section 7.1 Compiler), a list of all errors that occurred will be made.
Remark:
Error recovery is not yet possible if the grammar contains context sensitive
rules.
A context sensitive rule rewrites the input that
fits the right-hand side to the pattern described by the left-hand side. By
preference, the rewrite sign between the two sides is "<::". The
left-hand side of such a rule may never be empty. In context sensitive rules
ordinary terminals, like 'a' or <032> or '.', can be used freely. They
can also contain non-terminals and intermediates; their use will be explained
in the section for context sensitive recognizing grammars as well as in the
section for transduction grammars.
There are, however, some restrictions on the use
of other symbol types and meta symbols on the left-hand side of the rule. On
the left-hand side of a context sensitive rule, the following are
not
allowed:
1.
Regular expressions
2.
The meta symbol "|" (alternative sign)
3.
Ranges and wildcards
4.
Non-terminals with parameters
5.
Negations of any kind
6.
Lexicon symbols with parameters
7.
Actions
Some of these restrictions are easy to overcome
by the introduction of extra non-terminals. As an illustration, a small
(transduction) grammar [5]
for the transformation of the letter "c" in one of the possible
phonemes "GPhon", "KPhon" and "SPhon"
(intermediates) is given. In this example a regular expression and a negation
are both replaced by non-terminals:
GPhon <:: 'ch' .
SPhon, e_or_i <:: 'c',
e_or_i .
KPhon, Not_e_h_i <:: 'c',
Not_e_h_i .
e_or_i ::= 'e' | 'i' .
Not_e_h_i ::= `( 'e' | 'h' | 'i' ) .
The switches file contains:
UNIVERSE = [a-z] .
Intermediates like "GPhon",
"SPhon" and "KPhon" will be emitted to the output,
resulting in a phonetic transcription of the words.
In run time, if a context sensitive rule comes to
an end (reduces), the input will be replaced by the pattern described by the
left-hand side of the rule; the run time system resumes the parse at the first
symbol of that emitted output.
When context sensitive grammars are used for
recognition and/or parsing, the context sensitive rules transform the input in
such a way that other rules (either context free or context sensitive) can
operate on the result. We will discuss an example (section 4.2.1 Example of a context
sensitive grammar used for recognition), and
make some remarks about the special role of non-terminals (section 4.2.2 (Cover)
non-terminals in context sensitive grammars) and
intermediates (section 4.2.3 Intermediates in
context sensitive grammars) in this type of grammar.
As an example, consider
the following grammar that describes the plural of some English nouns:
/* Grammar for the plural of English nouns */
Input ::= ( PluralNoun, CR )+ .
PluralNoun ::= SingularNoun, 's' .
SingularNoun ::= 'book' | 'house' | 'spy' .
'spys', CR <:: 'spies', CR .
If the input is a line containing
"spies", the third rule will match up to the "sp" in
"spy". Since an "i" follows instead of a "y", the
only rule that still matches is the context sensitive one. Upon recognition of
this rule, the parse is (as it were) resumed at the input position where the
context sensitive rule started, but now the input has been transformed
according to the matching rules. Thus, starting again after "sp", the
third rule will now match the "y", leading eventually to accepting
the input. Note the over recognition in this grammar; the incorrect English
word "spys" is also accepted as input.
The next grammar with lexicon symbols recognizes
the same input. Lexical symbols can be viewed as special non-terminals defined
as one or more times a dontcare symbol ("(*)+"). So, the word
"spies" will be transformed by the context sensitive rule into
"spys" and, by the second rule, be recognized as a $Noun, followed by
a "s".
Input ::= ( PluralNoun, CR )+ .
PluralNoun ::= $Noun(), 's' .
'ys', CR <:: 'ies',
CR .
Lexicon:
$Noun('spy')
Remark:
the use of context sensitive rules in grammars used for recognition slows down
the run time system.
Non-terminal
symbols are a special case. When used on the left-hand side of a rule, the
result of the transformation will include the symbol as such, not the input that was
matched by it. There are cases in which this is exactly what you want; if the
rules that will eventually recognize the result of the transformation mention
the same non-terminal, they will match at once, without the need of matching
the same input once again (thus providing efficiency).
Example:
Sentence ::= Prep,
Subj, Verb, Obj .
Prep ::= 'in the garden ' .
Subj ::= 'the man ' .
Verb ::= 'sees ' .
Obj ::= 'a bird ' .
Prep, Subj, Verb, Obj <:: Subj, Verb,
Obj, Prep .
In this example, the input "in the garden
the man sees a bird" as well as "the man sees a bird in the
garden" will be recognized as the starting symbol "Sentence".
There are, however, also situations in which you
want the non-terminal on the left-hand side to stand for the
input it
has matched. Such situations include:
‑ the rules operating on the result of the transformation
do not mention the non-terminal, but they do describe the same production.
‑ the grammar is
compiled with the compiler switch FINITE=TRUE. In this case, the compiler
replaces all non-terminals with their productions. The generated code will
therefore only know terminal symbols, and no non-terminals.
Remark: A fatal
error occurs in the case of FINITE=TRUE (and
TRANSDUCE=FALSE) if a context sensitive rule
contains non-terminals without a caret at the left-hand side.
Using the cover sign (the caret, "^"),
you can mark a non-terminal on the left-hand side of a context sensitive rule
as being a cover non-terminal. Upon recognition of the rule, the run time
system will replace cover non-terminals with the terminal symbols they have
covered on the right-hand side. Note that it is an error to use a cover sign
after a non-terminal that does not occur exactly once on the right-hand side of
the rule. On the other hand, it is allowed to use the same cover non-terminal
more than once on the left-hand side.
In run time, cover non-terminals behave as if
they were substituted by the sequence of terminals covered by the same symbol
on the right-hand side. The following grammar provides an example. It describes
perfect forms of Latin verbs of the type LAVDO. Assume a lexicon containing an entry:
$LAVDO(‘navig').
The aim of the grammar is to accept the Latin
form variant "navigasti" as the same form as "navigavisti":
/* Grammar for perfect forms of Latin verbs of type
"LAVDO" */
/*1*/ Perfectum ::= $LAVDO(), 'av',
PerfectumEnding .
/*2*/ PerfectumEnding ::= 'i' |
'isti' | 'it’ | 'imvs' | 'istis' | 'ervnt' .
/*
Context sensitive rules */
/*3*/ 'avi', S_Ending^ <:: 'a',
S_Ending.
/*4*/ 'ave',
R_Ending^ <:: 'a', R_Ending .
/*
Definitions */
/*5*/ S_Ending ::= 'sti' | 'stis' .
/*6*/ R_Ending ::= 'rvnt' .
In the case of the input "navigasti",
symbol "S_Ending" will contain the string
"sti". Application of context sensitive rule 3 will transform the
input to "avi" plus the contents of "S_Ending", thus to "avisti". Now the parsing process will
restart in the situation that existed when rule 3 started. Whereas rule 1
originally failed (because the input did not contain "av"), it now
succeeds.
If the cover sign were left out in rule 3, the
parse would fail, since rule 1 does not match a symbol "S_Ending".
In order to make the grammar work without the
cover signs in rule 3 and 4, we would have to change rule 2 into:
/*2'*/ PerfectumEnding ::= 'i' |
'isti' | 'it’ | 'imvs' | 'istis' | 'ervnt' |
S_Ending | R_Ending .
The
following grammar[6]
shows that it is also possible to use intermediates in recognition grammars. It
is a modified version of the grammar given above (see section 4.2.2 (Cover)
non-terminals in context sensitive grammars). The
purpose of the intermediates VTRACE, ITRACE and ETRACE is to ensure that output parameter "Word" of rule 1 always
reflects the original input (note that in case of an input
"navigasti" the context sensitive rules transform the input into
"navigavisti").
/*
Modified grammar for perfect forms of Latin verbs of type "LAVDO" */
/*1*/ Perfectum(O:Word) ::=
$LAVDO(Stem), 'a', V_or_VTRACE(Ending1) ,
PerfectumEnding(Ending2)
{Word = Stem||'a'||Ending1||Ending2} .
/*2*/ PerfectumEnding(O:Ending) ::=
'i'
{Ending = 'i'} |
'it'
{Ending = 'it'} |
'imvs'
{Ending = 'imvs'} |
I_ or_ ITRACE(Ending),
'sti'
{Ending = Ending||'sti'},
(
's' {Ending = Ending||'s'} )? |
E_ or_ ETRACE(ending),
'rvnt'
{Ending = Ending||'rvnt'} .
/*
Context sensitive rules */
/*3*/ 'a',
VTRACE, ITRACE, S_ Ending^ <:: 'a',
S_ending.
/*4*/ 'a',
VTRACE, ETRACE, R_ Ending^ <:: 'a',
R_ending .
/*
Definitions */
/*5*/ S_ending ::= 'sti' | 'stis' .
/*6*/ R_ending ::= 'rvnt' .
/*7*/ V_or_VTRACE(O:Ending) ::= 'v' {Ending
= 'v'} |
VTRACE {Ending = ''}.
/*8*/ I_or_ITRACE(O:Ending) ::= 'i' {Ending
= 'i'} |
ITRACE {Ending = ''}.
/*9*/ E_ or_ ETRACE(O:Ending) ::= 'e' {Ending
= 'e'} |
ETRACE {Ending = ''}.
Transduction
grammars differ in many aspects of grammars used for recognition and/or
parsing. Much of what has been said in the preceding chapters about
transduction will be repeated here, together with some new information. A
description of transduction rules is found in section 4.3.1 Description
of transduction rules. In the sections 4.3.2 Non-terminals
in transduction rules and 4.3.3 Intermediates
in transduction rules, one can find examples of the use of non-terminal
symbols and intermediates in this grammar type. In section 4.3.4 Ambiguity and
"looping" problems in transduction grammars some problems in
connection with transduction will be discussed. Recommendations for the
compilation and execution can be found in section section 4.3.5 Compiling and
running a transduction grammar. The subject of section section 4.3.6 Cascaded grammars
is an extended feature, the cascaded grammars, by which several transduction
grammars can be combined.
A
transduction grammar is a grammar by means of which an input stream can be
transformed into a (different) output stream. The transduction is described by
rules, which have the appearance of context sensitive rules.
As was mentioned before, transduction grammars do
not have a first or starting symbol. They consist of at least one transduction
rule and zero or more context free rules; in these rules non-terminals can be
defined which are used in the context sensitive rules and/or other context free
rules. Each transduction rule consists of a left-hand side and a right-hand
side and can be interpreted as "the right-hand side (the input side), when
recognized, will be replaced by the left-hand side (the output side)".
For the rewrite sign between the left-hand side
and the right-hand side in a transduction rule, one can choose between
"<::" and "::=", where the former is recommended. If the
left-hand side consists of only one symbol, the rule must be marked as a context sensitive
rule with "<::".
The left-hand side of a context sensitive rule
can be empty (a pattern in the input matched by the right-hand side will be
deleted); the right-hand side on the other hand must contain at least one
symbol.
Besides
the absence of a starting symbol, in a transduction grammar there is no need to
describe the whole input as is the case in a grammar used for recognition and
parsing. Only parts of the input may be covered by the rules. Input not described
by any transduction rule will be simply emitted to the output.
Remark: If the
input contains control characters not described by any rule, those control
characters will appear again in the transduction output file (see also section 5.2 Appearance of
symbols in output files). Control characters can also be emitted explicitly
by the left-hand side transduction rule. These characters will not be
transformed by any built-in mechanism. For example, the rule:
#x16
<:: '16'.
will transform any occurrence of the
number "16" in the input file into the control character with ASCII
code 16.
It is important to be aware of the fact that
during run time, as soon as a transduction rule reduces, the input covered by
the right-hand side will be replaced by the symbols mentioned in the
left-hand side. The run time system resumes scanning the input with the first
symbol of the newly inserted ones, thus treating the replaced part as new input.
Examples:
/*1*/
'a' <:: 'a',
'a' . : each "aa" in the
input will be replaced by "a".
/*2*/
'b' 'a' <:: 'a',
'b' . : each "ab" will be
replaced by "ba".
/*3*/ <:: 'd', 'd' . : each "dd" will be omitted.
A grammar consisting of only rule 1 will
transform a large number of "a"'s into only one "a". The
first "aa" will be replaced by one "a", which together with
the next "a" forms a new pair, which will again be replaced by one
"a", until finally the last "a" remains untouched. By the
same mechanism, a grammar consisting of only rule 2 rewrites an input
"aaabbb" to the output "bbbaaa" and an input
"cabc" to the output "cbac". Finally, a grammar consisting
of only rule 3 transforms an input "cddcd" into "ccd".
Transduction grammars can only be compiled to a
finite state machine (with instructions
for storage) (if the switch TRANSDUCE is
set to TRUE then the default value for the switch FINITE is also TRUE). This
implies that no lexicon symbols (see section 3.2.2 Lexicon symbols) can be used in transduction grammars and also that no use can be made
of non-terminals with parameters (see section 3.2.1 Non-terminals
with parameters). On the other hand, negations (see section 3.3 Negation) can be used freely.
Warning: Some
transduction grammars can result in very large programs and need a long
compilation time. In developing a large transduction grammar one can reach a
point where adding one transduction rule to the grammar causes the compilation
to break off with the message that more memory is needed than the computer can
offer. There is no simple formula to describe the relationship between the
number of transduction rules in the grammar and the number of states to be
generated by the compiler. When compilation is no longer possible or even when
this process needs too much time, one could consider the possibility of
splitting the grammar into parts and building a cascaded grammar (see section 4.3.6 Cascaded grammars).
Non-terminals
used in context sensitive rules in transduction grammars are by default cover
non-terminals. There is therefore no need to mark the non-terminals with a
cover sign.
All non-terminals that are defined in the grammar
can be used on the right-hand side of a transduction rule, but on the left-hand
side only non-terminals that also occur on the right-hand side of the same rule
can be used. Moreover each appearance of a non-terminal on the left-hand side
must be unambiguously interpretable in terms of the non-terminals on the
right-hand side. If a non-terminal on the left-hand side does not reoccur on
the right-hand side or reoccurs more than once on that side, an error message
will be generated by the compiler.
Thus the rule:
Digit <:: Digit, Digit .
Digit ::= [0-9] .
is not a good example, because it is not clear whether
the digit on the left-hand side is the same as the first one or the second one
on the right-hand side.
A solution for problems like this is often to
double the definition for the non-terminals; we can rewrite this grammar as
follows:
Digit <:: Digit, DigitScnd .
Digit ::= [0-9] .
DigitScnd ::= Digit
.
Note that this problem does not exist in the case
of terminals and intermediates (see below).
An intermediate (see section 3.2.4 Intermediate symbols) can only be introduced on the left-hand side of a transduction rule.
As soon as it is introduced, the same intermediate can play a role on the
right-hand side of other rules (either context sensitive or context free).
Example:
/*
Part of a grammar by which numbers are translated in Dutch words. */
Digit1,
en, Digit2, tig <:: Digit2,
Digit1 .
Digit1 ::= [1-9] .
Digit2 ::= [2-9] .
twin,
tig <:: '2',
tig .
twee,
en <:: '2',
en .
der, tig <:: '3',
tig .
drie, en <:: '3',
en .
In the first rule, the intermediates
"en" and "tig" are introduced. They are also used in the
four last rules. Because there is no physical ordering of the rules in a
transduction grammar, it is not necessary to write the last four rules after
the first one; the grammar has the same effect if these rules precede the first
one. There is, however, an implicit ordering of the rules,
determined by the order of applicability of the rules. At the starting moment
of the treatment of "32", only the first rule is applicable, and the
input is replaced by "2en3tig". At this point the right-hand side of
two other rules cover the newly formed input string one after the other and the
final output will be "tweeendertig". This string consists of four
intermediates: "twee", "en", "der" and
"tig".
Remark 1: Do
not confuse intermediates with string terminals. In a string terminal like
'this is a string', every character is treated as an entity. An intermediate is
always treated as a whole. So, in the example above, the rule:
twin, tig <:: '2',
tig .
can
not be
replaced by the rule:
twin, 'tig' <:: '2',
'tig' .
because
the string 'tig' (in the last rule) can not refer to the intermediate with the
name "tig" (in the first rule).
Remark 2: If a
transduction grammar contains intermediate symbols and at the same time use is
made of negations, one has to decide whether or not to use the compiler switch SHARED_ UNIVERSE.
Special
attention must be paid to ambiguous transduction. This occurs if two (or more)
context sensitive rules describe (part of) the same input and will reduce at
the same moment in run time. For example:
/*1*/
'a', 'c', 'b' <:: 'a',
'b' .
/*2*/
'd', 'b' <:: 'b' .
If (part of) the input is "ab" then as
soon as the run time system reads the "b", both rules reduce. Rule 1
rewrites this input to "acb" and rule 2 to "adb" and the
grammar creates ambiguous output. By default the switch ONE_ CS_
REDUCE will be set to TRUE for
transduction grammars. The effect of this value of the switch is:
‑ if two or more
context sensitive rules are applicable at the same moment of reduction, the run
time system chooses the one with the longest right-hand side (the number of
terminals that have been covered by the rule) and
‑ if two rules
have the same length on the right-hand side, the one that is mentioned first in
the grammar text will be chosen.
If ambiguous output is really wanted, the
compiler switch
ONE_ CS_ REDUCE=FALSE can
be added to the switches file. In each case, the compiler generates a warning
if ambiguous rewritings are to be expected.
In designing a transduction grammar, care must be
taken not to write transduction rules that will bring the run time system in an
endless loop. For example, the following innocent looking rule will cause such a
looping problem:
The input of only one "a" will be
replaced by two "a"'s, each of which in turn will be replaced by two
"a"'s, etc., until the program crashes due to a shortage of memory.
Sometimes two different rules can have the same
effect:
'a' <:: 'b' .
'b' <:: 'a' .
Again, the run time system will never stop to
alter each "a" into a "b" and replace the "b"
again by an "a".
Problems like this can sometimes be solved by
introducing a more or less superfluous intermediate, and/or by anchoring the
input to an extra "start" and/or "end" symbol. For example,
the "a‑problem" can be solved by means of the next rules:
'a', treated,
'a', treated, End <:: 'a',
End .
End ::= 'a' | ' ' .
The output "atreatedatreated " of the
input "a " can not be rewritten again, because no part of the
right-hand side of the transduction rule covers the newly formed output. By
doing this, we can prevent a looping problem, but at the same time we create a
new problem: this output is not what we would like. Adding the rule:
'a' <:: 'a',
treated .
will bring us in a loop again. However, there is
a solution in the form of cascaded grammars (see section 4.3.6 Cascaded grammars).
If the
grammar starts with a transduction rule, the compiler automatically sets the
switch
TRANSDUCE to the value TRUE. If, however, the grammar is
meant to be a transduction grammar, but starts for some reason with a defining
(context free) rule, one needs a switches file; this file must contain at least
the switch TRANSDUCE=TRUE. By default the switch FINITE will be set to TRUE.
The default value of the switch ONE_ CS_ REDUCE is also TRUE for transduction
grammars, so in the default case no ambiguous rewriting is to be expected.
Adding run time switches to the switches file is
usually not necessary. While running the grammar, the run time system will
write the output to the .TDO file (transduce output file). If the grammar contains no ambiguous rewritings, this
file consists of the rewritten output. If, however, during compilation time a
warning has been generated that ambiguous rewritings could be expected, the .TDO file is not directly
readable and must be transformed by means of the program Drawgraph (see section
9.3 Drawgraph).
It is possible to split
a grammar into two (or more) grammars, and to link these grammars together into
a so-called cascaded grammar. The run time system will treat
the input initially with the first grammar and will use the resulting output as
input for the second grammar, and so on. The linking can be performed by the
program Plink (see section 9.4 Plink). All grammars have to be transduction grammars,
except the last one. A cascaded grammar can consist of a maximum of 25
individually compiled grammars.
The a‑problem in the example on page 46 can be solved in the following way:
Grammar1:
'a',
treated, 'a', treated, End <:: 'a',
End .
End ::= 'a' | cr .
Grammar2:
<:: treated . /* This transduction rule has an empty left-hand side */
After linking to a cascaded grammar, the run time system
will transform the input without the superfluous intermediate
"treated".
Remark: By
linking to the cascaded grammar, each one of the composing grammars knows the
intermediates of other grammars. However, they do not know each other's
non-terminal symbols.
Cascaded grammars can also be helpful in other cases:
1. In
developing a very large transduction grammar (with a long compilation time),
adding new rules is sometimes difficult because one has to wait a long time to
test the effect. In cases like this one can write a small grammar, consisting
of one or more new transduction rules, and link the two grammars together.
Later on, the new grammar can be added to the existing one, and recompiled to
one grammar.
2. Some
transduction grammars are so large that they need more memory during the
compilation than the computer can offer. In this case you also can split the
grammar into two or more smaller parts, compile them independently, and link
them together.
If a grammar does not contain any intermediates but will
be part of a cascaded grammar in which the preceding grammar outputs this kind
of symbol, it must be compiled with the switch INTERM_ INPUT set to TRUE.
Remark: If a
transduction grammar can produce ambiguous rewriting it can only be part of a cascaded
grammar if it is the last one. Otherwise the program Plink will generate an
error message.
The last grammar of a cascaded grammar can be a
transducer or a recognizer/parser. In the latter case, one can transform the
input by means of the transduction grammars followed by a check on the
rewriting. However, such a configuration does not result in a file with the
rewritten input (the file with extension .TDO) but an output as requested
by the last grammar; rejecting/accepting the transformed input in case of a
recognizer, a parse tree in case of a parser and eventual output written by
output instructions.
When the run time system performs the instructions of a compiled
grammar on an input file, it always creates some kind of output. This output
varies from a simple message on the screen like "input correct" to
large new files. The form of the output is partly dependent on the kind of
grammar and the features used in it and partly on the values of the run time
switches. In this chapter some remarks will be made about output related to
parse trees (section 5.3 Parse tree output),
reports (section 5.4 Report
output) and
errors (section 5.5 Error output).
These sections will be preceded by section 5.1 Input
files in
which the input files are discussed and by section 5.2 Appearance of symbols in output files with some general remarks concerning the appearance of different types
of symbols in output files. Details of the output of transduction grammars can
be found in section section 4.3.5 Compiling and running a transduction grammar.
For a description of the
remaining messages on the screen during run time, the reader should refer to section 7.2.2 Screen output and keyboard
input during the execution phase.
Every
file can be analyzed or transformed by means of a suitable compiled Transform
grammar as long as it is a file of characters. There are no constraints on its
content; it can consist of all characters represented by an ASCII code from 0
to 255 (decimal).
It is also possible to use the
run time system interactively (see section 7.2.1 Commands for the run time
system in batch). In
that case the input will be read from the standard input
file,
the keyboard.
Recognizers and parsers can
define different scopes of the input by means of the definition of their
starting symbol:
‑ Some grammars
give a description of a whole file. This is the case, for example, if one wishes
to analyze documents consisting of subdocuments, paragraphs and articles with
titles, headings etc. Typically, the first rule in such a grammar is a rule
like:
File ::= ( FilePart )+ .
‑ The main purpose
of other grammars is to analyze smaller units of a file. For example, with a grammar for the syntactical
analysis of English sentences, one is primarily interested in the parse tree of
individual sentences. A typical first rule could be:
Sentence ::=
... . /* syntactic description of a
sentence */
Of course, one can
widen the scope of such a grammar by first adding another rule (making the
original first rule the second one):
File ::= ( Sentence, cr )+ .
When parse trees have
to be built, adding such a rule slows down the speed of the parsing process
dramatically [7]. A better solution is to use the switch SINGLE_ LINES, by which one can force the run time system to treat every single line
in the input as a separate unit for the parsing process. In such a situation
there are two limitations:
1. The
grammar must not contain the special terminal "cr" or the terminal
"<013>".
2. Every sentence in the
input file must be written on a separate line, terminated by an end of
line symbol; the sentence itself may not contain any other end of line symbol.
For the scope of transducers
we refer to what has been said on page 44.
One can also adjust the run
time system to skip and/or convert some of the characters in the input with the
switches IGNORE_ BLANK, IGNORE_ CR and CASE_ CONVERT.
Depending on the grammar type
and the setting of some relevant switches, the run time system emits characters
originating from the input file to one or more output files; terminal
characters appear in parse trees, in reports, in messages with variables, in
reported output variables of the starting symbol, in transduction output and in
error messages. In three cases, an emitted character is not equal to the
corresponding input character:
‑ If the
conversion switch CASE_ CONVERT is set to one of the values UPPER_ TO_ LOWER and LOWER_ TO_ UPPER, the
run time system emits letter characters only in their converted format. This
affects not only the appearance of terminal characters and strings in parse
trees for example, but also corresponding reports and variables, written to the
report file.
‑ If one of the
switches IGNORE_ CR or
IGNORE_ BLANKS has the value TRUE, end of line symbols or spaces and tabs will not
appear in any output.
‑ If the input file contains control
characters (characters with a
decimal ASCII code smaller than 32 or greater than 126), those characters will
eventually be emitted as control characters only to the report file or the
transduction output file. Control characters will be written to all other
relevant files (general output file, run time error file and the screen or
batch output file[8] as a number between angle brackets, the number being the ASCII code of
the character[9]. The user can influence the overall base of the number format of the
control characters for all those files at the same time. This base is equal to
the value of the run time switch RUN_ BASE
or, if this switch has not been added to the switches file, equal to the
value of the switch PARSE_ BASE during the compilation of the grammar [10]. The default value
for both switches is HEX.
The possible differences
between input and output format applies only to terminal characters and
strings. This also includes words that match lexical entries.
Because non-terminals and
intermediates find their origin in the grammar, if they appear in parse trees their
names will be written exactly as they were defined by the grammar writer.
Parse trees can only result from grammars with a starting symbol. While
treating the input, the run time system is able to build a tree-like structure
in which the leaves carry the input characters and the nodes are labeled with
the names of the encountered non-terminal symbols. The root of the tree always
consists of the starting symbol.
The resulting parse trees will
be written to the general output file. If the run time switch CPU_ LOG has one of the values TO_ FILE
or TO_FILE_SCREEN, each parse tree in this file will be preceded by a
line containing the CPU time used for the parse.
By default, the run time
system only builds (parts of) parse trees if it needs to store information
concerning treated input necessary for the recognition of the remaining part of
the input. This need exists, for example, in cases of (temporary) ambiguity of
the parse. The building of the internal structures that correspond to those
parse trees costs memory. Therefore, as soon as the structure is no longer
needed, it will be destroyed.
However, by adding the switch BUILD_ PARSETREE=TRUE to the switches file, the run time system can be instructed to build
the total parse tree, to store it and to emit the result for inspection. A file
with the extension .OUT (also called the general
output file) will be created for this purpose.
If no precautions are taken
during compilation time, parse trees can be incomplete; unless the compiler
switch PREPARE_ PARSETREE has been set to TRUE non-terminals that are defined by means of unit
rules will not occur in the parse tree.
If the grammar contains
messages, the report and signal numbers will also occur in the emitted trees.
At the same time, a separate report file is created (see section 5.4 Report
output).
The number of parse trees that
will be written to the file depends on the outcome of the parsing process:
‑ There will be no
resulting parse tree if an error is found in the input during the parse. In
this case, the run time system will have created an error file in which the
exact position of the error can be found and the surrounding input.
‑ If the input can
be described in more than one way in terms of the grammar, the output will
consist of two (or more) alternative parse trees. In the output file these
parse trees are marked with a "#" followed by a number.
‑ In the case of
one possible description, there will be one parse tree.
If an input file consists of
lines that must be parsed separately by the run time system and consequently
the switch SINGLE_ LINES=TRUE has been added to the switches file, the parse tree output of each
line will be appended to the output of the preceding parse. For some
applications it is useful to separate different kinds of parse tree output. If
the run time switch SPLIT_ OUTPUT is set to TRUE, the run time system creates
three output files instead of one: non-ambiguous parse trees will still appear
in the file with extension .OUT, but the ambiguous parse trees
will be written to a file with extension .AMB and a file with extension .NOP (no parse tree) will contain
the original lines in which an error occurred during the parse.
The resulting parse trees have
a very simple lay-out. For example, an input file PLATES.TXT containing:
12‑34‑AB
AB‑12‑34
parsed by the grammar LICENCE.GRM (mentioned in section 1.1 Hands-on and for this purpose compiled with the switch PREPARE_ PARSETREE=TRUE) resulted in the following parse tree (in the file PLATES.OUT):
<
LicencePlates:(LicencePlate:(Digits:(1 2)
Hyphen:(‑)
Digits:(3 4)
Hyphen:(‑)
Letters:(A B)
)
cr:<13>)
LicencePlate:(Letters:(A B)
Hyphen:(‑)
Digits:(1 2)
Hyphen:(‑)
Digits:(3 4)
)
cr:<13>)
)
>
Using the program Drawtree
(see section 9.2 Drawtree), one can transform a file with parse trees into another format.
The
run time system creates a report file (a file with the extension .REP) in two cases: if the grammar
contains messages or if the starting symbol of the grammar has output
parameters. Therefore, output in a report file can consist of:
‑ signals
‑ reports with (a concatenation of) input
characters
‑ signals and reports with expressions
‑ output parameters of the starting symbol.
The variables and parameters in turn are built from literals (mentioned in actions) and/or characters found in the input file. In general, the characters resulting from the input will be emitted unchanged, so a report file can contain non-printable characters. If the switch CASE_ CONVERT has a value other than NONE, letters will appear in the report file in their converted form (which may be different from their appearance in the input file).
If a grammar contains
messages
(reports and/or signals, with or without variables, see sections section 3.1.2 Messages
and section 3.1.2.2 Messages with variables), every line of the resulting report file starts
with a number, followed by a colon (":"). In the case of simple
reports, the colon is followed by the concatenation of all the input characters
subsequently emitted with the same report number. In the case of messages with
variables, the colon will be followed by the string value of the expression
occurring in the grammar after the report number. In the sections mentioned
above and in section 3.1.3 Places of actions in
rules, the
reader will find various examples.
The resulting report file can
be used as input file for other programs. For example, if parts of the input
file are transferred to the report file by means of r‑messages, a simple
program can be used to reconstruct an extract of the original file.
By default, the run time
system will emit the messages online as long as the parse is not ambiguous. In case of
temporary ambiguity, the run time system will stop the emission and stores the
messages in parse trees. As soon as the parse becomes unambiguous again (and
only one parse tree is left), the gathered messages will be written to the report
file, and the superfluous parse tree will be destroyed (unless the switch BUILD_ PARSETREE has the value TRUE). There are two situations in which problems will
arise; in both ambiguity of the grammar plays the crucial role:
‑ The input file
is very large and for a substantial part the parse is temporarily ambiguous. In
this situation the intermediate parse trees can not be broken down and must be
constantly extended. This results in a slowing down of the parse and eventually
to a crash of the run time system because of shortage of memory. In such a case
the user must make his grammar as disambiguous as possible.
‑ The
grammar contains inherent ambiguities, but this fact will be detected by the
run time system after the
emission of the first message output. As a consequence of the ambiguity, two or
more parse trees will be built, but when the parse comes to an end and the
stored message information has to be emitted, the run time system has no
remaining knowledge of the already emitted messages. The messages belonging to
all but the first parse tree will be written without the leading ones. In this
case, the run time system will generate a warning. The resulting report file
will be useless for most subsequent applications.
At this moment the user can only try
to overcome this problem by altering his grammar. In the near future however,
the run time system will be adapted in order to cope with situations like this
in a correct manner.
If the starting symbol of the grammar has one or more
output parameters (see section 3.2.1 Non-terminals with parameters), the content of these resulting parameters will
only be written to the report file if the switches file contains the switch OUTPUT_ PARAMETERS set to the value TO_FILE or TO_FILE_SCREEN.
Upon recognizing the starting
symbol, the run time system writes the contents of these variables to the
report file. The value of the first parameter is prefixed by "1:",
the second by "2:" and so on. If the parse was ambiguous, the values
are repeated for every possible analysis. The last parameter of the last
analysis is followed by a line containing "0:". This "0:"
line can be used as a delimiter in the case of SINGLE_LINES=TRUE.
This feature is applied in a grammar for the morphological analysis of
Latin words (a small part of this grammar was presented as an example in
section 4.2.3 Intermediates in
context sensitive grammars).
The
run time system will only create an error file (a file with the extension .ERR) if errors are found in the
input file. Therefore, such a file can never be the result of a transducer, but
only of a recognizer/parser.
If no FIND_ON_ERROR switch (see section 3.4 Error recovery) has
been added to the switches file while running the grammar, the run time system
will halt as soon as an error has been detected in the input. In the run time
error file the user can read the exact position of the error in the input file
(line number and position in the line), together with the context and a list of
the expected terminals at the moment the error was found.
If the switches file contains FIND_ON_ERROR, the run time system will list all detected errors in the manner
described above, supplemented by the string of characters that has been skipped
and the character where the parse was resumed.
Both the compiler and the run
time system can be adapted to the wishes of the user; some options can be
specified to modify the behavior of the programs to some extent. These options
(or switches) are specified in a so-called switches file for the compiler and the run
time system. The switches file has the extension .SW. Each switch has a default
value; if a switch value is not specified by the user, this default value is
assumed. The default values of the switches will be specified below; they can
also be found in the USER.SW file.
The names of the switches (and
of their values) are not case sensitive: uppercase and lowercase characters may
be used. Most of the switches are Boolean: they are either TRUE or FALSE. Such Boolean switches may be specified as TRUE by merely
stating the name of the switch and as FALSE by NO plus the name of the switch,
for example:
FINITE=TRUE or FINITE
FINITE=FALSE or NOFINITE
Other switches require an
integer
value (default value 0, other possible values depending on the switch) or a
string
value. For example:
TRACE_ ERROR = 1
LEXICON = MY_LEXICON
CASE_CONVERT = UPPER_TO_LOWER
There are also switches that
resemble grammar rules. The value of those switches consists of an enumeration
of terminal symbols. For example:
FIND_ON_ERROR = '.' | CR .
UNIVERSE = [a-z] | [A-Z] | ….
It is possible to add blank
lines and/or comments to a switches file.
In section 6.1 Compiler Directives the switches for the compiler are described. Section 6.2 Runtime Directives covers the switches for the run time system. In both sections the
switches are listed with their domain of values (if no value domain is given
the switch is a Boolean one with the values TRUE and FALSE), the default value and an explanation of their meaning and operation.
Not all switches are relevant
for ordinary users; those which are, are listed here. Other switches are only
relevant for the programmers of the Transform system; these switches can be
found in Appendix A.
For a quick overview of the
related switches while preparing a grammar for compilation or execution, one
can refer to the checklist presented in section 10.3 Checklist for switches.
The switches are grouped together according to the different functions
they serve.
related to: grammar types
default value: FALSE if TRANSDUCE=FALSE
TRUE if TRANSDUCE=TRUE
callout: "Generate
a finite state automaton"
explanation:
If FINITE=TRUE the grammar will be compiled into a finite
state automaton (see section 1.4.1 Finite state automata). If the run time system will need a stack for running the compiled
grammar the switch must be set to FALSE.
Transduction grammars can only be compiled to finite
state automata (with instructions for operations on a data structure); the
default for FINITE is therefore TRUE if the switch TRANSDUCE has been added to
the switches file. For other types of grammars the default is FALSE.
The value FALSE is the correct value in case of
recognizing grammars containing:
‑ nested
recursion (see also the switch RECURSION_ MESSAGE in section Switches for grammar debugging)
‑ non-terminals with parameters
‑ lexicon symbols
‑ actions with variables (assignments, tests and
reports with variables).
However, one must add FINITE=TRUE for recognizing
grammars containing negations.
FINITE can be FALSE when negations only appear in unit rules.
The interaction of the
above restrictions may give rise to conflicting values of FINITE. Therefore, in
grammars with variables and in recursive grammars, negations may only occur in
unit rules without actions.
related to: grammar types
default value: FALSE if TRANSDUCE=FALSE
TRUE
if TRANSDUCE=TRUE
callout: "Solve
conflicts between CS rules"
explanation:
This switch only has effect if there are two or more
context sensitive rules in the grammar. Moreover, the right-hand side of one of
those rules must cover input that is also totally covered by the right-hand
side of another rule or is also covered by an ending part of the right-hand
side of another rule. If ONE_ CS_REDUCE is FALSE, the existence of such
(partly) overlapping rules will result in ambiguous parse trees or ambiguous
rewritings. If, however, ONE_CS_REDUCE is set to TRUE, no ambiguity will occur,
because some precautions will be taken:
‑ When there are
two or more active transduction rules which can reduce at the same time, only
the rule with the longest right-hand side (in terms of the number of covered
terminals) will be reduced.
‑ When there are
two or more reducable rules with the same length, the rule which is first
mentioned in the grammar is chosen.
related to: parse trees
default value: FALSE if TRANSDUCE=FALSE
TRUE
if TRANSDUCE=TRUE
callout: "Solve
ambiguity for negation/coordination"
explanation:
When the grammar contains
negated regular expressions and/or negated non-terminal symbols, different
parse trees will be built during run time, even if the grammar is not
inherently ambiguous. In the case of transduction grammars ambiguous
transduction output will be generated. In order to avoid this, ONE_ALTERNATIVE
can be set to TRUE. The user has no influence as to which of the alternative
rules will be selected for the parse tree.
related to: grammar
debugging
default value: FALSE if FINITE=FALSE
TRUE
if FINITE=TRUE
callout: "Suppress
recursion warning"
explanation:
When FINITE=TRUE, an error message is generated when the
grammar contains nested recursion. When FINITE=FALSE and RECURSION_MESSAGE is
TRUE, this message is also generated, but as a warning message.
related to: symbol types
default value: FALSE .
callout: "Chars
and intermediates share one universe"
explanation:
The use of the switch SHARED_UNIVERSE is reserved for
grammars with negations and a mixture of terminal and intermediate symbols.
When SHARED_UNIVERSE is FALSE, terminal symbols and
intermediate symbols have separate universes for negation. The complement of a
terminal symbol is a set of terminal symbols; the complement of an intermediate
symbol is a set of intermediate symbols. When SHARED_UNIVERSE is TRUE, terminal
and intermediate symbols have a shared universe, the union of the two
universes.
The universe of terminal symbols is defined by default as
the character set [<000>-<255>]; it can be defined otherwise by the
switch UNIVERSE. The universe of intermediate symbols is always the set of all
intermediates that are found in the grammar.
related to: symbol types
value domain: enumeration of terminals
default value: #x00-#xFF .
callout: "Numeric
base for character constants"
explanation:
If the grammar contains negations, the negation of a
terminal will be the complement of that terminal with respect to the universe.
For example, if the universe is defined as "[a-z]", "`a"
(not "a") covers all terminals inside the range [b-z]; if, however,
the universe is defined as "[a-z] | [A-Z]", "`a" covers all
terminals inside the range "[b-z]" and inside the range "[A-Z]" .
related to: grammar types
default value: FALSE if the first rule in the grammar contains the rewrite sign
"::="
TRUE if the first rule in the grammar
contains the rewrite sign "<".
callout: "Transduction
grammar"
explanation:
When TRANSDUCE=TRUE, the grammar is compiled as a transduction grammar, otherwise as a
recognizing and/or interpreting grammar with a starting symbol. The user has to
add this switch to the switches file only in the case of a transduction
grammar, that, for some reason, starts with a defining (context free) rule.
default value: FALSE .
callout: "Never
optimize away nonterminals"
explanation:
The switch PREPARE_PARSETREE concerns the generation of
program code for non-terminals defined with unit rules (reduction in one step).
When PREPARE_PARSETREE is FALSE, all these non-terminals
were removed from the grammar. As a result, the non-terminals will not appear
in the parse trees. One only need use this switch, if one also sets the run
time switch BUILD_PARSETREE to TRUE.
callout: "Grammar
in cascade with intermediate symbols"
explanation:
If a grammar which does not contain intermediates itself
will be part of a cascaded grammar in which the preceding grammar will output
this type of symbols, the switch INTERM_INPUT should be set to true.
related to: grammar
debugging
default value: FALSE .
callout: "Write
local vars to error file"
explanation:
When there are local variables in the grammar, they are
listed in the compiler error file. The user can use this switch to check typing
errors in variable names (if the user intends to use a parameter, but mistypes
its name, Transform assumes a local variable).
related to: symbol types
value domain: DEC,
OCT, HEX .
default value: HEX .
callout: "Character
universe for negation"
explanation:
For older versions of Transform: if the grammar contains
terminals in ASCII code representation, the numbers will be interpreted
according to the decimal, octal or hexadecimal base, depending on the value of
this switch. In the current implementation of Transform ASCII-values always
have a hexadecimal notation. Currently, it is not necessary to adjust this
switch.
The switches that are of interest to programmers can be found in Appendix
A. The switches of interest to ordinary users are listed below.
related to: input and output files
default value: FALSE if the program is called using the
PARSEER command
TRUE
if the program is called using the IPARSEER command
(see section 7.2.1 Commands for the run time
system in batch for
these commands)
Callout:
explanation:
The input is echoed in the output file (the screen or the
log file of a batch job). Note that echoing slows down the run time.
related to: input characters
default value: FALSE .
callout: "Ignore
line breaks"
explanation:
When IGNORE_CR is FALSE, an end of line character is
considered as input (to be matched with the symbol "cr" in the
grammar) an so does a form feed. When IGNORE_CR is TRUE, end of line characters
and form feed characters are ignored. When the switch SINGLE_LINES is set to
TRUE, IGNORE_CR is automatically set to FALSE. The same effect can be reached
by adding the switch:
SKIP_CHARS = <012> | <013> .
related to: input and output files
default value: FALSE .
callout: "Output
according to ambiguity and success"
explanation:
When SPLIT_OUTPUT is TRUE (and BUILD_PARSETREE is TRUE),
the unambiguous parse trees are written to the .OUT file; the ambiguous parse
trees to the .AMB file and the input in which an error occurs during the parse
is echoed to the .NOP file (no proceeding).
related to: output of the results of the parse
default value: FALSE .
callout: "Output
a parsetree"
explanation:
When BUILD_PARSETREE is TRUE, a parse tree will be
constructed for each successful parse and written to the general output file (a
file with the extension .OUT). The parse tree is only complete if the grammar
was compiled with default value TRUE of the switch PREPARE_PARSETREE. If the
grammar was compiled with this switch set to FALSE, non-terminals defined by
unit rules (see section 1.5.3 Unit rules and empty rules) will not occur in the parse tree.
related to: input characters
value domain: enumeration of terminals
default value: the empty set
callout: "Characters
that should be skipped"
explanation:
The enumerated terminals will be ignored by the run time
system.
related to: error recovery
value domain: enumeration of terminals
default value: the empty set
callout: "Characters
to be skipped during error recovery"
explanation:
This switch can only be used in combination with the
switch FIND_ON_ERROR. It lists the terminals that can introduce some unwanted
domain for the terminals mentioned in FIND_ON_ERROR.
related to: output of the results of the parse
value domain: NO_OUTPUT, TO_FILE, TO_SCREEN,
TO_FILE_SCREEN .
default value: NO_OUTPUT .
callout: "Output
parameters of start symbol"
explanation:
With the switch OUTPUT_PARAMETERS, one can regulate
whether or not the values of the output parameters of the starting symbol of
the grammar will be written to the report file and/or the screen. When the
value is TO_FILE, the values are listed in the report file; when the value is
TO_SCREEN, they will appear on the screen; when the value is TO_FILE_SCREEN,
the output will be directed to the report file as well as to the screen and
with the default value no output will be emitted at all.
The value of the first parameter is preceded by
"1:", the second by "2:" and so on. If the parse was
ambiguous, the values are repeated for every possible analysis. The last
parameter of the last analysis is followed by a line containing "0:".
This "0:" line can be used as a delimiter in case of
SINGLE_LINES=TRUE.
value domain: a string, the (path) name of a lexicon,
either between simple or double quotes or no quotes at all.
default value: the empty string
callout: "Pathname
of the compiled lexicon"
explanation:
If the grammar contains lexical symbols, the name of the
lexicon must be supplied to the run time system. Specify the pathname of the
lexicon (without any extension), for example: LEXICON=D:MY_LINGWARE\MY_LEXICON.
value domain: enumeration of terminals
default value: [a-z] | [A-Z] | [0-9] | '''' | '_'
callout: "Characters
for lexicon tokenization"
explanation:
A specification of the characters, that define a word in
the lexicon. This switch has only effect if the switch LEX_INCREMENT has the
value FALSE.
related to: input and output files
default value: FALSE if the program is called using the
PARSEER command
TRUE if the program is called using the IPARSEER
command
callout: "Treat
end-of-line as end-of-file"
explanation:
When SINGLE_LINES is TRUE, each line in the input is
treated as if it were a separate file. Parsing is restarted at every line. All
output (such as messages, output variables of the starting symbol, parse trees)
is appended to the output of the preceding parse. The combination
SINGLE_LINES=TRUE and IGNORE_CR=TRUE is not possible. In that case, the run
time system will change the latter value into FALSE.
related to: input characters
default value: FALSE .
callout: "Ignore
spaces and tabs"
When IGNORE_BLANKS is FALSE, spaces and tabs are treated
as input symbols, which must be handled by the grammar. When IGNORE_BLANKS is
TRUE, spaces and tabs are ignored. The same effect can be reached by adding the
switch:
SKIP_CHARS = <008> | <032> .
related to: debug tools
default value: FALSE if the grammar is not a transduction
grammar
TRUE
if a transduction grammar contains no ambiguous rewriting and FALSE if
it does.
callout: "flatten
unambiguous transduction result"
explanation:
This switch is only
significant for transduction grammars. Normally, this switch is transferred
from the compiler to the run time system. If FLAT_OUTPUT=FALSE has been added
to the switches file of a non-ambiguous transduction grammar, the .TDO file is
an intermediate file and has to be further processed by Drawgraph (see section 9.3 Drawgraph). The value TRUE in case of ambiguous transduction output makes no
sense. If the program Plink detects a grammar with possible ambiguous rewriting
in another position than the last one, a (fatal) error will be generated.
related to: input characters
value domain: NONE, UPPER_TO_LOWER, LOWER_TO_UPPER .
default value: NONE .
callout: "Conversion
from/to lowercase"
explanation:
When CASE_CONVERT=UPPER_TO_LOWER is added to the switches
file, all capitals in the input are treated in their lowercase form. This is
especially recommended if use is made of a lexicon in which the entries are
usually written in lowercase letters. The value LOWER_TO_UPPER has the opposite
result.
related to: error recovery
value domain: enumeration of
terminals
default value: the empty set
callout: "Find characters on error"
explanation:
If the run time system
detects an error in the input, it will skip forward to one of the mentioned
terminals and resumes parsing starting from that point.
related to: output of the results of the
parse
value domain: DEC, OCT, HEX
default value: HEX .
callout: "Numeric base for character constants"
explanation:
Printable terminals
appear in the parse trees (and the error and trace file) in their character
representation. The switch RUN_BASE regulates the appearance of the control
characters in these files. By default, these characters will be written as a
decimal number, enclosed by angle brackets. The number represents the
underlying ASCII code. If the switch has the value OCT, the ASCII code is
represented by an octal number, and in the case of the value HEX, by a
hexadecimal number. For example:
character decimal octal hexadecimal
carriage return <013> <015> <00F>
blank <032> <040> <020>
Remark: This switch does not affect
the output in the report file or the file with transduction output.
default value: TRUE .
callout: "No lexical tokenization"
explanation:
When the switch LEXICON has been supplied and
LEX_INCREMENT is FALSE, the lexicon entries must match words in the input, i.e.
strings of alphanumeric characters enclosed by non-alphanumeric characters (see
section section 3.2.2 Lexicon symbols). There may also be entries
for punctuation marks; these entries match one non-alphanumeric character in
the input. When LEXICON has a non-empty
value and LEX_INCREMENT is TRUE, the lexicon entries may match words or
prefixes of words in the input.
related to: output of the results of the parse
value domain: NO_OUTPUT, TO_FILE, TO_SCREEN,
TO_FILE_SCREEN
default value: NO_OUTPUT if
SINGLE_LINES=TRUE
else
TO_SCREEN
explanation:
The switch CPU_LOG deals with the CPU time used for each
successful parse. The meaning of the values is the same as in the case of the
switch OUTPUT_PARAMETERS, except that in the case of the values TO_FILE and
TO_FILE_SCREEN the target file is not the report file but the general output
file (a file with the extension .OUT). If parse trees are also written to the
general output file, the CPU time is reported after the parse tree(s).
The setup and the GUI of the
Transform system is already dealt with by the Transform User Manual. In this chapter we explain the operation of the Transform system. The
system can operate in batch as well as in the GUI. The description is taken
from the point of view of batch operations.
A newly written grammar must
be compiled before it can be executed and it must be recompiled every time
changes are made in the original grammar. Actually, the parser generation is
achieved in two steps. In the first step, the grammar
is parsed with Bootstrap, the meta grammar for grammars, and an intermediate
(report) file is created with the extension .IM.
The metagrammer describes
itself and is preprocessed by Transform as a normal grammar (hence the name
Bootstrap). Therefore, parsing of a grammar is essentially the same process as
parsing an input file.
The intermediate file is a
coded representation of the grammar which serves as input for the program
generation process. In the second step, the program code is generated on the basis
of the intermediate file. The first process is only executed when necessary; if
there is no intermediate file for the grammar or if there is an intermediate
file which is older than the present grammar file [11].
The command name for the
compiler is GENPARS (for generate parser):
GENPARS
<GRAMMAR>
When the compiler switches are
defined in a switches file named <GRAMMAR>.SW, this file will
automatically be taken into account.
When the compiler switches are
defined in a switches file with another name, the command is:
GENPARS
<GRAMMAR> <SWITCHES>
Examples:
To compile a grammar named MY_GRAMMAR.GRM with switches values
contained in the file MY_GRAMMAR.SW, the command is:
GENPARS MY_GRAMMAR
If the switches values are in
a file other than MY_GRAMMAR.SW (for example in a file named MY_SWITCHES.SW), the command is:
GENPARS MY_GRAMMAR MY_SWITCHES.SW
When the first parse begins,
the screen shows the message:
Syntax checking of the grammar ...
If the parsing of the grammar
fails at some point, a message appears together with the part of the grammar
which does not conform to Bootstrap. These errors will also be listed in the
compiler error file. If no errors are found the message:
Input correct.
appears on the screen.
When the actual compilation
starts, the revision number of the compiler appears on the screen, together
with a listing of the actual compiler switches (values may have been set or may
be the default values) and the message:
Compilation has started, please wait ....
If an error occurs, an error
message will be written to the compiler error file.
Sometimes warnings appear on the screen. They deal
with peculiarities of the grammar of which the compiler can not decide if they
are typing errors. For example, context free rules with a
non-terminal on the left-hand side that never occurs in one of the other rules
are shown. Also, all intermediates that have been found in the grammar will be
written to the screen; sometimes mistyped non-terminals appear among them. If
the user is sure he made no mistakes, he can ignore these warnings.
When the compiler is ready,
some statistics about the compilation process will be written to the output
(compilation time and the amount of program code).
The input files for the compiler [12] are:
|
grammar file |
<GRAMMAR>.GRM |
|
switches file |
<GRAMMAR>.SW or <FILENAME> |
The output files of the compiler inherit their
name from the name of the grammar with a specific extension:
|
descriptive name |
extension |
function |
|
intermediate file |
.IM |
result of the first parse of the grammar |
|
program code file |
.PG |
contains the program instructions |
|
label index file |
.LB |
index for the former file |
|
symbol table file |
.ST |
contains the names of the symbols and literals used in the grammar which are necessary for the output of the run time system |
|
symbolic file |
.SM |
contains symbolic output, to be used in the future [13] |
|
parsout file |
.POU |
contains information about the value of the switches, some statistics concerning the compilation and the output resulting
from PRINT_... switches (only
useful for programmers) |
|
compiler error file |
.ERR |
contains error messages and warnings |
|
|
|
|
|
trace file |
.TRA |
contains tracing information, only useful for
programmers |
|
statistics file |
.STA |
contains statistical information, only useful for
programmers |
All listed files are printable
with exception of the program code and the label index files which are binary
files.
During the first parse of the
grammar, the rules are checked for errors against the syntax. The exact
position of all the detected errors will be written to the output and to the
compiler error file. The actual compilation will not start after discovery of errors.
Remark 1: As
explained in section 3.4 Error recovery, upon
finding an error some input will be skipped and the parsing will be resumed at
the first period. Therefore, no error checking takes place in the skipped part.
It is possible that new errors will emerge after correction of the grammar.
Remark 2: In
some rare cases, the error is so severe that there is no possibility to recover
the error during the first parse. A long list of rather puzzling error messages
can result. A good strategy is then to concentrate upon the first mentioned
error, correct the grammar at this position and try again.
During the actual compilation,
errors may be discovered again. Some of them will stop the compilation process,
others will only result in a warning. Errors and warnings will be written to
the compiler error file, some warnings also to the output (screen or batch log
file). If an compiler error file has been created, this will be reported to the
screen or log file.
All errors and warnings have a
three digit code. Apart from this code the meaning of the error and, in some
cases, advice on how to modify the grammar will be given.
There are six types of
possible errors and warnings. The error numbers of each type start with a
specific digit:
1. The error numbers starting with a "1" concern
errors due to conflicting switches, or switches that do not match
the contents of the grammar. In this case, the user must alter the switches
file.
2. If a limit in one of the specified numbers in the compiler is
reached an error will also be given, the number of which starts with a
"2", for example, if the number of rules exceeds the prespecified
maximum. The user can not handle this problem alone, but must contact the
programmer of Transform. The programmer has to alter the maximum, and recompile
the modules of Transform.
3. Some features are not yet fully implemented in Transform; in rare
circumstances the continuation of the compilation would produce erroneous code.
In these cases an error number (or a warning) starting with a "3"
will be given. In some cases advice for the modification of the grammar will be
given; if the user has difficulties he can ask the programmer(s) for help.
4. Some errors against the syntax are not checked during the first
parse, but will be discovered during the compilation process, for example, a
regular expression used on the left-hand side of a transduction rule. If an
error of this type (starting with a "4"), is discovered, the user has
to modify his grammar.
5. Errors detected while the compiler is reading the
intermediate file with extension .IM start with a "5".
6. Some compiler switches can be used for debugging grammars (see section 8.2 Debugging Grammars). The result of a check, induced by these
switches, will appear as a warning; the number starts with a "6".
7. In order to prevent a program trap during the compilation,
some unexpected circumstances (out of scope of the user) cause an error (the
error number starts with a "9") and the compilation stops. In this
case one must contact the programmer; please save a copy of the grammar (and
the switches file) that cause the error.
A complete list of all
possible error messages during compilation time is listed in Appendix C, with
the exception of the errors in the last mentioned category.
The run time system executes a
grammar which is compiled by Transform over an input file specified by the
user.
The command name for the run time system is PARSEER:
For interactive use, the
command is:
IPARSEER
<GRAMMAR>
In this case, the input will
be read from the keyboard.
Again, when the switches are
defined in a switches file named <GRAMMAR>.SW, this file is automatically
taken into consideration.
When the switches are defined
in a switches file with another name, the commands are:
PARSEER
<GRAMMAR> <INPUT> <SWITCHES>
IPARSEER
<GRAMMAR> <SWITCHES>
Examples:
To parse a file named MY_FILE.TXT with a grammar named MY_GRAMMAR.GRM and run time switches values
contained in the file MY_GRAMMAR.SW, the command is:
PARSEER MY_GRAMMAR MY_FILE.TXT
If the values of the run time
switches are in a file other than MY_GRAMMAR.SW (say in a file with the name MY_RUNSWITCHES.SW), the command is:
PARSEER MY_GRAMMAR MY_FILE.TXT MY_RUNSWITCHES.SW
When the run time system starts with parsing an input, the revision
number of the run time system and the actual run time switches (a combination
of the switches set by the user and the default switches) are listed. If the
parse fails at some point, a message appears together with an indication of the
part of the input which does not match the grammar.
If the run time switch ECHO is
set to TRUE, the run time system echoes every character (or word) that has been
treated to the output.
The PARSEER command assumes that the
input comes from a file while the IPARSEER command assumes input from
the keyboard. [14]
When
the run time system is called by the command IPARSEER, the user can input to the
run time system by typing his text followed by a ctrl‑z. The parsing starts at the moment of
the ctrl‑z, so one can edit the text beforehand. If the switch
SINGLE_LINES is set to TRUE as well, after parsing of a sentence the user will
be asked whether to continue or to quit.
Remark: In case
of the IPARSEER command: if the parse produces output, it will be sent to the file
<GRAMMAR>_IN.<EXTENSION>. For instance, reports
from MYGRAM.GRM go to MYGRAM_IN.REP.
If the grammar is a cascaded
grammar, the output of all grammars will be echoed to the standard output file.
The user must provide for two
input files for the run time system:
|
input file |
<INPUTFILE> |
|
switches file |
<GRAMMAR>.SW or <FILENAME> |
However, if the run time
system is used interactively (IPARSEER), the input file will automatically be replaced by the keyboard.
Furthermore, the run time
system needs the three main output files of the compiler:
|
program code file |
.PG |
contains the program instructions |
|
label index file |
.LB |
index for the former file |
|
symbol table file |
.ST |
contains the names of the symbols and literals used in the grammar which are necessary for the output of the run time system |
In the case of calling the run
time system by the PARSEER command, all output files inherit the name of the input file with a specific
extension. With the IPARSEER command, where the input comes from the keyboard, the names of the
output files consist of the name of the grammar followed by the string "IN" and the specific
extension. For instance, a report file created while running a grammar called MY_GRAMMAR with an input file called MY_TEXT is named:
MY_TEXT.REP with PARSEER
MY_GRAMMAR_IN.REP with IPARSEER
The creation of the different
output files depends in some cases on the content of the grammar and/or the
settings of the switches:
|
descriptive name |
extension |
function |
|
general output file |
.OUT |
contains
information about the parsing dependent on
the switch values of BUILD_PARSETREE and CPU_LOG. |
|
ambiguous output file |
.AMB |
contains
the ambiguous parse trees (only created if SPLIT_OUTPUT=TRUE) |
|
no proceeding file |
.NOP |
contains
the echoed input in which an error was found
(only created if SPLIT_OUTPUT=TRUE) |
|
report file |
.REP |
contains
messages or the values of the output
variables of the starting symbol, so this file will be only created if -
the grammar contains messages or -
the starting symbol has output parameters and the
switch OUTPUT_PARAMETERS
has the value TO_FILE or TO_FILE_SCREEN |
|
transduce output file |
.TDO |
contains
the output of transduction (only created in the
case of a transduction grammar) |
|
run time error file |
.ERR |
contains
run time errors (only created if an error has been found) |
|
trace file |
.TRA |
contains trace information (used by programmers, only
created if TRACE_LEVEL<>0 and/or TRACE_USECOUNT=TRUE) |
|
statistics file |
.STA |
contains statistics in connection with the parsing
process (used by programmers, only created if RUN_STATISTICS=TRUE and/or one
of the COUNT_... switches is TRUE) |
All output files created by
the run time system are printable; depending on the grammar and the input, the .TDO file and the .REP file can contain control
characters.
In grammars used for
recognition and/or parsing a FIND_ON_ERROR switch can be specified (see section 3.4 Error recovery).
If no such switch is present
and an error is discovered in the input, the run time system will stop the
parsing process and give a message to the run time error file and the output.
The message includes the context of the location of the discovered error; the
exact spot is marked by a repeated arrow ("<<<<<").
The user can modify his
grammar and/or his input file. When the input file consists of different items
and the switch SINGLE_LINES has been set to TRUE, the run time system will
continue with the next item after the discovery of an error.
The situation is different if
FIND_ON_ERROR has been specified. If an error is found the run time system
tries to recover from the error state and skips to the first occurrence of the
terminal or of one of the terminals that are mentioned in the FIND_ON_ERROR
switch. In most cases the run time system will continue parsing. The position
where the error occurred and the position where the parsing is resumed are
included in a message to the output and to the run time error file. The
terminals that were expected at the moment that the error occurred will also be
written to the run time error file.
In transduction grammars, no
errors will ever be detected; when the input file is not correct, the output
will not be as expected. The user must detect this by himself.
Sooner or later every grammar writer reaches a stage not mentioned in section 1.2 From grammar to program: a problem phase. This occurs when after a run of a compiled
grammar the output seems different from what has been expected. Some groups of
problems can be distinguished:
‑ A recognizer finds errors in an input while the input has
to be considered as correct or,
alternatively, incorrect input is judged as "correct".
‑ A parser produces "wrong" parse trees.
‑ A recognizer or parser finds unexpected ambiguities in the
input.
‑ The output of a transducer is different from what was
intended, or the transducer produces ambiguous rewriting.
Although it is difficult, or
even impossible, to give solutions for all the problems that could arise, some
general advice will be given in this chapter. The chapter starts with section 8.1 Current limitations on the known limitations (and some effects of bugs) of the Transform
system; if a problem resembles the ones presented there, one can try to get
round the imperfection due to the system by rewriting or restructuring (parts
of) the grammar, for example, by making use of alternative ways to reach the
intended effect.
The second part (section 8.2 Debugging Grammars) contains some practical advice for "debugging" grammars.
Some of the current limitations will be discussed here. In addition,
some known effects of imperfect implementation ("bugs" in the
programs) will be mentioned.
It is possible to create
grammars that need more memory in the compilation phase than the computer can
offer. There is however no clear connection between the physical size of the
grammar text file (or even the number of rules the grammar contains) and the
number of states that will be generated; therefore no general rule can be given
for defining beforehand whether a grammar will be too big for compilation. In
the programs of Transform, there are still possibilities to optimize the
process of state generation.
The main limitation lies in
the incompatibility of some of the extended features with the different grammar
types in which the switch FINITE plays a crucial role. Only grammars with a
nested recursion can never be compiled into a finite state machine; for
grammars not containing this type of recursion a choice between the two types
of resulting machines could be possible, in principle. Practically however, the
use of some features has the effect of a fixation of the resulting automaton
type. The following table contains an overview of the restrictions concerning
the use of the switch FINITE:
─────────────────────────────────────────────────────
Item
Can be used in combination with
FINITE=FALSE FINITE=TRUE
─────────────────────────────────────────────────────
Transduction
rules ‑ +
"Recognizing" context sensitive rules + ‑
Negation ‑ +
Negation in unit rules + +
Parameters + ‑
Actions with variables + ‑
Lexicon symbols + ‑
─────────────────────────────────────────────────────
The
restrictions for context sensitive rules apply to rules in recognizing grammars
as well as to the transduction rules.
The left-hand side of a context sensitive rule can never contain:
‑ regular expressions
‑ the alternative sign
‑ ranges or wildcards
‑ non-terminals with
parameters
‑ lexicon symbols with
parameters
‑ negations
‑ actions
In
section section 4.3.1 Description of transduction rules, solutions were been given to overcome these
restrictions.
The right-hand side of context sensitive rules may never be empty.
A
transduction grammar that outputs ambiguous rewritings can only be the last one
in a cascaded grammar.
1. The "or"-operator can only be
used between tests.
2. In
combination with tests, the "or"-operator has precedence over the
conjunction operator; there are (still) no means to reverse this order of
performance. As a consequence there is no equivalence for an "if‑then‑else"
construction. A method to overcome this limitation is described in section section 3.1.3.4 Actions in empty units.
3. Actions can not be placed at the start of the right-hand side of a rule.
1. Non-terminals
with input parameters can not be used in a left recursive
position in a grammar rule.
2. In
a grammar to be compiled with FINITE=FALSE, the successor of an "arb" can only
consist of a terminal. In addition, in such grammars an
"arb" can
never be
the last symbol in a rule.
3. A
non-terminal with output parameters can not be defined by two or more empty
paths with actions.
1. Signals
with variables used inside regular expressions sometimes yield incorrect
output.
2. Chains
of empty rewritings with actions can produce the wrong effects.
Debugging
grammars can be a hard and time consuming job. It demands a lot of
concentration, hard thinking, inventivity and possibly also some experience.
From our own experience, we can only present some suggestions here ... and wish
you inspiration.
Advice
1: Make
use of the warnings the compiler can produce and the built-in debug facilities
of the run time system:
‑ Take
notice of the information in the run time error file if an error occurred.
Sometimes the reported
expected terminals can give a clue.
Warning:
The run time system reports the last read character before the error occurs.
However, if the grammar contains tests, it is possible that the problem is
caused in a much earlier stage.
One
can also set the switch TRACE_ERROR to TRUE (see Appendix A) to get even more information
about the rules that are applicable at the moment of the error. However, in the
case of substantial grammars, this information is too extensive to be of any
help.
‑ If
the run time system signals a possible error in the consulted lexicon, check and, if necessary, correct the entries
before trying again.
Remark:
Repetition of identical entries (with possibly different features) in the
lexicon can cause ambiguity.
‑ Look
at the screen (or batch output) or in the compiler error file for the report
of:
‑
unused rules.
If
the grammar contains a context free rule of which the left-hand non-terminal
does not reoccur in at least one of the right-hand sides of other rules, the
compiler reports such a rule as unused. The
cause can be a typing error in the name of the non-terminal.
‑ a list of intermediates.
The
compiler treats every name of a symbol that does not reoccur on the left-hand
side of a context free rule as an intermediate (see section section
3.2.4 Intermediate symbols). A mistyping in the name of a non-terminal
causes that name to appear in the reported list of intermediates. Also errors
in the names of intermediates themselves can be traced in this list.
‑ In
the case where the grammar contains variables and/or
parameters:
recompile the grammar with the switch CHECK_VARIABLES set to TRUE. The compiler makes a list of all the local
variables used in a rule in the compiler error file; maybe there is a typing
error in the name of a parameter or local variable.
‑ Check
the contents of the switches file (the compiler makes a list in the .POU file) and compare these switches with the
checklist given in section 10.3 Checklist for switches of this manual.
‑ If
the grammar is compiled with NOFINITE, recompile it with the switch RECURSION_MESSAGE; maybe there is an unintended (possibly
implicit) nested recursion.
Advice 2: Take a closer look at your grammar.
‑ Check
the parameters (is their input/output type correct?) and the variables. Check
the assignments and tests.
‑ Check,
if used, the ASCII code representation of characters and control the base of
the numbers.
‑ Turn
yourself into a kind of "machine"; try to follow the problem input
through the course of the grammar. Make note of the reductions and write down
the contents of the variables.
Advice 3: Try to locate the problem in a more specific
way. Two methods can be used:
1. systematically varying the input
2. systematically adapting the
grammar
One
can choose to concentrate on one of the methods or apply them both at the same
time.
Warning:
Before altering the grammar, save a copy of the original and also of the
original switches file and the "problem" input file. It occurs too
often that a problem can no longer be reproduced.
Advice 3a: Vary the input. Figure out what (kind of) input
is treated correctly and what not. Try to do this in a systematic manner.
‑ Run
the program several times, starting with a simple input and make it gradually
more complex. Or, alternatively, start with a complex input and simplify it
stepwise. A closer look at the transition point from "correct treatment'
to "incorrect treatment" can contribute to the solution.
‑ Choose
different inputs that will be processed by different parts of the grammar. By
this one could locate the problem section in the grammar in more detail.
Advice 3b: Adapt the grammar. Try to locate the problem to
a restricted area of the grammar text.
‑ Often
a problem arises "suddenly". While developing and expanding a grammar
an input string treated correctly in the past causes a problem after a certain
change in the grammar. Inspection of the last change can give some clues.
Note:
This illustrates the need for documentation of a grammar; it is recommended to
keep note of the development of a grammar. This can be done by saving copies of
earlier versions, by writing a "log file" and/or by documenting the
grammar itself using the comment feature.
‑ Switch
off one or more parts of the grammar by enclosing rules using comments and/or
replace them with dummy rules. Of course, after every alteration, the grammar
has to be recompiled. Run the grammar every time with the input that caused the
problem and try to locate the transition point of "problem‑no
problem".
‑ Make
use of the
message-feature; at
certain points of the rules one can write reports or signals. Inspection of the
report file can tell whether or not certain points of the grammar were reached
by the input.
‑ If
the grammar is of a suitable type, make a combination of variables (to "store" parts of the input) and output parameters for the starting symbol. If the switches file
contains OUTPUT_PARAMETERS, one also has a check on the rules that were
reached.
‑ If
the problem grammar is a
transducing one,
remember that small parts of such a grammar can be compiled to separate
transducers, that can be linked together to one cascaded grammar (see section section 4.3.6 Cascaded grammars). To some extent, the parts are independently testable.
Besides
the compiler and the run time system, the Transform system contains some
auxiliary programs, used in combination with the two main ones.
When lexicon symbols are used
in a grammar, the run time system must use a lexicon; the steps to build one
will be described in section 9.1 Lexicon. A tool for transforming the resulting parse tree into a better
lay-out is the program Drawtree, described in section 9.2 Drawtree. Drawgraph, found in section section 9.3 Drawgraph, is useful in cases where the transduction output has been produced in
an intermediate, not directly interpretable form. In section 9.4 Plink the
program for building a cascaded grammar will be described. Section 9.6 Disassembler with the description of a disassembler program, mainly to be used by
programmers, concludes this chapter.
All tools can be run in a
command shell. Many tools run also in the gui.
When a
lexicon is used, this lexicon must first be built, using a separate program
called Parserlex. Parserlex creates an ltree lexicon for use in cooperation with a
Transform grammar. The lexicon is built on the basis of a text file which
consists of entries in a specified format.
The format of the entries is
illustrated by the following examples:
$prep('in')
$noun('program', 'sg')
where "in" is
classified as a "prep" (preposition) and "program" is
classified as a "noun" with the characteristic "sg" (singular).
The single quotes may be
replaced by double quotes. To include a quote in a string, it must be written
twice when it is the same quote as the quotes which bound the string. Thus:
"This is a double quote: ""."
'This is a single quote: ''.'
'This is a double quote: ".'
"This is a single quote: '."
The lexical category may also
be enclosed between quotes. The name may contain non-alphanumeric characters,
but it must still start with a dollar sign. For example:
'$NO(x)'('abc', 'xyz')
Blanks (tabs, spaces, end of
line characters) and comments may be added to the text file.
The syntax of the Parserlex
command is in batch:
PARSERLEX
<LEXICON_NAME>
<TEXTFILE>
The name of the text file
should include the extension, for instance .TXT. The program Parserlex
creates three files, all with the name of the original text file with the
extensions .LEX, .ADM and .INF.
A lexicon may be composed of
more than one input file by repeated calls of Parserlex with the same lexicon
name. It is recommended to make a backup copy of the lexicon before a new file
is added to it.
Remark: Identical entries will be ignored but a
warning will be displayed.
There is a demonstration
program for the Ltree lexicon software, which can also be used for the
examination of an existing lexicon. The program is called without arguments:
LEXDEMO
Another program shows the
specifications of an existing lexicon. This program also enables the user to
lower the "use count" of a lexicon. The "use count"
registers the number of processes that simultaneously consult the lexicon. When
the run time system crashes because of a console interrupt or a system failure,
the use count is usually not decremented. Since a use count of zero is needed
for modification of a lexicon, the use count should be corrected by the program
Lexspecs, which is also called without arguments.
LEXSPECS
Using the program Unparserlex,
it is possible to check the contents of a specific lexicon. With the same
program a lexicon can be translated to a text file. The command for running
this program is:
UNPARSERLEX
As mentioned before, the lay-out of the resulting parse trees is a very
simple one. With the program Drawtree one can transform the trees to another
lay-out. For example, the parse tree given in section 5.3 has been treated by
this program and resulted in:
LicencePlates
(‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑*‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑)
| | | |
| | | |
LicencePlate
cr
LicencePlate cr
(‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑*‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑) |
(‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑*‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑) |
| |
| | |
| | |
| | |
|
| |
| | |
| | |
| |
| |
Digits Hyphen Digits
Hyphen Letters |
Letters Hyphen Digits
Hyphen Digits |
| |
| | |
| | |
| | |
|
| |
| | |
| |
| | |
| |
12 ‑ 34
‑ AB <13> AB
‑ 12 ‑ 34
<13>
The user can partly influence
the lay-out produced by Drawtree by means of a switches file.
The command for the program
is:
DRAWTREE
<INPUTFLE>
<OUTPUTFILE> <DRAWTREE_SWITCHES_FILE>
For example, after running a
grammar over the input file MY_INPUT the resulting general output
file MY_INPUT.OUT can be modified to a file
called MY_INPUT.TREE by means of the call:
DRAWTREE MY_INPUT.OUT MY_INPUT.TREE
or, if a Drawtree switches
file with the name MY_SWITCHES.DSW is used:
DRAWTREE MY_INPUT.OUT MY_INPUT.TREE MY_SWITCHES.DSW
The switches that can be added
to a Drawtree switches file are listed below:
Switch: DRAW_SKIPBLANKS
default value: TRUE .
explanation:
In the resulting parse trees, the leaves that consist of
a blank or a tab are omitted.
Switch: DRAW_MESSAGES
default value: FALSE .
explanation:
If DRAW_MESSAGES is FALSE, no messages are represented in the resulting tree(s).
Switch: DRAW_SKIPCHARS
possible values: enumeration of terminals
default value: the empty set
explanation:
In writing a grammar, some
non-terminals are only introduced in order to write the rules in an efficient
way. If these non-terminals are superfluous in the parse tree(s), one can
choose to name them with strings that all start with the same unique character,
for instance the underscore ("_"),
and to add to the Drawtree switches file: DRAW_SKIPCHARS = '_'. Another application is to omit names of
lexicon symbols by adding
DRAW_SKIPCHARS = '$' to this
file. You can do both at the same time by adding the switch
DRAW_SKIPCHARS = '_' | '$'.
Switch: DRAW_UPSIDEDOWN
default value: FALSE .
explanation:
DRAW_UPSIDEDOWN=FALSE results in a tree with the root at the top and
the leaves at the bottom, the value TRUE results in the opposite.
The program Drawgraph is only
of interest for output of transduction grammars. If such a grammar produced
ambiguous output the transduction output file (.TDO) is not immediately readable.
This output has to be converted by means of the program Drawgraph. The calling
format for this program is:
DRAWGRAPH
<INPUTFILE>
<OUTPUTFILE>
For example:
DRAWGRAPH MY_INPUT.TDO MY_INPUT.TRANS
The
linking of two or more compiled grammars into one cascaded grammar (see section
section 4.3.6 Cascaded grammars) can
be performed by the next command:
PLINK
<RESULTGRAMMAR>
<GRAMMAR1>,<GRAMMAR2>,…,<GRAMMARN>
The name of the cascaded
grammar which can be run will be <RESULTGRAMMAR>.
The linking of grammars can be
performed in steps: after linking two (or more) grammars to a resulting grammar
"result1" and linking two other grammars to "result2", the
two resulting grammars can again be linked together.
The maximum number of grammars
that can be linked is 25.
Example:
PLINK
LARGEGRAM SMALL1,SMALL2,SMALL3
As can be expected, the
program Plink constructs files with extensions .PG, .LB, .ST and .SM. It also creates a file <RESULTGRAMMAR>.SAM, which lists the grammars
that were used to build the resulting grammar. Its only use is as
documentation; it can safely be deleted.
Remark: If at
least one of the grammars that will be part of a cascaded grammar contains
intermediates, the other related grammars that do not make use of
intermediates, must be compiled with the switch INTERM_INPUT set to TRUE.
The program Plink does not
create a switches file; the user has to create one if necessary. The run time
switches will be applied to all the composing grammars.
The
utility Xgram links the program code file together with the run time system
into one executable program file. This leads to a faster start-up time of the
run time system. It also makes the compiled grammar independent of changes in
the run time system, which may lead to code incompatibility. The files
<grammar>.EXE and <grammar>.COM are created by the following
command:
XGRAM
<GRAMMAR>
or, if there are run time
switches in a file with a name not equal to <GRAMMAR>.SW:
XGRAM
<GRAMMAR> <MY_SWITCHES>
It is recommended to supply a
full path name for <grammar>, since in this case the generated command
file becomes independent of the current default directory. Better still,
precede <grammar> with a logical name referring to the directory.
The grammar transformed into
an executable program can be called by the following command:
<GRAMMAR>
<INPUTFILE>
or, for interactive use (input
from the keyboard), simply:
<GRAMMAR>
Remark:
Currently, the utility XGRAM is not functioning. Some maintenance is needed in
order to be operational again.
The utility
Disassembler is mainly of interest for programmers. It operates on the program
code file and the symbol table file. It produces a copy of the program code in
a symbolic and annotated form, which can be useful in debugging the grammar (or
modules of the Transform system itself). The program is called as follows:
DASS <GRAMMAR>
Its output is a file with name <GRAMMAR>.DA.
Remark:
The program code file can become very large in the case of substantial
grammars.
All
meta symbols used in Transform grammars together with their names and a short
description of their functions are listed below:
Delimiters and operators used in rules:
|
name |
meta symbol |
function |
|
rewrite sign |
::= |
rewrites a left-hand side into a right-hand side |
|
transduction sign |
<:: |
transduction: written between left- and right-hand
side |
|
concatenation sign |
, |
concatenation |
|
alternative sign |
| |
between alternatives |
|
end of rule sign |
. |
end of a rule |
|
comment signs |
/ * and * / |
start and end comment |
|
action brackets |
{ and } |
between these brackets actions are notated |
Reserved symbols for regular expressions:
|
name |
meta symbol |
function |
|
regular expression open |
( |
open bracket of a regular expression |
|
negated reg. expr. open |
`( |
open bracket for negated regular expression |
|
r.e. close zero or one |
)? |
zero or one times the enclosed part |
|
r.e. close one |
) |
one times the enclosed part |
|
r.e. close zero or more |
)* |
zero or more times the enclosed part |
|
r.e. close one or more |
)+ |
one or more times the enclosed part |
Reserved non-terminals and
terminals:
|
meta symbol |
function |
|
|
dontcare symbol |
* |
one arbitrary character |
|
line symbol |
‑ |
zero or more arbitrary
characters |
|
arb |
= |
zero or more arbitrary
characters, not including the terminal that follows |
|
carriage return |
cr or CR |
end of line character |
Operands
and indicators for symbol names:
|
name |
meta symbol |
function |
|
string sign |
' |
single character terminals
or strings are denoted between single quotes [15] |
|
escape sign |
\ |
terminal indication for a
non-name character |
|
octal brackets |
< and > |
between the angle brackets a
number representing the ASCII code of
a character |
|
base indicator |
#X or #x |
the next number has a
hexadecimal notation |
|
ranges sign |
[..-..] |
defines range of characters |
|
negation sign |
` |
before the symbol: the
negation of that symbol |
|
lexicon sign |
$ |
start of the name of a
lexicon symbol |
|
cover sign |
^ |
end of the name of a cover
non-terminal |
|
parameter brackets |
( and ) |
between these brackets
formal or actual parameters are denoted |
Indicators [16] in formal (and actual) parameter lists:
|
name |
meta symbol |
function |
|
input indicator |
i: |
type indicator for input
parameter |
|
output indicator |
o: |
type indicator for output
parameter |
|
input/output indicator |
io: |
type indicator for
input/output parameter |
Indicators
16 and
operators in actions:
|
name |
meta symbol |
function |
|
report indicator |
r: |
report message |
|
signal indicator |
s: |
signal message |
|
action conjunction |