5. LingDoc and Parsing and Transduction Technology: Position
Paper
1
Computational layers
For each linguistic layer there may be
several computational layers. Their advantage is that the overall usability of
a system is improved.
The total complexity decreases because of
the principle “divide and conquer”. The understandability increases because
each layer is studied on its own. Extensions are made easier and maintenance
becomes easier.
o
Requirements
§
There shall be an efficient
handling of ambiguities at all levels.
§
There shall be an integrated
error handling throughout all levels, with correct explanations at the user
level.
o
Current solutions
In current practice of document and
language engineering the following computational layers may be in use:
§
A workbench for the entry and maintenance of specifications
in a special formalism for a specific type of application, with eventually an
automatic translation onto a grammatical description.
§
A workbench for the entry and maintenance of grammatical
descriptions written in a grammatical formalism.
§
An interpreter for the
grammatical formalism or a program generator (which may be called a compiler)
which translates the grammatical description into code for a formal automaton.
The last approach is not mentioned by Jurafsky.
§
An interpreter for the
automaton, or a compiler which translates the code for a formal automaton into
a programming language or directly into computer instructions.
§
The resulting program which is
an interpreter, a parser or a transducer for a document/text.
§
Principles, theory and practice
of computer science can be applied for the interpreters and compilers involved.
o
Position of Transform and
Clarity
The computational layers are:
§
A workbench for the entry and maintenance of grammatical
descriptions written in a grammatical formalism.
§
A compiler which translates the
grammatical description into code for a formal automaton
§
An interpreter for the formal
automaton.
§
The resulting program which is
an interpreter, a parser or a transducer for a document/text. This program may
be embedded in another process, like an authors workbench.
2
Recognizer, parser, transducer
Recognizers, parsers and transducers are
programs that execute the grammatical specification for each layer. They read
the input and generate output.
§
A recognizer is a process which takes as input a string and either
accepts or rejects it depending on whether or not the string is a sentence
that may be produced by the syntactic description.
§
A recognition process outputs,
in principle, "yes" or "no" .
§
A parser is a recognizer which also outputs one or more structural
descriptions (“parse”) of the sentence according to
the syntactic description.
§
A transducer is a recognizer or parser which emits
output-symbols. We distinguish two situations. The first one concerns a
recognition process which generates a message when some point in the grammar is
reached. The second one concerns a process where output-symbols are emitted.
o
Position of Transform
§
Transform can act as a recognizer,
a parser or a transducer.
§
The output may contain:
structural descriptions (with values of variables), reports, the variables which are
associated with the start-symbol and transformed text.
o
Position of Capri
§
Capri can act as a recognizer, a parser or a transducer.
§
The output after each sentence
may contain: a message, structural descriptions (with values of variables), transformed/translated sentence.
3
Off-line, on-line, real-time
Processes are embedded within an
environment which can impose time-constraints. The following definitions are
relevant.
A process is said to be :
§
Off-line when a response about
rejection or acceptation is given after processing the whole input. This
assumes that the output is of limited length. E.g., the strategy of
backtracking forces a process to be off-line.
§
On-line when there is a response
after each symbol of the input, indicating possible continuation or rejection.
The input is potentially of unlimited length. In the case of a transducer the
output is sometimes called “streaming output”.
§
Real-time; when the on-line response will come within a
guaranteed time. The term indicates the cooperation with other processes where
this guarantee is imperative; usually, rejection will not occur and the input
is potentially of unlimited length. In principle there is no upper bound for
the guaranteed time. In a more colloquial definition of real-time the delay
is, for human experience, ignorable.
o
Requirements
§
Embedding in continuous
processes requires the on-line property.
§
Cooperating processes will need
a signal when a specific position in the input is reached.
§
If a partial semantic
representation has to be built during on-line processing, then evaluated
expressions with variables have to be emitted after each input symbol.
o
Position of Transform
§
Transform has the on-line
(streaming) property. It emits output when the process is in a deterministic
state.
o
Position of Capri
§
Capri has the off-line property for a complete sentence. It emits output
when a sentence is completed.
4
Indeterminism
During recognition, parsing and
transduction indeterminism may arise.
With on-line recognition of ambiguous grammars there are 3
possibilities of proceeding after a new symbol is read in:
§
no proceeding possible (and an
error-recovery scheme may have to be invoked)
§
there is exactly one
possibility of proceeding (then the process is said to be deterministic)
§
there are two or more possible
ways of proceeding (the non-deterministic case).
o
Requirements
§
In linguistic processes all
forms of ambiguities amongst all linguistic layers must be handled.
§
Subsequently, in up-conversion
processes and in author workbenches multiple choices have to be presented to
the user.
o
Current solutions
§
Temporal ambiguities may be
resolved by lookahead.
§
Inherent ambiguities may be
resolved by
-
Explicit backtracking ("depth first", provided in some
interpreters of formalisms and in some programming languages (like Icon, Prolog etc.).
-
Parallel parsing; where all possible paths are
tried in parallel ("breadth first") and common sub-paths
are shared. This is attractive for the parsing of context-free grammars,
because of the possibility of cubic processing time, e.g. in a parse forest
with Tomita style parsing (GLR), or in the (interpretative) algorithm of
Earley.
o
Problems with current solutions
§
Temporal ambiguities are often
introduced by inexperienced users who try to write an initial grammar. In that
case a constraint to non-ambiguous grammars is frustrating.
§
Inherent ambiguities mostly
have semantic reasons. They may be minimized by maximizing the use of syntactic
clues in the input, making use of lookahead. A grammar formalism together with
the indefinite lookahead property provides for a context of indefinite size, to
the left and to the right. This is in contrast with event-based formalisms.
§
In many implementations of
grammatical systems the handling of ambiguity is poor. Most implementations
work with chronological backtracking techniques, which operate
off-line. This also leads to exponential runtime behavior.
o
Position of Transform
§
In the design of LingDoc
Transform, special care has been paid to handle ambiguity in an efficient way.
§
Parallel parsing uses
recombination of paths; this accounts for the on-line property. There is no
backtracking.
§
Also, attributes and
expressions are evaluated on-line.
-
There is error repair when no proceeding is
possible.
-
If one analysis path remains, the output, which
is stored in the remaining parse-forest, is streamed.
o
Position of Capri
§
For each sentence there is
parallel parsing according to the Tomita algorithm. After the analysis is
finished attributes are unified, expressions are evaluated and output is
emitted.
5
Interpreters and compilers
o
Interpreters
§
Description of issue
An interpreter for a grammatical formalism
calculates after each input symbol what will be the next state of the process.
During the development of a grammar the
use of an interpreter may be advantagous because the grammar need not to be
compiled.
§
Requirements
For efficiency the formalism may be
pre-processed.
§
Current solutions
A well-known example for the parsing of
context-free languages is the dynamic programming algorithm of Earley. This
algorithm has been extended for a number of extensions of context-free grammars
(see also Jurafsky).
§
Position of Transform
As an option, Transform may perform
parsing for ecfg’s with Earley’s algorithm. The pre-processing is more or less
the same as for LR-parser generation. The idea is that for testing grammars an
interpreter may be faster than compilation with subsequent parsing.
Accordingly, in the course of time, when
additions were made to the ecfg formalism of Transform, Earley’s algorithm has
been extended. However, there has been a delay in implementation. Chapter 6 of
the thesis describes which pre-processing can be performed for the full
formalism.
o
Compilers
§
Description of issue
In general: a compiler transforms a
specification written in a formalism into an executable program. This may be
machine code or intermediate code which will be interpreted.
After the development of a grammar the use
of a compiler may be advantagous because the run time will drastically improve.
A more or less equivalent term for a
compiler is a program generator; if the program is a parser then the compiler
may be called a parser generator.
A parser generator takes a formal
description of a grammar (e.g. in BNF) and outputs code for a parser. The
parser will recognize valid strings according to the grammar and will perform
actions associated with grammar rules.
In generating the code the compiler describes every possible so-called
state of the parser during the treatment of the input. This state is only
dependent on the section of the input already treated.
§
Position of Transform and Capri
Both have a compiler.
See further the separate section on
program generation.
6
Automata
In Computer Science a correspondence is
maintained of grammatical formalisms and formal automata.
A formal automaton consists of a number of
states and of set of data structures. The states are pre-calculated by a parser
generator. The data structures are updated during the processing of the input.
Each state is associated with a unique situation. Upon reading a symbol a
transition is made from one state to another, together with a possible update
of the data structures and an emittance of output.
The complexity of time and space of
recognition, parsing and transduction can be studied more easily with the aid
of automata.
The automata which correspond to ecfg’s
operate on-line.
o
Requirements
§
Automata shall be able to
handle non-determinism.
§
Automata shall run with the
lowest complexity possible.
o
Current solutions
§
Jurafsky e.a. discuss a number
of solutions. However, they do not discuss solutions for non-deterministic
breadth-first parsing where a parse forest is maintained, like in the
algorithms of Tomita (GLR).
o
Position of Transform
§
The development of Transform
started with a solution comparable with Tomita’s solution. (This development
occurred independently.) The extensions of the formalism are reflected in
extensions to the formal automaton. The result was the so-called Parallel
Transduction Automaton (PTA), described in the thesis.
§
Upon reading the next symbol a
number of instructions are executed which operate on two dag-structured stacks
and on the parse forest. The instructions are calculated by the compiler.
§
The developed run time system
is an interpreter for the PTA. The run time system tries to follow all possible
paths in parallel (breadth first). If there is more then one path, common parts
of the resulting parse trees will be shared as much as possible in a “parse
forest”.
§
There is a “deep binding” of
variables and output in the parse forest.
§
In the case of a transduction
grammar pieces of a text will be transformed. If no more rules are applicable
the transformed text is emitted.
§
Multiple grammars may be
cascaded. Output from one PTA is then submitted to another PTA, as soon as
output becomes available.
o
Position of Capri
§
Capri follows the solution of Tomita. After parsing a sentence all
resulting parsetrees are reconstructed. For each parsetree the variables and
expressions are evaluated as if there was one parsetree.
7
Lexical scanner, lexicon and parser
Commonly, the lexical scanner, the lexicon
and the parser operate in different linguistic layers.
o
Requirements
§
The lexicon shall provide for
-
Multiwords
-
Idiomatic expressions.
§
The lexical scanner has to
identify tokens for the parser/transducer, markup of the document and strings
of separators. In the output of a transducer the markup and strings of
separators have to be re-inserted in the right place.
o
Problems with current
solutions, according to the usability criteria
Conceptually, the lexicon is part of a syntactic
description, but in practice it plays a separate role. Most implementations
make use of an external data structure for which it is necessary to know the
whole entry before it can be looked up. This may cause troubles with multiwords and words that appear in the grammar itself.
o
Position of Transform
§
The lexical scanner handles
symbols on the character- and on the word-level.
§
Words (in general strings) may
be described in the lexicon as well as in the grammar.
§
The lexicon is essentially
treated as an extension of the grammar so that ambiguities on the lexical level
are dealt with in parallel with ambiguities on the syntactic level.
§
The access to the lexicon runs in parallel with the
parsing process and is transparent for ambiguities, multiwords and words which appear in the grammar.
§
The lexicon is implemented as
an external “trie” data structure. Because of this, two valuable properties
result:
-
the on-line property is maintained at the
character level
-
the access-time of the lexicon is independent of
the size of the lexicon.
o
Position of Capri
§
In Capri
a word is the smallest lexical unit. It has to appear in the lexicon.
§
The lexicon resides in a
database. It can handle ambiguities and multiwords. The handling of ambiguities in the lexicon and in the grammar is
transparent.
§
The lexical scanner
distinguishes words and strings of separators. A string of separators may
contain XML markup. During translation the XML markup is kept with the lexical
units with which they are associated.
§
Rules for sandhi in the
translated text are, up till now, programmed for each target language.
8
Robustness
A typical aspect of practical Document and
Language Engineering is integrated error handling among all linguistic layers.
Errors arise from ill-formed input and
from wrong or inconsistent grammatical specifications.
o
Requirements
§
Error handling shall be robust.
§
Errors have to be detected as
soon as possible.
§
Errors have to be
"repaired", as correctly as possible, in order to continue the process.
§
All errors have to be reported,
with an exact positioning and with suggestions for correction.
o
Position of Transform
§
Errors are repaired as long as
possible and as correctly as possible, in order to continue the process.
§
The grammar writer may define a
set of “recovery symbols”. If an error is encountered the process continues
along all paths that are available until a recovery symbol is met. There is a
strategy to determine which input symbols have to be skipped and which recovery
paths in the dag-structured stack and parse forest have to be pruned.
o
Position of Capri
§
Capri works with grammar rules in which grammatical errors are specified,
together with their corrections.
1
Workbenches
Linguists should be provided with a
workbench for the development of lingware.
o
Requirements
§
There have to be facilities for
tracing and debugging of grammars.
o
Position of Transform and Capri
§
Detailed tracing of the
parsing/transduction process is possible. However, the LR parsing processing
works bottom-up and is therefore difficult to follow.
§
More visual tools for the
grammar writer are needed.
2
Embedding
A process for recognition,
parsing or transduction may be embedded within other processes.
o
Requirements
§
The embedding has to follow
standard procedures for cooperating processes.
o
Position of Transform
§
There are no specific
provisions for embedding. However, Transform is programmed as a multitasking
system.
o
Position of Clarity: Capri, Lexbench, Author
§
Capri is the engine which performs the analysis, correction and
translation; it runs in the background of Author or as a standalone program.
§
Author is the generic name for
the interface between Capri and an external
authoring system.
§
Author assists an author during
normal editing tasks with correction and standardization of spelling,
terminology, grammar and style. If errors are encountered, suggestions for
corrections are given to the author.
§
Author has been implemented in
three versions:
-
As an interface to LEditor, a standalone editor,
developed internally by Lingware services. This was the first version of Author
with a tight integration of editing functions, functions of Capri, functions
for inspecting and updating the lexicon and functions for the workflow of
authoring documents in controlled language. LEditor has been productive in a
number of projects. The
experience with it suggested a more formal approach for the development of an
independent version of Author with an own identity.
-
FmEdit, an add-in for Framemaker, was the second
version of Author.
-
WoEdit, an add-in for Microsoft
Word, was the third version of Author.
-
A still more generalized version of Author has
been envisaged but has not yet been implemented.
3
Compiler/program generator/parser generator
The theory of parser generation stems from
computer science. LL (bottom-up) and LR (top-down) parsing have drawn a lot of
attention because they cover most types of context-free grammars, while
indeterminism of the parser may be resolved by lookahead. Parser generation
involves a direct mapping of a grammar onto a stack automaton.
Variants of LR parsing are LALR and SLR.
LALR(1) means: LALR parsing with one lookahead symbol. SLR(k) means: SLR
parsing with an indefinite number of lookahead symbols.
o
Requirements
§
The extensions to ecfg’s
according to Position Paper 4 shall also be handled, including ambiguous
grammars.
§
The grammatical formalism shall
be translated directly into code for a formal automaton.
§
Errors shall be handled.
§
The theoretical lower bounds for
time and space shall be approached.
o
Current solutions
§
Examples of open source parser
generators are
-
Yacc (LALR(1))
-
Cup (LALR(k))
-
Antlr (LL(k))
-
Precc
(LL(k)), where the formalism is EBNF
with attributes.
o
Problems with current solutions
§
there is no ambiguity allowed
§
there are no extensions to
cfg’s
§
difficulty of debugging.
o
Position of Transform
- The
LR method accommodates more types of grammar than the LL method.
- The
larger the number of lookahead symbols, the more time is needed for parser
generation. Therefore, if ambiguous grammars are accommodated then there is
less need for a large number of lookahead symbols. SLR has been chosen instead
of LALR because it generates less states than LALR and subsequently less code
is generated.
§ Extensions to LR parser theory have been made for the formalism of
Position Paper 4.
§ Optimizations are built in for code generation for the lowest type
of formal automaton possible. E.g.: grammars without middle recursion are
mapped onto a finite state automaton. Empty or unit rules are eliminated.
o
Position of Capri
§
an ECFG is first translated
into a CFG
§
variables are handled
separately from the CFG
§
SLR(1) parser generation is
used
§
parsing is done by a stack
automaton with a parse forest (Tomita style).
4
Complexity of time and space
Typical stages of the evolution of
algorithms and programs are
§
the resolution of
undecidability problems
§
the development of naive
algorithms (usually exponential in running time or NP-complete)
§
the improvement in speed to
polynomial or linear time algorithms
§
the improvement of the constant
factors
§
the optimizations within the
program, not dealing with the algorithm, and
§
the adaptations in the hardware
of the machine (e.g. development of specialized micro code).
o
Requirements
§
Complexity of time and space
has to be as small as possible; the theoretical lower bounds for time and space
shall be approached.
o
Current solutions
§
There is ongoing research on
the complexity of algorithms; in general the strategy is to find sub-classes of
classes of problems for which the complexity can be improved. For the
complexity of time this may concern:
-
From NP to P
-
From P to linear time
-
From linear to sub-linear time.
o
Position of Transform
§
In Transform a number of improvements
are implemented. These are described in the thesis. There it is shown
-
how to parse on-line in exponential time ambiguous type-0 grammars and transduction grammars,
enriched with variables and with an on-line construction of the parses, with
the aid of the PTA
-
how to parse on-line in polynomial time,
with an on-line construction of the parse trees,
extended context-free grammars with don't cares
-
how to reformulate the problem of pattern
matching of Aho and Corasick into the problem of LR-parsing, with the same
linear time behavior
-
how to extend the LR parser generation method for the treatment of grammars
with Boolean expressions for the treatment of term-rewriting systems in such a way that pattern matching with a number of keywords,
containing regular expressions
and don't cares,
can be implemented to run in sub-linear time
-
how to generate in all these cases the minimum
of code for the PTA in order to use only those parts which are
necessary
-
how to implement the PTA on parallel hardware.
o
Position of Clarity: Capri, Lexbench, Author
§
No special improvements of complexity
have been implemented. In the runtime system the algorithm of Tomita is
implemented.
5
To be done
o
Transform: a number of sub
formalisms have been disabled temporarily. They have to be activated again.
They concern:
§
Interpretation according to
Earley.
§
“Cooperation’s” between
(pattern-)grammar rules (not mentioned before; this concerns the
synchronization of patterns, as an extension to Boolean “and”).
§
Tree pattern matching (its
maintenance was postponed because Xquery became available).
§
Booleans of nonterminals.
o
Transform and Capri:
probabilities of grammar rules.
o
Transform: full unification by
on-line lazy evaluation.
o
Capri: from explicit to implicit unification (a matter of
pre-processing).
o
Transform and Capri:
easier facilities for debugging grammars.
6
Literature
o
Van der Steen, G.J., A program generator for recognition,
parsing and transduction with syntactic patterns. CWI Tract 55, Centre for
Mathematics and Computer Science, P.O.
Box 4079, 1009 AB Amsterdam,
The Netherlands,
284 pp, 1988, ISBN 9061963613.
o
Van der Steen,
G.J., “Unrestricted on-line parsing and transduction with graph structured
stacks”. In: Lankhorst, M. et all (eds.), “Tomita’s Algorithm - Extensions and
Applications”, Twente Workshop on Language Technology 1 (TWLT1), University of Twente, 1991, pp. 36-78.
o
User Manual Transform.
o
Reference Manual Transform.