4. LingDoc and Grammar Formalisms: Position Paper

 

In this paper we position the grammar formalisms of Transform and Clarity.

Formalisms for Document and Language Engineering range from general purpose programming languages to specific application oriented languages. Language designers want to maximize the applicability of their formalism; problem solvers want to specify their problem in the shortest way possible, with implicit assumptions about the domain of their problem.

For a new problem domain the natural tendency is that problem solvers start with a general purpose programming language and that in the course of time more specific languages are developed. Computational Linguistics has not been an exception to this tendency. As more linguistic theories emerged more formalisms were invented to suit their theories. Sometimes, in that process computer scientists tend to enforce their ideas about formalisms on linguists. Also, linguists may have wrong ideas about the complexity of computer programs and for that reason take inappropriate decisions about the use of formalisms. They are not aware of internal optimizations which influence the complexity.

1         Requirements

In general, the following requirements for formalisms for Document and Language Engineering may be stated, based upon our experience and observation.

o      A formalism for a problem domain shall

§       have high expressive power.

§       have expressive flexibility.

§       have minimal complexity, like control of flow; it shall be more functional than procedural.

§       be adjustable for weak and strong equivalency.

§       be adjustable in order to restrict users in the choice they can make of sub-for­malisms.

§       allow for ambiguity (an intrinsic property of natural language).

§       This all without the formalism be­coming too powerful and introducing new problems on this account, like undecidability.

o      The blend of Document Engineering and Language Engineering requires that a formalism shall express:

§       Context sensitivity to document markup

§       Patterns for information extraction, in the same way as queries for document databases.

o      The complexity of the formalism shall be minimized by use of computational layers

§       These layers hide from the user all details that can be automated.

§       Program generation shall be possible, like parser generation for grammatical formalisms.

2         Current solutions: formalisms for Computational Linguistics

Jurafsky and Martin (2000, 2008) discuss a large number of syntactic formalisms for Computational Linguistics. Some of them have been used only experimentally, others are in use in practical natural language systems. We are in­terested in the latter category and in the question how we can build effective systems.

Current formalisms, among others, are the following.

§       (Extended) Context-Free Grammars (ECFG, borrowed from Computer Science)

§       Recursive transition networks (RTN)

§       Augmented transition networks (ATN)

§       Attribute Grammars (borrowed from Computer Science)

§       Affix grammars

§       Augmented Phrase Structure Grammars (APSG)

§       Unification based APSG grammars

§       Definite Clause Grammars (DCG: ECFG+features+unification, mostly written in Prolog)

§       Probabilistic Context-Free Grammar (PCFG)

§       Head Phrase Structure Grammars (HPSG)

§       Tree adjoin Grammars (TAG), in which restrictions are imposed on parsetrees

§       Lexical Functional Grammars (LFG)

§       Syntax directed translation schemata

§       Pattern Matching and Replacement languages (PMR).

3         Current solutions: packages for Computational Linguistics

There are some commercial packages available for the deployment of one or more of the abovementioned formalisms. Some try to introduce a graphical representation of the syntax.

Most implementations are provided by universities, as an offspring of dedicated projects.

4         Current solutions: formalisms for Document Engineering

Document Engineering has been facilitated by a number of standards, notably those of the W3C. Syntactic formalisms have been developed for, among others, document structure description (like DTD and schema languages), document transformation (like XSLT) and document querying (like XQUERY).

5         Problems with current solutions

Some problems manifest themselves gradually during the course of a project. One should pay attention beforehand in order to avoid them.

§       There is a wrong match between a problem statement and the applied formalism. An example is a linguist who uses a programming language for parsing.

§       The syntax is too verbose or too terse.

§       No ambiguities are allowed.

§       There is no context-sensitivity to document markup.

§       No program generation possible.

6         Positions of LingDoc Transform and Clarity

Both Transform and Clarity try to reach the requirements by the unification of frequently used formalisms in Computational Linguistics and Computer Science. Starting with the latter, extensions which were necessary for the solution of practical problems in Language and Document Engineering were gradually added. Document structure, linguistic structure and subsequent transformations can be expressed by one common formalism.

The emphasis during the development of Transform was to exploit and extend techniques from Computer Science in order to reach efficient solutions for a large number of applications. The property of on-line processing has constantly been maintained.

o      The main line of development of Transform has been:

§       ECFG, with full handling of ambiguities

§       Extension with a metagrammar in which the formalism is described

§       Extension with regular expressions on the character level

§       Extension to attribute grammars

§       Extension to (ambiguous) Chomsky type 0 grammars

§       Extension to pattern grammars, enabling syntactic pattern recognition and information extraction (PMR)

§       Extension to pattern rewriting

§       Extension with program statements within grammar rules

§       Maintaining on-line integration with lexical scanner and lexicon

§       Extension of LR parser generation techniques for a formal automaton which corresponds to the extended formalism; the automaton has the on-line property (see Position Paper 5).

o      The emphasis during the development of Capri was to accommodate the use of Controlled Languages.

§       Start with ECFG, with full handling of ambiguities

§       Extension to attribute grammars

§       Extension with Syntax directed rewriting

§       Extension with program statements within grammar rules

§       Extension of LR parser generation for an automaton which conducts Tomita-style parsing; the automaton has the off-line property (see Position Paper 5).

The formalisms of Transform and Capri have a large common core with their own extensions. The details of both formalisms will now be described.

o      Chomsky-type grammar

§       Transform: type 0

§       Capri: type 2

§       Grammars may be left- and right-recursive.

o      Regular expressions on the grammar rule level at the right-hand side

§       Like in ECFG (EBNF notation).

o      Regular expressions on the character level. Transform only.

§       Integrated with regular expressions on the grammar rule level. One of the advantages is that, compared to regular expressions in PM languages, the expressions become much more readable and more powerful.

§       Characters, literary strings and parts of the lexicon may be used in the grammar.

o      Wildcard symbols for pattern matching purposes. Transform only.

o      Variables, synthesized and inherited (like in attribute grammars)

o      Unification of variables. Capri only.

The unification is specified explicitly by comparison of variables.

o      Grammatical correction.

Capri only. This is performed by correction rules and by overruling attributes, e.g. for the correction of agreement. Connected to the correction rules are messages with explanations for the author.

o      Programming statements.

The grammatical formalism is declarative, but within grammar rules programming statements may be inserted. They communicate with the grammar rule through variables.

§       Transform: program statements are executed sequentially.

-    Assignment

-    Concatenation

-    If-then-else

-    Halt/continue current parse

-    Output.

§       Capri: program statements are executed when the value of variables becomes available.

-    Assignment: within a grammar rule a variable may be assigned only once; because of this the sequence of statements is not important

-    Concatenation

-    If-then-else

-    Halt current parse

-    Skip next notion in grammar

-    Output.

o      Types of output

§       For recognition only: the answer yes/no.

§       For parsing: the parse structure, for all ambiguous parses.

§       Transduction. (Transform only.) By output instructions. These may concern “signals” and “reports”. A “signal” is an integer which is emitted when some position in the grammar is reached. Signals may trigger other cooperating processes. These pro­cesses are often embedded in an on-line environment. A “report” is a signal together with the value of an expression. If the integer value of the signal is left out an on-line stream of values of expressions is emitted.

o      Cascaded grammars.

Transform only. The output of one cascaded grammar is the input for the next one. The grammars are connected by pipes.

o      Reordering constructions for transduction.

§       Transform: transduction rules are written as Chomsky type-0 rules. The right-hand side (rhs) of a rule will rewrite to its left-hand side (lhs). In the case of a transduction grammar, pieces of a text will be transformed. If no more transduction rules are applicable the transformed text is emitted.

§       Capri: transduction rules are written as rules for syntax directed translation. Translation and correction are treated with the same mechanism. Correction is seen as translation from incorrect to correct sentences.

o      Probabilities.

Not in Capri and not in Transform. In Transform provisions for probabilities of grammar rules were made, but there has not been a final implementation. The reason was that because of the gradual extension of the formalism the implementation of the handling of probabilities had to be adjusted over and over again.

o      Transparent for all ambiguities, grammatical and/or lexical.

o      Generation of programs in pseudo-code.

The code is interpreted by a formal automaton.

o      Metagrammar.

Transform only. The syntax of the formalism is described by a metagrammar which is adaptable and which is written in the same formalism. Therefore, the user may adjust the syntax to become more terse or verbose or to disallow some sub-formalisms. Transform comes with two metagrammars. The “classic” one was used until 2003 (also used in the thesis). The “modern” one follows the syntax for the W3C recommendations.

7         Tables for comparison

The following tables give a comprehensive comparison of Transform, Capri and the formalisms that were listed before.

 

 

 

strong

equi

valence

cfg

Chom

sky type-0

grammar

regular

expres

sions

on

char

level

regular

expres

sions

on

grammar

rule

level

characters

as tokens

RTN

 

 

 

 

x

 

ATN

 

 

 

 

x

 

Attribute

 

 

 

 

x

 

Affix

 

 

 

 

x

 

APSG

 

 

 

 

x

 

APSG +unification

 

 

 

 

x

 

DCG (in  Prolog)

 

x

 

 

 

 

PCFG

 

x

 

 

x

 

HPSG

 

x

 

 

 

 

TAG

 

 

 

 

 

 

LFG

 

 

 

 

 

 

Syntax

directed

translation

schemata

 

x

 

 

 

 

PMR

 

 

 

x

 

 

Transform

x

 

x

x

x

x

Capri

 

 

 

 

x

 

 

 

 

 

lexe

mes as

tokens

wild

cards

varia

bles

unifi

cation

of varia

bles

bool

eans

cor

rec

tion / trans

lation

RTN

x

 

 

 

 

 

ATN

x

 

x

 

 

 

Attribute

x

 

x

 

 

 

Affix

x

 

x

 

 

 

APSG

x

 

x

 

 

 

APSG +unification

x

 

x

im

plicit

 

 

DCG (in  Prolog)

x

 

x

im

plicit

 

 

PCFG

 

 

 

 

 

 

HPSG

 

 

 

 

 

 

TAG

 

 

 

 

 

 

LFG

 

 

 

 

 

 

Syntax

directed

translation

schemata

 

 

 

 

 

x

PMR

 

x

x

 

x

 

Transform

x

x

x

ex

plicit

on

line

x

 

Capri

x

 

x

ex

plicit

off

line

 

x

 

 

 

pro

grams

attached

output

instruc

tions

 

proba

bilies

weak equi

valent nota

tions

left recur

sion

ambi

gui

ties

allo

wed

error

reco

very

RTN

 

 

 

 

x

?

 

ATN

 

 

 

 

x

?

 

Attribute

 

 

 

 

 

 

 

Affix

 

 

 

 

 

x

 

APSG

 

 

 

 

 

x

 

APSG +unification

 

 

 

 

 

x

 

DCG (in  Prolog)

 

 

 

 

 

x

 

PCFG

 

 

x

 

 

x

 

HPSG

 

 

 

 

 

x

 

TAG

 

 

 

 

 

x

 

LFG

 

 

 

 

 

x

 

Syntax

directed

translation

schemata

 

 

 

 

 

 

 

PMR

 

x

 

 

 

 

 

Transform

x

x

 

x

x

x

x

Capri

x

 

 

 

x

x

 

 

8         Wishes

Some wishes for LingDoc Transform are:

§       Full unification of variables, like in LingDoc Capri.

§       Generalized lazy evaluation of statements.

 

9         Literature

§       Documents within LingDoc

-    Position Papers

-    User Manual Transform

-    Reference Manual Transform

§       International Standard ISO/IEC 14977  Information technology – Syntactic metalanguage – Extended BNF

§       Jeffrey Friedl, Mastering Regular Expressions (O’Reilly), ISBN 0-596-00289-0 en Pocket Reference; contains also EBNF in Perl.