LingDoc Transform
A tool for document and language
engineering
User Manual
Version 3.2
Palstar
Uffelte
The
www.lingdoc.eu
www.palstar.nl
Contents
2........... Working Procedure
3........... Install Transform
4........... Directory structure
5........... Explore the Interface
7........... Syntax: EBNF grammars
and Regular Expressions
7.1 The formalism of Transform in a
technical nutshell
7.1.3 When to use recognition and when
transduction grammars
8........... A running example
9........... Example 1: getting
started
9.1 Source document (Ex_1.txt)
10.......... Syntax: extensions for
output, variables and expressions
11.1.2 Messages with variables
12.......... Example 2: Output
12.1 Source document (Ex_2.txt, same as
Ex_1.txt)
12.2 DTD (Ex_2.dtd, same as Ex_1.dtd)
12.3 Grammar (Ex_2.grm, as an extension
to Ex_1.grm)
12.4 Target document (Ex_2.xml)
13.......... Syntax: Variables and
expressions
13.2 Variables, literals and expressions
13.4.3 Combinations of tests and/or
assignments
14.......... Example 3: Variables
14.1 Source document (Ex_3.txt)
14.4 Target document (Ex_3.xml)
15.......... Methodology for
up-conversion: from DTD to grammar
16.......... Example 4: The complete
conversion
17.......... Syntax: Extensions of
context-sensitive grammars
18.......... Syntax: Extensions of
transduction grammars
18.1.3 Ambiguity and "looping"
problems in transduction grammars
18.2 More reading on transduction
grammars
19.......... Example 5: Floating
elements
19.2 Source document (Ex_5.txt, same as
Ex_4.txt)
19.4 Target document (Ex_5.tdo)
20.......... Example 6: Two-stage
conversion
21.......... Syntax: Cascaded
grammars
22.......... Example 7: Cascaded
grammars
24.......... Example 8a: the
construction of a lexicon
24.2 Lexicon for the running example
24.3 Gui: compilation of a lexicon
24.4 Instructions for compiling a lexicon
25.......... Example 8b: conversion
with a lexicon
25.3 Instructions for using the lexicon
26.......... Example 9: Natural
language parsing
26.4 Source document (ex_9.txt)
26.7 Instructions for running the example
27.......... Example 10: Structure
sensitive Natural Language parsing
27.3 Source document (ex_10.txt)
27.5 Lexicon (same as ex_9_lex.txt)
28.......... Syntax: Ambiguity and
non-determinism
29.......... Example 11: Ambiguity
29.2 Source document (ex_11.txt)
30.1 Run Time Input and Output files
30.4 Drawing (ambiguous) parsetrees
30.5 Drawing (ambiguous) transduction
output
LingDoc Transform (in short: Transform) is a general-purpose system for the analysis of natural language texts and for the conversion of documents. It is introduced in the “Management Summary of LingDoc Transform” and in a number of position papers.
This User Manual tries to explain the working of Transform in the shortest way possible and provides for immediate hands-on experience. It explains the syntax and shows a number of examples in the form of exercises. You may modify the exercises and try them out on your computer. There is one running example: the analysis and conversion of an aerospace manual into XML, together with the linguistic analysis of its text. The running example gives a comprehensive overview of the capabilities for up-conversion to Xml and for language engineering tasks. Along the way references are made, by hyperlinks, to the accompanying Reference Manual. That manual gives a rather exhaustive description of the complete system in its own pace.
This User Manual is aimed at the more experienced user who already has a basic understanding of the syntax of XML DTD’s, EBNF and regular expressions. We advise the impatient reader, experienced or not, to follow this User Manual and to use the hyperlinks in case something is not clear or not precise enough. We welcome any suggestions for further improvement of these manuals.
The Transform specify-parse-adjust-convert cycle enables you to easily specify the structure of a document, parse it, adjust it and then convert it to valid XML. Here’s how it works:
3. ADJUST GRAMMAR AND DOCUMENTS 2. PARSE DOCUMENTS 4. CONVERT DOCUMENTS TO XML 1. SPECIFY GRAMMAR FOR DOCUMENTS
Obtain a number of source documents together witch their target documents in XML together with the document schema. Convert the schema, in a one-to-one correspondence, into a grammar in the W3C notation. The end symbols of the grammar are the characters in the source documents. Add regular expressions to fine-tune to the peculiarities of the source documents.
Compile the grammar into a parser. Parse the documents and let errors be reported.
Errors may be caused by flaws in the grammar and/or by errors and inconsistencies in the documents. Edit the grammar and/or the documents. The cycle of parse-adjust may be repeated until no errors are reported.
Add output-specifications to the grammar. The documents will be converted into XML. Alternatively, whichever is easier, specify in Transform a second type of grammar which contains rewriting patterns. Or, use your own rewriting program. In the last case, you use Transform mainly as a tool for getting documents correct, using the cycle 1-2-3.
For the installation of Transform from the CD: follow the instructions in the file Install.txt.
Two directories will be available: Transform and Java. The directory Java will only be used by the Gui. The directory structure of Transform is as follows (a larger font indicates its relevance for the user):
|
BIN |
contains the
executables for the system Transform |
|
ETC |
contains a text
file for error-messages, to be used by the system |
|
Examples_more |
contains additional examples, referred by the User Manual |
|
Examples_myself |
the directory you will work with |
|
Examples_User_Manual |
initially the same as Examples_myself; in case you mess up your work you can always copy the original |
|
GRM |
contains the
bootstrap grammars for the system (the formalisms used within the system are
described by Transform grammars themselves) |
|
Gui |
contains the
java classes for the graphical user interface |
|
Manuals |
contains the User Manual and the Reference Manual |
|
several .bat
files |
for running the
system from the Gui or in batch |
|
click-for-Transform |
a Windows pif file; click on it to start the Gui; you may copy this pif file to the desktop for easier use |
|
read.me |
instructions
for adjusting path names; initially, there is no need to do this |
Start the Gui of Transform by clicking on “click-for-Transform”. This brings you in the main menu, also called “Basic Mode”. The menu contains comments on each of the six buttons. The buttons are further explained hereafter.
The behavior of the compiler and the run time system can be adapted to the wishes of the user; some options can be specified to modify the behavior of the programs to some extent. The options are set by switches in the GUI. Nearly all switches may be specified and altered using the Advanced mode.
The switches are kept in a so-called switches file, one for the compiler and one for the run time system. The switches files have the extension .SW. Each switch has a default value; if a switch value is not specified by the user, this default value is assumed.
A switches file is maintained automatically by the GUI, but can also be altered by hand.
Some settings of switches are related. The GUI takes care of the automatic change of the setting of a switch in case the setting of a related switch is altered.
Clicking in the Main menu on “Advanced mode…” will bring you to the menu “Advanced Mode”, where you can select five tab pages. In general, you can set a number of switches for the compiler and the parser, disassemble a compiled grammar, compile a lexicon and link several grammars together. Moving the cursor over one of the switches will reveal a short explanation.
For each exercise, the values of the switches are contained in the switches’ file, with extension .sw.
However, for doing the exercises in this User Manual it is not necessary to use the Advanced Mode.
The Advanced Mode is explained in more detail in the Reference Manual.
You are encouraged to experiment with the examples in this User Manual. In doing so you will operate on text files. You may use your favorite text editor (a simple editor like Wordpad will be sufficient.)
In Windows, use “mapoptions” to associate the extensions of the relevant input and output files of Transform to your editor. The extensions are:
- dtd (this type could also be associated with an XML-editor)
- grm (the grammar file)
- tdo (the target document created by the transducer)
- txt (the source document)
- xml (the target document) (this type could also be associated with a browser like Internet Explorer)
A few more types of files are created by the compiler and the run time system. However, you do not have to inspect or alter them. (If you are interested: read more on files related to compilation and files related to the runtime system.)
The operation of the Transform system is described here. A description is given on types of files, error handling, batch operation and screen output.
The basic syntax of Transform is that of EBNF grammars, together with Regular Expressions for terminals. The symbols for EBNF are the same as for content models in XML DTD’s.
A grammar consists of a set of rewrite rules, which have a left hand side (“lhs”) and a right hand side (“rhs”). The set is unordered.
The basic syntax is extended in a number of ways:
- everywhere in a rhs so-called “Actions” may be specified, which contain small programs; the programs use variables which scope is the grammar-rule. With the aid of variables one may construct the target document.
- nonterminals may have parameters, which may be the variables used in the Actions. Parameters are used to distribute components of the source document, e.g. for the construction of XML attributes or for re-ordering purposes.
If the lhs of a rule contains one symbol it is called a context-free rule; if it contains more than one symbol it is called a type-1 or 0 Chomsky-grammar (type 0 if the lhs is longer than the rhs).
Type-0 rules may be written as transduction rules. In that case a rhs becomes a pattern that, if located in the source document, will be rewritten according to the lhs.
If the compiler switch Transduce is set to true then a grammar will be treated as a transduction grammar.
Transduction rules behave completely different from recognition rules, while the syntax is nearly the same. All transduction rules run in parallel. The output of a rule is again subject to the matching of all rules.
Recognition rules describe precisely the sequence of characters within the source document. For each notion in the grammar its context is automatically determined by the whole grammar. The parsing of a source document with a recognition grammar may fail or succeed.
To the contrary, the transduction of a source document always succeeds. This is the way XSLT and tools based on pattern-matching operate.
With transduction rules you may delete, insert, replace and reorder components of the source document within context. Sets of transduction rules may be cascaded in order to provide for a number of conversion phases within one run of Transform.
Transduction rules may contain non-terminals which are, in their turn, defined by recognition rules.
Transduction grammars can be made very powerful. To get an impression: see the examples in the directories examples_more\cijfers and examples_more\chiffres on the translation of numbers into Dutch and French alphabetic representations.
It is advised to use recognition grammars in the first stage of a conversion process. With a recognition grammar you may specify the syntax of the source document on the character level. All inconsistencies and errors will be reported. In that case either the grammar or the source document will have to be adjusted.
When recognition succeeds one may proceed in two ways. The recognition grammar may be supplied by actions and parameters in order to produce the output document. On the other hand one may reorganize the recognition grammar into a transduction grammar. In the case of up-conversion to XML the first strategy suffices most of the time.
In case the structure of the source document is quite different from the structure of the target XML document one may first specify a document schema according to the structure of the source document and convert according to that schema. In a second stage XSLT may be used for the conversion to the target document.
In the case of up-conversion of “floating elements” into an XML mixed content model the use of recognition and transduction grammars may be combined. Floating elements escape, by definition, a rigorous syntactic description of their position in a document. A limited transduction grammar may then be used to identify floating elements in the source document and to supply them with XML tags (and attributes). This can be done in a pre- or post processing stage adjacent to the recognition stage. This will be demonstrated in example 5.
Not all extensions to the syntax can be used together; the allowed combinations are listed. Read more on the allowed combinations.
The meta symbols used in Transform grammars for EBNF notation are listed below, together with their names and a short description of their functions:
Delimiters and operators used in rules are:
|
name |
meta symbol |
function |
|
rewrite sign |
::= |
recognition: written
between left- and right-hand side |
|
transduction sign |
<:: |
transduction: written
between left- and right-hand side |
|
concatenation sign |
, |
concatenation of symbols
(optional) |
|
alternative sign |
| |
between alternatives |
|
end of rule sign |
. |
end of a rule |
|
comment signs |
/* and */ |
start and end comment |
|
action brackets |
{ and } |
between these brackets actions
are notated |
Regular expressions are used for the description of strings of characters as well as for the grouping of symbols, like in an XML content model. Reserved symbols for regular expressions are:
|
name |
meta symbol |
function |
|
regular expression open |
( |
open bracket of a regular
expression |
|
negated reg. expr. open |
`( |
open bracket for a negated
regular expression |
|
r.e. close zero or one |
)? |
zero or one times the
enclosed part |
|
r.e. close one |
) |
one time the enclosed part |
|
r.e. close zero or more |
)* |
zero or more times the
enclosed part |
|
r.e. close one or more |
)+ |
one or more times the
enclosed part |
The comma in between symbols is optional. For readability, whitespace may be added freely. Empty rules are allowed: in that case the rhs is absent. All types of recursion are allowed (left-, middle- and right-recursion).
For
examples: see grammar Ex_1.grm below.
All characters (including the characters that have a meta function), will be recognized by the system as terminals if they are surrounded by single quotes, for example:
Vowel ::= 'a' | 'e' | 'i' | 'o' | 'u' .
s ::= '
' |
#x09 | CR. /* space, TAB or CR
*/
Quote ::= ''''
| '"'.
/* The single quote itself
must be doubled. */
The reserved symbol "cr" or "CR" matches an end of line character.
All
terminal symbols may also be represented by the number representing their
underlying code. A value in the (decimal) range ‘
In summary, there are alternative ways to represent a single character in a grammar:
|
"name
characters" like "a" |
"single
quote" character |
"end
of line" character |
other
characters like "." |
|
'a' |
'''' |
CR
or cr |
'.' |
|
#x61 |
#x27 |
#x0D |
#x2E |
The matching of characters is by default case sensitive. This may be suppressed by a compiler switch.
Sequences
of terminal symbols may be given as a string between single quotes. For
example:
abcString ::= 'abc' .
is
another notation for:
abcString ::= 'a', 'b', 'c' .
or
for:
abcString ::= 'a' 'b' 'c' .
A
range of characters is specified like in