LingDoc Transform

 

A tool for document and language engineering

 

 

 

 

 

 

User Manual

 

Version 3.2

 

 

 

 

 

 

 

 

 

Palstar

Uffelte

The Netherlands

www.lingdoc.eu

www.palstar.nl

 


Contents

 

1........... In a nutshell 5

1.1      What is LingDoc Transform.. 5

1.2      About this manual 5

2........... Working Procedure. 5

2.1      SPECIFY.. 5

2.2      PARSE. 6

2.3      ADJUST. 6

2.4      CONVERT. 6

3........... Install Transform.. 6

4........... Directory structure. 6

5........... Explore the Interface. 6

5.1      Gui: Basic Mode. 6

5.2              7

5.3      Gui: Select Grammar…... 7

5.4      Gui: Compile Grammar…... 7

5.5      Gui: Select Inputfile…... 8

5.6      Gui: Run Grammar…... 9

5.7      Gui: Advanced mode…... 9

5.8      Use your favorite editor 10

6........... Operation. 10

7........... Syntax: EBNF grammars and Regular Expressions. 10

7.1      The formalism of Transform in a technical nutshell 10

7.1.1      Recognition grammars 11

7.1.2      Transduction grammars 11

7.1.3      When to use recognition and when transduction grammars 11

7.2      EBNF. 12

7.2.1      Symbols 12

7.2.2      Characters 12

7.2.3      Strings 13

7.2.4      Ranges of characters 13

7.2.5      Wildcards 13

7.2.6      Non-terminal symbols 13

7.2.7      Regular expressions 14

7.3      More reading. 14

8........... A running example. 14

9........... Example 1: getting started. 14

9.1      Source document (Ex_1.txt) 14

9.2      DTD (Ex_1.dtd) 14

9.3      Grammar (Ex_1.grm) 14

9.4      Compile and parse Ex_1. 15

9.5      Experiment 16

10.......... Syntax: extensions for output, variables and expressions. 16

10.1     Introduction. 16

11.......... Syntax: Output 16

11.1     Introduction. 16

11.1.1     Simple messages 16

11.1.2     Messages with variables 17

11.2     More reading. 17

12.......... Example 2: Output 17

12.1     Source document (Ex_2.txt, same as Ex_1.txt) 17

12.2     DTD (Ex_2.dtd, same as Ex_1.dtd) 18

12.3     Grammar (Ex_2.grm, as an extension to Ex_1.grm) 18

12.4     Target document (Ex_2.xml) 18

13.......... Syntax: Variables and expressions. 18

13.1     Introduction. 18

13.2     Variables, literals and expressions 18

13.3     Assignment 19

13.4     Tests on expressions 19

13.4.1     Introduction. 19

13.4.2     Types of tests 19

13.4.3     Combinations of tests and/or assignments 20

13.5     Parameters 20

13.6     More reading. 21

14.......... Example 3: Variables. 21

14.1     Source document (Ex_3.txt) 21

14.2     DTD (Ex_3.dtd) 21

14.3     Grammar (Ex_3.grm) 21

14.4     Target document (Ex_3.xml) 22

15.......... Methodology for up-conversion: from DTD to grammar 22

15.1     Introduction. 22

15.2     SPECIFY.. 22

15.3     PARSE. 23

15.4     ADJUST. 23

15.5     CONVERT. 23

16.......... Example 4: The complete conversion. 23

16.1     Introduction. 23

16.2     Source document 23

16.3     DTD.. 24

16.4     Target document in XML. 24

16.5     Grammar 25

17.......... Syntax: Extensions of context-sensitive grammars. 28

17.1     Introduction. 28

17.2     Example. 28

18.......... Syntax: Extensions of transduction grammars. 28

18.1     Introduction. 28

18.1.1     Cover symbols 29

18.1.2     Intermediate symbols 30

18.1.3     Ambiguity and "looping" problems in transduction grammars 30

18.1.4     Negation. 30

18.2     More reading on transduction grammars 30

19.......... Example 5: Floating elements. 31

19.1     Introduction. 31

19.2     Source document (Ex_5.txt, same as Ex_4.txt) 31

19.3     Grammar (Ex_5_td.grm) 31

19.4     Target document (Ex_5.tdo) 32

20.......... Example 6: Two-stage conversion. 33

21.......... Syntax: Cascaded grammars. 33

21.1     Introduction. 33

21.2     Example. 33

22.......... Example 7: Cascaded grammars. 33

22.1     Introduction. 33

22.2     Gui: Cascade linker 34

22.3     Instructions 34

23.......... Syntax: lexicons. 34

24.......... Example 8a: the construction of a lexicon. 35

24.1     Maintenance of a lexicon. 35

24.2     Lexicon for the running example. 35

24.3     Gui: compilation of a lexicon. 35

24.4     Instructions for compiling a lexicon. 36

25.......... Example 8b: conversion with a lexicon. 36

25.1     Source document 36

25.2     Grammar 36

25.3     Instructions for using the lexicon. 37

26.......... Example 9: Natural language parsing. 37

26.1     Introduction. 37

26.2     The input 37

26.3     Switches used. 38

26.4     Source document (ex_9.txt) 38

26.5     Grammar (ex_9.grm) 38

26.6     Lexicon (ex_9_lex.txt) 39

26.7     Instructions for running the example. 40

26.8     Parsetrees 40

27.......... Example 10: Structure sensitive Natural Language parsing. 40

27.1     Introduction. 40

27.2     Switches used. 40

27.3     Source document (ex_10.txt) 41

27.4     Grammar (ex_10.grm) 41

27.5     Lexicon (same as ex_9_lex.txt) 43

27.6     Parsetrees (ex10.out 43

28.......... Syntax: Ambiguity and non-determinism.. 43

28.1     Introduction. 43

29.......... Example 11: Ambiguity. 44

29.1     Introduction. 44

29.2     Source document (ex_11.txt) 44

29.3     Grammar (ex_11.grm) 44

29.4     Lexicon (ex_11_lex.txt) 44

29.5     Parsetrees (ex11.out) 45

30.......... Miscellaneous. 46

30.1     Run Time Input and Output files 46

30.2     Error recovery. 46

30.3     Switches. 46

30.4     Drawing (ambiguous) parsetrees 46

30.5     Drawing (ambiguous) transduction output 46

30.6     Debugging grammars 46

31.......... Contact 46

 

1     In a nutshell

1.1     What is LingDoc Transform

LingDoc Transform (in short: Transform) is a general-purpose system for the analysis of natural language texts and for the conversion of documents. It is introduced in the “Management Summary of LingDoc Transform” and in a number of position papers.

1.2     About this manual

This User Manual tries to explain the working of Transform in the shortest way possible and provides for immediate hands-on experience. It explains the syntax and shows a number of examples in the form of exercises. You may modify the exercises and try them out on your computer. There is one running example: the analysis and conversion of an aerospace manual into XML, together with the linguistic analysis of its text. The running example gives a comprehensive overview of the capabilities for up-conversion to Xml and for language engineering tasks. Along the way references are made, by hyperlinks, to the accompanying Reference Manual. That manual gives a rather exhaustive description of the complete system in its own pace.

This User Manual is aimed at the more experienced user who already has a basic understanding of the syntax of XML DTD’s, EBNF and regular expressions. We advise the impatient reader, experienced or not, to follow this User Manual and to use the hyperlinks in case something is not clear or not precise enough. We welcome any suggestions for further improvement of these manuals.

2     Working Procedure

The Transform specify-parse-adjust-convert cycle enables you to easily specify the structure of a document, parse it, adjust it and then convert it to valid XML. Here’s how it works:

 

3. ADJUST GRAMMAR AND DOCUMENTS

2. PARSE DOCUMENTS

4. CONVERT DOCUMENTS TO XML

1. SPECIFY GRAMMAR FOR DOCUMENTS

 

 

 

 

 

 

 

 

 

 

 


2.1     SPECIFY

Obtain a number of source documents together witch their target documents in XML together with the document schema. Convert the schema, in a one-to-one correspondence, into a grammar in the W3C notation. The end symbols of the grammar are the characters in the source documents. Add regular expressions to fine-tune to the peculiarities of the source documents.

 

2.2     PARSE

Compile the grammar into a parser. Parse the documents and let errors be reported.

 

2.3     ADJUST

Errors may be caused by flaws in the grammar and/or by errors and inconsistencies in the documents. Edit the grammar and/or the documents. The cycle of parse-adjust may be repeated until no errors are reported.

 

2.4     CONVERT

Add output-specifications to the grammar. The documents will be converted into XML. Alternatively, whichever is easier, specify in Transform a second type of grammar which contains rewriting patterns. Or, use your own rewriting program. In the last case, you use Transform mainly as a tool for getting documents correct, using the cycle 1-2-3.

 

3     Install Transform

For the installation of Transform from the CD: follow the instructions in the file Install.txt.

4     Directory structure

Two directories will be available: Transform and Java. The directory Java will only be used by the Gui. The directory structure of Transform is as follows (a larger font indicates its relevance for the user):

 

BIN

contains the executables for the system Transform

ETC

contains a text file for error-messages, to be used by the system

Examples_more

contains additional examples, referred by the User Manual

Examples_myself

the directory you will work with

Examples_User_Manual

initially the same as Examples_myself; in case you mess up your work you can always copy the original

GRM

contains the bootstrap grammars for the system (the formalisms used within the system are described by Transform grammars themselves)

Gui

contains the java classes for the graphical user interface

Manuals

contains the User Manual and the Reference Manual

several .bat files

for running the system from the Gui or in batch

click-for-Transform

a Windows pif file; click on it to start the Gui; you may copy this pif file to the desktop for easier use

read.me

instructions for adjusting path names; initially, there is no need to do this

5     Explore the Interface

 

Start the Gui of Transform by clicking on “click-for-Transform”.  This brings you in the main menu, also called “Basic Mode”. The menu contains comments on each of the six buttons. The buttons are further explained hereafter.

5.1     Gui: Basic Mode

5.2    

 

 

 

 

5.3     Gui: Select Grammar…

 

 

 

 

 

 

 

 

 

 

 

 

5.4     Gui: Compile Grammar…

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5.5     Gui: Select Inputfile…

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5.6     Gui: Run Grammar…

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5.7     Gui: Advanced mode…

The behavior of the compiler and the run time system can be adapted to the wishes of the user; some options can be specified to modify the behavior of the programs to some extent. The options are set by switches in the GUI. Nearly all switches may be specified and altered using the Advanced mode.

 

The switches are kept in a so-called switches file, one for the compiler and one for the run time system. The switches files have the extension .SW. Each switch has a default value; if a switch value is not specified by the user, this default value is assumed.

A switches file is maintained automatically by the GUI, but can also be altered by hand.

 

Some settings of switches are related. The GUI takes care of the automatic change of the setting of a switch in case the setting of a related switch is altered.

More reading on switches.

 

Clicking in the Main menu on “Advanced mode…” will bring you to the menu “Advanced Mode”, where you can select five tab pages. In general, you can set a number of switches for the compiler and the parser, disassemble a compiled grammar, compile a lexicon and link several grammars together. Moving the cursor over one of the switches will reveal a short explanation.

For each exercise, the values of the switches are contained in the switches’ file, with extension .sw.

 

However, for doing the exercises in this User Manual it is not necessary to use the Advanced Mode.

The Advanced Mode is explained in more detail in the Reference Manual.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5.1     Use your favorite editor

You are encouraged to experiment with the examples in this User Manual. In doing so you will operate on text files. You may use your favorite text editor (a simple editor like Wordpad will be sufficient.)

In Windows, use “mapoptions” to associate the extensions of the relevant input and output files of Transform to your editor. The extensions are:

-         dtd (this type could also be associated with an XML-editor)

-         grm (the grammar file)

-         tdo (the target document created by the transducer)

-         txt (the source document)

-         xml (the target document) (this type could also be associated with a browser like Internet Explorer)

A few more types of files are created by the compiler and the run time system. However, you do not have to inspect or alter them. (If you are interested: read more on files related to compilation and files related to the runtime system.)

6     Operation

The operation of the Transform system is described here. A description is given on types of files, error handling, batch operation and screen output.

7     Syntax: EBNF grammars and Regular Expressions

7.1     The formalism of Transform in a technical nutshell

The basic syntax of Transform is that of EBNF grammars, together with Regular Expressions for terminals. The symbols for EBNF are the same as for content models in XML DTD’s.

7.1.1        Recognition grammars

A grammar consists of a set of rewrite rules, which have a left hand side (“lhs”) and a right hand side (“rhs”). The set is unordered.

The basic syntax is extended in a number of ways:

-              everywhere in a rhs so-called “Actions” may be specified, which contain small programs; the programs use variables which scope is the grammar-rule. With the aid of variables one may construct the target document.

-              nonterminals may have parameters, which may be the variables used in the Actions. Parameters are used to distribute components of the source document, e.g. for the construction of XML attributes or for re-ordering purposes.

 

7.1.2        Transduction grammars

If the lhs of a rule contains one symbol it is called a context-free rule; if it contains more than one symbol it is called a type-1 or 0 Chomsky-grammar (type 0 if the lhs is longer than the rhs).

Type-0 rules may be written as transduction rules. In that case a rhs becomes a pattern that, if located in the source document, will be rewritten according to the lhs.

If the compiler switch Transduce is set to true then a grammar will be treated as a transduction grammar.

Transduction rules behave completely different from recognition rules, while the syntax is nearly the same. All transduction rules run in parallel. The output of a rule is again subject to the matching of all rules.

Recognition rules describe precisely the sequence of characters within the source document. For each notion in the grammar its context is automatically determined by the whole grammar. The parsing of a source document with a recognition grammar may fail or succeed.

To the contrary, the transduction of a source document always succeeds. This is the way XSLT and tools based on pattern-matching operate.

With transduction rules you may delete, insert, replace and reorder components of the source document within context. Sets of transduction rules may be cascaded in order to provide for a number of conversion phases within one run of Transform.

Transduction rules may contain non-terminals which are, in their turn, defined by recognition rules.

Transduction grammars can be made very powerful. To get an impression: see the examples in the directories examples_more\cijfers and examples_more\chiffres on the translation of numbers into Dutch and French alphabetic representations.

 

7.1.3        When to use recognition and when transduction grammars

It is advised to use recognition grammars in the first stage of a conversion process. With a recognition grammar you may specify the syntax of the source document on the character level. All inconsistencies and errors will be reported. In that case either the grammar or the source document will have to be adjusted.

When recognition succeeds one may proceed in two ways. The recognition grammar may be supplied by actions and parameters in order to produce the output document. On the other hand one may reorganize the recognition grammar into a transduction grammar. In the case of up-conversion to XML the first strategy suffices most of the time.

In case the structure of the source document is quite different from the structure of the target XML document one may first specify a document schema according to the structure of the source document and convert according to that schema. In a second stage XSLT may be used for the conversion to the target document.

In the case of up-conversion of “floating elements” into an XML mixed content model the use of recognition and transduction grammars may be combined. Floating elements escape, by definition, a rigorous syntactic description of their position in a document. A limited transduction grammar may then be used to identify floating elements in the source document and to supply them with XML tags (and attributes). This can be done in a pre- or post processing stage adjacent to the recognition stage. This will be demonstrated in example 5.

 

Not all extensions to the syntax can be used together; the allowed combinations are listed. Read more on the allowed combinations.

 

7.2     EBNF

The meta symbols used in Transform grammars for EBNF notation are listed below, together with their names and a short description of their functions:

 

Delimiters and operators used in rules are:

 

name

meta symbol

function

rewrite sign

::=

recognition: written between left- and right-hand side

transduction sign

<::

transduction: written between left- and right-hand side

concatenation sign

,

concatenation of symbols (optional)

alternative sign

|

between alternatives

end of rule sign

.

end of a rule

comment signs

/* and */

start and end comment

action brackets

{ and }

between these brackets actions are notated

                       

Regular expressions are used for the description of strings of characters as well as for the grouping of symbols, like in an XML content model. Reserved symbols for regular expressions are:

 

name

meta symbol

function

regular expression open

(

open bracket of a regular expression

negated reg. expr. open

`(

open bracket for a negated regular expression

r.e. close zero or one

)?

zero or one times the enclosed part

r.e. close one

)

one time the enclosed part

r.e. close zero or more

)*

zero or more times the enclosed part

r.e. close one or more

)+

one or more times the enclosed part

 

The comma in between symbols is optional. For readability, whitespace may be added freely. Empty rules are allowed: in that case the rhs is absent. All types of recursion are allowed (left-, middle- and right-recursion).

For examples: see grammar Ex_1.grm below.

 

7.2.1        Symbols

7.2.2        Characters

All characters (including the characters that have a meta function), will be recognized by the system as terminals if they are surrounded by single quotes, for example:

 

      Vowel ::=   'a' | 'e' | 'i' | 'o' | 'u' .

s     ::=   ' '  |  #x09 | CR.      /* space, TAB or CR */

      Quote ::=   '''' | '"'.             /* The single quote itself must be doubled. */

 

The reserved symbol "cr" or "CR" matches an end of line character.

All terminal symbols may also be represented by the number representing their underlying code. A value in the (decimal) range ‘0’ to ‘255’ represents a character from the extended ASCII set, ISO 8859/1. (In the future, Transform will support wide-range character sets.) The values have to be written in hexadecimal notation, like ‘#x61’.

In summary, there are alternative ways to represent a single character in a grammar:

 

"name characters" like "a"

"single quote" character

"end of line" character

other characters like "."

'a'

''''

CR or cr

'.'

#x61

#x27

#x0D

#x2E

 

The matching of characters is by default case sensitive. This may be suppressed by a compiler switch.

 

7.2.3        Strings

Sequences of terminal symbols may be given as a string between single quotes. For example:

      abcString   ::=   'abc' .

is another notation for:

      abcString   ::=   'a', 'b', 'c' .

or for:

      abcString   ::=   'a' 'b' 'c' .

 

7.2.4        Ranges of characters

A range of characters is specified like in