2. LingDoc and XML Up-Conversion: Position Paper

1         Introduction

In this position paper we describe one of the hardest problems of Document Enginering: the up-conversion of XML-documents. We show how the problem can be solved by the use of LingDoc Transform.

o      Importance

§       XML encoding leverages the usability of a document, be it a manual, a newspaper or a website. Up-conversion is the process of the conversion of a non-XML document into an XML encoded document. This process can be one-time (legacy) or part of a continuous production process. In the last case on-line processing may be a requirement.

§       Up-conversion is not a trivial process. For legacy documents the involved costs may be high, sometimes counting up to 90% of the total cost of the introduction of an XML-based production system.

§       Technically, up-conversion may be defined as the conversion of a source document from some arbitrary encoding format into a valid XML instance of some target document schema, using (at least partly) automated processing techniques. The automated part is sometimes called “auto-tagging”.

§       In order to design a conversion process a mapping has to be made between the structures of the source and target documents. The assumption is that the source file is a valid instance of the target document schema, in which structure has only been “hidden” in some way and has to be “uncovered”.

§       For the vast majority of legacy documents (typesetting code, word processor formats, result of OCR processes), the structure is primarily related to visual presentation.

§       The mapping between the two structures may require extensive specifications of context and of linguistic knowledge.

o      Error handling

§       It is a fact of life that legacy documents contain errors and inconsistencies and that up-conversion may require human intervention for the resolution of semantic ambiguities. Therefore, with legacy up-conversion it is nearly always necessary to solve problems through human decisions, with variations from 0 to 100%. In the last case the target XML document is completely tagged by hand, a solution that is chosen for a surprisingly large number of conversion projects.

§       The handling of errors during auto-tagging is not trivial. Sometimes errors are not detected, or the exact position can not be reported. Error messages may be misleading. For the process, error-recovery may be difficult. At last, corrections made by hand may introduce new errors.

There are many parallels between the processes of XML up-conversion and Machine Translation. Seen in that way, XML up-conversion may be regarded as the hardest problem of Document Engineering.

2         Requirements

In our opinion, the following requirements hold for XML up-conversion.

§       The tagging of the target XML-documents shall be unambiguous and correct.

§       The conversion process must be able to handle ambiguity and undecidability. In these cases correct choices have to be presented to human decision makers.

§       The conversion process has to be robust. Errors and inconsistencies in the source document have to be detected as soon as possible. A further requirement may be that the error will be "repaired", as correctly as pos­sible, in order to continue the conversion pro­cess.

§       All errors have to be reported, with an almost exact position of the error.

§       Specifications for the conversion shall be understandable and adjustable in order to be re-used for other projects.

§       Up-conversion in continuous processes requires the on-line (or “streaming”) property.

3         Current solutions

For the mapping between the structures of source and target documents several methods and techniques are used, implemented by

§       general programming languages,

§       systems for pattern-matching and replacement (PMR),

§       grammar based systems.

For human intervention, pre- and/or post-editing systems are used, such as in Machine Aided Human Translation (MAHT) or in Human Aided Machine Translation (HAMT).

4         Implementations with general programming languages

In the early days of up-conversion to XML, general programming languages have been used. Their advantage may be that every detail of the conversion process can be programmed. The disadvantages are numerous:

§       Complexity: programming is ad hoc and time consuming; optimization is ad hoc.

§       Reliability: the handling of errors, inconsistencies, ambiguity and undecidability is problematic.

§       Adjustability and understandability is poor; some reuse is possible with program libraries.

5         Implementations with tools for Pattern Matching and Replacement (PMR)

The most frequent implementation of autotagging is with languages and tools for pattern matching and replacement, like Perl and Omnimark, and their predecessors like Snobol/Spitbol, Icon, Awk, DynaTag and FastTAG.

All languages and tools make use of pattern rules which are event-driven, based on lexical events, sometimes within some (XML) context. The rules contain regular expressions. E.g., the rules

§       Regular expression A ==> P

§       Regular expression B ==> Q

will fire independently of each other as soon as A or B are recognized.

Problems with this approach are as follows:

§       Context. PMR tools tend to restrict the programmer to a linear vision of the conversion process, and make it difficult to address cases where structure disambiguation requires look-ahead and look-behind.

§       Overlapping rules. If the regular expressions of A and B overlap, then a choice has to be made about which to fire. Most of the time the programmer has no overview of all possible occurrences in the source documents. Therefore wrong or incomplete decisions may be built in.

§       Completeness. There is no possible check on the completeness of the set of rules.

§       Errors. If the source document contains errors and/or inconsistencies (and most of them do) then it is possible that some rules will not fire.

6         Implementations with grammar based tools

In grammar based tools the specifications are written as formal grammars; e.g. as context-free grammars, with extensions for easy reference of characters, strings and regular expressions and with extensions for pattern replacement and for output. Compared with tools for PMR there are a number of advantages:

§       PMR-rules are implicit within grammar rules.

§       Therefore context sensitivity is implicit; it is unlimited to the left and to the right of a partial expression within the grammar.

§       Therefore specifications are precise.

§       The problem of the overlap of PMR rules is nonexistent because of the unlimited context sensitivity.

§       Reporting of undecidability is implicit in the general handling of ambiguity and nondeterminism.

§       Detection of errors and inconsistencies in the source document is implicit in the general handling of parsing errors.

§       The grammar for the source document is akin to the document schema for the target document; a first version of the grammar may be obtained automatically from the document schema (under the assumption that the source document is a valid instance of the target document schema, in which the structure has been “hidden”).

o      Requirements

The requirements for grammar-based tools for up-conversion are implicit in the above description. They are:

§       The grammar formalism shall offer sufficient provisions for the specification of regular expressions at the character level.

§       Parsers shall not impose restrictions on the use of the formalism. They must provide for unlimited lookahead and shall handle ambiguity and errors.

o      Current solutions

Partial solutions are offered by:

§       Open source parser generators like Yacc, PRECC, Antlr and CUP.

§       Open source packages like Chaperon which includes a tree builder in order to create an XML document.

§       Commercial packages, like CambridgeDocs.

o      Problems with current solutions

§       If lookahead is limited then it is difficult to write grammars with no temporal ambiguities.

§       If lookahead is limited then error handling is difficult.

§       No one solution has the on-line (streaming) property.

7         Position of LingDoc Clarity

Clarity is not meant for conversion purposes.

8         Position of LingDoc Transform

LingDoc Transform uses pattern-matching rules as well as grammar rules, with nearly the same syntax. The generated programs:

§       have the unlimited look-ahead property

§       handle all kind of ambiguities

§       have the on-line property

§       report errors as soon as possible.

Transform may be used in a one-stage or multi-stage/pipelined process with eventually a document schema for each stage.

The first stage is typically used for the detection of errors and inconsistencies in the source document and for their correction.

In the second stage a correct target document is created.

Another possible way of use may be to use only the first stage for the detection and correction of errors and inconsistencies in a document, without subsequent conversion.

9         Development methodology for LingDoc Transform

The process of writing an initial grammar for conversion is in Transform essentially the same as the process for writing an XML document schema. The only difference is that lexical elements, rather than elements with XML markup, are described.

The assumption is that the structure of the source document is more or less the same as the target document. If that is not the case the best strategy is to define an intermediate XML document with the same structure as the source document. In a subsequent stage the intermediate document can be converted to the target document. This can be done, again, by Transform or by an XML transformation language, like XSLT.

The following method may be followed.

§       Determine where in the source document information appears that may play a role in the identification of markup in the target document. For example: fonts, sizes, styles, margins, headers/footers, columns, tables, formulas, sub/superscript, index marks, footnotes, sidenotes, revision bars, revision/version control, hyphenation, outliner marks, etcetera.

-    More formally said this involves an identification of “visual presentation classes”; that is, groups of text objects sharing a common set of formatting properties (quote from Chahuneau, 1988). A mapping is made of visual presentation classes onto XML element types from the target document schema, thus putting XML names, attributes and entities upon identified object types. This involves a possible reorganization of data and addition of missing structures (both elements and attributes) to comply with the rules imposed by the target document schema.

§       Determine which identifying information is used more than once, without a clue for disambiguation. This includes situations in which human assistance may be required, probably in a post-editing stage. In that case markers can be generated in the target document at the spots where possible ambiguities arise.

§       Determine also in which cases there is insufficient identifying information. This also indicates a situation where human assistance may be required, probably in a pre-editing stage. In that case markers have to be added to the source document at the spots where there is insufficient identifying information.

§       After the grammar is stabilized additional rules or attributes may be added for reordering and for generating streaming output in which markup is included.

§       This may be an iterative process due to two reasons

§       sufficient clues have to be found for identification in order to arrive at a stable grammar

§       errors and inconsistencies in the documents are found. They have to be corrected in the source document. It is advisable not to correct them in the target document, because in that case new errors may be introduced.

§       In a multistage conversion process, the correctness of the generated output of a stage may be secured by parsing it according to the input grammar for the next stage.

10    Literature

o      Documents within LingDoc

§       Position Papers

o      Ref user manual Transform

§       With a running example of up-conversion of an aerospace manual.

§      Van der Steen, G.J., “Methods and Tools for Automatic Mark-up of Documents”, Conference “International MarkUp’89” organized by the Graphic Communications Association, Gmunden, 1989