2. LingDoc and XML Up-Conversion: Position Paper
1
Introduction
In this position paper we describe one of
the hardest problems of Document Enginering: the up-conversion of
XML-documents. We show how the problem can be solved by the use of LingDoc Transform.
o
Importance
§
XML encoding leverages the
usability of a document, be it a manual, a newspaper or a website.
Up-conversion is the process of the conversion of a non-XML document into an
XML encoded document. This process can be one-time (legacy) or part of a
continuous production process. In the last case on-line processing may be a
requirement.
§
Up-conversion is not a trivial
process. For legacy documents the involved costs may be high, sometimes
counting up to 90% of the total cost of the introduction of an XML-based
production system.
§
Technically, up-conversion may
be defined as the conversion of a source document from some arbitrary encoding
format into a valid XML instance of some target document schema, using (at
least partly) automated processing techniques. The automated part is sometimes
called “auto-tagging”.
§
In order to design a conversion
process a mapping has to be made between the structures of the source and
target documents. The assumption is that the source file is a valid instance of
the target document schema, in which structure has only been “hidden” in some
way and has to be “uncovered”.
§
For the vast majority of legacy
documents (typesetting code, word processor formats, result of OCR processes),
the structure is primarily related to visual presentation.
§
The mapping between the two
structures may require extensive specifications of context and of linguistic
knowledge.
o
Error handling
§
It is a fact of life that
legacy documents contain errors and inconsistencies and that up-conversion may
require human intervention for the resolution of semantic ambiguities.
Therefore, with legacy up-conversion it is nearly always necessary to solve
problems through human decisions, with variations from 0 to 100%. In the last
case the target XML document is completely tagged by hand, a solution that is
chosen for a surprisingly large number of conversion projects.
§
The handling of errors during
auto-tagging is not trivial. Sometimes errors are not detected, or the exact
position can not be reported. Error messages may be misleading. For the
process, error-recovery may be difficult. At last, corrections made by hand may
introduce new errors.
There are many parallels between the
processes of XML up-conversion and Machine Translation. Seen in that way, XML
up-conversion may be regarded as the hardest problem of Document Engineering.
2
Requirements
In our
opinion, the following requirements hold for XML up-conversion.
§
The tagging of the target
XML-documents shall be unambiguous and correct.
§
The conversion process must be
able to handle ambiguity and undecidability. In these cases correct choices
have to be presented to human decision makers.
§
The conversion process has to
be robust. Errors and inconsistencies in the source document have to be detected
as soon as possible. A further requirement may be that the error will be
"repaired", as correctly as possible, in order to continue the
conversion process.
§
All errors have to be reported,
with an almost exact position of the error.
§
Specifications for the
conversion shall be understandable and adjustable in order to be re-used for
other projects.
§
Up-conversion in continuous
processes requires the on-line (or “streaming”) property.
3
Current solutions
For the
mapping between the structures of source and target documents several methods
and techniques are used, implemented by
§
general programming languages,
§
systems for pattern-matching
and replacement (PMR),
§
grammar based systems.
For human
intervention, pre- and/or post-editing systems are used, such as in Machine
Aided Human Translation (MAHT) or in Human Aided Machine Translation (HAMT).
4
Implementations with general programming
languages
In the
early days of up-conversion to XML, general programming languages have been used.
Their advantage may be that every detail of the conversion process can be
programmed. The disadvantages are numerous:
§
Complexity: programming is ad
hoc and time consuming; optimization is ad hoc.
§
Reliability: the handling of
errors, inconsistencies, ambiguity and undecidability is problematic.
§
Adjustability and
understandability is poor; some reuse is possible with program libraries.
5
Implementations with tools for Pattern Matching
and Replacement (PMR)
The most frequent implementation of
autotagging is with languages and tools for pattern matching and replacement,
like Perl and Omnimark, and their predecessors like Snobol/Spitbol, Icon, Awk,
DynaTag and FastTAG.
All languages and tools make use of
pattern rules which are event-driven, based on lexical events, sometimes within
some (XML) context. The rules contain regular expressions. E.g., the rules
§
Regular expression A ==> P
§
Regular expression B ==> Q
will fire independently of each other as
soon as A or B are recognized.
Problems with this approach are as
follows:
§
Context. PMR tools tend to
restrict the programmer to a linear vision of the conversion process, and make
it difficult to address cases where structure disambiguation requires
look-ahead and look-behind.
§
Overlapping rules. If the
regular expressions of A and B overlap, then a choice has to be made about
which to fire. Most of the time the programmer has no overview of all possible
occurrences in the source documents. Therefore wrong or incomplete decisions
may be built in.
§
Completeness. There is no
possible check on the completeness of the set of rules.
§
Errors. If the source document
contains errors and/or inconsistencies (and most of them do) then it is
possible that some rules will not fire.
6
Implementations with grammar based tools
In grammar
based tools the specifications are written as formal grammars; e.g. as
context-free grammars, with extensions for easy reference of characters,
strings and regular expressions and with extensions for pattern replacement and
for output. Compared with tools for PMR there are a number of advantages:
§
PMR-rules are implicit within
grammar rules.
§
Therefore context sensitivity
is implicit; it is unlimited to the left and to the right of a partial
expression within the grammar.
§
Therefore specifications are
precise.
§
The problem of the overlap of
PMR rules is nonexistent because of the unlimited context sensitivity.
§
Reporting of undecidability is
implicit in the general handling of ambiguity and nondeterminism.
§
Detection of errors and
inconsistencies in the source document is implicit in the general handling of
parsing errors.
§
The grammar for the source
document is akin to the document schema for the target document; a first
version of the grammar may be obtained automatically from the document schema
(under the assumption that the source document is a valid instance of the
target document schema, in which the structure has been “hidden”).
o
Requirements
The
requirements for grammar-based tools for up-conversion are implicit in the
above description. They are:
§
The grammar formalism shall
offer sufficient provisions for the specification of regular expressions at the
character level.
§
Parsers shall not impose
restrictions on the use of the formalism. They must provide for unlimited lookahead
and shall handle ambiguity and errors.
o
Current solutions
Partial
solutions are offered by:
§
Open source parser generators
like Yacc, PRECC, Antlr and CUP.
§
Open source packages like
Chaperon which includes a tree builder in order to create an XML document.
§
Commercial packages, like
CambridgeDocs.
o
Problems with current solutions
§
If lookahead is limited then it
is difficult to write grammars with no temporal ambiguities.
§
If lookahead is limited then
error handling is difficult.
§
No one solution has the on-line
(streaming) property.
7
Position of LingDoc Clarity
Clarity is not meant for conversion
purposes.
8
Position of LingDoc Transform
LingDoc Transform uses pattern-matching
rules as well as grammar rules, with nearly the same syntax. The generated
programs:
§
have the unlimited look-ahead
property
§
handle all kind of ambiguities
§
have the on-line property
§
report errors as soon as
possible.
Transform may be used in a one-stage or
multi-stage/pipelined process with eventually a document schema for each stage.
The first stage is typically used for the
detection of errors and inconsistencies in the source document and for their
correction.
In the second stage a correct target
document is created.
Another possible way of use may be to use
only the first stage for the detection and correction of errors and
inconsistencies in a document, without subsequent conversion.
9
Development methodology for LingDoc Transform
The process of writing an initial grammar
for conversion is in Transform essentially the same as the process for writing
an XML document schema. The only difference is that lexical elements, rather
than elements with XML markup, are described.
The assumption is that the structure of
the source document is more or less the same as the target document. If that is
not the case the best strategy is to define an intermediate XML document with
the same structure as the source document. In a subsequent stage the
intermediate document can be converted to the target document. This can be
done, again, by Transform or by an XML transformation language, like XSLT.
The following method may
be followed.
§
Determine where in the source
document information appears that may play a role in the identification of
markup in the target document. For example: fonts, sizes, styles, margins,
headers/footers, columns, tables, formulas, sub/superscript, index marks,
footnotes, sidenotes, revision bars, revision/version control, hyphenation,
outliner marks, etcetera.
-
More formally said this involves an
identification of “visual presentation classes”; that is, groups of text
objects sharing a common set of formatting properties (quote from Chahuneau,
1988). A mapping is made of visual presentation classes onto XML element types
from the target document schema, thus putting XML names, attributes and
entities upon identified object types. This involves a possible reorganization
of data and addition of missing structures (both elements and attributes) to
comply with the rules imposed by the target document schema.
§
Determine which identifying information
is used more than once, without a clue for disambiguation. This includes
situations in which human assistance may be required, probably in a
post-editing stage. In that case markers can be generated in the target
document at the spots where possible ambiguities arise.
§
Determine also in which cases
there is insufficient identifying information. This also indicates a situation
where human assistance may be required, probably in a pre-editing stage. In
that case markers have to be added to the source document at the spots where
there is insufficient identifying information.
§
After the grammar is stabilized
additional rules or attributes may be added for reordering and for generating
streaming output in which markup is included.
§
This may be an iterative
process due to two reasons
§
sufficient clues have to be
found for identification in order to arrive at a stable grammar
§
errors and inconsistencies in
the documents are found. They have to be corrected in the source document. It
is advisable not to correct them in the target document, because in that case
new errors may be introduced.
§
In a multistage conversion
process, the correctness of the generated output of a stage may be secured by
parsing it according to the input grammar for the next stage.
10 Literature
o
Documents within LingDoc
§
Position Papers
§
With a running example of
up-conversion of an aerospace manual.
§ Van der Steen, G.J., “Methods and Tools for Automatic Mark-up of
Documents”, Conference “International MarkUp’89” organized by the Graphic
Communications Association, Gmunden, 1989