HOMECONTACT US

Document Complexity in XML conversion

During the conversion process, Exegenix software uses its integral knowledge of typographic principles to identify constructs such as sections, paragraphs, quotes, lists, tables, footnotes, etc., and applies a variety of techniques across the entire document to form a complete, cohesive, internal representation of its structure.

The automated approach of Exegenix Conversion Solutions means that it produces best results on documents that employ commonly used graphical objects, traditional ways of formatting these objects, and regular, repeating patterns.

Three broad document categories indicate the increasing difficulty of conversion:

Conventional
A set of documents formatted in a consistent style using common typographical conventions. An example would be several volumes of an encyclopedia.
Idiosyncratic
A single document formatted in an "internally consistent" style that may use less common typographical conventions. A book that was individually typeset and not published as part of a series might fall into this category.
Chaotic
A small number of pages formatted in a style that may not be internally consistent, and that use unconventional typography. Design-heavy publications such as “Wired” magazine are an example of this.

Note that these categories do not refer to the traditional concept of a document’s "complexity". For example, a document that has many levels of nested lists can be converted as readily as one that has only a single level, provided that the lists are formatted in regular ways.

In fact, estimating the ease of conversion is an art rather than a science, and Exegenix uses a combination of the following guidelines on text flow, graphical subtlety, and other considerations:

Text flow

Text flow describes the path that the eye takes when reading the information on the page, and can be characterized as:

Smooth
A page in which the text flows in a straight line downward from the top to the bottom of the page (in one or more columns). This is typical of conventional documents.
Lateral
A page in which the text flows both top-to-bottom and left-to-right. This could happen, for example, because a one-column layout switches to two columns part way down the page, or because information is organized in an explicit or implicit table grid. "Idiosyncratic" documents may have this characteristic.
Turbulent
A page in which the text flow cannot be described by a single path. This could happen if there are sidebars or pull-quotes that are offset from the main text flow; if a page contains parts of two or more logical text flows; or if text flows around images. Turbulent text flow is typical of chaotic documents.

Difficulties in correctly identifying and understanding the text flow in the original document can result in XML output that contains out-of-order text or elements, requiring post-conversion processing.

Graphical subtlety

Graphical subtlety describes the ease with which text and graphical objects can be distinguished within the original document, and the degree to which graphics are used to convey textual information. It can be characterized as:

Clear
No textual information is conveyed using graphical objects. This is typical in conventional documents.
Mixed use
Some text information is contained in or conveyed by graphical objects. For example, a diagram containing text labels; text displayed in an image; a header contained in an image; rules used as dividers; graphical ornamentation encountered. "Idiosyncratic" documents may have these characteristics.
Overlap
Text and graphical objects are painted on the same areas of the page. Examples could be a complex graphical page background; an "underlaid" capital in which a letter is drawn in grey underneath text. Overlap graphics are typical of chaotic documents.

Difficulties in distinguishing between text and graphics in the original document can result in XML output that contains text captured as a graphic, or textual parts of a graphic captured separately, requiring post-conversion processing.

Other considerations

Other considerations that help determine the level of conversion difficulty include:

Frequency of images Images tend to break the text flow, convey non-parsable information, and require manual design, etc.

Mathematics Complex math can be handled by Exegenix in a variety of different ways, depending on customer requirements.

Ratio of white/dark space Text-heavy documents are easier to convert, and less likely to have unconventional text flows.

Semantic structure Any structure that is conveyed via content alone cannot be detected, although it may be possible to process post-conversion.

Vertical markets Exegenix Conversion Solutions deal best with the widely-observed formatting conventions that span disciplines. Certain types of very specialized documents will initially fall into the idiosyncratic category but, as our development team incorporates their conventions, will move into the conventional category.

It’s also important to note that the easier a document is to convert, the easier it is to render into a structured format: easily convertible formats can be rendered into richly-structured XML or SGML; more difficult documents may be rendered into HTML that captures the formatting but not the structure of the original document; and the most difficult documents can be rendered sensibly only into text format.

Submit sample documents
for conversion.
Try it FREE!

More Info