HOMECONTACT US

Conversion Challenges

Human resource is the most expensive part of most XML conversion processes. The main challenges that influence the amount of human intervention required during conversion, and hence the cost, are:

Complexity of source material

Three broad document categories indicate the increasing difficulty of conversion, and hence the amount of human intervention (and thus cost):

  • Conventional: A set of documents formatted in a consistent style using common typographical conventions. An example would be several volumes of an encyclopaedia.

  • Idiosyncratic: A single document formatted in an "internally consistent" style that may use less common typographical conventions. A book that was individually typeset and not published as part of a series might fall into this category.

  • Chaotic: A small number of pages formatted in a style that may not be internally consistent, and that use unconventional typography. Design-heavy publications such as "Wired" magazine are an example of this.

For additional information on this subject, see “Document Complexity and XML Conversion

Consistency of source material

Content authors and contributors typically use a variety of tools and formats (word processors, desktop publishing applications, presentation software, etc.) to create and store information in many different file formats. This can complicate the conversion process.

For example, a user can construct a simple indented bulleted list item by:

  • Inserting a number of spaces using the space bar, and then inserting a bullet character
  • Inserting a tab, and then a bullet character
  • Doing either of the above, but instead of entering a bullet character, entering a period, increasing its font size, and superscripting it
  • Clicking on a word-processor's "Bulleted list" tool

For each of these methods, a different set of typographical codes will be generated.

In addition, different word processors and typesetting applications will emit different codes for the same user actions. Conversion software based on rules therefore requires human programming resource to take account of every input variation, or low quality XML will result.

See our paper entitled: “There are no unstructured documents” for more details.


Type of markup required

The intended application of the XML output will dictate the type of markup required. The more sophisticated the markup, the more versatile it is, and the more human resource will be required in its production. Three distinct and increasingly useful levels of markup are:

  • Formatting-based markup: supports only simple text search and/or minimal republishing requirements
  • Structural markup: supports context-sensitive search and indexing, single source publishing, content syndication
  • Semantic markup: supports vertical content indexing applications, including topic maps, and government regulations compliance

For more information, see “Classes of XML Output”.

More Info

Refresher

Context

Videos