HOMECONTACT US

Classes of XML Output

There are three general classes of XML output generated by conversion tools. In increasing order of utility, they are:

  • Formatting Based
  • Structural
  • Semantic

Formatting based markup

Just as HTML is designed to deliver documents to Web browsers, formatting-based XML markup is designed to deliver content that looks a specific way to a specific output format, for example, 8½” x 11” Portrait. Change the paper size, and you change the page breaks, text flow, etc, changing the markup substantially.

This type of markup says nothing about the content itself, just how it should look when published. Because the markup is closely tied to a specific authoring application, or output formatting, it is not versatile in terms of publishing to other devices, or indexing for context-sensitive searching.

Furthermore, if the formatting codes are inconsistently ordered, where "bold italic" may appear sometimes as "italic bold", it is very difficult to post-process to more useful forms.

This is the type of markup generated by all off-the-shelf XML conversion tools, as well as "Save as XML" from word processors or desktop publishing tools.

Example:

<font name=“Arialsize=“20pt”><b>Formatting Based Markup: Is it useful in a practical way, or just the Status Quo?</b></font>

<font name=“TimesNewRoman size=“12pt”>Most conversion technologies leverage specific combinations of formatting codes to produce their output, or simply dump the formatting codes discovered in an XML-like syntax.</font>

<font name=“Arial” size=“10pt”>Formatting codes are rarely consistently applied by authors, nor consistently encoded in data files by authoring applications.</font>

<font name=“TimesNewRoman size=“12pt”>Output from conversion technologies that deliver or rely on formatting based code is therefore inconsistent. Where consistency of data is important to long-term management and republishing of content, formatting-based markup is simply not good enough.</font>

Formatting Based Markup: Is it useful in a practical way, or just the Status Quo

Most conversion technologies leverage specific combinations of formatting codes to produce their output, or simply dump the formatting codes discovered in an XML-like syntax.

Formatting codes are rarely consistently applied by authors, nor consistently encoded in data files by authoring applications.

Output from conversion technologies that deliver or rely on formatting based code is therefore inconsistent. Where consistency of data is important to long-term management and republishing of content, formatting-based markup is simply not good enough.

 

Structural markup

Markup that represents document structures is not tied to a specific application, output medium or device.

Common structures, like sections, titles, paragraphs, list, tables, etc., transcend specific formatting combinations. Together with a style sheet for each output device or medium, large volumes of structure-based markup can be published on demand in a consistent, pleasing way.

Additionally, structure-based markup provides the proper foundation for increasingly sophisticated markup, which also stores information about the actual meaning of the content, known as semantic markup.

This is the type of markup generated by Exegenix Conversion Solutions.

Example:

<section>

<title font-family=“Arial” font-size=“20pt” font-weight=”bold”>Structural Markup: A Better Way</title>

<sectionbody><para font-family=“TimesNewRoman” font-size=“12pt”>Deriving structure from input data results in output that has common tagging despite a diverse input dataset.</para>

<para font-family=“TimesNewRoman”>Input formatting is retained, and is available for use in post processing or republishing of converted material.</para>

<para font-family=“TimesNewRoman” font-size=“12pt”>Regardless of minor variations in input formatting, structure is unambiguous and consistent, and therefore, a better choice for long term storage and management of content.</para></sectionbody>

</section>

Structural Markup: A Better Way

Deriving structure from input data results in output that has common tagging despite a diverse input dataset.

Input formatting is retained, and is available for use in post processing or republishing of converted material.

Regardless of minor variations in input formatting, structure is unambiguous and consistent, and therefore, a better choice for long-term storage and management of content.

Semantic markup

Semantic markup is the most difficult type of markup to produce, because it requires human intervention to interpret the text and identify content that matches specific requirements. Often, the person making the determination must be a subject-matter expert in order to properly understand what they are reading, and apply the proper semantic tagging.

Consider the example below, and how one would attempt to find the term “Semantic Tagging”. With the semantically tagged content shown, it is possible to restrict the search to return only those documents where the search phrase “Semantic Tagging” is found in the “subject” of a “concept”.

Structure based markup allows for search on structures, like “section title” or “figure caption”, and serves as a foundation for semantic searches. Together, it is possible to search for a term in a “concept” in “Chapter 1”. Fewer results are returned, and these results are exactly what was desired.

With formatting-based markup, no context sensitive search or indexing is possible. Search queries return hundreds, possibly thousands of results, leaving the reader to check every result to find the document they want.

Semantically-tagged output can be costly to achieve, but the most sophisticated content publishing and indexing applications are possible when the content is properly semantically tagged.

Exegenix Conversion Solutions deliver structural markup automatically, and provide tools to add semantic value to content during the conversion process.

Example:

<concept>

<subject> Semantic Tagging</subject>

<title font- family=“Arial” font-size=“20pt” font- weight=”bold”>Semantic Tagging: The Ultimate</title>

<premise><para font-family=“TimesNewRoman” font- size=“12pt”>Structural tags form the basis of semantic tagging.</para>

<para>Semantic tagging enables extremely accurate indexing and search</para></premise>

<proof><para font-family=“TimesNewRoman”font- size=“12pt”>Consider searching for the term “Semantic Tagging” in the “subject” of a “concept”, versus a simple keyword search on the phrase “Semantic Tagging” across a large dataset.</para></proof>

<conclusion><para>Semantic Tagging rules!</para></conclusion>

</concept>

Semantic Tagging: The Ultimate

Structural tags form the basis of semantic tagging.

Semantic tagging enables extremely accurate indexing and search

Consider searching for the term “Semantic Tagging” in the “subject” of a “concept”, versus a simple keyword search on the phrase “Semantic Tagging” across a large dataset.

Semantic Tagging rules!

Submit sample documents
for conversion.
Try it FREE!

More Info