Handling Heterogeneous Doctypes

The examples we’ve seen so far with the autosectioner work relatively well when all of the documents have the same doctype. This cannot always be guaranteed. Information aggregators receive data from multiple sources that usually do not share a common DTD, for instance. XSL transformation could be used to transform incoming documents to a single standard, but this approach suffers from two limitations:

  • It may not be possible to come up with a single standard DTD that can accommodate all expressible data in the various incoming DTDs. Even when this is possible, the process may be irreversible so that you can’t get the original document back, or the standard DTD may be so broad that documents vary quite a bit.

  • Transformation may be an expensive process, limiting data import capacity.

In most cases like this, the various doctypes are stored in the same table, creating a document set that is heterogeneous. In such cases, the autosectioner may have difficulty searching the entire document set. For instance:

  • Different doctypes may use different tags to represent the same information. This will force queries to use OR-combined WITHIN clauses, which looks messy and is less efficient than single WITHIN clauses.

  • Different doctypes may use the same tags to represent different information, or the autosectioner’s inability to distinguish tag case may lead to a tag collision. These situations will make queries less precise because the WITHIN clause will be unable to ...

Get Building Oracle XML Applications now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.