Text-Based XML Processing

At their foundation, XML documents are text. The content and markup are both represented as text, and text-editing tools can be extremely useful for XML document inspection, creation, and modification. XML’s textual foundations make it possible for developers to work with XML directly, using XML-specific tools only when they choose to.

One of the original design goals of XML was for documents to be easy to parse. For very simple documents that do not depend on features such as attribute defaulting and validation, it is possible to parse tags, attributes, and text data using standard programming tools such as regular expressions and tokenizers, but the complexity of processing grows rapidly as documents use more features. Unless the application can completely control the content of incoming documents, it is almost always preferable to use one of the many high-quality XML parsers that are freely available for most programming languages.

Textual tools are a key part of the XML toolset, however. Many developers use text editors such as vi, Emacs, NotePad, WordPad, BBEdit, and UltraEdit to create or modify XML documents. Regular expressions—in environments such as sed, grep, Perl, and Python—can be used for search and replace or for tweaking documents prior to XML parsing or XSLT processing. Various standards are beginning to take advantage of regular expression matching after a particular document has been parsed. The W3C’s XML Schema recommendation, for instance, ...

Get XML in a Nutshell, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.