XML and HTML are called markup languages because of the way they add structure to plain-text documents—by surrounding parts of the text with tags that indicate structure or meaning, much as someone with a pen might highlight a sentence and add a note. While HTML predefines a set of tags and their structure, XML is a blank slate in which the author gets to define the tags, the rules, and their meanings.
Both XML and HTML owe their lineage to Standard Generalized Markup Language (SGML)—the mother of all markup languages. SGML has been used in the publishing industry for decades (including at O’Reilly). But it wasn’t until the Web captured the world that it came into the mainstream through HTML. HTML started as a very small application of SGML, and if HTML has done anything at all, it has proven that simplicity reigns.
When Tim Berners-Lee began postulating the Web back at CERN in the late 1980s, he wanted to organize project information using hypertext with links embedded in plain text. When the Web needed a protocol, HTTP—a simple, text-based client-server protocol—was invented. So, what exactly is so enchanting about the idea of plain text? Why, for example, didn’t Tim turn to the Microsoft Word format as the basis for web documents? Surely a binary, non-human-readable format and a similarly machine-oriented protocol would be more efficient? Since the Web’s inception, there have now been literally trillions of HTTP transactions. Was it really a good idea for them to use (English) words like “GET” and “POST” as part of the protocol?
The answer, as we’ve all seen, is yes! Whatever humans can read and undertstand, human developers can work with more easily. There is a time and place for a high level of optimization (and obscurity), but when the goal is universal acceptance and cross-platform portability, simplicity and transparency are paramount. This is the first fundamental proposition of XML: simple and nominally human-readable data.
Using text to exchange data is not exactly a new idea,
either, but historically, for every new document format that came along,
a new parser would have to be written. A
parser is an application that reads a document and understands its
formatting conventions, usually enforcing some rules about the content.
For example, the Java
class has a parser for the standard properties file format (Chapter 11). In our simple spreadsheet in Chapter 18, we wrote a parser capable of
understanding basic mathematical expressions. As we’ve seen, depending
on complexity, parsing can be quite tricky.
With XML, we can represent data without having to write this kind of custom parser. This isn’t to say that it’s reasonable to use XML for everything (e.g., typing math expressions into our spreadsheet), but for the common types of information that we exchange on the Net, we shouldn’t have to write parsers that deal with basic syntax and string manipulation. In conjunction with document-verifying components (Document Type Definitions [DTDs] or XML Schema), much of the complex error checking is also done automatically. This is the second fundamental proposition of XML: standardized parsing and validation.
All the basic APIs for working with XML are now bundled
with the standard release of Java. This included the
extension packages for working with Simple API for XML (SAX), Document
Object Model (DOM), XML Binding JAXB, and Extensible Stylesheet Language (XSL)
transforms, as well as APIs such as XPath, and XInclude. If you are using an older
version of Java, you can still use many of these tools but you will have
to download these packages separately.
When viewed in older browsers or in contexts that do not explicitly format XML for viewing, the browser will generally simply display the text of the document with all the tags (structural information) stripped off. This is the prescribed behavior for working with unknown XML markup in a viewing environment. Remember that you can always use the “view source” option to display the text of a file in your browser if you want to see the original source.