Chapter 1. Perl and XML

Perl is a mature but eccentric programming language that is tailor-made for text manipulation. XML is a fiery young upstart of a text-based markup language used for web content, document processing, web services, or any situation in which you need to structure information flexibly. This book is the story of the first few years of their sometimes rocky (but ultimately happy) romance.

Why Use Perl with XML?

First and foremost, Perl is ideal for crunching text. It has filehandles, “here” docs, string manipulation, and regular expressions built into its syntax. Anyone who has ever written code to manipulate strings in a low-level language like C and then tried to do the same thing in Perl has no trouble telling you which environment is easier for text processing. XML is text at its core, so Perl is uniquely well suited to work with it.

Furthermore, starting with Version 5.6, Perl has been getting friendly with Unicode-flavored character encodings, especially UTF-8, which is important for XML processing. You’ll read more about character encoding in Chapter 3.

Second, the Comprehensive Perl Archive Network (CPAN) is a multimirrored heap of modules free for the taking. You could say that it takes a village to make a program; anyone who undertakes a programming project in Perl should check the public warehouse of packaged solutions and building blocks to save time and effort. Why write your own parser when CPAN has plenty of parsers to download, all tested and chock full of configurability? CPAN is wild and woolly, with contributions from many people and not much supervision. The good news is that when a new technology emerges, a module supporting it pops up on CPAN in short order. This feature complements XML nicely, since it’s always changing and adding new accessory technologies.

Early on, modules sprouted up around XML like mushrooms after a rain. Each module brought with it a unique interface and style that was innovative and Perlish, but not interchangeable. Recently, there has been a trend toward creating a universal interface so modules can be interchangeable. If you don’t like this SAX parser, you can plug in another one with no extra work. Thus, the CPAN community does work together and strive for internal coherence.

Third, Perl’s flexible, object-oriented programming capabilities are very useful for dealing with XML. An XML document is a hierarchical structure made of a single basic atomic unit, the XML element, that can hold other elements as its children. Thus, the elements that make up a document can be represented by one class of objects that all have the same, simple interface. Furthermore, XML markup encapsulates content the way objects encapsulate code and data, so the two complement each other nicely. You’ll also see that objects are useful for modularizing XML processors. These objects include parser objects, parser factories that serve up parser objects, and parsers that return objects. It all adds up to clean, portable code.

Fourth, the link between Perl and the Web is important. Java and JavaScript get all the glamour, but any web monkey knows that Perl lurks at the back end of most servers. Many web-munging libraries in Perl are easily adapted to XML. The developers who have worked in Perl for years building web sites are now turning their nimble fingers to the XML realm.

Ultimately, you’ll choose the programming language that best suits your needs. Perl is ideal for working with XML, but you shouldn’t just take our word for it. Give it a try.

XML Is Simple with XML::Simple

Many people, understandably, think of XML as the invention of an evil genius bent on destroying humanity. The embedded markup, with its angle brackets and slashes, is not exactly a treat for the eyes. Add to that the business about nested elements, node types, and DTDs, and you might cower in the corner and whimper for nice, tab-delineated files and a split function.

Here’s a little secret: writing programs to process XML is not hard. A whole spectrum of tools that handle the mundane details of parsing and building data structures for you is available, with convenient APIs that get you started in a few minutes. If you really need the complexity of a full-featured XML application, you can certainly get it, but you don’t have to. XML scales nicely from simple to bafflingly complex, and if you deal with XML on the simple end of the continuum, you can pick simple tools to help you.

To prove our point, we’ll look at a very basic module called XML::Simple , created by Grant McLean. With minimal effort up front, you can accomplish a surprising amount of useful work when processing XML.

A typical program reads in an XML document, makes some changes, and writes it back out to a file. XML::Simple was created to automate this process as much as possible. One subroutine call reads in an XML document and stores it in memory for you, using nested hashes to represent elements and data. After you make whatever changes you need to make, call another subroutine to print it out to a file.

Let’s try it out. As with any module, you have to introduce XML::Simple to your program with a use pragma like this:

use XML::Simple;

When you do this, XML::Simple exports two subroutines into your namespace:

XMLin( )

This subroutine reads an XML document from a file or string and builds a data structure to contain the data and element structure. It returns a reference to a hash containing the structure.

XMLout( )

Given a reference to a hash containing an encoded document, this subroutine generates XML markup and returns it as a string of text.

If you like, you can build the document from scratch by simply creating the data structures from hashes, arrays, and strings. You’d have to do that if you wanted to create a file for the first time. Just be careful to avoid using circular references, or the module will not function properly.

For example, let’s say your boss is going to send email to a group of people using the world-renowned mailing list management application, WarbleSoft SpamChucker. Among its features is the ability to import and export XML files representing mailing lists. The only problem is that the boss has trouble reading customers’ names as they are displayed on the screen and would prefer that they all be in capital letters. Your assignment is to write a program that can edit the XML datafiles to convert just the names into all caps.

Accepting the challenge, you first examine the XML files to determine the style of markup. Example 1-1 shows such a document.

Example 1-1. SpamChucker datafile

 <?xml version="1.0"?>
 <spam-document version="3.5" timestamp="2002-05-13 15:33:45">
 <!-- Autogenerated by WarbleSoft Spam Version 3.5 -->
 <customer>
  <first-name>Joe</first-name>
  <surname>Wrigley</surname>
  <address>
    <street>17 Beable Ave.</street>
    <city>Meatball</city>
    <state>MI</state>
    <zip>82649</zip>
  </address>
  <email>joewrigley@jmac.org</email>
  <age>42</age>
 </customer>
 <customer>
  <first-name>Henrietta</first-name>
  <surname>Pussycat</surname>
   <address>
    <street>R.F.D. 2</street>
    <city>Flangerville</city>
    <state>NY</state>
    <zip>83642</zip>
   </address>
   <email>meow@263A.org</email>
   <age>37</age>
  </customer>
 </spam-document>

Having read the perldoc page describing XML::Simple, you might feel confident enough to craft a little script, shown in Example 1-2.

Example 1-2. A script to capitalize customer names

# This program capitalizes all the customer names in an XML document
# made by WarbleSoft SpamChucker.

# Turn on strict and warnings, for it is always wise to do so (usually)
use strict;
use warnings;

# Import the XML::Simple module
use XML::Simple;

# Turn the file into a hash reference, using XML::Simple's "XMLin"
# subroutine.
# We'll also turn on the 'forcearray' option, so that all elements
# contain arrayrefs.
my $cust_xml = XMLin('./customers.xml', forcearray=>1);

# Loop over each customer sub-hash, which are all stored as in an
# anonymous list under the 'customer' key
for my $customer (@{$cust_xml->{customer}}) {
  # Capitalize the contents of the 'first-name' and 'surname' elements
  # by running Perl's built-in uc(  ) function on them
  foreach (qw(first-name surname)) {
    $customer->{$_}->[0] = uc($customer->{$_}->[0]);
  }
}

# print out the hash as an XML document again, with a trailing newline
# for good measure
print XMLout($cust_xml);
print "\n";

Running the program (a little trepidatious, perhaps, since the data belongs to your boss), you get this output:

<opt version="3.5" timestamp="2002-05-13 15:33:45">
  <customer>
    <address>
      <state>MI</state>
      <zip>82649</zip>
      <city>Meatball</city>
      <street>17 Beable Ave.</street>
    </address>
    <first-name>JOE</first-name>
    <email>joewrigley@jmac.org</email>
    <surname>WRIGLEY</surname>
    <age>42</age>
  </customer>
  <customer>
    <address>
      <state>NY</state>
      <zip>83642</zip>
      <city>Flangerville</city>
      <street>R.F.D. 2</street>
    </address>
    <first-name>HENRIETTA</first-name>
    <email>meowmeow@augh.org</email>
    <surname>PUSSYCAT</surname>
    <age>37</age>
  </customer>
</opt>

Congratulations! You’ve written an XML-processing program, and it worked perfectly. Well, almost perfectly. The output is a little different from what you expected. For one thing, the elements are in a different order, since hashes don’t preserve the order of items they contain. Also, the spacing between elements may be off. Could this be a problem?

This scenario brings up an important point: there is a trade-off between simplicity and completeness. As the developer, you have to decide what’s essential in your markup and what isn’t. Sometimes the order of elements is vital, and then you might not be able to use a module like XML::Simple. Or, perhaps you want to be able to access processing instructions and keep them in the file. Again, this is something XML::Simple can’t give you. Thus, it’s vital that you understand what a module can or can’t do before you commit to using it. Fortunately, you’ve checked with your boss and tested the SpamChucker program on the modified data, and everyone was happy. The new document is close enough to the original to fulfill the application’s requirements.[1] Consider yourself initiated into processing XML with Perl!

This is only the beginning of your journey. Most of the book still lies ahead of you, chock full of tips and techniques to wrestle with any kind of XML. Not every XML problem is as simple as the one we just showed you. Nevertheless, we hope we’ve made the point that there’s nothing innately complex or scary about banging XML with your Perl hammer.

XML Processors

Now that you see the easy side of XML, we will expose some of XML’s quirks. You need to consider these quirks when working with XML and Perl.

When we refer in this book to an XML processor (which we’ll often refer to in shorthand as a processor, not to be confused with the central processing unit of a computer system that has the same nickname), we refer to software that can either read or generate XML documents. We use this term in the most general way—what the program actually does with the content it might find in the XML it reads is not the concern of the processor itself, nor is it the processor’s responsibility to determine the origin of the document or decide what to do with one that is generated.

As you might expect, a raw XML processor working alone isn’t very interesting. For this reason, a computer program that actually does something cool or useful with XML uses a processor as just one component. It usually reads an XML file and, through the magic of parsing, turns it into in-memory structures that the rest of the program can do whatever it likes with.

In the Perl world, this behavior becomes possible through the use of Perl modules: typically, a program that needs to process XML embraces, through a use directive, an existing package that makes a programmer interface available (usually an object-oriented one). This is why, before they get down to business, many XML-handling Perl programs start out with use XML::Parser; or something similar. With one little line, they’re able to leave all the dirty work of XML parsing to another, previously written module, leaving their own code to decide what to do pre- and post-processing.

A Myriad of Modules

One of Perl’s strengths is that it’s a community-driven language. When Perl programmers identify a need and write a module to handle it, they are encouraged to distribute it to the world at large via CPAN. The advantage of this is that if there’s something you want to do in Perl and there’s a possibility that someone else wanted to do it previously, a Perl module is probably already available on CPAN.

However, for a technology that’s as young, popular, and creatively interpretable as XML, the community-driven model has a downside. When XML first caught on, many different Perl modules written by different programmers appeared on CPAN, seemingly all at once. Without a governing body, they all coexisted in inconsistent glee, with a variety of structures, interfaces, and goals.

Don’t despair, though. In the time since the mist-enshrouded elder days of 1998, a movement towards some semblance of organization and standards has emerged from the Perl/XML community (which primarily manifests on ActiveState’s perl-xml mailing list, as mentioned in the preface). The community built on these first modules to make tools that followed the same rules that other parts of the XML world were settling on, such as the SAX and DOM parsing standards, and implemented XML-related technologies such as XPath. Later, the field of basic, low-level parsers started to widen. Recently, some very interesting systems have emerged (such as XML::SAX) that bring truly Perlish levels of DWIMminess out of these same standards.[2]

Of course, the goofy, quick-and-dirty tools are still there if you want to use them, and XML::Simple is among them. We will try to help you understand when to reach for the standards-using tools and when it’s OK to just grab your XML and run giggling through the daffodils.

Keep in Mind...

In many cases, you’ll find that the XML modules on CPAN satisfy 90 percent of your needs. Of course, that final 10 percent is the difference between being an essential member of your company’s staff and ending up slated for the next round of layoffs. We’re going to give you your money’s worth out of this book by showing you in gruesome detail how XML processing in Perl works at the lowest levels (relative to any other kind of specialized text munging you may perform with Perl). To start, let’s go over some basic truths:

  • It doesn’t matter where it comes from.

    By the time the XML parsing part of a program gets its hands on a document, it doesn’t give a camel’s hump where the thing came from. It could have been received over a network, constructed from a database, or read from disk. To the parser, it’s good (or bad) XML, and that’s all it knows.

    Mind you, the program as a whole might care a great deal. If we write a program that implements XML-RPC, for example, it better know exactly how to use TCP to fetch and send all that XML data over the Internet! We can have it do that fetching and sending however we like, as long as the end product is the same: a clean XML document fit to pass to the XML processor that lies at the program’s core.

    We will get into some detailed examples of larger programs later in this book.

  • Structurally, all XML documents are similar.

    No matter why or how they were put together or to what purpose they’ll be applied, all XML documents must follow the same basic rules of well-formedness: exactly one root element, no overlapping elements, all attributes quoted, and so on. Every XML processor’s parser component will, at its core, need to do the same things as every other XML processor. This, in turn, means that all these processors can share a common base. Perl XML-processing programs usually observe this in their use of one of the many free parsing modules, rather than having to reimplement basic XML parsing procedures every time.

    Furthermore, the one-document, one-element nature of XML makes processing a pleasantly fractal experience, as any document invoked through an external entity by another document magically becomes “just another element” within the invoker, and the same code that crawled the first document can skitter into the meat of any reference (and anything to which the reference might refer) without batting an eye.

  • In meaning, all XML applications are different.

    XML applications are the raison d'être of any one XML document, the higher-level set of rules they follow with an aim for applicability to some useful purpose—be it filling out a configuration file, preparing a network transmission, or describing a comic strip. XML applications exist to not only bless humble documents with a higher sense of purpose, but to require the documents to be written according to a given application specification.

    DTDs help enforce the consistency of this structure. However, you don’t have to have a formal validation scheme to make an application. You may want to create some validation rules, though, if you need to make sure that your successors (including yourself, two weeks in the future) do not stray from the path you had in mind when they make changes to the program. You should also create a validation scheme if you want to allow others to write programs that generate the same flavor of XML.

Most of the XML hacking you’ll accomplish will capitalize on this document/application duality. In most cases, your software will consist of parts that cover all three of these facts:

  • It will accept input in an appropriate way—listening to a network socket, for example, or reading a file from disk. This behavior is very ordinary and Perlish: do whatever’s necessary here to get that data.

  • It will pass captured input to some kind of XML processor. Dollars to doughnuts says you’ll use one of the parsers that other people in the Perl community have already written and continue to maintain, such as XML::Simple, or the more sophisticated modules we’ll discuss later.

  • Finally, it will Do Something with whatever that processor did to the XML. Maybe it will output more XML (or HTML), update a database, or send mail to your mom. This is the defining point of your XML application—it takes the XML and does something meaningful with it. While we won’t cover the infinite possibilities here, we will discuss the crucial ties between the XML processor and the rest of your program.

XML Gotchas

This section introduces topics we think you should keep in mind as you read the book. They are the source of many of the problems you’ll encounter when working with XML.

Well-formedness

XML has built-in quality control. A document has to pass some minimal syntax rules in order to be blessed as well-formed XML. Most parsers fail to handle a document that breaks any of these rules, so you should make sure any data you input is of sufficient quality.

Character encodings

Now that we’re in the 21st century, we have to pay attention to things like character encodings. Gone are the days when you could be content knowing only about ASCII, the little character set that could. Unicode is the new king, presiding over all major character sets of the world. XML prefers to work with Unicode, but there are many ways to represent it, including Perl’s favorite Unicode encoding, UTF-8. You usually won’t have to think about it, but you should still be aware of the potential.

Namespaces

Not everyone works with or even knows about namespaces. It’s a feature in XML whose usefulness is not immediately obvious, yet it is creeping into our reality slowly but surely. These devices categorize markup and declare tags to be from different places. With them, you can mix and match document types, blurring the distinctions between them. Equations in HTML? Markup as data in XSLT? Yes, and namespaces are the reason. Older modules don’t have special support for namespaces, but the newer generation will. Keep it in mind.

Declarations

Declarations aren’t part of the document per se; they just define pieces of it. That makes them weird, and something you might not pay enough attention to. Remember that documents often use DTDs and have declarations for such things as entities and attributes. If you forget, you could end up breaking something.

Entities

Entities and entity references seem simple enough: they stand in for content that you’d rather not type in at that moment. Maybe the content is in another file, or maybe it contains characters that are difficult to type. The concept is simple, but the execution can be a royal pain. Sometimes you want to resolve references and sometimes you’d rather keep them there. Sometimes a parser wants to see the declarations; at other times it doesn’t care. Entities can contain other entities to an arbitrary depth. They’re tricky little beasties and we guarantee that if you don’t give careful thought to how you’re going to handle them, they will haunt you.

Whitespace

According to XML, anything that isn’t a markup tag is significant character data. This fact can lead to some surprising results. For example, it isn’t always clear what should happen with whitespace. By default, an XML processor will preserve all of it—even the newlines you put after tags to make them more readable or the spaces you use to indent text. Some parsers will give you options to ignore space in certain circumstances, but there are no hard and fast rules.

In the end, Perl and XML are well suited for each other. There may be a few traps and pitfalls along the way, but with the generosity of various module developers, your path toward Perl/XML enlightenment should be well lit.



[1] Some might say that, disregarding the changes we made on purpose, the two documents are semantically equivalent, but this is not strictly true. The order of elements changed, which is significant in XML. We can say for sure that the documents are close enough to satisfy all the requirements of the software for which they were intended and of the end user.

[2] DWIM = “Do What I Mean,” one of the fundamental philosophies governing Perl.

Get Perl and XML now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.