This final chapter focuses on common tasks that come up when working with an assortment of common markup languages and formats: HTML, XHTML, XML, CSV, and INI. Although we’ll assume at least basic familiarity with these technologies, a brief description of each is included at the start of the chapter to make sure we’re on the same page before digging in. The descriptions here concentrate on the basic syntax rules needed to correctly search through the data structures of each format. Other details will be introduced as we encounter relevant issues.
Although it’s not always apparent on the surface, some of these formats can be surprisingly complex to process and manipulate accurately, at least using regular expressions. It’s usually best to use dedicated parsers and APIs instead of regular expressions when performing many of the tasks in this chapter, especially if accuracy is critical (e.g., if your processing might have security implications). Nevertheless, these recipes show useful techniques that can be used with many quick processing tasks.
So let’s look at what we’re up against. Many of the difficulties we’ll encounter throughout this chapter involve how we should handle cases that deviate from the following rules in expected or unexpected ways.
HTML is used to describe the structure, semantics, and appearance of billions of web pages and other documents. It’s common to want to process HTML using regular expressions, ...