Chapter 11. Tagging and Structure

Structured PDF

As you’ve seen in all the previous chapters, PDF provides the ability to draw text, vectors, raster images, and even video and 3D onto a page that can be displayed or printed. However, the content is just that: a series of drawing instructions. It has no semantic or structural context. There is nothing that delineates one paragraph from another or one image from another. In fact, there isn’t even a concept of a paragraph or a word—just a bunch of glyphs and their associated encoding.

This limitation is addressed by a feature of PDF called logical structure. It enables associating a hierarchical grouping of objects, called structure elements, with the various graphic objects on the page and any additional attributes needed to sufficiently describe those objects. This is quite similar in concept to markup languages such as HTML or XML, but in PDF that structure and content are in separate logical areas of the PDF rather than being intermixed (as they are in HTML, for example). This separation allows the ordering and nesting of logical elements to be entirely independent of the order and location of graphic objects on the document’s pages.

While there is a series of predefined types of structure elements that enable the organization of a document into chapters and sections or the identification of special elements such as figures, tables, and footnotes, the facilities provided by PDF are quite extensible. This extensibility allows writers ...

Get Developing with PDF now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.