Two months ago, I was slated to participate in a conference panel on metadata but unfortunately had to cancel. But very fortunately, Todd Carpenter, Executive Director of the National Information Standards Organization (NISO), filled in for me at the last moment. You can take a look at his slides here. The following is a brief precis of the talk that I was going to give with a couple of nods to Todd and audience commentary that a colleague shared with me post-conference.
First and foremost, it must be understood that metadata drives your online publishing platform functionality. Without it nothing really works and Todd put it very well when he said during his presentation that “without metadata, users will ‘walk by’”. Whenever publishers ask us if our PubFactory platform has been developed to maximize discoverability or to support the latest cool, new curation features, my typical response is to ask “Yes…and how will you provide the metadata needed to enable the requested feature(s)?”. Surprisingly often, the answer I get is “I have no idea”.
The kind of metadata that is needed is driven by what you are trying to accomplish. Most metadata discussions center around discovery but for this discussion I would like to examine the metadata needs for a publisher created platform. The use case that I have in mind is of a publisher supplying content and metadata for an online delivery and reading system for their books which will primarily serve an institutional/academic market.
Just as search is the starting point for the typical user journey when accessing scholarly content, so to it is a critical component of any online platform and has many and varied effects on the metadata that needs to be supplied along with the content. One of the first considerations is how search should function for a collection of books. In some cases having search return whole books results makes sense, but in many others the need to provide very granular results means that returning chapters or even sections within chapters is critical.
If chapter-level results are returned you must provide title and author metadata at the chapter-level. The platform may be able to automatically generate this metadata for monographs but for edited collections it is likely that this metadata will need to be supplied by the publisher. This is a case where internal publisher workflows may need to be updated if chapter-level metadata is not currently being created and managed so that it can be supplied to the platform.
The next thing to consider is what metadata fields should be searchable for a standard search. This will vary depending on the types of books in the collection. Also important when returning chapter-level results is recognizing that not all chapters should be presented in the search results. For example, returning front matter sections such as the title or copyright pages or the index from the back of the book doesn’t generally address the users’ needs. And as a result, it is necessary to provide data for each section of the book indicating whether or not it should be included in search and whether or not it should be accessible on the platform.
It can also be helpful to provide abstracts in search results. Publishers usually can supply abstracts for books but may not have chapter-level abstracts readily at hand. If the data is not available are there alternatives? For instance, in this case, one possibility is to extract the first page of content in lieu of an authored abstract. In general, the key is to think through and consciously decide on all the metadata issues that come up.
How does content format impact metadata?
The second major determinant of the type of metadata required will be the format of the content itself. There are a wide range of choices for book content ranging from the ever popular PDF, to ePubs and various forms of XML. The most common industry standards for book XML format include NLM (which has evolved into BITS), DocBooks, TEI, and DITA. The XML formats typically include support for most of the metadata that is needed in the format itself, whereas ePub includes some metadata but will likely need to be supplemented and PDFs generally have almost no metadata associated with them.
If a separate metadata format is necessary, then ONIX is one obvious industry standard to use. Despite the fact that ONIX 3.X has been around for a long time a surprising number of publishers are still using ONIX 2.X. ONIX is most typically thought of as a metadata format for title level metadata but it can support chapter-level metadata as well.
There are a number of important details to address when dealing with ONIX metadata, including –
- Confirming that you and your platform host support the same version of ONIX.
- Understanding how you are matching ONIX records to the content and any other associated files, e.g. book covers.
- Deciding whether you will provide different ISBNs for different formats of the same book. Publishers will sometimes do this if they are providing both PDF and ePub or Mobi versions of a book even though the platform will not treat them as separate books.
- Ensuring that you and your platform host encode chapter-level information in the same way.
- Making sure that the way the book is divided up exactly matches the content format and the metadata provided.
Although it can be quite complex and require significant effort to properly create and maintain not just basic metadata, but rich metadata for your content – the investment is more than worth it. During the conference session, Chuck Koscher, Director of Technology at Crossref, said the following which is worth repeating here (and elsewhere) and that is “Metadata is its own entity. It goes places that your content never goes to”. The corollary that I’m emphasizing here is that metadata is critical to facilitating the functioning of your platform.
Watch out for my next post in which I will delve into taxonomies, business models and pricing as core components of your metadata management.