Posted on by & filed under bookworm, epub, tools.

ePub books are supposed to have a unique identifier: the Dublin Core identifier found in the OPF file. Unfortunately, the ePub spec doesn’t have any mechanism to enforce the uniqueness of the ID, so we live in a world where in fact many many epubs don’t have truly unique identifiers (or indeed, any identifiers at all).

Early in Bookworm’s history we didn’t specifically extract the identifier from the ePub and put it in the database, but we do now, so I can query the state of identifiers in the more recent books.

18,571 epubs in Bookworm have some kind of identifier in the database. Many of these were auto-generated by Bookworm, for example if the book had none at all.

14,735 of these are unique. Most of the duplicates came from earlier versions of Calibre, which simply generated identifiers like “25.” Others are just duplicate copies of the same book, which of course is okay.

There are three truly useful identifier types. I’ll specify these in order of my personal preference:

ISBN

Modern commercial books already have unique identifiers. Convenient! Use the ISBN as your primary ID in the epub if you have it, and be sure to use the correct ISBN for your ebook edition:

<dc:identifier xmlns:dc="http://purl.org/dc/elements/1.1/"
                 id="bookid" 
                 opf:scheme="ISBN">urn:isbn:9780596158347</dc:identifier>

Including the scheme “isbn” in the identifier as well as using the optional opf:scheme attribute allows intelligent reading systems to leverage the ISBN by searching other systems. (1,132 books in Bookworm are courteous in this way.)

URI

Project Gutenberg epubs use a URI. This is a great method for digital-native books with their own steady identifiers. In this case Gutenberg identifiers include the Gutenberg book id:

    <dc:identifier opf:scheme="URI" id="etextno">http://www.gutenberg.org/ebooks/11</dc:identifier>

Like ISBN, these identifiers are stable, and let reading systems do more with the book if they’re able (for example direct users to the canonical information page for that book). If all publishers had their own websites with a page for every book I’d even prefer URIs to ISBNs for this reason.

939 books on Bookworm are from Gutenberg directly, based on searching for this style of identifier. Only about 100 other books use URIs as identifiers.

UUID

A Universally Unique Identifier solves the “required unique” problem by specifying very large numbers, using a variety of schemes.

It’s trivial to generate a UUID in most programming languages. Here’s Python:

>>> import uuid
>>> uuid.uuid4()
UUID('58dce2ac-7aec-45c3-a6de-903a30061545')

Wikipedia lists UUID implementations in other programming languages. You can even just go to a UUID-generating web site. These days, if Bookworm gets an epub with no identifier, it generates a UUID.

UUID provides the least useful information of the three identifier types, but, as Wikipedia points out, “One’s annual risk of being hit by a meteorite is estimated to be one chance in 17 billion […], equivalent to the odds of creating a few tens of trillions of UUIDs in a year and having one duplicate.” So at least it’s unique.

Tags:

4 Responses to “What’s in an identifier?”

  1. Dave Cramer

    Note that there can be many dc:identifier elements in a single ePub, but only one can be the unique identifier. This must be specified on the package element itself:

    <package unique-identifier="bob"…

    Then the dc:identifier with id="bob" becomes the unique identifier.

    –Dave, who believes that real ISBNs contain hyphens!

Trackbacks/Pingbacks

  1.  What’s an identifier? http://bit.ly/30OLiq #bhl …
  2.  Cataloguing-in-Publication for ePubs « My Blog
  3.  Choosing InDesign ePub output options : Threepress Consulting blog