Posted on by & filed under bookworm, ebooks, epub.

We were curious about the distribution of languages for ePubs on Bookworm (Ibis Reader doesn’t yet have enough titles to be representative yet.)

The following information is derived from the dc:language field in the OPF file.

Here’s the chart:

Missing from the chart, of course, is English. It’s so overrepresented it skews the chart to the point of being unreadable.

Of the 62,000 epubs on Bookworm right now:

  • 29,642 have no language value
  • A little over 20,000 are English (combining various values like “en”, “en-GB”, or — embarrassingly — “American”)
  • The remainder, 5,874, are distributed among all other languages
  • Almost half of the values are represented just one time (likely bad data)

I found it very interesting that the most represented non-English language code is cs — Czech — by a huge margin. Any ideas why?

Wondering which values are correct? The OPF 2.0 spec is unambiguous:

The content of this element [dc:language] must comply with RFC 3066

(Also, does anyone speak “Robert”?)

Tags:

4 Responses to “Languages in real-world ePubs”

  1. Colin Hazlehurst

    Thanks for providing this information. I recently added a language dropdown list to opubWriter (http://opubwriter.com) so authors can select a language code for their new epub books. I wondered if I should limit the list to the languages I thought would be most used and I’m glad I didn’t.

    It’s interesting that nearly half of the books in your study have no language code, yet the Open Packaging Format schema says there must be one-or-more of title, identifier, and language. In the rush to create epubs there’s clearly not much validation going on out there.

    Colin Hazlehurst

  2. Liza Daly

    You should limit the language values to RFC 3066 for sure.

    Bookworm doesn’t disallow invalid epubs, but it does require at least dc:title. To me that was a minimal standard of metadata; any less than that and I didn’t want to guarantee that we could render it.