Most of my work in maintaining the Bookworm ePub reader is keeping up with all of the variations of the format that people try to upload. There are some consistent problems that I’m seeing “out in the wild,” some serious, some understandable.
Lots of these problems would be caught by epubcheck, which can be used via the threepress.org epubcheck service, but I imagine that many people are testing only by opening the ePub in Adobe Digital Editions. ADE is very forgiving. In the long run it’s best to validate all ePubs, as that guarantees they’ll work properly in future rendering systems that might not be so generous.
Missing required attributes in the metadata. This is the one that’s most likely to get your ePub rejected by Bookworm, and the most common case is missing playOrder attributes in the NCX table of contents file. The playOrder attribute specifies the order in which the table of contents should be laid out, and it’s easy to miss because its information is usually redundant — generally the navPoints are laid out in document order anyway. A recent update of Bookworm will allow books that are missing their playOrder to be loaded (it then relies on document order), but strictly speaking, playOrder is required.
Metadata that hasn’t been proofread. I’m not going to name names, but there’s a major publisher releasing ePubs with their own name misspelled in the
dc:publisherfield. That’s not only embarrassing, it prevents web-aware ereaders like Bookworm from doing anything useful with that data, like automatically linking back to the publisher’s web site, or showing other books by that publisher.
Improper nesting of the ePub zip file. The
mimetypefile inside an ePub must be at the top level of the archive, not in a sub-folder. Bookworm won’t accept these documents and epubcheck rejects them. It’s a requirement I might loosen in the future but doing so is not a high priority for me.
Items declared in the OPF file that are missing from the archive. I could “fix” this in Bookworm by ignoring any missing files, but I’ve been reluctant to do it because it could easily lead to Bookworm appearing to be buggy when it isn’t. For example, I could remove any missing pages from the TOC, but internal document links will be broken, and if what’s missing is ‘Chapter 7’, I think most readers would want to know this. I feel this is a serious enough problem that I plan to continue to reject books that have this issue.
Invalid XHTML. This is pretty common but not serious in the scope of things. A lot of “XHTML” in ePub is really HTML 4.01 or broken XHTML pretending to be valid. Bookworm does want to parse the content a bit (to do some pre-processing like rendering inline SVG as external links, and to extract just the
<body>from the file), but if the content isn’t strictly XHTML it can still cope, just as a web browser does. Nevertheless, if your ePub content isn’t really XML, it limits the number of ways that it could be reused.
The exceptional cases are ePubs which are themselves generated from web content, such as blogs or fanfic. Cleaning up real-world HTML is an art form and I don’t expect automated tools like BookGlutton’s HTML to ePub converter (which uses tidy) to be able to make it perfect.
One idiosyncrasy that isn’t technically a problem but has caused me no end of headaches is the issue of internal links within XHTML content. For example, imagine you have all your content files in a sub-folder, so your ePub looks like:
image1.png as an inline image, what does the value of the
src look like?
I see both forms. Bookworm will try to locate the full path first, and then fall back to just looking for the image name anywhere in the archive. This means it could potentially pull the wrong image if you have multiple images with the same name in different sub-folders, but I haven’t seen this happen. (If the
src contains an absolute path, it will fail to find the image, a problem that epubcheck would flag.)
To be strictly accurate, any references inside an XHTML file should be relative to that file’s location in the archive. In the above example, the link should be