Posted on by & filed under bookworm, ebooks.

One of the great things about sitting on 30,000+ ePub books (as uploaded to Bookworm) is the ability to look at what’s happening in real-world ebook production. Today I’m examining file size, which is useful if you happen to be doing resource planning for a cloud-based ebook reading system.

Smallest 1.6 kilobytes
Largest 233 megabytes
Total # 35,854
Total size 20 gigabytes

I did a frequency analysis of all the individual sizes across the entire corpus:

Screen shot 2009-11-16 at 10.58.04 AM

And then zoomed in on that huge spike in the middle range around 1M and 5M:

Screen shot 2009-11-16 at 10.45.18 AM

So there’s a peak at 3M but really anywhere between 1M and 4M is about average.

I did this analysis earlier in the year when Bookworm had only a paltry 7,800 books, and while the 3M median held, I can say that ebooks in general have gotten larger on average. I attribute that to an increased number of commercially-produced books which include images.


9 Responses to “How big is the average ePub book?”

  1. India

    Oh, the things you could do with that corpus! What fun.

    I’d also be interested to know which titles and genres of books have been uploaded most, what devices they’re being downloaded to, how many users have uploaded those 30,000+ books, . . .

    More, more, more said the baby.

  2. Fran Toolan

    Hey Liza,
    do you have similar data for PDF’s? i’d be curious to see how that correlates with what we are seeing. thanks.

  3. Eric Lease Morgan

    While the size of a book in bytes is interesting in an of itself, I would advocate you measure the size of a book in terms of number of words, especially if you want to give an idea of how long a book is. I have begun to do this against content in my Alex Catalogue.

    Furthermore, the field of digital humanities offers great amount of other types of analysis. Number of action words. Count the this type of sound or that kind of sound. The number of times a text mentions “great ideas” or “big names”. Calculating an item’s “readability index”.

    Fun with computers and books.

  4. Dave Cramer

    My biggest is 180MB; smallest is 40k, from a sample of around 3,000 from a major trade publisher. Average around 1.2MB, but the big books push that up. Two-thirds are between 300k and 1MB.


  5. liza

    As you might imagine, doing any kind of real analysis on 30 gigabytes of data is pretty time-consuming. Bytes are easy since the operating system has that information readily available.

    I can probably come up with the total size of all the Bookworm images and subtract that from the epub total, but that isn’t as useful as it could be since the text is compressed.

  6. bowerbird

    yes, 30 gigs is a big chunk, for sure, no question about it.

    on the other hand, it’s useful to know how big the text is,
    as opposed to the images, because you have to manipulate
    the text, whereas you merely have to display the pictures…

    i asked the question because the .epub format combines
    the text and pictures into the .zip file, thereby making it
    difficult to easily compute the size of the text all by itself.
    so the format you’ve chosen complicates finding an answer.

    but you will find that fact more palatable if _you_ say it
    than if _i_ say it.

    (the fact that it’s all compressed is another complication,
    but since most graphics are already at a compressed size,
    and most text compresses at a fairly standard rate, this is
    not a very big concern.)

    your corpus is also dependent on who did the uploading
    — as evidence by the comment up above that says “hey,
    most of the big books are probably ones we uploaded” —
    so a more useful answer to the question of the size of text
    would probably be found in the project gutenberg corpus.

    which is not to deny that the total size — text and pictures,
    not to mention audio and video in the days to come — is
    a very useful measure in and of itself, especially when you
    consider the matter of “cloud storage” which you mention.