Over on Tools of Change there’s a post of mine discussing the so-called “analog hole” as it applies to digital books. It was a fun article to write, especially the hands-on part. I used Google’s OCRopus open-source OCR software, which was a little impenetrable to someone outside of the machine-learning community but did a good job once I fumbled around with it for awhile.
Also on that page at the moment is a giant photo of my head advertising What Publishers Need to Know About Digitization, a web seminar I’ll be hosting with O’Reilly Media on November 12. It will be a very high-level, introductory overview aimed at non-technical staff in publishing who are considering a digitization project.
Going full-circle, I wonder if there would be interest in a simple web-based OCR service where publishers could upload a scanned document to see how well bare-bones OCR performed on an image-only PDF or JPEG scan. I imagine it might help predict the complexity of a digitization project, and understand some of the challenges inherent in the process.