31

image Recovering Text from Corrupted Documents

The ingest processes are quite straightforward when they work correctly. Everything runs predictably, and the content flows through the system. Occasionally, you get a rogue document. Perhaps it was saved incorrectly or a computer crashed while producing it. There may be some content there, but the editing application won’t open it.

Some document formats will allow you to extract some or even all of the text from a damaged file. A few document formats are so heavily compressed or tokenized that this is not feasible. You can approach this at different levels depending on the format of the file and the ...

Get Developing Quality Metadata now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.