Extract Structure from Unstructured Content

As mentioned many times before, the term “unstructured content” is a bit misleading. Obviously, all useful information has some kind of definable structure. Regardless, this category applies to any content item with a non-rigid structure: e-mails, Microsoft Word documents, spreadsheets with tables and graphs, images, video files, scanned paper documents, and the like.

Most unstructured content items have structured characteristics in common: titles, authors, sections, sub-sections, diagrams, tables, and images. It is certainly possible to create a database schema to store a Microsoft Word document in such atomic units. However, would this be of value to your end users? You could define a document template ...

Get Transforming Infoglut! A Pragmatic Strategy for Oracle Enterprise Content Management now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.