Posted on by & filed under content.

The last set of Gutenberg HTML books that were planned for demonstration on threepress have been added.  As usual, data-loading took more time and uncovered up more problems than expected, which is always a reason to add as many samples as possible.  This set includes one non-fiction book (On the Origin of Species) and one with verse components (The Jungle Book); both required significant updates to the XSLT that converts the Gutenberg DTD to TEI.

To expand the project in useful ways I’d like to be able to add:

  1. Other content types besides novels, especially reference
  2. Content from other document formats, such as DocBook
  3. Native, highly-tagged TEI documents

Wikipedia and its cohorts are by far the largest source of public domain data on the web now, but they aren’t encoded in XML. Publishers are unlikely to use wiki formatting to mark up their content and thus developing a workflow to convert from wiki to TEI doesn’t seem productive.

XML data welcome!

Tags: project gutenberg, tei, wiki, wikipedia, XML, xslt,

Comments are closed.