Posted on by & filed under tools.

IBM DeveloperWorks has just released an article of mine on High-Performance XML Parsing in Python.  Although there is nothing publishing-centric about the article itself, it was based on my own experience in dealing with large XML datasets in academic publishing.

Massive XML files are uncommon in the general web development world, where the primary roles of XML are either as configuration files, read only infrequently, or for interchange across the web, in which case the files are necessarily small. It’s rare to encounter XML measured in gigabytes or more; data at that level is usually stored in a relational database.

For that reason I find myself frustrated with many XML tools, even those ostensibly designed to handle large amounts of data.  Too often they don’t scale well or at least easily.  I don’t believe that scaling should be a black art that each individual developer needs to solve independently.  Unfortunately, in commercial products ease-of-use is a key bullet point and computationally-difficult problems are hard to summarize in a user’s guide.

I tend to recommend open-source software most strongly in two scenarios: for small projects with limited budgets and for large projects with unique challenges.  There simply isn’t going to be a one-size-fits-all application for most interesting publishing work.

This is one of many reasons I’m excited by Google’s willingness to open its Google Books archive to researchers:  Python is a first-class programming language in the Google ecosystem, and Google has a good track record of open-sourcing those internal tools with limited commercial value.  I expect a lot of interesting work to come out of that archive once it’s available.

Tags: article, digitization, Google, IBM, lxml, Python,

Comments are closed.