#32 Extracting Text from XML (xml_text_extractor.rb)

Counting occurrences of tags is fine, but XML is designed to hold text wrapped in tags, providing some organization beyond what’s available simply from the content. That said, though, sometimes having just the text content is handy. When I was preparing a document using DocBook, I found myself wanting to use a spell checker on it. There are spell checkers that are XML-aware, but another approach would be to run a text extractor on XML and pass that output into a spell checker that expects plain text. This xml_text_extractor.rb is just such a script.

The Code

 #!/usr/bin/env ruby # xml_text_extractor.rb ❶ CHOMP_TAG = lambda { |tag| tag.to_s.chomp } =begin rdoc This script uses the Rexml parser, ...

Get Ruby by Example now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.