Summary

This chapter has shown how to solve several text processing problems, none of which would be simple to do in most programming languages. The critical lessons of this chapter are:

  • Data markup is extremely valuable, although it need not be complex. A unique single character, such as a tab, colon, or comma, often suffices.

  • Pipelines of simple Unix tools and short, often inline, programs in a suitable text processing language, such as awk, can exploit data markup to pass multiple pieces of data through a series of processing stages, emerging with a useful report.

  • By keeping the data markup simple, the output of our tools can readily become input to new tools, as shown by our little analysis of the output of the word-frequency filter, wf, applied to Shakespeare's texts.

  • By preserving some minimal markup in the output, we can later come back and massage that data further, as we did to turn a simple ASCII office directory into a web page. Indeed, it is wise never to consider any form of electronic data as final: there is a growing demand in some quarters for page-description languages, such as PCL, PDF, and PostScript, to preserve the original markup that led to the page formatting. Word processor documents currently are almost devoid of useful logical markup, but that may change in the future. At the time of this writing, one prominent word processor vendor was reported to be considering an XML representation for document storage. The GNU Project's gnumeric spreadsheet, the Linux ...

Get Classic Shell Scripting now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.