You are previewing Beautiful Code.

Beautiful Code

Cover of Beautiful Code by Greg Wilson... Published by O'Reilly Media, Inc.
  1. Beautiful Code
    1. SPECIAL OFFER: Upgrade this ebook with O’Reilly
    2. A Note Regarding Supplemental Files
    3. Foreword
    4. Preface
      1. How This Book Is Organized
      2. Conventions Used in This Book
      3. Using Code Examples
      4. How to Contact Us
      5. Safari® Enabled
    5. 1. A Regular Expression Matcher
      1. 1.1. The Practice of Programming
      2. 1.2. Implementation
      3. 1.3. Discussion
      4. 1.4. Alternatives
      5. 1.5. Building on It
      6. 1.6. Conclusion
    6. 2. Subversion's Delta Editor: Interface As Ontology
      1. 2.1. Version Control and Tree Transformation
      2. 2.2. Expressing Tree Differences
      3. 2.3. The Delta Editor Interface
      4. 2.4. But Is It Art?
      5. 2.5. Abstraction As a Spectator Sport
      6. 2.6. Conclusions
    7. 3. The Most Beautiful Code I Never Wrote
      1. 3.1. The Most Beautiful Code I Ever Wrote
      2. 3.2. More and More with Less and Less
      3. 3.3. Perspective
      4. 3.4. What Is Writing?
      5. 3.5. Conclusion
      6. 3.6. Acknowledgments
    8. 4. Finding Things
      1. 4.1. On Time
      2. 4.2. Problem: Weblog Data
      3. 4.3. Problem: Who Fetched What, When?
      4. 4.4. Search in the Large
      5. 4.5. Conclusion
    9. 5. Correct, Beautiful, Fast (in That Order): Lessons from Designing XML Verifiers
      1. 5.1. The Role of XML Validation
      2. 5.2. The Problem
      3. 5.3. Version 1: The Naïve Implementation
      4. 5.4. Version 2: Imitating the BNF Grammar O(N)
      5. 5.5. Version 3: First Optimization O(log N)
      6. 5.6. Version 4: Second Optimization: Don't Check Twice
      7. 5.7. Version 5: Third Optimization O(1)
      8. 5.8. Version 6: Fourth Optimization: Caching
      9. 5.9. The Moral of the Story
    10. 6. Framework for Integrated Test: Beauty Through Fragility
      1. 6.1. An Acceptance Testing Framework in Three Classes
      2. 6.2. The Challenge of Framework Design
      3. 6.3. An Open Framework
      4. 6.4. How Simple Can an HTML Parser Be?
      5. 6.5. Conclusion
    11. 7. Beautiful Tests
      1. 7.1. That Pesky Binary Search
      2. 7.2. Introducing JUnit
      3. 7.3. Nailing Binary Search
      4. 7.4. Conclusion
    12. 8. On-the-Fly Code Generation for Image Processing
    13. 9. Top Down Operator Precedence
      1. 9.1. JavaScript
      2. 9.2. Symbol Table
      3. 9.3. Tokens
      4. 9.4. Precedence
      5. 9.5. Expressions
      6. 9.6. Infix Operators
      7. 9.7. Prefix Operators
      8. 9.8. Assignment Operators
      9. 9.9. Constants
      10. 9.10. Scope
      11. 9.11. Statements
      12. 9.12. Functions
      13. 9.13. Array and Object Literals
      14. 9.14. Things to Do and Think About
    14. 10. The Quest for an Accelerated Population Count
      1. 10.1. Basic Methods
      2. 10.2. Divide and Conquer
      3. 10.3. Other Methods
      4. 10.4. Sum and Difference of Population Counts of Two Words
      5. 10.5. Comparing the Population Counts of Two Words
      6. 10.6. Counting the 1-Bits in an Array
      7. 10.7. Applications
    15. 11. Secure Communication: The Technology Of Freedom
      1. 11.1. The Heart of the Start
      2. 11.2. Untangling the Complexity of Secure Messaging
      3. 11.3. Usability Is the Key
      4. 11.4. The Foundation
      5. 11.5. The Test Suite
      6. 11.6. The Functioning Prototype
      7. 11.7. Clean Up, Plug In, Rock On…
      8. 11.8. Hacking in the Himalayas
      9. 11.9. The Invisible Hand Moves
      10. 11.10. Speed Does Matter
      11. 11.11. Communications Privacy for Individual Rights
      12. 11.12. Hacking the Civilization
    16. 12. Growing Beautiful Code in BioPerl
      1. 12.1. BioPerl and the Bio::Graphics Module
      2. 12.2. The Bio::Graphics Design Process
      3. 12.3. Extending Bio::Graphics
      4. 12.4. Conclusions and Lessons Learned
    17. 13. The Design of the Gene Sorte
      1. 13.1. The User Interface of the Gene Sorter
      2. 13.2. Maintaining a Dialog with the User over the Web
      3. 13.3. A Little Polymorphism Can Go a Long Way
      4. 13.4. Filtering Down to Just the Relevant Genes
      5. 13.5. Theory of Beautiful Code in the Large
      6. 13.6. Conclusion
    18. 14. How Elegant Code Evolves with Hardware The Case of Gaussian Elimination
      1. 14.1. The Effects of Computer Architectures on Matrix Algorithms
      2. 14.2. A Decompositional Approach
      3. 14.3. A Simple Version
      4. 14.4. LINPACK's DGEFA Subroutine
      5. 14.5. LAPACK DGETRF
      6. 14.6. Recursive LU
      7. 14.7. ScaLAPACK PDGETRF
      8. 14.8. Multithreading for Multi-Core Systems
      9. 14.9. A Word About the Error Analysis and Operation Count
      10. 14.10. Future Directions for Research
      11. 14.11. Further Reading
    19. 15. The Long-Term Benefits of Beautiful Design
      1. 15.1. My Idea of Beautiful Code
      2. 15.2. Introducing the CERN Library
      3. 15.3. Outer Beauty
      4. 15.4. Inner Beauty
      5. 15.5. Conclusion
    20. 16. The Linux Kernel Driver Model: The Benefits of Working Together
      1. 16.1. Humble Beginnings
      2. 16.2. Reduced to Even Smaller Bits
      3. 16.3. Scaling Up to Thousands of Devices
      4. 16.4. Small Objects Loosely Joined
    21. 17. Another Level of Indirection
      1. 17.1. From Code to Pointers
      2. 17.2. From Function Arguments to Argument Pointers
      3. 17.3. From Filesystems to Filesystem Layers
      4. 17.4. From Code to a Domain-Specific Language
      5. 17.5. Multiplexing and Demultiplexing
      6. 17.6. Layers Forever?
    22. 18. Python's Dictionary Implementation: Being All Things to All People
      1. 18.1. Inside the Dictionary
      2. 18.2. Special Accommodations
      3. 18.3. Collisions
      4. 18.4. Resizing
      5. 18.5. Iterations and Dynamic Changes
      6. 18.6. Conclusion
      7. 18.7. Acknowledgments
    23. 19. Multidimensional Iterators in NumPy
      1. 19.1. Key Challenges in N-Dimensional Array Operations
      2. 19.2. Memory Models for an N-Dimensional Array
      3. 19.3. NumPy Iterator Origins
      4. 19.4. Iterator Design
      5. 19.5. Iterator Interface
      6. 19.6. Iterator Use
      7. 19.7. Conclusion
    24. 20. A Highly Reliable Enterprise System for NASA's Mars Rover Mission
      1. 20.1. The Mission and the Collaborative Information Portal
      2. 20.2. Mission Needs
      3. 20.3. System Architecture
      4. 20.4. Case Study: The Streamer Service
      5. 20.5. Reliability
      6. 20.6. Robustness
      7. 20.7. Conclusion
    25. 21. ERP5: Designing for Maximum Adaptability
      1. 21.1. General Goals of ERP
      2. 21.2. ERP5
      3. 21.3. The Underlying Zope Platform
      4. 21.4. ERP5 Project Concepts
      5. 21.5. Coding the ERP5 Project
      6. 21.6. Conclusion
    26. 22. A Spoonful of Sewage
    27. 23. Distributed Programming with MapReduce
      1. 23.1. A Motivating Example
      2. 23.2. The MapReduce Programming Model
      3. 23.3. Other MapReduce Examples
      4. 23.4. A Distributed MapReduce Implementation
      5. 23.5. Extensions to the Model
      6. 23.6. Conclusion
      7. 23.7. Further Reading
      8. 23.8. Acknowledgments
      9. 23.9. Appendix: Word Count Solution
    28. 24. Beautiful Concurrency
      1. 24.1. A Simple Example: Bank Accounts
      2. 24.2. Software Transactional Memory
      3. 24.3. The Santa Claus Problem
      4. 24.4. Reflections on Haskell
      5. 24.5. Conclusion
      6. 24.6. Acknowledgments
    29. 25. Syntactic Abstraction: The syntax-case Expander
      1. 25.1. Brief Introduction to syntax-case
      2. 25.2. Expansion Algorithm
      3. 25.3. Example
      4. 25.4. Conclusion
    30. 26. Labor-Saving Architecture: An Object-Oriented Framework for Networked Software
      1. 26.1. Sample Application: Logging Service
      2. 26.2. Object-Oriented Design of the Logging Server Framework
      3. 26.3. Implementing Sequential Logging Servers
      4. 26.4. Implementing Concurrent Logging Servers
      5. 26.5. Conclusion
    31. 27. Integrating Business Partners the RESTful Way
      1. 27.1. Project Background
      2. 27.2. Exposing Services to External Clients
      3. 27.3. Routing the Service Using the Factory Pattern
      4. 27.4. Exchanging Data Using E-Business Protocols
      5. 27.5. Conclusion
    32. 28. Beautiful Debugging
      1. 28.1. Debugging a Debugger
      2. 28.2. A Systematic Process
      3. 28.3. A Search Problem
      4. 28.4. Finding the Failure Cause Automatically
      5. 28.5. Delta Debugging
      6. 28.6. Minimizing Input
      7. 28.7. Hunting the Defect
      8. 28.8. A Prototype Problem
      9. 28.9. Conclusion
      10. 28.10. Acknowledgments
      11. 28.11. Further Reading
    33. 29. Treating Code As an Essay
    34. 30. When a Button Is All That Connects You to the World
      1. 30.1. Basic Design Model
      2. 30.2. Input Interface
      3. 30.3. Efficiency of the User Interface
      4. 30.4. Download
      5. 30.5. Future Directions
    35. 31. Emacspeak: The Complete Audio Desktop
      1. 31.1. Producing Spoken Output
      2. 31.2. Speech-Enabling Emacs
      3. 31.3. Painless Access to Online Information
      4. 31.4. Summary
      5. 31.5. Acknowledgments
    36. 32. Code in Motion
      1. 32.1. On Being "Bookish"
      2. 32.2. Alike Looking Alike
      3. 32.3. The Perils of Indentation
      4. 32.4. Navigating Code
      5. 32.5. The Tools We Use
      6. 32.6. DiffMerge's Checkered Past
      7. 32.7. Conclusion
      8. 32.8. Acknowledgments
      9. 32.9. Further Reading
    37. 33. Writing Programs for "The Book"
      1. 33.1. The Nonroyal Road
      2. 33.2. Warning to Parenthophobes
      3. 33.3. Three in a Row
      4. 33.4. The Slippery Slope
      5. 33.5. The Triangle Inequality
      6. 33.6. Meandering On
      7. 33.7. "Duh!"—I Mean "Aha!"
      8. 33.8. Conclusion
      9. 33.9. Further Reading
    38. A. Afterword
    39. B. Contributors
    40. Colophon
    41. SPECIAL OFFER: Upgrade this ebook with O’Reilly
O'Reilly logo

Problem: Weblog Data

Let's look at a sample problem to get a feel for how a search works in real life. I have a directory containing logfiles from my weblog (http://www.tbray.org/ongoing) from early 2003 to late 2006; as of the writing of this chapter, they recorded 140,070,104 transactions and occupied 28,489,788,532 bytes (uncompressed). All these statistics, properly searched, can answer lots of questions about my traffic and readership.

Let's look at a simple question first: which articles have been read the most? It may not be instantly obvious that this problem is about search, but it is. First of all, you have to search through the logfiles to find the lines that record someone fetching an article. Second, you have to search through those lines to find the name of the article they fetched. Third, you have to keep track, for each article, of how often it was fetched.

Here is an example of one line from one of these files, which wraps to fit the page in this book, but is a single long line in the file:

	c80-216-32-218.cm-upc.chello.se - - [08/Oct/2006:06:37:48 -0700] "GET /ongoing/When/
	200x/2006/10/08/Grief-Lessons HTTP/1.1" 200 5945 "http://www.tbray.org/ongoing/"
	"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)

Reading from left to right, this tells us that:

Somebody from an organization named chello in Sweden,
who provided neither a username nor a password,
contacted my weblog early in the morning of October 8, 2006 (my server's time zone is seven hours off Greenwich),
and requested a resource named /ongoing/When/200x/2006/10/08/Grief-Lessons
using the HTTP 1.1 protocol;
the request was successful and returned 5,945 bytes;
the visitor had been referred from my blog's home page,
and was using Internet Explorer 6 running on Windows XP.

This is an example of the kind of line I want: one that records the actual fetch of an article. There are lots of other lines that record fetching stylesheets, scripts, pictures, and so on, and attacks by malicious users. You can spot the kind of line I want by the fact that the article's name starts with /ongoing/When/ and continues with elements for the decade, year, month, and day.

Our first step, then, should be to find lines that contain something like:

	/ongoing/When/200x/2006/10/08/

Whatever language you're programming in, you could spend lots of time writing code to match this pattern character by character. Or you could apply regular expressions.

Regular Expressions

Regular expressions are special languages designed specifically for matching patterns in text. If you learn how to use them well, you'll save yourself immense amounts of time and irritation. I've never met a really accomplished programmer who wasn't a master of regular expressions (often called regexps for short). Chapter 1, by Brian Kernighan, is dedicated to the beauty of regular expressions.

Because the filenames on my web site match such a strict, date-based pattern, a very straightforward regular expression can find the logfile lines I'm interested in. Other sites' logfiles might require a more elaborate one. Here it is:

	"GET /ongoing/When/\d\d\dx/\d\d\d\d/\d\d/\d\d/[^ .]+ "

A glance at this line instantly reveals one of the problems with regular expressions; they're not the world's most readable text. Some people might challenge their appearance in a book called Beautiful Code. Let's put that issue aside for a moment and look at this particular expression. The only thing you need to know is that in this particular flavor of regular expression:

\d

Means "match any digit, 0 through 9"

[^ .]

Means "match any character that's not a space or period"[1]

+

Means "match one or more instances of whatever came just before the +"

That [^ .]+, then, means that the last slash has to be followed by a bunch of nonspace and nonperiod characters. There's a space after the + sign, so the regular expression stops when that space is found.

This regular expression won't match a line where the filename contains a period. So it will match Grief-Lessons, the example I showed earlier from my logfile, but not IMG0038.jpg.

Putting Regular Expressions to Work

A regular expression standing by itself, as shown above, can be used on the command line to search files. But it turns out that most modern computer languages allow you to use them directly in program code. Let's do that, and write a program that prints out only the lines that match the expression, which is to say a program that records all the times someone fetched an article from the weblog.

This example (and most other examples in this chapter) is in the Ruby programming language because I believe it to be, while far from perfect, the most readable of languages.

If you don't know Ruby, learning it will probably make you a better programmer. In Chapter 29, the creator of Ruby, Yukihiro Matsumoto (generally known as "Matz"), discusses some of the design choices that have attracted me and so many other programmers to the language.

Example 4-1 shows our first Ruby program, with added line numbers on the left side. (All the examples in this chapter are available from the O'Reilly web site for this book.)

Example 4-1. Printing article-fetch lines

1 ARGF.each_line do |line|
2   if line =~ %r{GET /ongoing/When/\d\d\dx/\d\d\d\d/\d\d/\d\d/[^ .]+ }
3   puts line
4   end 
5 end

Running this program prints out a bunch of logfile lines that look like the first example. Let's have a line-by-line look at it:

Line 1

We want to read all the lines of the input, and we don't care whether they're from files named on the command line or are being piped in from another program on the standard input. The designers of Ruby believe strongly that programmers shouldn't have to write ugly code to deal with common situations, and this is a common situation. So, ARGF is a special variable that represents all the input sources. If the command line includes arguments, ARGF assumes they're names of files and opens them one by one; if there aren't any, it uses the standard input.

each_line is a method that you can call on pretty well any file-like object, such as ARGF. It reads the lines of input and passes them, one at a time, to a "block" of following code.

The following do says that the block getting the input stretches from there to the corresponding end, and the |line| asks that the each_line method load each line into the variable line before giving it to the block.

This kind of loop may surprise the eyes of a new convert to Ruby, but it's concise, powerful, and very easy to follow after just a bit of practice.

Line 2

This is a pretty straightforward if statement. The only magic is the =~, which means "matches" and expects to be followed by regular expression. You can tell Ruby that something is a regular expression by putting slashes before and after it—for example, /this-is-a-regex/. But the particular regular expression we want to use is full of slashes. So to use the slash syntax, you'd have to "escape" them by turning each / into \/, which would be ugly. In this case, therefore, the %r trick produces more beautiful code.

Line 3

We're inside the if block now. So, if the current line matches the regexp, the program executes puts line, which prints out the line and a line feed.

Lines 4 and 5

That's about all there is to it. The first end terminates the if, and the second terminates the do. They look kind of silly dangling off the bottom of the code, and the designers of Python have figured out a way to leave them out, which leads to some Python code being more beautiful than the corresponding Ruby.

So far, we've shown how regular expressions can be used to find the lines in the logfile that we're interested in. But what we're really interested in is counting the fetches for each article. The first step is to identify the article names. Example 4-2 is a slight variation on the previous program.

Example 4-2. Printing article names

1 ARGF.each_line do |line|
2   if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) } 
3     puts $1
4   end
5 end

The differences are subtle. In line 2, I've added a pair of parentheses (in boldface) around the interesting part of the article name in the regular expression. In line 3, instead of printing out the whole value of line, I print out $1, which in Ruby (and several other regular-expression-friendly languages) means "the first place in the regular expression marked with parentheses." You can mark lots of different pieces of the expression, and thus use $2, $3, and so on.

The first few lines of output produced by running this program over some logfile data look like this:

	2003/10/10/FooCampMacs
	2006/11/13/Rough-Mix
	2003/05/22/StudentLookup
	2003/11/13/FlyToYokohama
	2003/07/31/PerlAngst
	2003/05/21/RDFNet
	2003/02/23/Democracy
	2005/12/30/Spolsky-Recursion
	2004/05/08/Torture
	2004/04/27/RSSticker

Before we go to work determining the popularity of different articles, I'd like to argue that in some important ways, this code is beautiful. Take a moment and think of the code you'd have to write to look at an arbitrary chunk of text and do the same matching and selection work done by the parenthesized regexp. It would be quite a few lines of code, and it would be easy to get wrong. Furthermore, if the format of the logfile changed, fixing the pattern matcher would be error-prone and irritating.

Under the covers, the way that regular expressions work is also among the more wonderful things in computer science. It turns out that they can conveniently be translated into finite automata. These automata are mathematically elegant, and there are astoundingly efficient algorithms for matching them against the text you're searching. The great thing is that when you're running an automaton, you have to look only once at each character in the text you're trying to match. The effect is that a well-built regular expression engine can do pattern matching and selection faster than almost any custom code, even if it were written in hand-optimized assembly language. That's beautiful.

I think that the Ruby code is pretty attractive, too. Nearly every character of the program is doing useful work. Note that there are no semicolons on the ends of the lines, nor parentheses around the conditional block, and that you can write puts line instead of puts(line). Also, variables aren't declared—they're just used. This kind of stripped-down language design makes for programs that are shorter and easier to write, as well as (more important) easier to read and easier to understand.

Thinking in terms of time, regular expressions are a win/win. It takes the programmer way less time to write them than the equivalent code, it takes less time to deliver the program to the people waiting for it, it uses the computer really efficiently, and the program's user spends less time sitting there bored.

Content-Addressable Storage

Now we're approaching the core of our problem, computing the popularity of articles. We'll have to pull the article name out of each line, look it up to see how many times it's been fetched, add one to that number, and then store it away again.

This may be the most basic of search patterns: we start with a key (what we're using to search—in this case, an article name), and we're looking for a value (what we want to find—in this case, the number of times the article has been fetched). Here are some other examples:

Key

Value

Word

List of web pages containing the word

Employee number

Employee's personnel record

Passport number

"true" or "false," indicating whether the person with that passport should be subject to extra scrutiny

What programmers really want in this situation is a very old idea in computer science: content-addressable memory, also known as an associative store and various other permutations of those words. The idea is to put the key in and get the value out. There actually exists hardware which does just that; it mostly lives deep in the bowels of microprocessors, providing rapid access to page tables and memory caches.

The good news is that you, the programmer, using any modern computer language, have access to excellent software implementations of associative memory. Different languages call these implementations different things. Often they are implemented as hash tables; in Java, Perl, and Ruby, which use this technique, they are called Hashes, HashMaps, or something similar. In Python, they are called dictionaries, and in the computer algebra language Maple, simply tables.

Now if you're an eager search-algorithm fan just itching to write your own super-efficient search, this may sound like bad news, not good news. But think about those flavors of time; if you use the built-in associative store, the amount of programmer time and management invested in writing search algorithms goes to nearly zero.

By writing your own search, you might be able to save a little computer (and thus end-user) time, compared to the built-in version, but on the other hand, you might not; the people who write these things tend to be pretty clever. Andrew Kuchling has written Chapter 18 of this book on one such effort.

Associative stores are so important that dynamically typed languages such as Ruby and Python have not only built-in support, but special syntax for defining and using them. Let's use Ruby's hashes to count article popularity in Example 4-3.

Example 4-3. Counting article fetches

1 counts = {}
2 counts.default = 0
3
4 ARGF.each_line do |line|
5   if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }
6     counts[$1] += 1
7   end
8 end

This program isn't that much different from the version in Example 4-2. Line 1 creates an empty Hash called counts. Line 2 gives the array a "default value" of zero; hold on for an explanation of that.

Then, in line 6, instead of printing out the article name, the name serves as the key to look up the number of fetches of this article seen so far in counts, add one to it, and store the value.

Now, consider what happens when the program sees some article name stored in $1 for the first time. I could write code along the lines of "if there is a counts[$1], then add one to it; otherwise, set counts[$1] to one." The designers of Ruby hate that kind of awkwardness; this is why they provided the notion of a "default value" for a Hash. If you look up a key the Hash doesn't know about, it says "OK, zero," allowing you to write counts[$1]+=1 and have it always just work.

I originally stated the problem as "Which of my articles have been read the most?" That's kind of fuzzy; let's interpret it to mean "Print out the top 10 most popular articles." The resulting program is shown in Example 4-4.

Example 4-4. Reporting the most popular articles

1 counts = {}
2 counts.default = 0
3
4 ARGF.each_line do |line|
5   if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }
6     counts[$1] += 1
7   end
8 end
9
10 keys_by_count = counts.keys.sort { |a, b| counts[b] <=> counts[a] }
11 keys_by_count[0 .. 9].each do |key|
12   puts "#{counts[key]}: #{key}"
13 end

Line 10 looks a little less beautiful to me than most Ruby code, but it's easy enough to understand. The keys method of counts returns an array containing all of the Hash's keys. Because of the hash implementation, the keys are stored in no predictable order, and are also returned by the keys method in random order. So, I have to sort them and store them back in a new array.

In Ruby, sort is accompanied by a code block, here enclosed in curly braces. (In Ruby, you can delimit a block either with do and end or with { and }.) The sort works its way back and forth through the array being sorted, passing pairs of elements to the block, which has to return a negative number, 0, or a positive number depending on whether the first element is less than, equal to, or greater than the second.

In this case, we want to get the data out of the hash in an order defined by the values (the counts themselves) rather than by the filenames (the keys), so we have to sort the keys by their values. Have a close look at the code, and you'll see how it works. Because this is something that people do all the time, I'm surprised that Ruby's Hash doesn't come with sort_by_value.

We use a decreasing order for the sort so that, no matter how many articles we've found, we know the first 10 items in keys_by_count represent the top 10 articles in popularity.

Now that we have an array of keys (article names) sorted in descending order of how many times they've been fetched, we can accomplish our assignment by printing out the first 10. Line 11 is simple, but a word is in order about that each method. In Ruby, you almost never see a for statement because anything whose elements you might want to loop through has an each method that does it for you.

Line 12 may be a little hard to read for the non-Rubyist because of the #{} syntax, but it's pretty straightforward.

So, let's declare victory on our first assignment. It took us only 13 lines of easy-to-read code. A seasoned Rubyist would have squeezed the last three lines into one.

Let's run this thing and see what it reports. Instead of running it over the whole 28 GB, let's just use it on a week's data: a mere 1.2 million records comprising 245 MB.

	~/dev/bc/ 548>zcat ~/ongoing/logs/2006-12-17.log.gz | \
                                   time ruby code/report-counts.rb
     4765: 2006/12/11/Mac-Crash
     3138: 2006/01/31/Data-Protection
     1865: 2006/12/10/EMail
     1650: 2006/03/30/Teacup
     1645: 2006/12/11/Java
     1100: 2006/07/28/Open-Data
     900: 2006/11/27/Choose-Relax
     705: 2003/09/18/NXML
     692: 2006/07/03/July-1-Fireworks
     673: 2006/12/13/Blog-PR
            13.54 real         7.49 user                      0.73 sys

This run took place on my 1.67 GHz Apple PowerBook. The results are unsurprising, but the program does seem kind of slow. Should we worry about performance?

Time to Optimize?

I was wondering whether my sample run was really unreasonably slow, so I pulled together a very similar program in Perl, a language that is less beautiful than Ruby but is extremely fast. Sure enough, the Perl version took half the time. So, should we try to optimize?

We need to think about time again. Yes, we might be able to make this run faster, and thus reduce the program execution time and the time a user spends waiting for it, but to do this we'd have to burn some of the programmer's time, and thus the time the user waits for the programmer to get the program written. In most cases, my instinct would be that 13.54 seconds to process a week's data is OK, so I'd declare victory. But let's suppose we're starting to get gripes from people who use the program, and we'd like to make it run faster.

Glancing over Example 4-4, we can see that the program falls into two distinct parts. First, it reads all the lines and tabulates the fetches; then it sorts them to find the top 10.

There's an obvious optimization opportunity here: why bother sorting all the fetch tallies when all we really want to do is pick the top 10? It's easy enough to write a little code to run through the array once and pick the 10 highest elements.

Would that help? I found out by instrumenting the program to find out how much time it spent doing its two tasks. The answer was (averaging over a few runs) 7.36 seconds in the first part and 0.07 in the second. Which is to say, "No, it wouldn't help."

Might it be worthwhile to try to optimize the first part? Probably not; all it does is match regular expressions, and store and retrieve data using a Hash, and these are among the most heavily optimized parts of Ruby.

So, getting fancy in replacing that sort would probably waste the time of the programmer and the customer waiting for the code, without saving any noticeable amount of computer or waiting-user time. Also, experience would teach that you're not apt to go much faster than Perl does for this kind of task, so the amount of speedup you're going to get is pretty well bounded.

We've just finished writing a program that does something useful and turns out to be all about search. But we haven't come anywhere near actually writing any search algorithms. So, let's do that.



[1] People who have used regular expressions know that a period is a placeholder for "any character," but it's harder to remember that when a period is enclosed in square brackets, it loses the special meaning and refers to just a period.

The best content for your career. Discover unlimited learning on demand for around $1/day.