Regular Expressions

At the time of writing this chapter, I was spending some time watching the Dow Jones Industrial Average, as the world was in the middle of a major financial meltdown. If you’re wondering what this has to do with Ruby or regular expressions, take a quick look at the following code:

require "open-uri"
loop do
  puts( open("http://finance.google.com/finance?cid=983582").read[
  /<span class="\w+" id="ref_983582_c">([+-]?\d+\.\d+)/m, 1] )
  sleep(30)
end

In just a couple of lines, I was able to throw together a script that would poll Google Finance and pull down the current average price of the Dow. This sort of “find a needle in the haystack” extraction is what regular expressions are all about.

Of course, the art of constructing regular expressions is often veiled in mystery. Even simple patterns such as this one might make some folks feel a bit uneasy:

/<span class="\w+" id="ref_983582_c">([+-]?\d+\.\d+)/m

This expression is simple by comparison to some other examples we can show, but it still makes use of a number of regular expression concepts. All in one line, we can see the use of character classes (both general and special), escapes, quantifiers, groups, and a switch that enables multiline matching.

Patterns are dense because they are written in a special syntax, which acts as a sort of domain language for matching and extracting text. The reason that it may be considered daunting is that this language is made up of so few special characters:

\ [ ] . ^ $ ? * + { } | ( )

At its heart, regular expressions are nothing more than a facility to do find and replace operations. This concept is so familiar that anyone who has used a word processor has a strong grasp on it. Using a regex, you can easily replace all instances of the word “Mitten” with “Kitten”, just like your favorite text editor or word processor can:

some_string.gsub(/\bMitten\b/,"Kitten")

Many programmers get this far and stop. They learn to use regex as if it were a necessary evil rather than an essential technique. We can do better than that. In this section, we’ll look at a few guidelines for how to write effective patterns that do what they’re supposed to without getting too convoluted. I’m assuming you’ve done your homework and are at least familiar with regex basics as well as Ruby’s pattern syntax. If that’s not the case, pick up your favorite language reference and take a few minutes to review the fundamentals.

As long as you can comfortably read the first example in this section, you’re ready to move on. If you can convince yourself that writing regular expressions is actually much easier than people tend to think it is, the tips and tricks to follow shouldn’t cause you to break a sweat.

Don’t Work Too Hard

Despite being such a compact format, it’s relatively easy to write bloated patterns if you don’t consciously remember to keep things clean and tight. We’ll now take a look at a couple sources of extra fat and see how to trim them down.

Alternation is a very powerful regex tool. It allows you to match one of a series of potential sequences. For example, if you want to match the name “James Gray” but also match “James gray”, “james Gray”, and “james gray”, the following code will do the trick:

>> ["James Gray", "James gray", "james gray", "james Gray"].all? { |e|
?>   e.match(/James|james Gray|gray/) }
=> true

However, you don’t need to work so hard. You’re really talking about possible alternations of simply two characters, not two full words. You could write this far more efficiently using a character class:

>> ["James Gray", "James gray", "james gray", "james Gray"].all? { |e|
?>   e.match(/[Jj]ames [Gg]ray/) }
=> true

This makes your pattern clearer and also will result in a much better optimization in Ruby’s regex engine. So in addition to looking better, this code is actually faster.

In a similar vein, it is unnecessary to use explicit character classes when a shortcut will do. To match a four-digit number, we could write:

/[0-9][0-9][0-9][0-9]/

which can of course be cleaned up a bit using repetitions:

/[0-9]{4}/

However, we can do even better by using the special class built in for this:

/\d{4}/

It pays to learn what shortcuts are available to you. Here’s a quick list for further study, in case you’re not already familiar with them:

. \s \S \w \W \d \D

Each one of these shortcuts corresponds to a literal character class that is more verbose when written out. Using shortcuts increases clarity and decreases the chance of bugs creeping in via ill-defined patterns. Though it may seem a bit terse at first, you’ll be able to sight-read them with ease over time.

Anchors Are Your Friends

One way to match my name in a string is to write the following simple pattern:

string =~ /Gregory Brown/

However, consider the following:

>> "matched" if "Mr. Gregory Browne".match(/Gregory Brown/)
=> "matched"

Oftentimes we mean “match this phrase,” but we write “match this sequence of characters.” The solution is to make use of anchors to clarify what we mean.

Sometimes we want to match only if a string starts with a phrase:

>> phrases = ["Mr. Gregory Browne", "Mr. Gregory Brown is cool",
              "Gregory Brown is cool", "Gregory Brown"]

>> phrases.grep /\AGregory Brown\b/
=> ["Gregory Brown is cool", "Gregory Brown"]

Other times we want to ensure that the string contains the phrase:

>> phrases.grep /\bGregory Brown\b/
=> ["Mr. Gregory Brown is cool", "Gregory Brown is cool", "Gregory Brown"]

And finally, sometimes we want to ensure that the string matches an exact phrase:

>> phrases.grep /\AGregory Brown\z/
=> ["Gregory Brown"]

Although I am using English names and phrases here for simplicity, this can of course be generalized to encompass any sort of matching pattern. You could be verifying that a sequence of numbers fits a certain form, or something equally abstract. The key thing to take away from this is that when you use anchors, you’re being much more explicit about how you expect your pattern to match, which in most cases means that you’ll have a better chance of catching problems faster, and an easier time remembering what your pattern was supposed to do.

An interesting thing to note about anchors is that they don’t actually match characters. Instead, they match between characters to allow you to assert certain expectations about your strings. So when you use something like \b, you are actually matching between one of \w\W , \W\w , \A , \z. In English, that means that you’re transitioning from a word character to a nonword character, or from a nonword character to a word character, or you’re matching the beginning or end of the string. If you review the use of \b in the previous examples, it should now be very clear how anchors work.

The full list of available anchors in Ruby is \A, \Z, \z, ^, $, and \b. Each has its own merits, so be sure to read up on them.

Use Caution When Working with Quantifiers

One of the most common antipatterns I picked up when first learning regular expressions was to make use of .* everywhere. Though this practice may seem innocent, it is similar to my bad habit of using rm -Rf on the command line all the time instead of just rm. Both can result in catastrophe when used incorrectly.

But maybe you’re not as crazy as I am. Instead, maybe you’ve been writing innocent things like /(\d*)Foo/ to match any number of digits prepended to the word “Foo”:

For some cases, this works great:

>> "1234Foo"[/(\d*)Foo/,1]
=> "1234"

But does this surprise you?

>> "xFoo"[/(\d*)Foo/,1]
=> ""

It may not, but then again, it may. It’s relatively common to forget that * always matches. At first glance, the following code seems fine:

if num = string[/(\d*)Foo/,1]
  Integer(num)
end

However, because the match will capture an empty string in its failure case, this code will break. The solution is simple. If you really mean “at least one,” use + instead:

if num = string[/(\d+)Foo/,1]
  Integer(num)
end

Though more experienced folks might not easily be trapped by something so simple, there are more subtle variants. For example, if we intend to match only “Greg” or “Gregory”, the following code doesn’t quite work:

>> "Gregory"[/Greg(ory)?/]
=> "Gregory"
>> "Greg"[/Greg(ory)?/]
=> "Greg"
>> "Gregor"[/Greg(ory)?/]
=> "Greg"

Even if the pattern looks close to what we want, we can see the results don’t fit. The following modifications remedy the issue:

>> "Gregory"[/\bGreg(ory)?\b/]
=> "Gregory"
>> "Greg"[/\bGreg(ory)?\b/]
=> "Greg"
>> "Gregor"[/\bGreg(ory)?\b/]
=> nil

Notice that the pattern now properly matches Greg or Gregory, but no other words. The key thing to take away here is that unbounded zero-matching quantifiers are tautologies. They can never fail to match, so you need to be sure to account for that.

A final gotcha about quantifiers is that they are greedy by default. This means they’ll try to consume as much of the string as possible before matching. The following is an example of a greedy match:

>> "# x # y # z #"[/#(.*)#/,1]
=> " x # y # z "

As you can see, this code matches everything between the first and last # character. But sometimes, we want processing to happen from the left and end as soon as we have a match. To do this, append a ? to the repetition:

>> "# x # y # z #"[/#(.*?)#/,1]
=> " x "

All quantifiers can be made nongreedy this way. Remembering this will save a lot of headaches in the long run.

Though our treatment of regular expressions has been by no means comprehensive, these few basic tips will really carry you a long way. The key things to remember are:

  • Regular expressions are nothing more than a special language for find-and-replace operations, built on simple logical constructs.

  • There are lots of shortcuts built in for common regular expression operations, so be sure to make use of special character classes and other simplifications when you can.

  • Anchors provide a way to set up some expectation about where in a string you want to look for a match. These help with both optimization and pattern correctness.

  • Quantifiers such as * and ? will always match, so they should not be used without sufficient boundaries.

  • Quantifiers are greedy by default, and can be made nongreedy via ?.

By following these guidelines, you’ll write clearer, more accurate, and faster regular expressions. As a result, it’ll be a whole lot easier to revisit them when you run into them in your own old code a few months down the line.

A final note on regular expressions is that sometimes we are seduced by their power and overlook other solutions that may be more robust for certain needs. In both the stock ticker and AFM parsing examples, we were working within the realm where regular expressions are a quick, easy, and fine way to go.

However, as documents take on more complex structures, and your needs move from extracting some values to attempting to fully parse a document, you will probably need to look to other techniques that involve full-blown parsers such as Treetop, Ghost Wheel, or Racc. These libraries can solve problems that regular expressions can’t solve, and if you find yourself with data that’s hard to map a regex to, it’s worth looking at these alternative solutions.

Of course, your mileage will vary based on the problem at hand, so don’t be afraid of trying a regex-based solution first before pulling out the big guns.

Get Ruby Best Practices now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.