Chapter 17. How Regexes Work

Mark Jason Dominus

This isn’t an article about how to use regexes; you’ve seen plenty of those already. It’s about how you would write a regex package from scratch, in a language like C that didn’t already have one. I demonstrate a new module, Regex.pm, which implements regexes from nothing, in Perl. This will give you an idea of how regex matching is possible, although the details differ rather substantially from what Perl actually does.

Here’s the basic strategy. We’ll see a kind of “machine” that reads some input, one character at a time, and then, depending on what’s in the input and on the various wheels and gears in the machine, says either “yes” or “no”. The machines are simple, and it turns out that if we have a regex, it’s not hard to construct a machine that says “yes” for exactly those strings that match the regex, and “no” for all the other strings.

When our program wants to see if S matched /R/, it’ll do something like this:

  • Look at R.

  • Construct the machine that corresponds to R.

  • Feed S into the machine.

  • If the machine says “yes”, then S matched /R/. (Otherwise, it didn’t.)

Maybe this sounds bizarre, but it’s what Perl does. If you can follow what happens in this article, you’ll know what Perl is really up to when it performs a regex match.

Machines

We’re on a tight budget here, so our machines will be made of circles and arrows instead of wheels and gears. This diagram shows a machine.

Figure 17-1. 

Let’s see if this machine says “yes” to the string ...

Get Computer Science & Perl Programming now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.