How REs Work in Practice

Problem

You want to know how these metacharacters work in practice.

Solution

Wherein I give a few more examples for the benefit of those who have not been exposed to REs.

In building patterns, you can use any combination of ordinary text and the metacharacters or special characters in Chapter 4. For example, the two-character RE ^T would match beginning of line (^) immediately followed by a capital T, i.e., any line beginning with a capital T. It doesn’t matter whether the line begins with Tiny trumpets, or Titanic tubas, or Triumphant trombones, as long as the capital T is present in the first position.

But here we’re not very far ahead. Have we really invested all this effort in RE technology just to be able to do what we could already do with the java.lang.String method startsWith( ) ? Hmmm, I can hear some of you getting a bit restless. Stay in your seats! What if you wanted to match not only a letter T in the first position, but also a vowel (a, e, i, o, or u) immediately after it, followed by any number of letters in a word, followed by an exclamation point? Surely you could do this in Java by checking startsWith(“T”) and charAt(1) == ‘a’ || charAt(1) == ‘e', and so on? Yes, but by the time you did that, you’d have written a lot of very highly specialized code that you couldn’t use in any other application. With regular expressions, you can just give the pattern ^T[aeiou]\w*. That is, ^ and T as before, followed by a character classlisting the vowels, followed by any number of word characters (\w*), followed by the exclamation point.

“But wait, there’s more!” as my late great boss Yuri Rubinsky used to say. What if you want to be able to change the pattern you’re looking for at runtime? Remember all that Java code you just wrote to match T in column 1 plus a vowel, some word-characters and an exclamation point? Well, it’s time to throw it out. Because this morning we need instead to match Q, followed by a letter other than u, followed by a number of digits, followed by a period. While some of you start writing a new function to do that, the rest of us will just saunter over to the RegExp Bar & Grille, order a ^Q[^u]\d+\. from the bartender, and be on our way.

Huh? Oh, the [^u] means “match any one character that is not the character u.” The \d+ means one or more numeric digits. Remember that + is a multiplier meaning one or more, and \d is any one numeric digit. (Remember that \n -- which sounds as though it might mean numeric digit -- actually means a newline.) Finally, the \.? Well, . by itself is a metacharacter. Single metacharacters are switched off by preceding them with an escape character. No, don’t hit that ESC key on your keyboard. The RE “escape” character is a backslash. Preceding a metacharacter like . with escape turns off its special meaning. Preceding a few selected alphabetic characters (n, r, t, s, w) with escape turns them into metacharacters. In some other implementations, escape also precedes (, ), <, and > to turn them into metacharacters.

One good way to think of regular expressions is as a “little language” for matching patterns of characters in text contained in strings. Give yourself extra points if you’ve already recognized this as the design pattern known as Interpreter. A regular expression API is an interpreter for matching regular expressions.

As for how REs work in theory -- the logic behind it and the different types of RE engines -- the reader is referred to the book Mastering Regular Expressions.

Get Java Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.