O'Reilly logo

Regular Expressions Cookbook by Steven Levithan, Jan Goyvaerts

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Basic Regular Expression Skills

The problems presented in this chapter aren’t the kind of real-world problems that your boss or your customers ask you to solve. Rather, they’re technical problems you’ll encounter while creating and editing regular expressions to solve real-world problems. The first recipe, for example, explains how to match literal text with a regular expression. This isn’t a goal on its own, because you don’t need a regex when all you want to do is to search for literal text. But when creating a regular expression, you’ll likely need it to match certain text literally, and you’ll need to know which characters to escape. Recipe 2.1 tells you how.

The recipes start out with very basic regular expression techniques. If you’ve used regular expressions before, you can probably skim or even skip them. The recipes further down in this chapter will surely teach you something new, unless you have already read Mastering Regular Expressions by Jeffrey E. F. Friedl (O’Reilly) cover to cover.

We devised the recipes in this chapter in such a way that each explains one aspect of the regular expression syntax. Together, they form a comprehensive tutorial to regular expressions. Read it from start to finish to get a firm grasp of regular expressions. Or dive right in to the real-world regular expressions in Chapters 4 through 8, and follow the references back to this chapter whenever those chapters use some syntax you’re not familiar with.

This tutorial chapter deals with regular expressions only and completely ignores any programming considerations. The next chapter is the one with all the code listings. You can peek ahead to Programming Languages and Regex Flavors in Chapter 3 to find out which regular expression flavor your programming language uses. The flavors themselves, which this chapter talks about, were introduced in Regex Flavors Covered by This Book.

2.1. Match Literal Text

Problem

Create a regular expression to exactly match this gloriously contrived sentence: The punctuation characters in the ASCII table are: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~.

Solution

ThepunctuationcharactersintheASCIItableare:↵
!"#\$%&'\(\)\*\+,-\./:;<=>\?@\[\\]\^_`\{\|}~
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

Any regular expression that does not include any of the dozen characters $()*+.?[\^{| simply matches itself. To find whether Mary had a little lamb in the text you’re editing, simply search for Maryhadalittlelamb. It doesn’t matter whether the “regular expression” checkbox is turned on in your text editor.

The 12 punctuation characters that make regular expressions work their magic are called metacharacters. If you want your regex to match them literally, you need to escape them by placing a backslash in front of them. Thus, the regex:

\$\(\)\*\+\.\?\[\\\^\{\|

matches the text:

$()*+.?[\^{|

Notably absent from the list are the closing square bracket ], the hyphen -, and the closing curly bracket }. The first two become metacharacters only after an unescaped [, and the } only after an unescaped {. There’s no need to ever escape }. Metacharacter rules for the blocks that appear between [ and ] are explained in Recipe 2.3.

Escaping any other nonalphanumeric character does not change how your regular expression works—at least not when working with any of the flavors discussed in this book. Escaping an alphanumeric character either gives it a special meaning or throws a syntax error.

People new to regular expressions often escape every punctuation character in sight. Don’t let anyone know you’re a newbie. Escape judiciously. A jungle of needless backslashes makes regular expressions hard to read, particularly when all those backslashes have to be doubled up to quote the regex as a literal string in source code.

Variations

Block escape

ThepunctuationcharactersintheASCIItableare:↵
\Q!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~\E
Regex options: None
Regex flavors: Java 6, PCRE, Perl

Perl, PCRE and Java support the regex tokens \Q and \E. \Q suppresses the meaning of all metacharacters, including the backslash, until \E. If you omit \E, all characters after the \Q until the end of the regex are treated as literals.

The only benefit of \Q...\E is that it is easier to read than \.\.\..

Warning

Though Java 4 and 5 support this feature, you should not use it. Bugs in the implementation cause regular expressions with \Q\E to match different things from what you intended, and from what PCRE, Perl, or Java 6 would match. These bugs were fixed in Java 6, making it behave the same way as PCRE and Perl.

Case-insensitive matching

ascii
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
(?i)ascii
Regex options: None
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

By default, regular expressions are case sensitive. regex matches regex but not Regex, REGEX, or ReGeX. To make regex match all of those, you need to turn on case insensitivity.

In most applications, that’s a simple matter of marking or clearing a checkbox. All programming languages discussed in the next chapter have a flag or property that you can set to make your regex case insensitive. Recipe 3.4 in the next chapter explains how to apply the regex options listed with each regular expression solution in this book in your source code.

If you cannot turn on case insensitivity outside the regex, you can do so within by using the (?i) mode modifier, such as (?i)regex. This works with the .NET, Java, PCRE, Perl, Python, and Ruby flavors.

.NET, Java, PCRE, Perl, and Ruby support local mode modifiers, which affect only part of the regular expression. sensitive(?i)caseless(?-i)sensitive matches sensitiveCASELESSsensitive but not SENSITIVEcaselessSENSITIVE. (?i) turns on case insensitivity for the remainder of the regex, and (?-i) turns it off for the remainder of the regex. They act as toggle switches.

Recipe 2.10 shows how to use local mode modifiers with groups instead of toggles.

See Also

Recipes 2.3 and 5.14

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required