O'Reilly logo

Learning Ruby by Michael Fitzgerald

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Regular Expressions

You have already seen regular expressions in action. A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. The syntax for regular expressions was invented by mathematician Stephen Kleene in the 1950s.

I'll spend a little time demonstrating some patterns to search for strings. In this little discussion, you'll learn the fundamentals: how to use basic string patterns, square brackets, alternation, grouping, anchors, shortcuts, repetition operators, and braces. Table 4-1 lists the syntax for regular expressions in Ruby.

We need a little text to munch on. Here are the opening lines of Shakespeare's 29th sonnet:

opening = "When in disgrace with fortune and men's eyes\nI all alone beweep my
outcast state,\n"

Note that this string contains two lines, set off by the newline character \n.

You can match the first line just by using a word in the pattern:

opening.grep(/men/) # => ["When in disgrace with fortune and men's eyes\n"]

By the way, grep is not a String method; it comes from the Enumerable module, which the String class includes, so it is available for processing strings. grep takes a pattern as an argument, and can also take a block (see http://www.ruby-doc.org/core/classes/Enumerable.html).

When you use a pair of square brackets ([]), you can match any character in the brackets. Let's try to match the word man or men using []:

opening.grep(/m[ae]n/) # => ["When in disgrace with fortune and men's eyes\n"]

It would also match a line with the word man in it:

Alternation lets you match alternate forms of a pattern using the pipe character (|):

opening.grep(/men|man/) # => ["When in disgrace with fortune and men's eyes\n"]

Grouping uses parentheses to group a subexpression, like this one that contains an alternation:

opening.grep(/m(e|a)n/) # => ["When in disgrace with fortune and men's eyes\n"]

Anchors anchor a pattern to the beginning (^) or end ($) of a line:

opening.grep(/^When in/) # => ["When in disgrace with fortune and men's eyes\n"]
opening.grep(/outcast state,$/) # => ["I all alone beweep my outcast state,\n"]

The ^ means that a match is found when the text When in is at the beginning of a line, and $ will only match outcast state if it is found at the end of a line.

One way to specify the beginning and ending of strings in a pattern is with shortcuts. Shortcut syntax is brief—a single character preceded by a backslash. For example, the \d shortcut represents a digit; it is the same as using [0-9] but, well, shorter. Similarly to ^, the shortcut \A matches the beginning of a string, not a line:

opening.grep(/\AWhen in/) # => ["When in disgrace with fortune and men's eyes\n"]

Similar to $, the shortcut \z matches the end of a string, not a line:

opening.grep(/outcast state,\z/) # => ["I all alone beweep my outcast state,"]

The shortcut \Z matches the end of a string before the newline character, assuming that a newline character (\n) is at the end of the string (it won't work otherwise).

Let's figure out how to match a phone number in the form (555)123-4567. Supposing that the string phone contains a phone number like this, the following pattern will find it:

phone.grep(/[\(\d\d\d\)]?\d\d\d-\d\d\d\d/) # => ["(555)123-4567"]

The backslash precedes the parentheses (\(...\)) to let the regexp engine know that these are literal characters. Otherwise, the engine will see the parentheses as enclosing a subexpression. The three \ds in the parentheses represent three digits. The hyphen (-) is just an unambiguous character, so you can use it in the pattern as is.

The question mark (?) is a repetition operator. It indicates zero or one occurrence of the previous pattern. So the phone number you are looking for can have an area code in parentheses, or not. The area-code pattern is surrounded by [ and ] so that the ? operator applies to the entire area code. Either form of the phone number, with or without the area code, will work. Here is a way to use ? with just a single character, u:

color.grep(/colou?r/) # => ["I think that colour is just right for you office."]

The plus sign (+) operator indicates one or more of the previous pattern, in this case digits:

phone.grep(/[\(\d+\)]?\d+-\d+/) # => ["(555)123-4567"]

Braces ({}) let you specify the exact number of digits, such as \d{3} or \d{4}:

phone.grep(/[\(\d{3}\)]?\d{3}-\d{4}/)# => ["(555)123-4567"]

Tip

It is also possible to indicate an "at least" amount with {m,}, and a minimum/maximum number with {m,n}.

The String class also has the =~ method and the !~ operator. If =~ finds a match, it returns the offset position where the match starts in the string:

color =~ /colou?r/ # => 13

The !~ operator returns true if it does not match the string, false otherwise:

color !~ /colou?r/ # => false

Also of interest are the Regexp and MatchData classes. The Regexp class (http://www.ruby-doc.org/core/classes/Regexp.html) lets you create a regular expression object. The MatchData class (http://www.ruby-doc.org/core/classes/MatchData.html) provides the special $- variable, which encapsulates all search results from a pattern match.

This discussion has given you a decent foundation in regular expressions (see Table 4-1 for a listing). With these fundamentals, you can define most any pattern.

Table 4-1. Regular expressions in Ruby

Pattern

Description

/pattern/options

Pattern pattern in slashes, followed by optional options, i.e., one or more of: i for case-insensitive; o for substitute once; x for ignore whitespace, allow comments; m for match multiple lines, newlines as normal characters

%r!pattern!

General delimited string for a regular expression, where ! can be an arbitrary character

^

Matches beginning of line

$

Matches end of line

.

Matches any character

\1...\9

Matches nth grouped subexpression

\10

Matches nth grouped subexpression, if already matched; otherwise, refers to octal representation of a character code

\n, \r, \t, etc.

Matches character in backslash notation

\w

Matches word character, as in [0-9A-Za-z_]

\W

Matches nonword character

\s

Matches whitespace character, as in [\t\n\r\f]

\S

Matches nonwhitespace character

\d

Matches digit, same as [0-9]

\D

Matches nondigit

\A

Matches beginning of a string

\Z

Matches end of a string, or before newline at the end

\z

Matches end of a string

\b

Matches word boundary outside [], or backspace (0x08) inside []

\B

Matches nonword boundary

\G

Matches point where last match finished

[..]

Matches any single character in brackets, such as [ch]at

[^..]

Matches any single character not in brackets

*

Matches 0 or more of previous regular expressions

*?

Matches zero or more of previous regular expressions (nongreedy)

+

Matches one or more of previous regular expressions

+?

Matches one or more of previous regular expressions (nongreedy)

{m}

Matches exactly m number of previous regular expressions

{m,}

Matches at least m number of previous regular expressions

{m,n}

Matches at least m but at most n number of previous regular expressions

{m,n}?

Matches at least m but at most n number of previous regular expressions (nongreedy)

?

Matches zero or one of previous regular expressions

|

Alternation, such as color|colour

( )

Grouping regular expressions or subexpression, such as col(o|ou)r

(?#..)

Comment

(?:..)

Grouping without back-references (without remembering matched text)

(?=..)

Specify position with pattern

(?!..)

Specify position with pattern negation

(?>..)

Matches independent pattern without backtracking

(?imx)

Toggles i, m, or x options on

(?-imx)

Toggles i, m, or x options off

(?imx:..)

Toggles i, m, or x options on within parentheses

(?-imx:..)

Toggles i, m, or x options off within parentheses

(?ix-ix: )

Turns on (or off) i and x options within this noncapturing group

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required