Chapter 2. Advanced Regular Expressions

Regular expressions, or just regexes, are at the core of Perl’s text processing, and certainly are one of the features that made Perl so popular. All Perl programmers pass through a stage where they try to program everything as regexes and, when that’s not challenging enough, everything as a single regex. Perl’s regexes have many more features than I can, or want, to present here, so I include those advanced features I find most useful and expect other Perl programmers to know about without referring to perlre, the documentation page for regexes.

References to Regular Expressions

I don’t have to know every pattern at the time that I code something. Perl allows me to interpolate variables into regexes. I might hard code those values, take them from user input, or get them in any other way I can get or create data. Here’s a tiny Perl program to do grep’s job. It takes the firstF argument from the command line and uses it as the regex in the while statement. That’s nothing special (yet); we showed you how to do this in Learning Perl. I can use the string in $regex as my pattern, and Perl compiles it when it interpolates the string in the match operator:^[1]

#!/usr/bin/perl
# perl-grep.pl

my $regex = shift @ARGV;

print "Regex is [$regex]\n";

while( <> )
        {
        print if m/$regex/;
        }

I can use this program from the command line to search for patterns in files. Here I search for the pattern new in all of the Perl programs in the current directory:

% perl-grep.pl new *.pl
Regex is [new]
my $regexp = Regexp::English->new
my $graph = GraphViz::Regex->new($regex);
                [ qr/\G(\n)/,                "newline"     ],
                                                { ( $1, "newline char"     ) }
print YAPE::Regex::Explain->new( $ARGV[0] )->explain;

What happens if I give it an invalid regex? I try it with a pattern that has an opening parenthesis without its closing mate:

$ ./perl-grep.pl "(perl" *.pl
Regex is [(perl]
Unmatched ( in regex; marked by <-- HERE in m/( <-- HERE perl/ 
        at ./perl-grep.pl line 10, <> line 1.

When I interpolate the regex in the match operator, Perl compiles the regex and immediately complains, stopping my program. To catch that, I want to compile the regex before I try to use it.

The qr// is a regex quoting operator that stores my regex in a scalar (and as a quoting operator, its documentation shows up in perlop). The qr// compiles the pattern so it’s ready to use when I interpolate $regex in the match operator. I wrap the eval operator around the qr// to catch the error, even though I end up die-ing anyway:

#!/usr/bin/perl
# perl-grep2.pl

my $pattern = shift @ARGV;

my $regex = eval { qr/$pattern/ };
die "Check your pattern! $@" if $@;

while( <> )
        {
        print if m/$regex/;
        }

The regex in $regex has all of the features of the match operator, including back references and memory variables. This pattern searches for a three-character sequence where the first and third characters are the same, and none of them are whitespace. The input is the plain text version of the perl documentation page, which I get with perldoc -t:

% perldoc -t  perl | perl-grep2.pl "\b(\S)\S\1\b"
    perl583delta        Perl changes in version 5.8.3
    perl582delta        Perl changes in version 5.8.2
    perl581delta        Perl changes in version 5.8.1
    perl58delta         Perl changes in version 5.8.0
    perl573delta        Perl changes in version 5.7.3
    perl572delta        Perl changes in version 5.7.2
    perl571delta        Perl changes in version 5.7.1
    perl570delta        Perl changes in version 5.7.0
    perl561delta        Perl changes in version 5.6.1
http://www.perl.com/       the Perl Home Page
http://www.cpan.org/       the Comprehensive Perl Archive
http://www.perl.org/       Perl Mongers (Perl user groups)

It’s a bit hard, at least for me, to see what Perl matched, so I can make another change to my grep program to see what matched. The $& variable holds the portion of the string that matched:

#!/usr/bin/perl
# perl-grep3.pl

my $pattern = shift @ARGV;

my $regex = eval { qr/$pattern/ };
die "Check your pattern! $@" if $@;

while( <> )
        {
        print "$_\t\tmatched >>>$&<<<\n" if m/$regex/;
        }

Now I see that my regex is matching a literal dot, character, literal dot, as in .8.:

% perldoc -t perl | perl-grep3.pl  "\b(\S)\S\1\b"
                perl587delta        Perl changes in version 5.8.7
                                matched >>>.8.<<<
                perl586delta        Perl changes in version 5.8.6
                                matched >>>.8.<<<
                perl585delta        Perl changes in version 5.8.5
                                matched >>>.8.<<<

Just for fun, how about seeing what matched in each memory group, the variables $1, $2, and so on? I could try printing their contents, whether or not I had capturing groups for them, but how many do I print? Perl already knows because it keeps track of all of that in the special arrays @- and @+, which hold the string offsets for the beginning and end, respectively, for each match. That is, for the match string in $_, the number of memory groups is the last index in @- or @+ (they’ll be the same length). The first element in each is for the part of the string matched (so, $&), and the next element, with index 1, is for $1, and so on for the rest of the array. The value in $1 is the same as this call to substr:

my $one = substr( 
        $_,              # string
        $-[1],           # start position for $1
        $+[1] - $-[1]    # length of $1 (not end position!)
        );

To print the memory variables, I just have to go through the indices in the array @-:

#!/usr/bin/perl
# perl-grep4.pl
        
my $pattern = shift @ARGV;
        
my $regex = eval { qr/$pattern/ };
die "Check your pattern! $@" if $@;

while( <> )
        {
        if( m/$regex/ )
                {

                print "$_";

                print "\t\t\$&: ", 
                        substr( $_, $-[$i], $+[$i] - $-[$i] ), 
                        "\n";

                foreach my $i ( 1 .. $#- )
                        {
                        print "\t\t\$$i: ", 
                                substr( $_, $-[$i], $+[$i] - $-[$i] ),
                                "\n";
                        }
                }
        }

Now I can see the part of the string that matched as well as the submatches:

% perldoc -t perl | perl-grep4.pl  "\b(\S)\S\1\b"
                perl587delta        Perl changes in version 5.8.7
                                $&: .8.
                                $1: .

If I change my pattern to have more submatches, I don’t have to change anything to see the additional matches:

% perldoc -t perl | perl-grep4.pl  "\b(\S)(\S)\1\b"
                perl587delta        Perl changes in version 5.8.7
                                $&: .8.
                                $1: .
                                $2: 8

(?imsx-imsx:PATTERN)

What if I want to do something a bit more complex for my grep program, such as a case-insensitive search? Using my program to search for either “Perl” or “perl” I have a couple of options, neither of which are too much work:

% perl-grep.pl "[pP]erl"
% perl-grep.pl "(p|P)erl"

If I want to make the entire pattern case-insensitive, I have to do much more work, and I don’t like that. With the match operator, I could just add the /i flag on the end:

print if m/$regex/i;

I could do that with the qr// operator, too, although this makes all patterns case-insensitive now:

my $regex = qr/$pattern/i;

To get around this, I can specify the match options inside my pattern. The special sequence (?imsx) allows me to turn on the features for the options I specify. If I want case-insensitivity, I can use (?i) inside the pattern. Case-insensitivity applies for the rest of the pattern after the (?i) (or for the rest of the enclosing parentheses):

% perl-grep.pl "(?i)perl"

In general, I can enable flags for part of a pattern by specifying which ones I want in the parentheses, possibly with the portion of the pattern they apply to, as shown in Table 2-1.

Table 2-1. Options available in the (?options:PATTERN)

Inline option	Description
`(?i:PATTERN)`	Make case-insensitive
`(?m:PATTERN)`	Use multiline matching mode
`(?s:PATTERN)`	Let `.` match a newline
`(?x:PATTERN)`	Turn on eXplain mode

I can even group them:

(?si:PATTERN)   Let . match a newline and make case-insensitive

If I preface the options with a minus sign, I turn off those features for that group:

(?-s:PATTERN)   Don’t let . match a newline

This is especially useful since I’m getting my pattern from the command line. In fact, when I use the qr// operator to create my regex, I’m already using these. I’ll change my program to print the regex after I create it with qr// but before I use it:

#!/usr/bin/perl
# perl-grep3.pl

my $pattern = shift @ARGV;

my $regex = eval { qr/$pattern/ };
die "Check your pattern! $@" if $@;

print "Regex ---> $regex\n";

while( <> )
        {
        print if m/$regex/;
        }

When I print the regex, I see it starts with all of the options turned off. The string version of regex uses (?-OPTIONS:PATTERN) to turn off all of the options:

% perl-grep3.pl "perl"
Regex ---> (?-xism:perl)

I can turn on case-insensitivity, although the string form looks a bit odd, turning off i just to turn it back on:

% perl-grep3.pl "(?i)perl"
Regex ---> (?-xism:(?i)perl)

Perl’s regexes have many similar sequences that start with a parenthesis, and I’ll show a few of them as I go through this chapter. Each starts with an opening parenthesis followed by some characters to denote what’s going on. The full list is in perlre.

References As Arguments

Since references are scalars, I can use my compiled regex just like any other scalar, including storing it in an array or a hash, or passing it as the argument to a subroutine. The Test::More module, for instance, has a like function that takes a regex as its second argument. I can test a string against a regex and get richer output when it fails to match:

use Test::More 'no_plan';

my $string = "Just another Perl programmer,";
like( $string, qr/(\S+) hacker/, "Some sort of hacker!" );

Since $string uses programmer instead of hacker, the test fails. The output shows me the string, what I expected, and the regex it tried to use:

not ok 1 - Some sort of hacker!
1..1
#   Failed test 'Some sort of hacker!'
#                   'Just another Perl programmer,'
#     doesn't match '(?-xism:(\S+) hacker)'
# Looks like you failed 1 test of 1.

The like function doesn’t have to do anything special to accept a regex as an argument, although it does check its reference type^[2]before it tries to do its magic:

if( ref $regex eq 'Regexp' ) { ... }

Since $regex is just a reference (of type Rexexp), I can do reference sorts of things with it. I use isa to check the type, or get the type with ref:

print "I have a regex!\n" if $regex->isa( 'Regexp' );
print "Reference type is ", ref( $regex ), "\n";

Noncapturing Grouping, (?:PATTERN)

Parentheses in regexes don’t have to trigger memory. I can use them simply for grouping by using the special sequence (?:PATTERN). This way, I don’t get unwanted data in my capturing groups.

Perhaps I want to match the names on either side of one of the conjunctions and or or. In @array I have some strings that express pairs. The conjunction may change, so in my regex I use the alternation and|or. My problem is precedence. The alternation is higher precedence than sequence, so I need to enclose the alternation in parentheses, (\S+) (and|or) (\S+), to make it work:

#!/usr/bin/perl

my @strings = (
        "Fred and Barney",
        "Gilligan or Skipper",
        "Fred and Ginger",
        );

foreach my $string ( @strings )
        {
        # $string =~ m/(\S+) and|or (\S+)/; # doesn't work
        $string =~ m/(\S+) (and|or) (\S+)/;

        print "\$1: $1\n\$2: $2\n\$3: $3\n";
        print "-" x 10, "\n";
        }

The output shows me an unwanted consequence of grouping the alternation: the part of the string in the parentheses shows up in the memory variables as $2 (Table 2-2). That’s an artifact.

Table 2-2. Unintended match memories

Not grouping and\|or	Grouping and\|or
$1: Fred $2: $3: ---------- $1: $2: Skipper $3: ---------- $1: Fred $2: $3: ----------	$1: Fred $2: and $3: Barney ---------- $1: Gilligan $2: or $3: Skipper ---------- $1: Fred $2: and $3: Ginger ----------

Not grouping and|or

Grouping and|or

$1: Fred
$2:
$3:
----------
$1:
$2: Skipper
$3:
----------
$1: Fred
$2:
$3:
----------

$1: Fred
$2: and
$3: Barney
----------
$1: Gilligan
$2: or
$3: Skipper
----------
$1: Fred
$2: and
$3: Ginger
----------

Using the parentheses solves my precedence problem, but now I have that extra memory variable. That gets in the way when I change the program to use a match in list context. All the memory variables, including the conjunction, show up in @names:

# extra element!
my @names = ( $string =~ m/(\S+) (and|or) (\S+)/ );

I want to simply group things without triggering memory. Instead of the regular parentheses I just used, I add ?: right after the opening parenthesis of the group, which turns them into noncapturing parentheses. Instead of (and|or), I now have (?:and|or). This form doesn’t trigger the memory variables, and they don’t count toward the numbering of the memory variables either. I can apply quantifiers just like the plain parentheses as well. Now I don’t get my extra element in @names:

# just the names now
my @names = ( $string =~ m/(\S+) (?:and|or) (\S+)/ );

Readable Regexes, /x and (?#...)

Regular expressions have a much deserved reputation of being hard to read. Regexes have their own terse language that uses as few characters as possible to represent virtually infinite numbers of possibilities, and that’s just counting the parts that most people use everyday.

Luckily for other people, Perl gives me the opportunity to make my regexes much easier to read. Given a little bit of formatting magic, not only will others be able to figure out what I’m trying to match, but a couple weeks later, so will I. We touched on this lightly in Learning Perl, but it’s such a good idea that I’m going to say more about it. It’s also in Perl Best Practices by Damian Conway (O’Reilly).

When I add the /x flag to either the match or substitution operators, Perl ignores literal whitespace in the pattern. This means that I spread out the parts of my pattern to make the pattern more discernible. Gisle Aas’s HTTP::Date module parses a date by trying several different regexes. Here’s one of his regular expressions, although I’ve modified it to appear on a single line, wrapped to fit on this page:

/^(\d\d?)(?:\s+|[-\/])(\w+)(?:\s+|[-\/])↲
(\d+)(?:(?:\s+|:)(\d\d?):(\d\d)(?::(\d\d))↲
?)?\s*([-+]?\d{2,4}|(?![APap][Mm]\b)[A-Za-z]+)?\s*(?:\(\w+\))?\s*$/

Quick: Can you tell which one of the many date formats that parses? Me neither. Luckily, Gisle uses the /x flag to break apart the regex and add comments to show me what each piece of the pattern does. With /x, Perl ignores literal whitespace and Perl-style comments inside the regex. Here’s Gisle’s actual code, which is much easier to understand:

        /^
         (\d\d?)               # day
            (?:\s+|[-\/])
         (\w+)                 # month
            (?:\s+|[-\/])
         (\d+)                 # year
         (?:
               (?:\s+|:)       # separator before clock
            (\d\d?):(\d\d)     # hour:min
            (?::(\d\d))?       # optional seconds
         )?                    # optional clock
                \s*
         ([-+]?\d{2,4}|(?![APap][Mm]\b)[A-Za-z]+)? # timezone
                \s*
         (?:\(\w+\))?          # ASCII representation of timezone in parens.
                \s*$
        /x

Under /x, to match whitespace I have to specify it explicitly, either using \s, which matches any whitespace, any of \f\r\n\t, or their octal or hexadecimal sequences, such as \040 or \x20 for a literal space.^[3]Likewise, if I need a literal hash symbol, #, I have to escape it too, \#.

I don’t have to use /x to put comments in my regex. The (?#COMMENT) sequence does that for me. It probably doesn’t make the regex any more readable at first glance, though. I can mark the parts of a string right next to the parts of the pattern that represent it. Just because you can use (?#) doesn’t mean you should. I think the patterns are much easier to read with /x:

$isbn = '0-596-10206-2';

$isbn =~ m/(\d+)(?#country)-(\d+)(?#publisher)-(\d+)(?#item)-([\dX])/i;

print <<"HERE";
Country code:   $1
Publisher code: $2
Item:           $3
Checksum:       $4
HERE

Global Matching

In Learning Perl we told you about the /g flag that you can use to make all possible substitutions, but it’s more useful than that. I can use it with the match operator, where it does different things in scalar and list context. We told you that the match operator returns true if it matches and false otherwise. That’s still true (we wouldn’t have lied to you), but it’s not just a boolean value. The list context behavior is the most useful. With the /g flag, the match operator returns all of the memory matches:

$_ = "Just another Perl hacker,";
my @words = /(\S+)/g; # "Just" "another" "Perl" "hacker,"

Even though I only have one set of memory parentheses in my regular expression, it makes as many matches as it can. Once it makes a match, Perl starts where it left off and tries again. I’ll say more on that in a moment. I often run into another Perl idiom that’s closely related to this, in which I don’t want the actual matches, but just a count:

my $word_count = () = /(\S+)/g;

This uses a little-known but important rule: the result of a list assignment is the number of elements in the list on the right side. In this case, that’s the number of elements the match operator returns. This only works for a list assignment, which is assigning from a list on the right side to a list on the left side. That’s why I have the extra () in there.

In scalar context, the /g flag does some extra work we didn’t tell you about earlier. During a successful match, Perl remembers its position in the string, and when I match against that same string again, Perl starts where it left off in that string. It returns the result of one application of the pattern to the string:

$_ = "Just another Perl hacker,";
my @words = /(\S+)/g; # "Just" "another" "Perl" "hacker,"

while( /(\S+)/g ) # scalar context
        {
        print "Next word is '$1'\n";
        }

When I match against that same string again, Perl gets the next match:

Next word is 'Just'
Next word is 'another'
Next word is 'Perl'
Next word is 'hacker,'

I can even look at the match position as I go along. The built-in pos() operator returns the match position for the string I give it (or $_ by default). Every string maintains its own position. The first position in the string is 0, so pos() returns undef when it doesn’t find a match and has been reset, and this only works when I’m using the /g flag (since there’s no point in pos() otherwise):

$_ = "Just another Perl hacker,";
my $pos = pos( $_ );            # same as pos()
print "I'm at position [$pos]\n"; # undef

/(Just)/g;
$pos = pos();
print "[$1] ends at position $pos\n"; # 4

When my match fails, Perl resets the value of pos() to undef. If I continue matching, I’ll start at the beginning (and potentially create an endless loop):

my( $third word ) = /(Java)/g;
print "The next position is " . pos() . "\n";

As a side note, I really hate these print statements where I use the concatenation operator to get the result of a function call into the output. Perl doesn’t have a dedicated way to interpolate function calls, so I can cheat a bit. I call the function in an anonymous array constructor, [ ... ], and then immediately dereference it by wrapping @{ ... } around it:^[4]

print "The next position is @{ [ pos( $line ) ] }\n";

The pos() operator can also be an lvalue, which is the fancy programming way of saying that I can assign to it and change its value. I can fool the match operator into starting wherever I like. After I match the first word in $line, the match position is somewhere after the beginning of the string. After I do that, I use index to find the next h after the current match position. Once I have the offset for that h, I assign the offset to pos($line) so the next match starts from that position:

my $line = "Just another regex hacker,";

$line =~ /(\S+)/g;
print "The first word is $1\n";
print "The next position is @{ [ pos( $line ) ] }\n";

pos( $line ) = index( $line, 'h', pos( $line) );

$line =~ /(\S+)/g;
print "The next word is $1\n";
print "The next position is @{ [ pos( $line ) ] }\n";

Global Match Anchors

So far, my subsequent matches can “float,” meaning they can start matching anywhere after the starting position. To anchor my next match exactly where I left off the last time, I use the \G anchor. It’s just like the beginning of string anchor, ^, except for where \G anchors at the current match position. If my match fails, Perl resets pos(), and I start at the beginning of the string.

In this example, I anchor my pattern with \G. After that, I use noncapturing parentheses to group optional whitespace, \s*, and word match, \w+. I use the /x flag to spread out the parts to enhance readability. My match only gets the first four words, since it can’t match the comma (it’s not in \w) after the first hacker. Since the next match must start where I left off, which is the comma, and the only thing I can match is whitespace or word characters, I can’t continue. That next match fails, and Perl resets the match position to the beginning of $line:

my $line = "Just another regex hacker, Perl hacker,";

while( $line =~ /  \G (?: \s* (\w+) )  /xg )
        {
        print "Found the word '$1'\n";
        print "Pos is now @{ [ pos( $line ) ] }\n";
        }

I have a way to get around Perl resetting the match position. If I want to try a match without resetting the starting point even if it fails, I can add the /c flag, which simply means to not reset the match position on a failed match. I can try something without suffering a penalty. If that doesn’t work, I can try something else at the same match position. This feature is a poor man’s lexer. Here’s a simple-minded sentence parser:

my $line = "Just another regex hacker, Perl hacker, and that's it!\n";

while( 1 )
        {
        my( $found, $type )= do {
                if( $line =~ /\G([a-z]+(?:'[ts])?)/igc )
                        { ( $1, "a word"           ) }
                elsif( $line =~ /\G (\n) /xgc             )
                        { ( $1, "newline char"     ) }
                elsif( $line =~ /\G (\s+) /xgc            )
                        { ( $1, "whitespace"       ) }
                elsif( $line =~ /\G ( [[:punct:]] ) /xgc  )
                        { ( $1, "punctuation char" ) }
                else
                        { last; ()                   }
                };

        print "Found a $type [$found]\n";
        }

Look at that example again. What if I wanted to add more things I could match? I’d have to add another branch to the decision structure. That’s no fun. That’s a lot of repeated code structure doing the same thing: match something, then return $1 and a description. It doesn’t have to be like that, though. I rewrite this code to remove the repeated structure. I can store the regexes in the @items array. I use the qr// quoter that I showed earlier, and I put the regexes in the order that I want to try them. The foreach loop goes through them successively until it finds one that matches. When it finds a match, it prints a message using the description and whatever showed up in $1. If I want to add more tokens, I just add their description to @items:

#!/usr/bin/perl
use strict;
use warnings;

my $line = "Just another regex hacker, Perl hacker, and that's it!\n";

my @items = (
        [ qr/\G([a-z]+(?:'[ts])?)/i, "word"        ],
        [ qr/\G(\n)/,                "newline"     ],
        [ qr/\G(\s+)/,               "whitespace"  ],
        [ qr/\G([[:punct:]])/,       "punctuation" ],
        );

LOOP: while( 1 )
        {
        MATCH: foreach my $item ( @items )
                {
                my( $regex, $description ) = @$item;
                my( $type, $found );

                next unless $line =~ /$regex/gc;

                print "Found a $description [$1]\n";
                last LOOP if $1 eq "\n";

                next LOOP;
                }
        }

Look at some of the things going on in this example. All matches need the /gc flags, so I add those flags to the match operator inside the foreach loop. My regex to match a word, however, also needs the /i flag. I can’t add that to the match operator because I might have other branches that don’t want it. I add the /i assertion to my word regex in @items, turning on case-insensitivity for just that regex. If I wanted to keep the nice formatting I had earlier, I could have made that (?ix). As a side note, if most of my regexes should be case-insensitive, I could add /i to the match operator, then turn that off with (?-i) in the appropriate regexes.

Lookarounds

Lookarounds are arbitrary anchors for regexes. We showed several anchors in Learning Perl, such as ^, $, and \b, and I just showed the \G anchor. Using a lookaround, I can describe my own anchor as a regex, and just like the other anchors, they don’t count as part of the pattern or consume part of the string. They specify a condition that must be true, but they don’t add to the part of the string that the overall pattern matches.

Lookarounds come in two flavors: lookaheads that look ahead to assert a condition immediately after the current match position, and lookbehinds that look behind to assert a condition immediately before the current match position. This sounds simple, but it’s easy to misapply these rules. The trick is to remember that it anchors to the current match position and then figure out on which side it applies.

Both lookaheads and lookbehinds have two types: positive and negative. The positive lookaround asserts that its pattern has to match. The negative lookaround asserts that its pattern doesn’t match. No matter which I choose, I have to remember that they apply to the current match position, not anywhere else in the string.

Lookahead Assertions, (?=PATTERN) and (?!PATTERN)

Lookahead assertions let me peek at the string immediately ahead of the current match position. The assertion doesn’t consume part of the string, and if it succeeds, matching picks up right after the current match position.

Positive lookahead assertions

In Learning Perl, we included an exercise to check for both “Fred” and “Wilma” on the same line of input, no matter the order they appeared on the line. The trick we wanted to show to the novice Perler is that two regexes can be simpler than one. One way to do this repeats both Wilma and Fred in the alternation so I can try either order. A second try separates them into two regexes:

#/usr/bin/perl
# fred-and-wilma.pl

$_ = "Here come Wilma and Fred!";
print "Matches: $_" if /Fred.*Wilma|Wilma.*Fred/;
print "Matches: $_" if /Fred/ && /Wilma/;

I can make a simple, single regex using a positive lookahead assertion, denoted by (?=PATTERN). This assertion doesn’t consume text in the string, but if it fails, the entire regex fails. In this example, in the positive lookahead assertion I use .*Wilma. That pattern must be true immediately after the current match position:

$_ = "Here come Wilma and Fred!";
print "Matches: $_" if /(?=.*Wilma).*Fred/;

Since I used that at the start of my pattern, that means it has to be true at the beginning of the string. Specifically, at the beginning of the string, I have to be able to match any number of characters except a newline followed by Wilma. If that succeeds, it anchors the rest of the pattern to its position (the start of the string). Figure 2-1 shows the two ways that can work, depending on the order of Fred and Wilma in the string. The .*Wilma anchors where it started matching. The elastic .*, which can match any number of non-newline characters, anchors at the start of the string.

Figure 2-1. The positive lookahead assertion (?=.*Wilma) anchors the pattern at the beginning of the string

It’s easier to understand lookarounds by seeing when they don’t work, though. I’ll change my pattern a bit by removing the .* from the lookahead assertion. At first it appears to work, but it fails when I reverse the order of Fred and Wilma in the string:

$_ = "Here come Wilma and Fred!";
print "Matches: $_" if /(?=Wilma).*Fred/; # Works

$_ = "Here come Fred and Wilma!";
print "Matches: $_" if /(?=Wilma).*Fred/; # Doesn't work

Figure 2-2 shows what happens. In the first case, the lookahead anchors at the start of Wilma. The regex tried the assertion at the start of the string, found that it didn’t work, then moved over a position and tried again. It kept doing this until it got to Wilma. When it succeeded it set the anchor. Once it sets the anchor, the rest of the pattern has to start from that position.

Figure 2-2. The positive lookahead assertion (?=Wilma) anchors the pattern at Wilma

In the first case, .*Fred can match from that anchor because Fred comes after Wilma. The second case in Figure 2-2 does the same thing. The regex tries that assertion at the beginning of the string, finds that it doesn’t work, and moves on to the next position. By the time the lookahead assertion matches, it has already passed Fred. The rest of the pattern has to start from the anchor, but it can’t match.

Since the lookahead assertions don’t consume any of the string, I can use it in a pattern for split when I don’t really want to discard the parts of the pattern that match. In this example, I want to break apart the words in the studly cap string. I want to split it based on the initial capital letter. I want to keep the initial letter, though, so I use a lookahead assertion instead of a character-consuming string. This is different from the separator retention mode because the split pattern isn’t really a separator; it’s just an anchor:

my @words = split /(?=[A-Z])/, 'CamelCaseString';
print join '_', map { lc } @words; # camel_case_string

Negative lookahead assertions

Suppose I want to find the input lines that contain Perl, but only if that isn’t Perl6 or Perl 6. I might try a negated character class to specify the pattern right after the l in Perl to ensure that the next character isn’t a 6. I also use the word boundary anchors \b because I don’t want to match in the middle of other words, such as “BioPerl” or “PerlPoint”:

#!/usr/bin/perl
# not-perl6.pl

print "Trying negated character class:\n";
while( <> )
        {
        print if /\bPerl[^6]\b/;  #
        }

I’ll try this with some sample input:

# sample input
Perl6 comes after Perl 5.
Perl 6 has a space in it.
I just say "Perl".
This is a Perl 5 line
Perl 5 is the current version.
Just another Perl 5 hacker,
At the end is Perl
PerlPoint is PowerPoint
BioPerl is genetic

It doesn’t work for all the lines it should. It only finds four of the lines that have Perl without a trailing 6, and a line that has a space between Perl and 6:

Trying negated character class:
        Perl6 comes after Perl 5.
        Perl 6 has a space in it.
        This is a Perl 5 line
        Perl 5 is the current version.
        Just another Perl 5 hacker,

That doesn’t work because there has to be a character after the l in Perl. Not only that, I specified a word boundary. If that character after the l is a nonword character, such as the " in I just say "Perl", the word boundary at the end fails. If I take off the trailing \b, now PerlPoint matches. I haven’t even tried handling the case where there is a space between Perl and 6. For that I’ll need something much better.

To make this really easy, I can use a negative lookahead assertion. I don’t want to match a character after the l, and since an assertion doesn’t match characters, it’s the right tool to use. I just want to say that if there’s anything after Perl, it can’t be a 6, even if there is some whitespace between them. The negative lookahead assertion uses (?!PATTERN). To solve this problem, I use \s?6 as my pattern, denoting the optional whitespace followed by a 6:

print "Trying negative lookahead assertion:\n";
while( <> )
        {
        print if /\bPerl(?!\s?6)\b/;  # or /\bPerl[^6]/
        }

Now the output finds all of the right lines:

Trying negative lookahead assertion:
        Perl6 comes after Perl 5.
        I just say "Perl".
        This is a Perl 5 line
        Perl 5 is the current version.
        Just another Perl 5 hacker,
        At the end is Perl

Remember that (?!PATTERN) is a lookahead assertion, so it looks after the current match position. That’s why this next pattern still matches. The lookahead asserts that right before the b in bar that the next thing isn’t foo. Since the next thing is bar, which is not foo, it matches. People often confuse this to mean that the thing before bar can’t be foo, but each uses the same starting match position, and since bar is not foo, they both work:

if( 'foobar' =~ /(?!foo)bar/ )
        {
        print "Matches! That's not what I wanted!\n";
        }
else
        {
        print "Doesn't match! Whew!\n";
        }

Lookbehind Assertions, (?<!PATTERN) and (?<=PATTERN)

Instead of looking ahead at the part of the string coming up, I can use a lookbehind to check the part of the string the regular expression engine has already processed. Due to Perl’s implementation details, the lookbehind assertions have to be a fixed width, so I can’t use variable width quantifiers in them.

Now I can try to match bar that doesn’t follow a foo. In the previous section I couldn’t use a negative lookahead assertion because that looks forward in the string. A negative lookbehind, denoted by (?<!PATTERN), looks backward. That’s just what I need. Now I get the right answer:

#!/usr/bin/perl
# correct-foobar.pl

if( 'foobar' =~ /(?<!foo)bar/ )
        {
        print "Matches! That's not what I wanted!\n";
        }
else
        {
        print "Doesn't match! Whew!\n";
        }

Now, since the regex has already processed that part of the string by the time it gets to bar, my lookbehind assertion can’t be a variable width pattern. I can’t use the quantifiers to make a variable width pattern because the engine is not going to backtrack in the string to make the lookbehind work. I won’t be able to check for a variable number of os in fooo:

'foooobar' =~ /(?<!fo+)bar/;

When I try that, I get the error telling me that I can’t do that, and even though it merely says not implemented, don’t hold your breath waiting for it:

Variable length lookbehind not implemented in regex...

The positive lookbehind assertion also looks backward, but its pattern must not match. The only time I seem to use these are in substitutions in concert with another assertion. Using both a lookbehind and a lookahead assertion, I can make some of my substitutions easier to read.

For instance, throughout the book I’ve used variations of hyphenated words because I couldn’t decide which one I should use. Should it be builtin or built-in? Depending on my mood or typing skills, I used either of them.^[5]

I needed to clean up my inconsistency. I knew the part of the word on the left of the hyphen, and I knew the text on the right of the hyphen. At the position where they meet, there should be a hyphen. If I think about that for a moment, I’ve just described the ideal situation for lookarounds: I want to put something at a particular position, and I know what should be around it. Here’s a sample program to use a positive lookbehind to check the text on the left and a positive lookahead to check the text on the right. Since the regex only matches when those sides meet, that means that it’s discovered a missing hyphen. When I make the substitution, it put the hyphen at the match position, and I don’t have to worry about the particular text:

@hyphenated = qw( built-in );

foreach my $word ( @hyphenated )
        {
        my( $front, $back ) = split /-/, $word;

        $text =~ s/(?<=$front)(?=$back)/-/g;
        }

If that’s not a complicated enough example, try this one. Let’s use the lookarounds to add commas to numbers. Jeffery Friedl shows one attempt in Mastering Regular Expressions, adding commas to the U.S. population:^[6]

$pop = 301139843;  # that's for Feb 10, 2007

# From Jeffrey Friedl   
$pop =~ s/(?<=\d)(?=(?:\d\d\d)+$)/,/g;

That works, mostly. The positive lookbehind (?<=\d) wants to match a number, and the positive lookahead (?=(?:\d\d\d)+$) wants to find groups of three digits all the way to the end of the string. This breaks when I have floating point numbers, such as currency. For instance, my broker tracks my stock positions to four decimal places. When I try that substitution, I get no comma on the left side of the decimal point and one of the fractional side. It’s because of that end of string anchor:

$money = '$1234.5678';

$money =~ s/(?<=\d)(?=(?:\d\d\d)+$)/,/g;  # $1234.5,678

I can modify that a bit. Instead of the end of string anchor, I’ll use a word boundary, \b. That might seem weird, but remember that a digit is a word character. That gets me the comma on the left side, but I still have that extra comma:

$money = '$1234.5678';

$money =~ s/(?<=\d)(?=(?:\d\d\d)+$)/,/g;  # $1,234.5,678

What I really want for that first part of the regex is to use the lookbehind to match a digit, but not when it’s preceded by a decimal point. That’s the description of a negative lookbehind, (?<!\.\d). Since all of these match at the same position, it doesn’t matter that some of them might overlap as long as they all do what I need:

$money = $'1234.5678';

$money =~ s/(?<!\.\d)(?<=\d)(?=(?:\d\d\d)+\b)/,/g; # $1,234.5678

That works! It’s a bit too bad that it does because I’d really like an excuse to get a negative lookahead in there. It’s too complicated already, so I’ll just add the /x to practice what I preach:

$money =~ s/
        (?<!\.\d)         # not a . digit right before the position
        
        (?<=\d)           # a digit right before the position
                          # <--- CURRENT MATCH POSITION
        (?=               # this group right after the position
        (?:\d\d\d)+       # one or more groups of three digits
          \b              # word boundary (left side of decimal or end)
        )
        
        /,/xg;

Deciphering Regular Expressions

While trying to figure out a regex, whether one I found in someone else’s code or one I wrote myself (maybe a long time ago), I can turn on Perl’s regex debugging mode.^[7]Perl’s -D switch turns on debugging options for the Perl interpreter (not for your program, as in Chapter 4). The switch takes a series of letters or numbers to indicate what it should turn on. The -Dr option turns on regex parsing and execution debugging.

I can use a short program to examine a regex. The first argument is the match string and the second argument is the regular expression. I save this program as explain-regex:

#!/usr/bin/perl

$ARGV[0] =~ /$ARGV[1]/;

When I try this with the target string Just another Perl hacker, and the regex Just another (\S+) hacker,, I see two major sections of output, which the perldebguts documentation explains at length. First, Perl compiles the regex, and the -Dr output shows how Perl parsed the regex. It shows the regex nodes, such as EXACT and NSPACE, as well as any optimizations, such as anchored "Just another ". Second, it tries to match the target string, and shows its progress through the nodes. It’s a lot of information, but it shows me exactly what it’s doing:

$ perl -Dr explain-regex 'Just another Perl hacker,' 'Just another (\S+) hacker,'
Omitting $` $& $' support.

EXECUTING...

Compiling REx `Just another (\S+) hacker,'
size 15 Got 124 bytes for offset annotations.
first at 1
rarest char k at 4
rarest char J at 0
   1: EXACT <Just another >(6)
   6: OPEN1(8)
   8:   PLUS(10)
   9:     NSPACE(0)
  10: CLOSE1(12)
  12: EXACT < hacker,>(15)
  15: END(0)
anchored "Just another " at 0 floating " hacker," at 14..2147483647 (checking anchored) minlen 22
Offsets: [15]
                1[13] 0[0] 0[0] 0[0] 0[0] 14[1] 0[0] 17[1] 15[2] 18[1] 0[0] 19[8] 0[0] 0[0] 27[0]
Guessing start of match, REx "Just another (\S+) hacker," against "Just another Perl hacker,"...
Found anchored substr "Just another " at offset 0...
Found floating substr " hacker," at offset 17...
Guessed: match at offset 0
Matching REx "Just another (\S+) hacker," against "Just another Perl hacker,"
  Setting an EVAL scope, savestack=3
   0 <> <Just another>    |  1:  EXACT <Just another >
  13 <ther > <Perl ha>    |  6:  OPEN1
  13 <ther > <Perl ha>    |  8:  PLUS
                                                   NSPACE can match 4 times out of 2147483647...
  Setting an EVAL scope, savestack=3
  17 < Perl> < hacker>    | 10:    CLOSE1
  17 < Perl> < hacker>    | 12:    EXACT < hacker,>
  25 <Perl hacker,> <>    | 15:    END
Match successful!
Freeing REx: `"Just another (\\S+) hacker,"'

The re pragma, which comes with Perl, has a debugging mode that doesn’t require a -DDEBUGGING enabled interpreter. Once I turn on use re 'debug', it applies to the entire program. It’s not lexically scoped like most pragmata. I modify my previous program to use the re pragma instead of the command-line switch:

#!/usr/bin/perl

use re 'debug';

$ARGV[0] =~ /$ARGV[1]/;

I don’t have to modify my program to use re since I can also load it from the command line:

$ perl -Mre=debug explain-regex 'Just another Perl hacker,' 'Just another (\S+) hacker,'

When I run this program with a regex as its argument, I get almost the same exact output as my previous -Dr example.

The YAPE::Regex::Explain, although a bit old, might be useful in explaining a regex in mostly plain English. It parses a regex and provides a description of what each part does. It can’t explain the semantic purpose, but I can’t have everything. With a short program I can explain the regex I specify on the command line:

#!/usr/bin/perl

use YAPE::Regex::Explain;

print YAPE::Regex::Explain->new( $ARGV[0] )->explain;

When I run the program even with a short, simple regex, I get plenty of output:

$ perl yape-explain 'Just another (\S+) hacker,'
The regular expression:

(?-imsx:Just another (\S+) hacker,)

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
						 (with ^ and $ matching normally) (with . not
						 matching \n) (matching whitespace and #
						 normally):
----------------------------------------------------------------------
  Just another             'Just another '
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
	\S+                      non-whitespace (all but \n, \r, \t, \f,
							 and " ") (1 or more times (matching the
							 most amount possible))
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
   hacker,                 ' hacker,'
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

Final Thoughts

It’s almost the end of the chapter, but there are still so many regular expression features I find useful. Consider this section a quick tour of the things you can look into on your own.

I don’t have to be content with the simple character classes such as \w (word characters), \d (digits), and the others denoted by slash sequences. I can also use the POSIX character classes. I enclose those in the square brackets with colons on both sides of the name:

print "Found alphabetic character!\n" if  $string =~ m/[:alpha:]/;
print "Found hex digit!\n"            if  $string =~ m/[:xdigit:]/;

I negate those with a caret, ^, after the first colon:

print "Didn't find alphabetic characters!\n" if  $string =~ m/[:^alpha:]/;
print "Didn't find spaces!\n" if  $string =~ m/[:^space:]/;

I can say the same thing in another way by specifying a named property. The \p{Name} sequence (little p) includes the characters for the named property, and the \P{Name} sequence (big P) is its complement:

print "Found ASCII character!\n"    if  $string =~ m/\p{IsASCII}/;
print "Found control characters!\n" if  $string =~ m/\p{IsCntrl}/;

print "Didn't find punctuation characters!\n" if  $string =~ m/\P{IsPunct}/;
print "Didn't find uppercase characters!\n"   if  $string =~ m/\P{IsUpper}/;

The Regexp::Common module provides pretested and known-to-work regexes for, well, common things such as web addresses, numbers, postal codes, and even profanity. It gives me a multilevel hash %RE that has as its values regexes. If I don’t like that, I can use its function interface:

use Regexp::Common;

print "Found a real number\n" if $string =~ /$RE{num}{real}/;

print "Found a real number\n" if $string =~ RE_num_real;

If I want to build up my own pattern, I can use Regexp::English, which uses a series of chained methods to return an object that stands in for a regex. It’s probably not something you want in a real program, but it’s fun to think about:

use Regexp::English;

my $regexp = Regexp::English->new
        ->literal( 'Just' )
                ->whitespace_char
        ->word_chars
                ->whitespace_char
        ->remember( \$type_of_hacker )
        ->word_chars
        ->end
                ->whitespace_char
        ->literal( 'hacker' );

$regexp->match( 'Just another Perl hacker,' );

print "The type of hacker is [$type_of_hacker]\n";

If you really want to get into the nuts and bolts of regular expressions, check out O’Reilly’s Mastering Regular Expressions by Jeffrey Friedl. You’ll not only learn some advanced features, but how regular expressions work and how you can make yours better.

Summary

This chapter covered some of the more useful advanced features of Perl’s regex engine. The qr() quoting operator lets me compile a regex for later and gives it back to me as a reference. With the special (?) sequences, I can make my regular expression much more powerful, as well as less complicated. The \G anchor allows me to anchor the next match where the last one left off, and using the /c flag, I can try several possibilities without resetting the match position if one of them fails.

Mastering Perl by brian d foy