Regular expressions, or just regexes, are at the core of Perl’s text processing, and certainly are one of the features that made Perl so popular. All Perl programmers pass through a stage where they try to program everything as regexes and, when that’s not challenging enough, everything as a single regex. Perl’s regexes have many more features than I can, or want, to present here, so I include those advanced features I find most useful and expect other Perl programmers to know about without referring to perlre, the documentation page for regexes.
I don’t have to know every pattern at the time that I code something. Perl allows me to interpolate
variables into regexes. I might hard code those values, take them from
user input, or get them in any other way I can get or create data. Here’s
a tiny Perl program to do grep
’s job. It takes the
firstF argument from the command line and uses it as the regex in the
while
statement. That’s nothing special
(yet); we showed you how to do this in Learning Perl. I can use the string
in $regex
as my pattern, and Perl
compiles it when it interpolates the string in the match
operator:[1]
#!/usr/bin/perl # perl-grep.pl my $regex = shift @ARGV; print "Regex is [$regex]\n"; while( <> ) { print if m/$regex/; }
I can use this program from the command line to search for patterns
in files. Here I search for the pattern new
in all of the Perl programs in the current
directory:
% perl-grep.pl new *.pl Regex is [new] my $regexp = Regexp::English->new my $graph = GraphViz::Regex->new($regex); [ qr/\G(\n)/, "newline" ], { ( $1, "newline char" ) } print YAPE::Regex::Explain->new( $ARGV[0] )->explain;
What happens if I give it an invalid regex? I try it with a pattern that has an opening parenthesis without its closing mate:
$ ./perl-grep.pl "(perl" *.pl Regex is [(perl] Unmatched ( in regex; marked by <-- HERE in m/( <-- HERE perl/ at ./perl-grep.pl line 10, <> line 1.
When I interpolate the regex in the match operator, Perl compiles the regex and immediately complains, stopping my program. To catch that, I want to compile the regex before I try to use it.
The qr//
is a regex quoting operator that stores my regex in a scalar (and as a
quoting operator, its documentation shows up in
perlop). The qr//
compiles the pattern so it’s ready to use when I interpolate $regex
in the match operator. I wrap the
eval
operator around the qr//
to catch the error, even though I end up
die
-ing anyway:
#!/usr/bin/perl # perl-grep2.pl my $pattern = shift @ARGV; my $regex = eval { qr/$pattern/ }; die "Check your pattern! $@" if $@; while( <> ) { print if m/$regex/; }
The regex in $regex
has all of
the features of the match operator, including back references and memory
variables. This pattern searches for a three-character sequence where the
first and third characters are the same, and none of them are whitespace.
The input is the plain text version of the perl
documentation page, which I get with perldoc
-t
:
% perldoc -t perl | perl-grep2.pl "\b(\S)\S\1\b" perl583delta Perl changes in version 5.8.3 perl582delta Perl changes in version 5.8.2 perl581delta Perl changes in version 5.8.1 perl58delta Perl changes in version 5.8.0 perl573delta Perl changes in version 5.7.3 perl572delta Perl changes in version 5.7.2 perl571delta Perl changes in version 5.7.1 perl570delta Perl changes in version 5.7.0 perl561delta Perl changes in version 5.6.1 http://www.perl.com/ the Perl Home Page http://www.cpan.org/ the Comprehensive Perl Archive http://www.perl.org/ Perl Mongers (Perl user groups)
It’s a bit hard, at least for me, to see what Perl matched, so I can
make another change to my grep
program
to see what matched. The $&
variable holds the portion of the string that matched:
#!/usr/bin/perl # perl-grep3.pl my $pattern = shift @ARGV; my $regex = eval { qr/$pattern/ }; die "Check your pattern! $@" if $@; while( <> ) { print "$_\t\tmatched >>>$&<<<\n" if m/$regex/; }
Now I see that my regex is matching a literal dot, character, literal dot, as in .8.
:
% perldoc -t perl | perl-grep3.pl "\b(\S)\S\1\b" perl587delta Perl changes in version 5.8.7 matched >>>.8.<<< perl586delta Perl changes in version 5.8.6 matched >>>.8.<<< perl585delta Perl changes in version 5.8.5 matched >>>.8.<<<
Just for fun, how about seeing what matched in each memory group,
the variables $1
, $2
, and so on? I could try printing their
contents, whether or not I had capturing groups for them, but how many do
I print? Perl already knows because it keeps track of all of that in the
special arrays @-
and @+
, which hold the string offsets for the
beginning and end, respectively, for each match. That is, for the match
string in $_
, the number of memory
groups is the last index in @-
or
@+
(they’ll be the same length). The
first element in each is for the part of the string matched (so, $&
), and the next element, with index
1
, is for $1
, and so on for the rest of the array. The
value in $1
is the same as this call to
substr
:
my $one = substr( $_, # string $-[1], # start position for $1 $+[1] - $-[1] # length of $1 (not end position!) );
To print the memory variables, I just have to go through the indices
in the array @-
:
#!/usr/bin/perl # perl-grep4.pl my $pattern = shift @ARGV; my $regex = eval { qr/$pattern/ }; die "Check your pattern! $@" if $@; while( <> ) { if( m/$regex/ ) { print "$_"; print "\t\t\$&: ", substr( $_, $-[$i], $+[$i] - $-[$i] ), "\n"; foreach my $i ( 1 .. $#- ) { print "\t\t\$$i: ", substr( $_, $-[$i], $+[$i] - $-[$i] ), "\n"; } } }
Now I can see the part of the string that matched as well as the submatches:
% perldoc -t perl | perl-grep4.pl "\b(\S)\S\1\b" perl587delta Perl changes in version 5.8.7 $&: .8. $1: .
If I change my pattern to have more submatches, I don’t have to change anything to see the additional matches:
% perldoc -t perl | perl-grep4.pl "\b(\S)(\S)\1\b" perl587delta Perl changes in version 5.8.7 $&: .8. $1: . $2: 8
What if I want to do something a bit more complex for my grep
program, such as a case-insensitive
search? Using my program to search for either “Perl” or “perl” I
have a couple of options, neither of which are too much work:
% perl-grep.pl "[pP]erl" % perl-grep.pl "(p|P)erl"
If I want to make the entire pattern case-insensitive, I have to
do much more work, and I don’t like that. With the match operator, I
could just add the /i
flag on the end:
print if m/$regex/i;
I could do that with the qr//
operator, too, although this makes all patterns case-insensitive
now:
my $regex = qr/$pattern/i;
To get around this, I can specify the match options inside my
pattern. The special sequence (?imsx)
allows me to turn on the features for the options I specify. If I want
case-insensitivity, I can use (?i)
inside the pattern. Case-insensitivity applies for the rest of the
pattern after the (?i)
(or for the
rest of the enclosing parentheses):
% perl-grep.pl "(?i)perl"
In general, I can enable flags for part of a pattern by specifying which ones I want in the parentheses, possibly with the portion of the pattern they apply to, as shown in Table 2-1.
I can even group them:
(?si:PATTERN) Let . match a newline and make case-insensitive
If I preface the options with a minus sign, I turn off those features for that group:
(?-s:PATTERN) Don’t let . match a newline
This is especially useful since I’m getting my pattern from the
command line. In fact, when I use the qr//
operator to create my regex, I’m already
using these. I’ll change my program to print the regex after I create it
with qr//
but before I use it:
#!/usr/bin/perl # perl-grep3.pl my $pattern = shift @ARGV; my $regex = eval { qr/$pattern/ }; die "Check your pattern! $@" if $@; print "Regex ---> $regex\n"; while( <> ) { print if m/$regex/; }
When I print the regex, I see it starts with all of the options
turned off. The string version of regex uses (?-OPTIONS:PATTERN)
to turn off all of the
options:
% perl-grep3.pl "perl" Regex ---> (?-xism:perl)
I can turn on case-insensitivity, although the string form looks a
bit odd, turning off i
just to turn
it back on:
% perl-grep3.pl "(?i)perl" Regex ---> (?-xism:(?i)perl)
Perl’s regexes have many similar sequences that start with a parenthesis, and I’ll show a few of them as I go through this chapter. Each starts with an opening parenthesis followed by some characters to denote what’s going on. The full list is in perlre.
Since references are scalars, I can use my compiled regex just like any
other scalar, including storing it in an array or a hash, or passing it
as the argument to a subroutine. The Test::More
module, for instance, has a
like
function that takes a regex as
its second argument. I can test a string against a regex and get richer
output when it fails to match:
use Test::More 'no_plan'; my $string = "Just another Perl programmer,"; like( $string, qr/(\S+) hacker/, "Some sort of hacker!" );
Since $string
uses programmer
instead of hacker
, the test fails. The output shows me
the string, what I expected, and the regex it tried to use:
not ok 1 - Some sort of hacker! 1..1 # Failed test 'Some sort of hacker!' # 'Just another Perl programmer,' # doesn't match '(?-xism:(\S+) hacker)' # Looks like you failed 1 test of 1.
The like
function doesn’t have
to do anything special to accept a regex as an argument, although it
does check its reference type[2]before it tries to do its magic:
if( ref $regex eq 'Regexp' ) { ... }
Since $regex
is just a
reference (of type Rexexp
), I can do
reference sorts of things with it. I use isa
to check the type, or get the
type
with ref
:
print "I have a regex!\n" if $regex->isa( 'Regexp' ); print "Reference type is ", ref( $regex ), "\n";
Parentheses in regexes don’t have to trigger memory. I can use them
simply for grouping by using the special sequence (?:PATTERN)
. This way, I don’t get unwanted data
in my capturing groups.
Perhaps I want to match the names on either side of one of the
conjunctions and
or or
. In @array
I have some strings that express pairs. The conjunction may change, so in
my regex I use the alternation and|or
.
My problem is precedence. The alternation is higher precedence than sequence, so I
need to enclose the alternation in parentheses, (\S+) (and|or)
(\S+)
, to make it work:
#!/usr/bin/perl my @strings = ( "Fred and Barney", "Gilligan or Skipper", "Fred and Ginger", ); foreach my $string ( @strings ) { # $string =~ m/(\S+) and|or (\S+)/; # doesn't work $string =~ m/(\S+) (and|or) (\S+)/; print "\$1: $1\n\$2: $2\n\$3: $3\n"; print "-" x 10, "\n"; }
The output shows me an unwanted consequence of grouping the
alternation: the part of the string in the parentheses shows up in the
memory variables as $2
(Table 2-2). That’s an artifact.
Using the parentheses solves my precedence problem, but now I have
that extra memory variable. That gets in the way when I change the program
to use a match in list context. All the memory variables, including the
conjunction, show up in @names
:
# extra element! my @names = ( $string =~ m/(\S+) (and|or) (\S+)/ );
I want to simply group things without triggering memory. Instead of
the regular parentheses I just used, I add ?:
right after the opening parenthesis of the
group, which turns them into noncapturing parentheses. Instead of (and|or)
, I now have (?:and|or)
. This form doesn’t trigger the
memory variables, and they don’t count toward the numbering of the memory
variables either. I can apply quantifiers just like the plain parentheses
as well. Now I don’t get my extra element in @names
:
# just the names now my @names = ( $string =~ m/(\S+) (?:and|or) (\S+)/ );
Regular expressions have a much deserved reputation of being hard to read. Regexes have their own terse language that uses as few characters as possible to represent virtually infinite numbers of possibilities, and that’s just counting the parts that most people use everyday.
Luckily for other people, Perl gives me the opportunity to make my regexes much easier to read. Given a little bit of formatting magic, not only will others be able to figure out what I’m trying to match, but a couple weeks later, so will I. We touched on this lightly in Learning Perl, but it’s such a good idea that I’m going to say more about it. It’s also in Perl Best Practices by Damian Conway (O’Reilly).
When I add the /x
flag to either
the match or substitution operators, Perl ignores literal whitespace in
the pattern. This means that I spread out the parts of my pattern to make
the pattern more discernible. Gisle Aas’s HTTP::Date
module parses a
date by trying several different regexes. Here’s one of his regular
expressions, although I’ve modified it to appear on a single line, wrapped
to fit on this page:
/^(\d\d?)(?:\s+|[-\/])(\w+)(?:\s+|[-\/])↲ (\d+)(?:(?:\s+|:)(\d\d?):(\d\d)(?::(\d\d))↲ ?)?\s*([-+]?\d{2,4}|(?![APap][Mm]\b)[A-Za-z]+)?\s*(?:\(\w+\))?\s*$/
Quick: Can you tell which one of the many date formats that parses?
Me neither. Luckily, Gisle uses the /x
flag to break apart the regex and add comments to show me what each piece
of the pattern does. With /x
, Perl
ignores literal whitespace and Perl-style comments inside the regex.
Here’s Gisle’s actual code, which is much easier to understand:
/^ (\d\d?) # day (?:\s+|[-\/]) (\w+) # month (?:\s+|[-\/]) (\d+) # year (?: (?:\s+|:) # separator before clock (\d\d?):(\d\d) # hour:min (?::(\d\d))? # optional seconds )? # optional clock \s* ([-+]?\d{2,4}|(?![APap][Mm]\b)[A-Za-z]+)? # timezone \s* (?:\(\w+\))? # ASCII representation of timezone in parens. \s*$ /x
Under /x
, to match whitespace I
have to specify it explicitly, either using \s
, which matches
any whitespace, any of \f\r\n\t
, or
their octal or hexadecimal sequences, such as \040
or
\x20
for a literal space.[3]Likewise, if I need a literal hash symbol, #
, I have
to escape it too, \#
.
I don’t have to use /x
to put
comments in my regex. The (?#COMMENT)
sequence does that for me. It probably doesn’t make the regex any more
readable at first glance, though. I can mark the parts of a string right
next to the parts of the pattern that represent it. Just because you can
use (?#)
doesn’t mean you should. I
think the patterns are much easier to read with /x
:
$isbn = '0-596-10206-2'; $isbn =~ m/(\d+)(?#country)-(\d+)(?#publisher)-(\d+)(?#item)-([\dX])/i; print <<"HERE"; Country code: $1 Publisher code: $2 Item: $3 Checksum: $4 HERE
In Learning Perl we told you about the /g
flag that you can use to make all possible substitutions, but
it’s more useful than that. I can use it with the match operator, where it
does different things in scalar and list context. We told you that the
match operator returns true if it matches and false otherwise. That’s
still true (we wouldn’t have lied to you), but it’s not just a boolean
value. The list context behavior is the most useful. With the /g
flag, the match operator returns all of the
memory matches:
$_ = "Just another Perl hacker,"; my @words = /(\S+)/g; # "Just" "another" "Perl" "hacker,"
Even though I only have one set of memory parentheses in my regular expression, it makes as many matches as it can. Once it makes a match, Perl starts where it left off and tries again. I’ll say more on that in a moment. I often run into another Perl idiom that’s closely related to this, in which I don’t want the actual matches, but just a count:
my $word_count = () = /(\S+)/g;
This uses a little-known but important rule: the result of a list
assignment is the number of elements in the list on the right side. In
this case, that’s the number of elements the match operator returns. This
only works for a list assignment, which is assigning from a list on the
right side to a list on the left side. That’s why I have the
extra ()
in there.
In scalar context, the /g
flag
does some extra work we didn’t tell you about earlier. During a successful
match, Perl remembers its position in the string, and when I match against
that same string again, Perl starts where it left off in that string. It
returns the result of one application of the pattern to the string:
$_ = "Just another Perl hacker,"; my @words = /(\S+)/g; # "Just" "another" "Perl" "hacker," while( /(\S+)/g ) # scalar context { print "Next word is '$1'\n"; }
When I match against that same string again, Perl gets the next match:
Next word is 'Just' Next word is 'another' Next word is 'Perl' Next word is 'hacker,'
I can even look at the match position as I go along. The
built-in pos()
operator returns
the match position for the string I give it (or $_
by default). Every string maintains its own position. The
first position in the string is 0
, so
pos()
returns undef
when it doesn’t find a match and has been
reset, and this only works when I’m using the /g
flag (since there’s no point in pos()
otherwise):
$_ = "Just another Perl hacker,"; my $pos = pos( $_ ); # same as pos() print "I'm at position [$pos]\n"; # undef /(Just)/g; $pos = pos(); print "[$1] ends at position $pos\n"; # 4
When my match fails, Perl resets the value of pos()
to undef
. If I continue matching, I’ll start at the
beginning (and potentially create an endless loop):
my( $third word ) = /(Java)/g; print "The next position is " . pos() . "\n";
As a side note, I really hate these print
statements where I use the concatenation
operator to get the result of a function call into the output. Perl
doesn’t have a dedicated way to interpolate function calls, so I can cheat
a bit. I call the function in an anonymous array constructor, [ ... ]
, and then immediately dereference it by
wrapping @{ ... }
around it:[4]
print "The next position is @{ [ pos( $line ) ] }\n";
The pos()
operator can also be an
lvalue, which is the fancy programming way of saying that I can assign to
it and change its value. I can fool the match operator into starting
wherever I like. After I match the first word in $line
, the match position is somewhere after the
beginning of the string. After I do that, I use index
to find the next h
after the current match position. Once I have
the offset for that h
, I assign the
offset to pos($line)
so the next match
starts from that position:
my $line = "Just another regex hacker,"; $line =~ /(\S+)/g; print "The first word is $1\n"; print "The next position is @{ [ pos( $line ) ] }\n"; pos( $line ) = index( $line, 'h', pos( $line) ); $line =~ /(\S+)/g; print "The next word is $1\n"; print "The next position is @{ [ pos( $line ) ] }\n";
So far, my subsequent matches can “float,” meaning they can start
matching anywhere after the starting position. To anchor my next match
exactly where I left off the last time, I use the \G
anchor. It’s just like the beginning of string anchor, ^
, except for where \G
anchors at the current match position. If my match fails, Perl resets
pos()
, and I start at the beginning
of the string.
In this example, I anchor my pattern with \G
. After that, I use noncapturing parentheses
to group optional whitespace, \s*
,
and word match, \w+
. I use the
/x
flag to spread out the parts to enhance readability. My match
only gets the first four words, since it can’t match the comma (it’s not
in \w
) after the first hacker
. Since the next match must start where
I left off, which is the comma, and the only thing I can match is
whitespace or word characters, I can’t continue. That next match fails,
and Perl resets the match position to the beginning of $line
:
my $line = "Just another regex hacker, Perl hacker,"; while( $line =~ / \G (?: \s* (\w+) ) /xg ) { print "Found the word '$1'\n"; print "Pos is now @{ [ pos( $line ) ] }\n"; }
I have a way to get around Perl resetting the match position. If I
want to try a match without resetting the starting point even if it
fails, I can add the /c
flag,
which simply means to not reset the match position on a
failed match. I can try something without suffering a penalty. If that
doesn’t work, I can try something else at the same match position. This
feature is a poor man’s lexer. Here’s a simple-minded sentence
parser:
my $line = "Just another regex hacker, Perl hacker, and that's it!\n"; while( 1 ) { my( $found, $type )= do { if( $line =~ /\G([a-z]+(?:'[ts])?)/igc ) { ( $1, "a word" ) } elsif( $line =~ /\G (\n) /xgc ) { ( $1, "newline char" ) } elsif( $line =~ /\G (\s+) /xgc ) { ( $1, "whitespace" ) } elsif( $line =~ /\G ( [[:punct:]] ) /xgc ) { ( $1, "punctuation char" ) } else { last; () } }; print "Found a $type [$found]\n"; }
Look at that example again. What if I wanted to add
more things I could match? I’d have to add another branch to the
decision structure. That’s no fun. That’s a lot of repeated code
structure doing the same thing: match something, then return $1
and a description. It doesn’t have to be
like that, though. I rewrite this code to remove the repeated structure.
I can store the regexes in the @items
array. I use the qr//
quoter that I showed earlier, and I put the regexes in the order
that I want to try them. The foreach
loop goes through them
successively until it finds one that matches. When it finds a match, it
prints a message using the description and whatever showed up in
$1
. If I want to add more tokens, I
just add their description to @items
:
#!/usr/bin/perl use strict; use warnings; my $line = "Just another regex hacker, Perl hacker, and that's it!\n"; my @items = ( [ qr/\G([a-z]+(?:'[ts])?)/i, "word" ], [ qr/\G(\n)/, "newline" ], [ qr/\G(\s+)/, "whitespace" ], [ qr/\G([[:punct:]])/, "punctuation" ], ); LOOP: while( 1 ) { MATCH: foreach my $item ( @items ) { my( $regex, $description ) = @$item; my( $type, $found ); next unless $line =~ /$regex/gc; print "Found a $description [$1]\n"; last LOOP if $1 eq "\n"; next LOOP; } }
Look at some of the things going on in this example. All matches
need the /gc
flags, so I add those
flags to the match operator inside the foreach
loop. My regex to match a word,
however, also needs the /i
flag. I can’t add that to the match operator because I might
have other branches that don’t want it. I add the /i
assertion to my word regex in @items
, turning on case-insensitivity for just
that regex. If I wanted to keep the nice formatting I had earlier, I
could have made that (?ix)
. As a side
note, if most of my regexes should be case-insensitive, I could add
/i
to the match operator, then turn
that off with (?-i)
in the
appropriate regexes.
Lookarounds are arbitrary anchors for regexes. We showed several anchors in
Learning Perl, such as ^
, $
, and
\b
, and I just showed the \G
anchor. Using a lookaround, I can describe my
own anchor as a regex, and just like the other anchors, they don’t count
as part of the pattern or consume part of the string. They specify a
condition that must be true, but they don’t add to the part of the string
that the overall pattern matches.
Lookarounds come in two flavors: lookaheads that look ahead to assert a condition immediately after the current match position, and lookbehinds that look behind to assert a condition immediately before the current match position. This sounds simple, but it’s easy to misapply these rules. The trick is to remember that it anchors to the current match position and then figure out on which side it applies.
Both lookaheads and lookbehinds have two types: positive and negative. The positive lookaround asserts that its pattern has to match. The negative lookaround asserts that its pattern doesn’t match. No matter which I choose, I have to remember that they apply to the current match position, not anywhere else in the string.
Lookahead assertions let me peek at the string immediately ahead of the current match position. The assertion doesn’t consume part of the string, and if it succeeds, matching picks up right after the current match position.
In Learning Perl, we included an exercise
to check for both “Fred” and “Wilma” on the same line of input, no
matter the order they appeared on the line. The trick we wanted to
show to the novice Perler is that two regexes can be simpler than one.
One way to do this repeats both Wilma
and Fred
in the alternation so I can try either
order. A second try separates them into two regexes:
#/usr/bin/perl # fred-and-wilma.pl $_ = "Here come Wilma and Fred!"; print "Matches: $_" if /Fred.*Wilma|Wilma.*Fred/; print "Matches: $_" if /Fred/ && /Wilma/;
I can make a simple, single regex using a positive
lookahead assertion, denoted by (?=PATTERN)
. This assertion doesn’t
consume text in the string, but if it fails, the entire regex fails.
In this example, in the positive lookahead assertion I use .*Wilma
. That pattern must be true
immediately after the current match position:
$_ = "Here come Wilma and Fred!"; print "Matches: $_" if /(?=.*Wilma).*Fred/;
Since I used that at the start of my pattern, that means it has
to be true at the beginning of the string. Specifically, at the
beginning of the string, I have to be able to match any number of
characters except a newline followed by Wilma
. If that succeeds, it anchors the rest
of the pattern to its position (the start of the string). Figure 2-1 shows the two ways that can work, depending on
the order of Fred
and Wilma
in the string. The .*Wilma
anchors where it started matching.
The elastic .*
, which can match any
number of non-newline characters, anchors at the start of the
string.
Figure 2-1. The positive lookahead assertion (?=.*Wilma) anchors the pattern at the beginning of the string
It’s easier to understand lookarounds by seeing when they don’t
work, though. I’ll change my pattern a bit by removing the .*
from the lookahead assertion. At first it
appears to work, but it fails when I reverse the order of Fred
and Wilma
in the string:
$_ = "Here come Wilma and Fred!"; print "Matches: $_" if /(?=Wilma).*Fred/; # Works $_ = "Here come Fred and Wilma!"; print "Matches: $_" if /(?=Wilma).*Fred/; # Doesn't work
Figure 2-2 shows what happens. In the first
case, the lookahead anchors at the start of Wilma
.
The regex tried the assertion at the start of the string, found that
it didn’t work, then moved over a position and tried again. It kept
doing this until it got to Wilma
.
When it succeeded it set the anchor. Once it sets the anchor, the rest
of the pattern has to start from that position.
In the first case, .*Fred
can
match from that anchor because Fred
comes after Wilma
. The second case
in Figure 2-2 does the same thing. The regex tries
that assertion at the beginning of the string, finds that it doesn’t
work, and moves on to the next position. By the time the lookahead
assertion matches, it has already passed Fred
. The rest of the pattern has to start
from the anchor, but it can’t match.
Since the lookahead assertions don’t consume any of the string,
I can use it in a pattern for split
when I don’t
really want to discard the parts of the pattern that match. In this
example, I want to break apart the words in the studly cap string. I
want to split it based on the initial capital letter. I want to keep
the initial letter, though, so I use a lookahead assertion instead of
a character-consuming string. This is different from the separator
retention mode because the split pattern isn’t really a separator;
it’s just an anchor:
my @words = split /(?=[A-Z])/, 'CamelCaseString'; print join '_', map { lc } @words; # camel_case_string
Suppose I want to find the input lines that contain Perl
, but only if that isn’t Perl6
or Perl
6
. I might try a negated character class to specify the
pattern right after the l
in
Perl
to ensure that the next
character isn’t a 6
. I also use the
word boundary anchors \b
because I
don’t want to match in the middle of other words, such as “BioPerl” or
“PerlPoint”:
#!/usr/bin/perl # not-perl6.pl print "Trying negated character class:\n"; while( <> ) { print if /\bPerl[^6]\b/; # }
I’ll try this with some sample input:
# sample input Perl6 comes after Perl 5. Perl 6 has a space in it. I just say "Perl". This is a Perl 5 line Perl 5 is the current version. Just another Perl 5 hacker, At the end is Perl PerlPoint is PowerPoint BioPerl is genetic
It doesn’t work for all the lines it should. It only finds four
of the lines that have Perl
without
a trailing 6
, and a line that has a
space between Perl
and 6
:
Trying negated character class: Perl6 comes after Perl 5. Perl 6 has a space in it. This is a Perl 5 line Perl 5 is the current version. Just another Perl 5 hacker,
That doesn’t work because there has to be a character after the
l
in Perl
. Not only that, I specified a word
boundary. If that character after the l
is a nonword character, such as the
"
in I
just say "Perl"
, the word boundary at the end fails. If I
take off the trailing \b
, now
PerlPoint
matches. I haven’t even
tried handling the case where there is a space between Perl
and 6
. For that I’ll need something much
better.
To make this really easy, I can use a negative lookahead
assertion. I don’t want to match a character after the l
, and since an assertion doesn’t match
characters, it’s the right tool to use. I just want to say that if
there’s anything after Perl
, it
can’t be a 6
, even if there is some
whitespace between them. The negative lookahead
assertion uses (?!PATTERN)
. To solve this problem, I
use \s?6
as my pattern, denoting
the optional whitespace followed by a 6
:
print "Trying negative lookahead assertion:\n"; while( <> ) { print if /\bPerl(?!\s?6)\b/; # or /\bPerl[^6]/ }
Now the output finds all of the right lines:
Trying negative lookahead assertion: Perl6 comes after Perl 5. I just say "Perl". This is a Perl 5 line Perl 5 is the current version. Just another Perl 5 hacker, At the end is Perl
Remember that (?!PATTERN)
is
a lookahead assertion, so it looks
after the current match position. That’s why this
next pattern still matches. The lookahead asserts that right before
the b
in bar
that the next thing isn’t foo
. Since the next thing is bar
, which is not foo
, it matches. People often confuse this
to mean that the thing before bar
can’t be foo
, but each uses the
same starting match position, and since bar
is not foo
, they both work:
if( 'foobar' =~ /(?!foo)bar/ ) { print "Matches! That's not what I wanted!\n"; } else { print "Doesn't match! Whew!\n"; }
Instead of looking ahead at the part of the string coming up, I can use a lookbehind to check the part of the string the regular expression engine has already processed. Due to Perl’s implementation details, the lookbehind assertions have to be a fixed width, so I can’t use variable width quantifiers in them.
Now I can try to match bar
that
doesn’t follow a foo
. In the previous
section I couldn’t use a negative lookahead assertion because
that looks forward in the string. A negative lookbehind, denoted by
(?<!PATTERN)
, looks backward.
That’s just what I need. Now I get the right answer:
#!/usr/bin/perl # correct-foobar.pl if( 'foobar' =~ /(?<!foo)bar/ ) { print "Matches! That's not what I wanted!\n"; } else { print "Doesn't match! Whew!\n"; }
Now, since the regex has already processed that part of the string
by the time it gets to bar
, my
lookbehind assertion can’t be a variable width pattern. I can’t use the
quantifiers to make a variable width pattern because the engine is not
going to backtrack in the string to make the lookbehind work. I won’t be
able to check for a variable number of o
s in fooo
:
'foooobar' =~ /(?<!fo+)bar/;
When I try that, I get the error telling me that I can’t do that,
and even though it merely says not
implemented
, don’t hold your breath waiting for it:
Variable length lookbehind not implemented in regex...
The positive lookbehind assertion also looks backward, but its pattern must not match. The only time I seem to use these are in substitutions in concert with another assertion. Using both a lookbehind and a lookahead assertion, I can make some of my substitutions easier to read.
For instance, throughout the book I’ve used variations of
hyphenated words because I couldn’t decide which one I should use.
Should it be builtin
or built-in
? Depending on my mood or typing
skills, I used either of them.[5]
I needed to clean up my inconsistency. I knew the part of the word on the left of the hyphen, and I knew the text on the right of the hyphen. At the position where they meet, there should be a hyphen. If I think about that for a moment, I’ve just described the ideal situation for lookarounds: I want to put something at a particular position, and I know what should be around it. Here’s a sample program to use a positive lookbehind to check the text on the left and a positive lookahead to check the text on the right. Since the regex only matches when those sides meet, that means that it’s discovered a missing hyphen. When I make the substitution, it put the hyphen at the match position, and I don’t have to worry about the particular text:
@hyphenated = qw( built-in ); foreach my $word ( @hyphenated ) { my( $front, $back ) = split /-/, $word; $text =~ s/(?<=$front)(?=$back)/-/g; }
If that’s not a complicated enough example, try this one. Let’s use the lookarounds to add commas to numbers. Jeffery Friedl shows one attempt in Mastering Regular Expressions, adding commas to the U.S. population:[6]
$pop = 301139843; # that's for Feb 10, 2007 # From Jeffrey Friedl $pop =~ s/(?<=\d)(?=(?:\d\d\d)+$)/,/g;
That works, mostly. The positive lookbehind (?<=\d)
wants to match a number, and the
positive lookahead (?=(?:\d\d\d)+$)
wants to find groups of three digits all the way to the end of the
string. This breaks when I have floating point numbers, such as
currency. For instance, my broker tracks my stock positions to four
decimal places. When I try that substitution, I get no comma on the left
side of the decimal point and one of the fractional side. It’s because
of that end of string anchor:
$money = '$1234.5678'; $money =~ s/(?<=\d)(?=(?:\d\d\d)+$)/,/g; # $1234.5,678
I can modify that a bit. Instead of the end of string anchor, I’ll
use a word boundary, \b
.
That might seem weird, but remember that a digit is a word character.
That gets me the comma on the left side, but I still have that extra
comma:
$money = '$1234.5678'; $money =~ s/(?<=\d)(?=(?:\d\d\d)+$)/,/g; # $1,234.5,678
What I really want for that first part of the regex is to use the
lookbehind to match a digit, but not when it’s preceded by a decimal
point. That’s the description of a negative lookbehind, (?<!\.\d)
. Since all of these match at the
same position, it doesn’t matter that some of them might overlap as long
as they all do what I need:
$money = $'1234.5678'; $money =~ s/(?<!\.\d)(?<=\d)(?=(?:\d\d\d)+\b)/,/g; # $1,234.5678
That works! It’s a bit too bad that it does because I’d really
like an excuse to get a negative lookahead in there. It’s too
complicated already, so I’ll just add the /x
to practice what I preach:
$money =~ s/ (?<!\.\d) # not a . digit right before the position (?<=\d) # a digit right before the position # <--- CURRENT MATCH POSITION (?= # this group right after the position (?:\d\d\d)+ # one or more groups of three digits \b # word boundary (left side of decimal or end) ) /,/xg;
While trying to figure out a regex, whether one I found in someone
else’s code or one I wrote myself (maybe a long time ago), I can turn on
Perl’s regex debugging mode.[7]Perl’s -D
switch turns on debugging options for the Perl interpreter (not
for your program, as in Chapter 4). The switch takes a
series of letters or numbers to indicate what it should turn on. The
-Dr
option turns on regex parsing and
execution debugging.
I can use a short program to examine a regex. The first argument is the match string and the second argument is the regular expression. I save this program as explain-regex:
#!/usr/bin/perl $ARGV[0] =~ /$ARGV[1]/;
When I try this with the target string Just
another Perl hacker,
and the regex Just
another (\S+) hacker,
, I see two major sections of output, which
the perldebguts documentation explains at
length. First, Perl compiles the regex, and the -Dr
output shows how Perl parsed the regex. It
shows the regex nodes, such as EXACT
and NSPACE
, as well as any
optimizations, such as anchored "Just another
"
. Second, it tries to match the target string, and shows its
progress through the nodes. It’s a lot of information, but it shows me
exactly what it’s doing:
$ perl -Dr explain-regex 'Just another Perl hacker,' 'Just another (\S+) hacker,' Omitting $` $& $' support. EXECUTING... Compiling REx `Just another (\S+) hacker,' size 15 Got 124 bytes for offset annotations. first at 1 rarest char k at 4 rarest char J at 0 1: EXACT <Just another >(6) 6: OPEN1(8) 8: PLUS(10) 9: NSPACE(0) 10: CLOSE1(12) 12: EXACT < hacker,>(15) 15: END(0) anchored "Just another " at 0 floating " hacker," at 14..2147483647 (checking anchored) minlen 22 Offsets: [15] 1[13] 0[0] 0[0] 0[0] 0[0] 14[1] 0[0] 17[1] 15[2] 18[1] 0[0] 19[8] 0[0] 0[0] 27[0] Guessing start of match, REx "Just another (\S+) hacker," against "Just another Perl hacker,"... Found anchored substr "Just another " at offset 0... Found floating substr " hacker," at offset 17... Guessed: match at offset 0 Matching REx "Just another (\S+) hacker," against "Just another Perl hacker," Setting an EVAL scope, savestack=3 0 <> <Just another> | 1: EXACT <Just another > 13 <ther > <Perl ha> | 6: OPEN1 13 <ther > <Perl ha> | 8: PLUS NSPACE can match 4 times out of 2147483647... Setting an EVAL scope, savestack=3 17 < Perl> < hacker> | 10: CLOSE1 17 < Perl> < hacker> | 12: EXACT < hacker,> 25 <Perl hacker,> <> | 15: END Match successful! Freeing REx: `"Just another (\\S+) hacker,"'
The re
pragma, which comes with
Perl, has a debugging mode that doesn’t require a -DDEBUGGING
enabled interpreter. Once I
turn on use re 'debug'
, it applies to
the entire program. It’s not lexically scoped like most pragmata. I modify
my previous program to use the re
pragma instead of the command-line switch:
#!/usr/bin/perl use re 'debug'; $ARGV[0] =~ /$ARGV[1]/;
I don’t have to modify my program to use re
since
I can also load it from the command line:
$ perl -Mre=debug explain-regex 'Just another Perl hacker,' 'Just another (\S+) hacker,'
When I run this program with a regex as its argument, I get almost
the same exact output as my previous -Dr
example.
The YAPE::Regex::Explain
,
although a bit old, might be useful in explaining a regex in mostly
plain English. It parses a regex and provides a description of what each
part does. It can’t explain the semantic purpose, but I can’t have
everything. With a short program I can explain the regex I specify on the
command line:
#!/usr/bin/perl use YAPE::Regex::Explain; print YAPE::Regex::Explain->new( $ARGV[0] )->explain;
When I run the program even with a short, simple regex, I get plenty of output:
$ perl yape-explain 'Just another (\S+) hacker,' The regular expression: (?-imsx:Just another (\S+) hacker,) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- Just another 'Just another ' ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- \S+ non-whitespace (all but \n, \r, \t, \f, and " ") (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- hacker, ' hacker,' ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------
It’s almost the end of the chapter, but there are still so many regular expression features I find useful. Consider this section a quick tour of the things you can look into on your own.
I don’t have to be content with the simple character classes such as
\w
(word characters), \d
(digits), and the others denoted by slash
sequences. I can also use the POSIX character classes. I enclose those in
the square brackets with colons on both sides of the name:
print "Found alphabetic character!\n" if $string =~ m/[:alpha:]/; print "Found hex digit!\n" if $string =~ m/[:xdigit:]/;
I negate those with a caret, ^
,
after the first colon:
print "Didn't find alphabetic characters!\n" if $string =~ m/[:^alpha:]/; print "Didn't find spaces!\n" if $string =~ m/[:^space:]/;
I can say the same thing in another way by specifying a named
property. The \p{Name}
sequence (little
p) includes the characters for the named property, and the \P{Name}
sequence (big P) is its
complement:
print "Found ASCII character!\n" if $string =~ m/\p{IsASCII}/; print "Found control characters!\n" if $string =~ m/\p{IsCntrl}/; print "Didn't find punctuation characters!\n" if $string =~ m/\P{IsPunct}/; print "Didn't find uppercase characters!\n" if $string =~ m/\P{IsUpper}/;
The Regexp::Common
module provides pretested and known-to-work regexes for, well,
common things such as web addresses, numbers, postal codes, and even
profanity. It gives me a multilevel hash %RE
that has as its values regexes. If I don’t
like that, I can use its function interface:
use Regexp::Common; print "Found a real number\n" if $string =~ /$RE{num}{real}/; print "Found a real number\n" if $string =~ RE_num_real;
If I want to build up my own pattern, I can use Regexp::English
, which uses a series of
chained methods to return an object that stands in for a regex. It’s
probably not something you want in a real program, but it’s fun to think
about:
use Regexp::English; my $regexp = Regexp::English->new ->literal( 'Just' ) ->whitespace_char ->word_chars ->whitespace_char ->remember( \$type_of_hacker ) ->word_chars ->end ->whitespace_char ->literal( 'hacker' ); $regexp->match( 'Just another Perl hacker,' ); print "The type of hacker is [$type_of_hacker]\n";
If you really want to get into the nuts and bolts of regular expressions, check out O’Reilly’s Mastering Regular Expressions by Jeffrey Friedl. You’ll not only learn some advanced features, but how regular expressions work and how you can make yours better.
This chapter covered some of the more useful advanced features of
Perl’s regex engine. The qr()
quoting operator
lets me compile a regex for later and gives it back to me as a reference.
With the special (?)
sequences, I can
make my regular expression much more powerful, as well as less
complicated. The \G
anchor allows me
to anchor the next match where the last one left off, and using the
/c
flag, I can try several possibilities without resetting the match
position if one of them fails.
perlre is the documentation for Perl regexes, and perlretut gives a regex tutorial. Don’t confuse that with perlreftut, the tutorial on references. To make it even more complicated, perlreref is the regex quick reference.
The details for regex debugging shows up in perldebguts. It explains the output
of -Dr
and re
'debug'
.
Perl Best Practices has a section on regexes,
and gives the \x
“Extended Formatting”
pride of place.
Mastering Regular Expressions covers regexes in general, and compares their implementation in different languages. Jeffrey Friedl has an especially nice description of lookahead and lookbehind operators. If you really want to know about regexes, this is the book to get.
Simon Cozens explains advanced regex features in two articles for Perl.com: “Regexp Power” (http://www.perl.com/pub/a/2003/06/06/regexps.html) and “Power Regexps, Part II” (http://www.perl.com/pub/a/2003/07/01/regexps.html).
The web site http://www.regular-expressions.info has good discussions about regular expressions and their implementations in different languages.
[1] As of Perl 5.6, if the string does not change, Perl will not
recompile that regex. Before Perl 5.6, I had to use the /o
flag to get that behavior. I can still use /o
if I don’t want to recompile the pattern
even if the variable changes.
[2] That actually happens in the maybe_regex
method in Test::Builder
.
[3] I can also escape a literal space character with a \
, but since I can’t really see the space, I
prefer to use something I can see, such as \x20
.
[4] This is the same trick I need to use to interpolate function
calls inside a string: print "Result is: @{ [
func(@args) ] }"
.
[5] As a publisher, O’Reilly Media has dealt with this many times, so it maintains a word list to say how they do it, although that doesn’t mean that authors like me read it: http://www.oreilly.com/oreilly/author/stylesheet.html.
[6] The U.S. Census Bureau has a population clock so you can use the latest number if you’re reading this book a long time from now: http://www.census.gov/main/www/popclock.html.
[7] The regular expression debugging mode requires an interpreter
compiled with -DDEBUGGING
. Running
perl -V
shows
the interpreter’s compilation options.
Get Mastering Perl now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.