Chapter 4. Pattern Matching with Regular Expressions

Introduction

Suppose you have been on the Internet for a few years and have been very faithful about saving all your correspondence, just in case you (or your lawyers, or the prosecution) need a copy. The result is that you have a 50-megabyte disk partition dedicated to saved mail. And let’s further suppose that you remember that there is one letter, somewhere in there, from someone named Angie or Anjie. Or was it Angy? But you don’t remember what you called it or where you stored it. Obviously, you will have to go look for it.

But while some of you go and try to open up all 15,000,000 documents in a word processor, I’ll just find it with one simple command. Any system that provides regular expression support will allow me to search for the pattern:

An[^ dn]

in all the files. The “A” and the “n” match themselves, in effect finding words that begin with “An”, while the cryptic [^ dn] requires the “An” to be followed by a character other than a space (to eliminate the very common English word “an” at the start of a sentence) or “d” (to eliminate the common word “and”) or “n” (to eliminate Anne, Announcing, etc.). Has your word processor gotten past its splash screen yet? Well, it doesn’t matter, because I’ve already found the missing file. To find the answer, I just typed the command:[14]

grep 'An[^ dn]' *

Regular expressions, or REs for short, provide a concise and precise specification of patterns to be matched in text. Java 2 did not include any facilities for describing regular expressions in text. This is mildly surprising given how powerful regular expressions are, how ubiquitous they are on the Unix operating system where Java was first brewed, and how powerful they are in modern scripting languages like sed, awk, Python, and Perl.

At any rate, there were no RE packages for Java when I first learned the language, and because of this, I wrote my own RE package. More recently, I had planned to submit a JSR[15] to Sun Microsystems, proposing to add to Java a regular expressions API similar to the one used in this chapter. However, the Apache Jakarta Regular Expressions project[16] has achieved sufficient momentum to become nearly a standard, but without the politics and meetings required of a JSR. Accordingly, my JSR has not been submitted yet. Conveniently, the Jakarta folk used a similar syntax to mine, so I was mostly able to migrate to theirs just by changing the imports. However, the Apache code is vastly more efficient than mine and should be used whenever possible. Mine was written for pedagogical display, and compiles the RE into an array of SubExpression objects. The Jakarta package, borrowing a trick from Java,[17] compiles to an array of integer commands, making it run much faster: around a factor of 3 or 4, even for simple cases like searching for the string “java” in a few dozen files. There are in fact a half dozen or so regular expression packages for Java; see Table 4-1.

Table 4-1. Java RE packages

Package

Notes

URL

Richard Emberson’s

Unknown license; not being maintained.

None; posted to

Ian Darwin’s RE

Simple, but SLOW. Incomplete; didactic.

http://www.darwinsys.com/java/

Apache Jakarta RegExp

(original by Jonathan Locke)

Apache (BSD-like) license.

http://jakarta.apache.org/regexp/

Apache Jakarta ORO

Apache license. More comprehensive?

http://jakarta.apache.org/oro/

Daniel Savarese

Unknown.

http://www.cs.umd.edu/users/dfs/java/

“GNU Java Regexp”

GPL; fairly fast.

http://www.gjt.org (Giant Java Tree)

The syntax of REs themselves is discussed in Section 4.2, hints on using them in Section 4.3, and the syntax of the Java API for using REs in Section 4.4.

See Also

O’Reilly’s Mastering Regular Expressions by Jeffrey E. F. Friedl is the definitive guide to all the details of regular expressions. Most introductory Unix tomes include some discussion of REs; O’Reilly’s UNIX Power Tools devotes a chapter to them.



[14] Non-Unix fans rejoice, for you can do this on Win32 using a package alternately called CygWin (after Cygnus Software) or GnuWin32 (http://sources.redhat.com/cygwin/). Or you can use my Grep program in Section 4.9 if you don’t have grep on your system. Incidentally, the name grep comes from an ancient Unix line editor command g/RE/p, the command to globally find the RE (regular expression) in all lines in the edit buffer and print the lines that match: just what the grep program does to lines in files.

[15] A JSR is a Java Standards Request, the process by which new standards are submitted by the Java Community and discussed in public prior to adoption. See Sun’s Java Community web site (http://developer.java.sun.com/developer/community/).

[16] Apache has, in fact, two regular expressions packages. The second, Oro, provides full Perl5-style regular expressions, AWK-like regular expressions, glob expressions, and utility classes for performing substitutions, splits, filtering filenames, etc. This library is the successor to the OROMatcher, AwkTools, PerlTools, and TextTools libraries from ORO, Inc. (http://www.oroinc.com).

[17] Java perhaps got the idea from the UCSD P-system, which used portable bytecodes in the early 1980s and ran on all the popular microcomputers of the day.

Get Java Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.