Regular Expressions

Now it’s time to take a brief detour on our trip through Java and enter the land of regular expressions. A regular expression, or regex for short, describes a text pattern. Regular expressions are used with many tools—including the java.util.regex package, text editors, and many scripting languages—to provide sophisticated text-searching and powerful string-manipulation capabilities.

If you are already familiar with the concept of regular expressions and how they are used with other languages, you may wish to skim through this section. At the very least, you’ll need to look at the “The java.util.regex API” section later in this chapter, which covers the Java classes necessary to use them. On the other hand, if you’ve come to this point on your Java journey with a clean slate on this topic and you’re wondering exactly what regular expressions are, then pop open your favorite beverage and get ready. You are about to learn about the most powerful tool in the arsenal of text manipulation and what is, in fact, a tiny language within a language, all in the span of a few pages.

Regex Notation

A regular expression describes a pattern in text. By pattern, we mean just about any feature you can imagine identifying in text from the literal characters alone, without actually understanding their meaning. This includes features, such as words, word groupings, lines and paragraphs, punctuation, case, and more generally, strings and numbers with a specific structure to them, such as phone numbers, email addresses, and quoted phrases. With regular expressions, you can search the dictionary for all the words that have the letter “q” without its pal “u” next to it, or words that start and end with the same letter. Once you have constructed a pattern, you can use simple tools to hunt for it in text or to determine if a given string matches it. A regex can also be arranged to help you dismember specific parts of the text it matched, which you could then use as elements of replacement text if you wish.

Write once, run away

Before moving on, we should say a few words about regular expression syntax in general. At the beginning of this section, we casually mentioned that we would be discussing a new language. Regular expressions do, in fact, constitute a simple form of programming language. If you think for a moment about the examples we cited earlier, you can see that something like a language is going to be needed to describe even simple patterns—such as email addresses—that have some variation in form.

A computer science textbook would classify regular expressions at the bottom of the hierarchy of computer languages, in terms of both what they can describe and what you can do with them. They are still capable of being quite sophisticated, however. As with most programming languages, the elements of regular expressions are simple, but they can be built up in combination to arbitrary complexity. And that is where things start to get sticky.

Since regexes work on strings, it is convenient to have a very compact notation that can be easily wedged between characters. But compact notation can be very cryptic, and experience shows that it is much easier to write a complex statement than to read it again later. Such is the curse of the regular expression. You may find that in a moment of late-night, caffeine-fueled inspiration, you can write a single glorious pattern to simplify the rest of your program down to one line. When you return to read that line the next day, however, it may look like Egyptian hieroglyphics to you. Simpler is generally better. If you can break your problem down and do it more clearly in several steps, maybe you should.

Escaped characters

Now that you’re properly warned, we have to throw one more thing at you before we build you back up. Not only can the regex notation get a little hairy, but it is also somewhat ambiguous with ordinary Java strings. An important part of the notation is the escaped character, a character with a backslash in front of it. For example, the escaped d character, \d, (backslash ‘d’) is shorthand that matches any single digit character (0-9). However, you cannot simply write \d as part of a Java string, because Java uses the backslash for its own special characters and to specify Unicode character sequences (\uxxxx). Fortunately, Java gives us a replacement: an escaped backslash, which is two backslashes (\\), means a literal backslash. The rule is, when you want a backslash to appear in your regex, you must escape it with an extra one:

    "\\d" // Java string that yields backslash "d"

And just to make things crazier, because regex notation itself uses backslash to denote special characters, it must provide the same “escape hatch” as well—allowing you to double up backslashes if you want a literal backslash. So if you want to specify a regular expression that includes a single literal backslash, it looks like this:

    "\\\\"  // Java string yields two backslashes; regex yields one

Most of the “magic” operator characters you read about in this section operate on the character that precedes them, so these also must be escaped if you want their literal meaning. This includes such characters as ., *, +, braces {}, and parentheses ().

If you need to create part of an expression that has lots of literal characters in it, you can use the special delimiters \Q and \E to help you. Any text appearing between \Q and \E is automatically escaped. (You still need the Java String escapes—double backslashes for backslash, but not quadruple.) There is also a static method Pattern.quote(), which does the same thing, returning a properly escaped version of whatever string you give it.

Beyond that, my only suggestion to help maintain your sanity when working with these examples is to keep two copies—a comment line showing the naked regular expression and the real Java string, where you must double up all backslashes.

Characters and character classes

Now, let’s dive into the actual regex syntax. The simplest form of a regular expression is plain, literal text, which has no special meaning and is matched directly (character for character) in the input. This can be a single character or more. For example, in the following string, the pattern “s” can match the character s in the words rose and is:

    "A rose is $1.99."

The pattern “rose” can match only the literal word rose. But this isn’t very interesting. Let’s crank things up a notch by introducing some special characters and the notion of character “classes.”

Any character: dot (.)

The special character dot (.) matches any single character. The pattern “.ose” matches rose, nose, _ose (space followed by ose) or any other character followed by the sequence ose. Two dots match any two characters, and so on. The dot operator is not discriminating; it normally stops only for an end-of-line character (and, optionally, you can tell it not to; we discuss that later).

We can consider “.” to represent the group or class of all characters. And regexes define more interesting character classes as well.

Whitespace or nonwhitespace character: \s, \S

The special character \s matches a literal-space character or one of the following characters: \t (tab), \r (carriage return), \n (newline), \f (formfeed), and backspace. The corresponding special character \S does the inverse, matching any character except whitespace.

Digit or nondigit character: \d, \D

\d matches any of the digits 0-9. \D does the inverse, matching all characters except digits.

Word or nonword character: \w, \W

\w matches a “word” character, including upper- and lowercase letters A-Z, a-z, the digits 0-9, and the underscore character (_). \W matches everything except those characters.

Custom character classes

You can define your own character classes using the notation [...]. For example, the following class matches any of the characters a, b, c, x, y, or z:

    [abcxyz]

The special x-y range notation can be used as shorthand for the alphabetic characters. The following example defines a character class containing all upper- and lowercase letters:

    [A-Za-z]

Placing a caret (^) as the first character inside the brackets inverts the character class. This example matches any character except uppercase A-F:

    [^A-F]    //  G, H, I, ..., a, b, c, ... etc.

Nesting character classes simply adds them:

    [A-F[G-Z]]   // A-Z

The && logical AND notation can be used to take the intersection (characters in common):

    [a-p&&[l-z]]  // l, m, n, o, p
    [A-Z&&[^P]]  // A through Z except P

Position markers

The pattern “[Aa] rose” (including an upper- or lowercase A) matches three times in the following phrase:

    "A rose is a rose is a rose"

Position characters allow you to designate the relative location of a match. The most important are ^ and $, which match the beginning and end of a line, respectively:

    ^[Aa] rose  // matches "A rose" at the beginning of line
    [Aa] rose$  // matches "a rose" at end of line

By default, ^ and $ match the beginning and end of “input,” which is often a line. If you are working with multiple lines of text and wish to match the beginnings and endings of lines within a single large string, you can turn on “multiline” mode as described later in this chapter.

The position markers \b and \B match a word boundary or nonword boundary, respectively. For example, the following pattern matches rose and rosemary, but not primrose:

    \brose

Iteration (multiplicity)

Simply matching fixed character patterns would not get us very far. Next, we look at operators that count the number of occurrences of a character (or more generally, of a pattern, as we’ll see in Capture groups):

Any (zero or more iterations): asterisk (*)

Placing an asterisk (*) after a character or character class means “allow any number of that type of character”—in other words, zero or more. For example, the following pattern matches a digit with any number of leading zeros (possibly none):

    0*\d   // match a digit with any number of leading zeros
Some (one or more iterations): plus sign (+)

The plus sign (+) means “one or more” iterations and is equivalent to XX* (pattern followed by pattern asterisk). For example, the following pattern matches a number with one or more digits, plus optional leading zeros:

    0*\d+   // match a number (one or more digits) with optional leading 
            // zeros

It may seem redundant to match the zeros at the beginning of an expression because zero is a digit and is thus matched by the \d+ portion of the expression anyway. However, we’ll show later how you can pick apart the string using a regex and get at just the pieces you want. In this case, you might want to strip off the leading zeros and keep only the digits.

Optional (zero or one iteration): question mark (?)

The question mark operator (?) allows exactly zero or one iteration. For example, the following pattern matches a credit-card expiration date, which may or may not have a slash in the middle:

    \d\d/?\d\d  // match four digits with an optional slash in the middle
Range (between x and y iterations, inclusive): {x,y}

The {x,y} curly-brace range operator is the most general iteration operator. It specifies a precise range to match. A range takes two arguments: a lower bound and an upper bound, separated by a comma. This regex matches any word with five to seven characters, inclusive:

    \b\w{5,7}\b  // match words with at least 5 and at most 7 characters
At least x or more iterations (y is infinite): {x,}

If you omit the upper bound, simply leaving a dangling comma in the range, the upper bound becomes infinite. This is a way to specify a minimum of occurrences with no maximum.

Grouping

Just as in logical or mathematical operations, parentheses can be used in regular expressions to make subexpressions or to put boundaries on parts of expressions. This power lets us extend the operators we’ve talked about to work not only on characters, but also on words or other regular expressions. For example:

    (yada)+

Here we are applying the + (one or more) operator to the whole pattern yada, not just one character. It matches yada, yadayada, yadayadayada, and so on.

Using grouping, we can start building more complex expressions. For example, while many email addresses have a three-part structure (e.g., foo@bar.com), the domain name portion can, in actuality, contain an arbitrary number of dot-separated components. To handle this properly, we can use an expression like this one:

    \w+@\w+(\.\w)+   // Match an email address

This expression matches a word, followed by an @ symbol, followed by another word and then one or more literal dot-separated words—e.g., , , or .

Capture groups

In addition to basic grouping of operations, parentheses have an important, additional role: the text matched by each parenthesized subexpression can be separately retrieved. That is, you can isolate the text that matched each subexpression. There is then a special syntax for referring to each capture group within the regular expression by number. This important feature has two uses.

First, you can construct a regular expression that refers to the text it has already matched and uses this text as a parameter for further matching. This allows you to express some very powerful things. For example, we can show the dictionary example we mentioned in the introduction. Let’s find all the words that start and end with the same letter:

    \b(\w)\w*\1\b  // match words beginning and ending with the same letter

See the 1 in this expression? It’s a reference to the first capture group in the expression, (\w). References to capture groups take the form \n where n is the number of the capture group, counting from left to right. In this example, the first capture group matches a word character on a word boundary. Then we allow any number of word characters up to the special reference \1 (also followed by a word boundary). The \1 means “the value matched in capture group one.” Because these characters must be the same, this regex matches words that start and end with the same character.

The second use of capture groups is in referring to the matched portions of text while constructing replacement text. We’ll show you how to do that a bit later when we talk about the Regular Expression API.

Capture groups can contain more than one character, of course, and you can have any number of groups. You can even nest capture groups. Next, we discuss exactly how they are numbered.

Numbering

Capture groups are numbered, starting at 1, and moving from left to right, by counting the number of open parentheses it takes to reach them. The special group number 0 always refers to the entire expression match. For example, consider the following string:

    one ((two) (three (four)))

This string creates the following matches:

    Group 0: one two three four
    Group 1: two three four
    Group 2: two
    Group 3: three four
    Group 4: four

Before going on, we should note one more thing. So far in this section we’ve glossed over the fact that parentheses are doing double duty: creating logical groupings for operations and defining capture groups. What if the two roles conflict? Suppose we have a complex regex that uses parentheses to group subexpressions and to create capture groups? In that case, you can use a special noncapturing group operator (?:) to do logical grouping instead of using parentheses. You probably won’t need to do this often, but it’s good to know.

Alternation

The vertical bar (|) operator denotes the logical OR operation, also called alternation or choice. The | operator does not operate on individual characters but instead applies to everything on either side of it. It splits the expression in two unless constrained by parentheses grouping. For example, a slightly naive approach to parsing dates might be the following:

    \w+, \w+ \d+ \d+|\d\d/\d\d/\d\d  // pattern 1 or pattern 2

In this expression, the left matches patterns such as Fri, Oct 12, 2001, and the right matches 10/12/2001.

The following regex might be used to match email addresses with one of three domains (net, edu, and gov):

    \w+@[\w\.]*\.(net|edu|gov)  // email address ending in .net, .edu, or .gov

Special options

There are several special options that affect the way the regex engine performs its matching. These options can be applied in two ways:

  • You can pass in one or more flags during the Pattern.compile() step (discussed later in this chapter).

  • You can include a special block of code in your regex.

We’ll show the latter approach here. To do this, include one or more flags in a special block (?x), where x is the flag for the option we want to turn on. Generally, you do this at the beginning of the regex. You can also turn off flags by adding a minus sign (?-x), which allows you to apply flags to select parts of your pattern.

The following flags are available:

Case-insensitive: (?i)

The (?i) flag tells the regex engine to ignore case while matching, for example:

    (?i)yahoo   // match Yahoo, yahoo, yahOO, etc.
Dot all: (?s)

The (?s) flag turns on “dot all” mode, allowing the dot character to match anything, including end-of-line characters. It is useful if you are matching patterns that span multiple lines. The s stands for “single-line mode,” a somewhat confusing name derived from Perl.

Multiline: (?m)

By default, ^ and $ don’t really match the beginning and end of lines (as defined by carriage return or newline combinations); they instead match the beginning or end of the entire input text. Turning on multiline mode with (?m) causes them to match the beginning and end of every line as well as the beginning and end of input. Specifically, this means the spot before the first character, the spot after the last character, and the spots just after and before line terminators inside the string.

Unix lines: (?d)

The (?d) flag limits the definition of the line terminator for the ^, $, and . special characters to Unix-style newline only (\n). By default, carriage return newline (\r\n) is also allowed.

Greediness

We’ve seen hints that regular expressions are capable of sorting some complex patterns. But there are cases where what should be matched is ambiguous (at least to us, though not to the regex engine). Probably the most important example has to do with the number of characters the iterator operators consume before stopping. The .* operation best illustrates this. Consider the following string:

    "Now is the time for <bold>action</bold>, not words."

Suppose we want to search for all the HTML-style tags (the parts between the < and > characters), perhaps because we want to remove them.

We might naively start with this regex:

    </?.*>  // match <, optional /, and then anything up to >

We then get the following match, which is much too long:

    <bold>action</bold>

The problem is that the .* operation, like all the iteration operators, is by default “greedy,” meaning that it consumes absolutely everything it can, up until the last match for the terminating character (in this case, >) in the file or line.

There are solutions for this problem. The first is to “say what it is”—that is, to be specific about what is allowed between the braces. The content of an HTML tag cannot actually include anything; for example, it cannot include a closing bracket (>). So we could rewrite our expression as:

    </?\w*>  // match <, optional /, any number of word characters, then >

But suppose the content is not so easy to describe. For example, we might be looking for quoted strings in text, which could include just about any text. In that case, we can use a second approach and “say what it is not.” We can invert our logic from the previous example and specify that anything except a closing bracket is allowed inside the brackets:

    </?[^>]*>

This is probably the most efficient way to tell the regex engine what to do. It then knows exactly what to look for to stop reading. This approach has limitations, however. It is not obvious how to do this if the delimiter is more complex than a single character. It is also not very elegant.

Finally, we come to our general solution: the use of “reluctant” operators. For each of the iteration operators, there is an alternative, nongreedy form that consumes as few characters as possible, while still trying to get a match with what comes after it. This is exactly what we needed in our previous example.

Reluctant operators take the form of the standard operator with a “?” appended. (Yes, we know that’s confusing.) We can now write our regex as:

    </?.*?> // match <, optional /, minimum number of any chars, then >

We have appended ? to .* to cause .* to match as few characters as possible while still making the final match of >. The same technique (appending the ?) works with all the iteration operators, as in the two following examples:

    .+?   // one or more, nongreedy
    .{x,y}?  // between x and y, nongreedy

Lookaheads and lookbehinds

In order to understand our next topic, let’s return for a moment to the position marking characters (^, $, \b, and \B) that we discussed earlier. Think about what exactly these special markers do for us. We say, for example, that the \b marker matches a word boundary. But the word “match” here may be a bit too strong. In reality, it “requires” a word boundary to appear at the specified point in the regex. Suppose we didn’t have \b; how could we construct it? Well, we could try constructing a regex that matches the word boundary. It might seem easy, given the word and nonword character classes (\w and \W):

    \w\W|\W\w  // match the start or end of a word

But now what? We could try inserting that pattern into our regular expressions wherever we would have used \b, but it’s not really the same. We’re actually matching those characters, not just requiring them. This regular expression matches the two characters composing the word boundary in addition to whatever else matches afterward, whereas the \b operator simply requires the word boundary but doesn’t match any text. The distinction is that \b isn’t a matching pattern but a kind of lookahead. A lookahead is a pattern that is required to match next in the string, but is not consumed by the regex engine. When a lookahead pattern succeeds, the pattern moves on, and the characters are left in the stream for the next part of the pattern to use. If the lookahead fails, the match fails (or it backtracks and tries a different approach).

We can make our own lookaheads with the lookahead operator (?=). For example, to match the letter X at the end of a word, we could use:

    (?=\w\W)X  // Find X at the end of a word

Here the regex engine requires the \W\w pattern to match but not consume the characters, leaving them for the next part of the pattern. This effectively allows us to write overlapping patterns (like the previous example). For instance, we can match the word “Pat” only when it’s part of the word “Patrick,” like so:

    (?=Patrick)Pat  // Find Pat only in Patrick

Another operator, (?!), the negative lookahead, requires that the pattern not match. We can find all the occurrences of Pat not inside of a Patrick with this:

    (?!Patrick)Pat  // Find Pat never in Patrick

It’s worth noting that we could have written all of these examples in other ways, by simply matching a larger amount of text. For instance, in the first example we could have matched the whole word “Patrick.” But that is not as precise, and if we wanted to use capture groups to pull out the matched text or parts of it later, we’d have to play games to get what we want. For example, suppose we wanted to substitute something for Pat (say, change the font). We’d have to use an extra capture group and replace the text with itself. Using lookaheads is easier.

In addition to looking ahead in the stream, we can use the (?<=) and (?<!)lookbehind operators to look backward in the stream. For example, we can find my last name, but only when it refers to me:

    (?<=Pat )Niemeyer  // Niemeyer, only when preceded by Pat

Or we can find the string “bean” when it is not part of the phrase “Java bean”:

    (?<!Java *)bean   // The word bean, not preceded by Java

In these cases, the lookbehind and the matched text didn’t overlap because the lookbehind was before the matched text. But you can place a lookahead or lookbehind at either point—before or after the match—for example, we could also match Pat Niemeyer like this:

    Niemeyer(?<=Pat Niemeyer)

The java.util.regex API

Now that we’ve covered the theory of how to construct regular expressions, the hard part is over. All that’s left is to investigate the Java API for applying regexes: searching for them in strings, retrieving captured text, and replacing matches with substitution text.

Pattern

As we’ve said, the regex patterns that we write as strings are, in actuality, little programs describing how to match text. At runtime, the Java regex package compiles these little programs into a form that it can execute against some target text. Several simple convenience methods accept strings directly to use as patterns. More generally, however, Java allows you to explicitly compile your pattern and encapsulate it in an instance of a Pattern object. This is the most efficient way to handle patterns that are used more than once, because it eliminates needlessly recompiling the string. To compile a pattern, we use the static method Pattern.compile():

    Pattern urlPattern = Pattern.compile("\\w+://[\\w/]*");

Once you have a Pattern, you can ask it to create a Matcher object, which associates the pattern with a target string:

    Matcher matcher = urlPattern.matcher( myText );

The matcher executes the matches. We’ll talk about that next. But before we do, we’ll just mention one convenience method of Pattern. The static method Pattern.matches() simply takes two strings—a regex and a target string—and determines if the target matches the regex. This is very convenient if you want to do a quick test once in your application. For example:

    Boolean match = Pattern.matches( "\\d+\\.\\d+f?", myText );

This line of code can test if the string myText contains a Java-style floating-point number such as “42.0f.” Note that the string must match completely in order to be considered a match.

The Matcher

A Matcher associates a pattern with a string and provides tools for testing, finding, and iterating over matches of the pattern against it. The Matcher is “stateful.” For example, the find() method tries to find the next match each time it is called. But you can clear the Matcher and start over by calling its reset() method.

If you’re just interested in “one big match”—that is, you’re expecting your string to either match the pattern or not—you can use matches() or lookingAt(). These correspond roughly to the methods equals() and startsWith() of the String class. The matches() method asks if the string matches the pattern in its entirety (with no string characters left over) and returns true or false. The lookingAt() method does the same, except that it asks only whether the string starts with the pattern and doesn’t care if the pattern uses up all the string’s characters.

More generally, you’ll want to be able to search through the string and find one or more matches. To do this, you can use the find() method. Each call to find() returns true or false for the next match of the pattern and internally notes the position of the matching text. You can get the starting and ending character positions with the Matcher start() and end() methods, or you can simply retrieve the matched text with the group() method. For example:

    import java.util.regex.*;

    String text="A horse is a horse, of course of course...";
    String pattern="horse|course";

    Matcher matcher = Pattern.compile( pattern ).matcher( text );
    while ( matcher.find() )
      System.out.println(
        "Matched: '"+matcher.group()+"' at position "+matcher.start() );

The previous snippet prints the starting location of the words “horse” and “course” (four in all):

    Matched: 'horse' at position 2
    Matched: 'horse' at position 13
    Matched: 'course' at position 23
    Matched: 'course' at position 33

The method to retrieve the matched text is called group() because it refers to capture group zero (the entire match). You can also retrieve the text of other numbered capture groups by giving the group() method an integer argument. You can determine how many capture groups you have with the groupCount() method:

    for (int i=1; i < matcher.groupCount(); i++)
    System.out.println( matcher.group(i) );

Splitting and tokenizing strings

A very common need is to parse a string into a bunch of fields based on some delimiter, such as a comma. It’s such a common problem that in Java 1.4, a method was added to the String class for doing just this. The split() method accepts a regular expression and returns an array of substrings broken around that pattern. For example:

    String text = "Foo, bar ,   blah";
    String [] fields = text.split( "\s*,\s*" );

yields a String array containing Foo, bar, and blah. You can control the maximum number of matches and also whether you get “empty” strings (for text that might have appeared between two adjacent delimiters) using an optional limit field.

If you are going to use an operation like this more than a few times in your code, you should probably compile the pattern and use its split() method, which is identical to the version in String. The String split() method is equivalent to:

    Pattern.compile(pattern).split(string);

Another look at Scanner

As we mentioned when we introduced it, the Scanner class in Java 5.0 can use regular expressions to tokenize strings. You can specify a regular expression to use as the delimiter (instead of the default whitespace) either at construction time or with the useDelimiter() method. The Scanner next(), hasNext(), skip(), and findInLine() methods all take regular expressions as well. You can specify these either as strings or with a compiled Pattern object.

You can use the findInLine() method of Scanner as an improved Matcher. For example:

    Scanner scanner = new Scanner( "Quantity: 42 items, Price $2.34" );
    scanner.findInLine("[Qq]uantity[:\\s]*");
    int quantity=scanner.nextInt();
    scanner.findInLine("[Pp]rice.*\\$");
    float price=scanner.nextFloat();

The previous snippet locates the quantity and price values, allowing for variations in capitalization and spacing before the numbers.

Before we move on, we’ll also mention a “Stupid Scanner Trick” that, although we don’t recommend it, you might find amusing. Using the \A boundary marker, which denotes the beginning of input, as a delimiter, we can tell the Scanner to return the whole input as a single string. This is an easy way to read the contents of any stream into one large string:

    InputStream source  = new URL("http://www.oreilly.com/").openStream();
    String text = new Scanner( source ).useDelimiter("\\A").next();

This is probably not the most efficient or understandable way to do it, but it may save you a little typing in your experimentation.

Replacing text

A common reason that you’ll find yourself searching for a pattern in a string is to change it to something else. The regex package not only makes it easy to do this but also provides a simple notation to help you construct replacement text using bits of the matched text.

The most convenient form of this API is Matcher’s replaceAll() method, which substitutes a replacement string for each occurrence of the pattern and returns the result. For example:

    String text = "Richard Nixon's social security number is: 567-68-0515.";
    Matcher matcher =
    Pattern.compile("\\d\\d\\d-\\d\\d\-\\d\\d\\d\\d").matcher( text );
    String output = matcher.replaceAll("XXX-XX-XXXX");

This code replaces all occurrences of U.S. government Social Security numbers with “XXX-XX-XXXX” (perhaps for privacy considerations).

Using captured text in a replacement

. Literal substitution is nice, but we can make this more powerful by using capture groups in our substitution pattern. To do this, we use the simple convention of referring to numbered capture groups with the notation $n, where n is the group number. For example, suppose we wanted to show just a little of the Social Security number in the previous example, so that the user would know if we were talking about him. We could modify our regex to catch, for example, the last four digits like so:

    \d\d\d-\d\d-(\d\d\d\d)

We can then use that in the substitution text:

    String output = matcher.replaceAll("XXX-XX-$1");

The static method Matcher.quoteReplacement() can be used to escape a literal string (so that it ignores the $ notation) before using it as replacement text.

Controlling the substitution

The replaceAll() method is useful, but you may want more control over each substitution. You may want to change each match to something different or base the change on the match in some programmatic way.

To do this, you can use the Matcher appendReplacement() and appendTail() methods. These methods can be used in conjunction with the find() method as you iterate through matches to build a replacement string. appendReplacement() and appendTail() operate on a StringBuffer that you supply. The appendReplacement() method builds a replacement string by keeping track of where you are in the text and appending all nonmatched text to the buffer for you as well as the substitute text that you supply. Each call to find() appends the intervening text from the last call, followed by your replacement, then skips over all the matched characters to prepare for the next one. Finally, when you have reached the last match, you should call appendTail(), which appends any remaining text after the last match. We’ll show an example of this next, as we build a simple “template engine.”

Our simple template engine

Let’s tie what we’ve discussed together in a nifty example. A common problem in Java applications is working with bulky, multiline text. In general, you don’t want to store the text of messages in your application code because it makes them difficult to edit or internationalize. But when you move them to external files or resources, you need a way for your application to plug in information at runtime. The best example of this is in Java servlets; a generated HTML page is often 99% static text with only a few “variable” pieces plugged in. Technologies such as JSP and XSL were developed to address this. But these are big tools, and we have a simple problem. So let’s create a simple solution—a template engine.

Our template engine reads text containing special template tags and substitutes values that we provide. And because generating HTML or XML is one of the most important applications of this, we’ll be friendly to those formats by making our tags conform to the style of an XML comment. Specifically, our engine searches the text for tags that look like this:

    <!--TEMPLATE:name  This is the template for the user name -->

XML-style comments start with <!— and can contain anything up to a closing —>. We’ll add the convention of requiring a TEMPLATE:name field to specify the name of the value we want to use. Aside from that, we’ll still allow any descriptive text the user wants to include. To be friendly (and consistent), we’ll allow any amount of whitespace to appear in the tags, including multiline text in the comments. We’ll also ignore the text case of the “TEMPLATE” identifier, just in case. Now, we could do this all with low-level String commands, looping over whitespace and taking many substrings. But using the power of regexes, we can do it much more cleanly and with only about seven lines of relevant code. (We’ve rounded out the example with a few more to make it more useful.)

    import java.util.*;
    import java.util.regex.*;


    public class Template
    {
        Properties values = new Properties();
        Pattern templateComment =
            Pattern.compile("(?si)<!--\\s*TEMPLATE:(\\w+).*?-->");

        public void set( String name, String value ) {
            values.setProperty( name, value );
        }

        public String fillIn( String text ) {
            Matcher matcher = templateComment.matcher( text );

            StringBuffer buffer = new StringBuffer();
            while( matcher.find() ) {
                String name = matcher.group(1);
                String value = values.getProperty( name );
                matcher.appendReplacement( buffer, value );
            }
            matcher.appendTail( buffer );
            return buffer.toString();
        }
    }

You’d use the Template class like this:

    String input = "<!-- TEMPLATE:name --> lives at "
       +"<!-- TEMPLATE:address -->";
    Template template = new Template();
    template.set("name", "Bob");
    template.set("address", "1234 Main St.");
    String output = template.fillIn( input );

In this code, input is a string containing tags for name and address. The set() method provides the values for those tags.

Let’s start by picking apart the regex, templatePattern, in the example:

    (?si)<!--\s*TEMPLATE:(\w+).*?-->

It looks scary, but it’s actually very simple. Just start reading from left to right. First, we have the special flags declaration (?si) telling the regex engine that it should be in single-line mode, with .* matching all characters including newlines (s), and ignoring case (i). Next, there is the literal <!— followed by any amount of whitespace (\s) and the TEMPLATE: identifier. After the colon, we have a capture group (\w+), which reads our name identifier and saves it for us to retrieve later. We allow anything (.*) up to the —>, being careful to specify that .* should be nongreedy (.*?). We don’t want .* to consume other opening and closing comment tags all the way to the last one, but instead to find the smallest match (one tag).

Our fillIn() method does the work, accepting a template string, searching it, and “replacing” the tag values with the values from set(), which we have stored in a Properties table. Each time fillIn() is called, it creates a Matcher to wrap the input string and get ready to apply the pattern. It then creates a temporary StringBuffer to hold the output and loops, using the Matcher find() method to get each tag. For each match, it retrieves the value of the capture group (group one) that holds the tag name. It looks up the corresponding value and replaces the tag with this value in the output string buffer using the appendReplacement() method. (Remember that appendReplacement() fills in the intervening text on each call, so we don’t have to.) All that remains is to call appendTail() at the end to get the remaining text after the last match and return the string value. That’s it!

We hope this section has shown you some of the power provided by these tools and whetted your appetite for more. Regexes allow you to work in ways you may not have considered before. Especially now, when the software world is focused on textual representations of almost everything—from data to user interfaces—via XML and HTML, having powerful text-manipulation tools is fundamental. Just remember to keep those regexes simple so you can reuse them again and again.

Get Learning Java, 4th Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.