Regular Expression Syntax

Problem

You need to learn the syntax of regular expressions.

Solution

Consult Chapter 4 for a list of the regular expression characters that the Apache Regular Expression API matches.

Table 4-2. Regular expression syntax

Subexpression

Will match:

Notes

General

  

a

The letter a (and similarly for any other Unicode character not listed in this table)

 

^

Start of line/string

 

$

End of line/string

 

.

Any one character

 

[...]

“Character class”; any one character from those listed

 

[^...]

Any one character not from those listed

 

Normal (greedy) multipliers (“greedy closures”)

  

{m,n}

Multiplier (closure) for from m to n repetitions

 

{m,}

Multiplier for from m repetitions on up

 

{,n}

Multiplier for 0 up to n repetitions

 

*

Multiplier for 0 or more repetitions

Short for {0,}

+

Multiplier for 1 or more repetitions

Short for {1,}

?

Multiplier for 0 or 1 repetitions

Short for {0,1}

Reluctant (non-greedy) multipliers (“reluctant closures”)

  

*?

Reluctant multiplier: 0 or more

 

+?

Reluctant multiplier: 1 or more

 

??

Reluctant multiplier: 0 or 1 times

 

Alternation and grouping

  

( )

Grouping

|

Alternation

 

Escapes and shorthands

  

\

Escape character: turns metacharacters off, and turns following alphabetics (t, w, d, and s) into metacharacters.

 

\t

Tab character

 

\w

Character in a word

Use \w+ for a word

\d

Numeric digit

Use \d+ for a number

\s

Whitespace

Space, tab, etc., as determined by java.lang.Character.isWhitespace( )

\W, \D, \S

Inverse of above (\W is a non-word character, etc.)

 

POSIX-style character classes

  

[:alnum:]

Alphanumeric characters

 

[:alpha:]

Alphabetic characters

 

[:blank:]

Space and tab characters

 

[:space:]

Space characters

 

[:cntrl:]

Control characters

 

[:digit:]

Numeric digit characters

 

[:graph:]

Printable and visible characters (not spaces)

 

[:print:]

Printable characters

 

[:punct:]

Punctuation characters

 

[:lower:]

Lowercase characters

 

[:upper:]

Uppercase characters

 

[:xdigit:]

Hexadecimal digit characters

 

[:javastart:]

Start of a Java language identifier

Not in POSIX

[:javapart:]

Part of a Java identifier

Not in POSIX

These pattern characters can be used in any combination that makes sense. For example, a+ means any number of occurrences of the letter a, from one up to a million or a gazillion. The pattern Mrs?\. matches Mr. or Mrs.. And, .*means “any character, any number of times,” and is similar in meaning to most command-line interpreters’ meaning of *.

It’s important to remember that REs will match anyplace possible in the input, and that patterns ending in a greedy closure will consume as much as possible without compromising any other subexpressions.

Also, unlike some RE packages, the Apache package was designed to handle Unicode characters from the beginning. Actually, it came for free, as its basic units are the Java char and String variable, which are Unicode-based. In fact, the standard Java escape sequence \unnnn is used to specify a Unicode character in the pattern. And we use methods of java.lang.Character to determine Unicode character properties, such as whether or not a given character is a space.

Get Java Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.