3.3. Wildcards and Character Classes

[ . . . ]

Square brackets define a character class, which may be given as individual characters, a range of characters, or a combination of these. [a-z] stands for all lower-case letters; [A-Z] denotes all upper-case letters; [0-9] stands for all digits. Character classes can be combined: [A-Za-z] matches all letters; [a-z_] represents all lower-case letters and the underscore character.

[^. . . ]

Negated character class (clarified below).

.

Any character (except \n and \r).

\d

Digit character, i.e. [0-9].

\D

Non-digit character, i.e. [^0-9].

\w

Word character: upper- and lower-case letters, digits, and underscore, i.e. [A-Za-z0-9_].

\W

Anything that is not a word character.

\s

White space characters: all spaces (normal, em, en, thin, etc.), tabs, and return characters.

\S

Anything that is not a whitespace character.

Character classes are extremely useful. They can be given as individual characters or in the form of a range. For example, the regex /d[aeiou]/ matches all letters d followed by a vowel. Thus, the square brackets define just one character position, which can be either of the vowels. Ranges of characters can be given as ranges of literals, as in [a-z], and in Unicode notation. For instance, /\u0250-\u02FF/ matches all of the phonetic characters in a text.

In some script languages, regexes are implemented in such a way that the character classes [a-z] and [A-Z] and the \w wildcard include accented ...

Get Automating InDesign with Regular Expressions now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.