5.13. Replace Repeated Whitespace with a Single Space

Problem

As part of a cleanup routine for user input or other data, you want to replace repeated whitespace characters with a single space. Any tabs, line breaks, or other whitespace should also be replaced with a space.

Solution

To implement either of the following regular expressions, simply replace all matches with a single space character. Recipe 3.14 shows the code to do this.

Clean any whitespace characters

\s+
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Clean horizontal whitespace characters

[\t]+
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

A common text cleanup routine is to replace repeated whitespace characters with a single space. In HTML, for example, repeated whitespace is simply ignored when rendering a page (with a few exceptions), so removing repeated whitespace can help to reduce the file size of pages without any negative effect.

Clean any whitespace characters

In this solution, any sequence of whitespace characters (line breaks, tabs, spaces, etc.) is replaced with a single space. Since the + quantifier repeats the whitespace class (\s) one or more times, even a single tab character, for example, will be replaced with a space. If you replaced the + with {2,}, only sequences of two or more whitespace characters would be replaced. This could result in fewer replacements and thus improved performance, but it could also leave behind ...

Get Regular Expressions Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.