8.5. Convert Plain Text to HTML by Adding <p> and <br> Tags

Problem

Given a plain text string, such as a multiline value submitted via a form, you want to convert it to an HTML fragment to display within a web page. Paragraphs, separated by two line breaks in a row, should be surrounded with <p></p>. Additional line breaks should be replaced with <br> tags.

Solution

This problem can be solved in four simple steps. In most programming languages, only the middle two steps benefit from regular expressions.

Step 1: Replace HTML special characters with character entity references

As we’re converting plain text to HTML, the first step is to convert the three special HTML characters &, <, and > to character entity references (see Table 8-3). Otherwise, the resulting markup could lead to unintended results when displayed in a web browser.

Table 8-3. HTML special character substitutions

Search for

Replace with

&

«&amp;»

<

«&lt;»

>

«&gt;»

Ampersands (&) must be replaced first, since you’ll be adding additional ampersands to the subject string as part of the character entity references.

Step 2: Replace all line breaks with <br>

Search for:

\r\n?|\n
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
\R
Regex options: None
Regex flavors: PCRE 7, Perl 5.10

Replace with:

<br>
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP, Python, Ruby

Step 3: Replace double <br> tags with </p><p>

Search for:

<br>\s*<br>
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, ...

Get Regular Expressions Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.