8.5. Convert Plain Text to HTML by Adding <p> and <br> Tags
Problem
Given a plain text string, such as a multiline value submitted via a form, you want
to convert it to an HTML fragment to display within a web page.
Paragraphs, separated by two line breaks in a row, should be
surrounded with <p>⋯</p>
. Additional
line breaks should be replaced with <br>
tags.
Solution
This problem can be solved in four simple steps. In most programming languages, only the middle two steps benefit from regular expressions.
Step 1: Replace HTML special characters with character entity references
As we’re converting plain text to HTML, the first step is to
convert the three special HTML characters &
, <
, and >
to character entity references (see
Table 8-3).
Otherwise, the resulting markup could lead to unintended results
when displayed in a web browser.
Table 8-3. HTML special character substitutions
Search for | Replace with |
---|---|
‹ | « |
‹ | « |
‹ | « |
Ampersands (&
) must be
replaced first, since you’ll be adding additional ampersands to the
subject string as part of the character entity references.
Step 2: Replace all line breaks with <br>
Search for:
\r\n?|\n
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
\R
Regex options: None |
Regex flavors: PCRE 7, Perl 5.10 |
Replace with:
<br>
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP, Python, Ruby |
Step 3: Replace double <br> tags with </p><p>
Search for:
<br>\s*<br>
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, ... |
Get Regular Expressions Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.