Strings

Problem

You need a regex that matches a string, which is a sequence of zero or more characters enclosed by double quotes. A string with nothing between the quotes is an empty string. Two sequential double quotes in a character string denote a single character, a double quote. Strings cannot include line breaks. Backslashes or other characters have no special meaning in strings.

Your regular expression should match any string, including empty strings, and it should return a single match for strings that contain double quotes. For example, it should return "before quote""after quote" as a single match, rather than matching "before quote" and "after quote" separately.

Solution

"[^"\r\n]*(?:""[^"\r\n]*)*"
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

Matching a string that cannot contain quotes or line breaks would be easy with "[^\r\n"]*". Double quotes are literal characters in regular expressions, and we can easily match a sequence of characters that are not quotes or line breaks with a negated character class.

But our strings can contain quotes if they are specified as two consecutive quotes. Matching these is not much more difficult if we handle the quotes separately. After the opening quote, we use [^\r\n"]* to match anything but quotes and line breaks. This may be followed by zero or more pairs of double quotes. We could match those with (?:"")*, but after each pair of double quotes, the string can have more characters that ...

Get Regular Expressions Cookbook, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.