8.10. Change the Delimiter Used in CSV Files

Problem

You want to change all field-delimiting commas in a CSV file to tabs. Commas that occur within double-quoted values should be left alone.

Solution

The following regular expression matches an individual CSV field along with its preceding delimiter, if any. The preceding delimiter is usually a comma, but can also be an empty string (i.e., nothing) when matching the first field of the first record, or a line break when matching the first field of any subsequent record. Every time a match is found, the field itself, including the double quotes that may surround it, is captured to backreference 2, and its preceding delimiter is captured to backreference 1.

Tip

The regular expressions in this recipe are designed to work correctly only with valid CSV files, according to the format rules under the term “Comma-Separated Values (CSV)” that was discussed earlier in this book.

(,|\r?\n|^)([^",\r\n]+|"(?:[^"]|"")*")?
Regex options: None (“^ and $ match at line breaks” must not be set)
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Here is the same regular expression again in free-spacing mode:

( , | \r?\n | ^ )   # Capturing group 1 matches field delimiters
                    #   or the beginning of the string
(                   # Capturing group 2 matches a single field:
  [^",\r\n]+        #   a non-quoted field
|                   #  or...
  " (?:[^"]|"")* "  #   a quoted field (may contain escaped double-quotes)
)?                  # The group is optional because fields may be empty
Regex options: Free-spacing (“^ and $ ...

Get Regular Expressions Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.