8.11. Extract CSV Fields from a Specific Column

Problem

You want to extract every field from the third column of a CSV file.

Solution

The regular expressions from Recipe 8.10 can be reused here to iterate over each field in a CSV subject string. With a bit of extra code, you can count the number of fields from left to right in each row, or record, and extract the fields at the position you’re interested in.

The following regular expression (shown with and without the free-spacing option) matches a single CSV field and its preceding delimiter in two separate capturing groups. Since line breaks can appear within double-quoted fields, it would not be accurate to simply search from the beginning of each line in your CSV string. By matching and stepping past fields one by one, you can easily determine which line breaks appear outside of double-quoted fields and therefore start a new record.

Tip

The regular expressions in this recipe are designed to work correctly only with valid CSV files, according to the format rules discussed in “Comma-Separated Values (CSV).”

(,|\r?\n|^)([^",\r\n]+|"(?:[^"]|"")*")?
Regex options: None (“^ and $ match at line breaks” must not be set)
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
( , | \r?\n | ^ ) # Capturing group 1 matches field delimiters # or the beginning of the string ( # Capturing group 2 matches a single field: [^",\r\n]+ # a non-quoted field | # or... " (?:[^"]|"")* " # a quoted field (may contain escaped double-quotes) )? # The group ...

Get Regular Expressions Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.