O'Reilly logo
  • david p moore thinks this is interesting:

his is a common text file format in which each line represents a single record,


Cover of Spark: The Definitive Guide


Wrong. Per the CSV standard found in in RFC 4180 there can be newline characters embedded inside a field in a record. These have to be enclosed in double quotes. Spark does support this with the multiline option, however if you set multiline to True it will have to read the CSV file using a single worker and you won't be able to process the data in parallel.