Deduplication of nonconflicting data items

Duplication is a common problem when collecting large amounts of data. In this recipe, we will combine similar records in a way that ensures no information is lost.

Getting ready

Create an input.csv file with repeated data:

Getting ready

How to do it...

Create a new file, which we will call Main.hs, and perform the following steps:

  1. We will be using the CSV, Map, and Maybe packages:
    import Text.CSV (parseCSV, Record)
    import Data.Map (fromListWith)
    import Control.Applicative ((<|>))
  2. Define the Item data type corresponding to the CSV input:
    data Item = Item { name :: String , color :: Maybe String , cost :: Maybe Float } deriving ...

Get Haskell Data Analysis Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.