Selecting and cleansing the dataset

We have selected the transaction history from a customer account as our dataset, which we will use to build the model for fraud detection. A large bank can have millions of customers. The historical transaction data of these customers can be used to build models that are unique for each customer, based upon his spending patterns.

We will start with the last 3 years of transaction history. Let's take a look at a single transaction to understand the information captured in it:

Datum;Naam / Omschrijving;Rekening;Tegenrekening;Code;AfBij;Bedrag (EUR);MutatieSoort;Mededelingen 20151022;ZIGGO SERVICES BV;NL54INGB07XXX32XXX;NL98INGB0000845745;IC;Af;52,5;Incasso;Europese Incasso, doorlopend IBAN: NL98INGB0000845745 BIC: ...

Get Hadoop Blueprints now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.