Now that we have a way to invoke mappers in parallel, let's look at the logic that they implement. Remember again that our task is to count the number of (From, To) email addresses from a large number of email messages.
The work involved here is relatively straightforward. With each mapper receiving a unique 100 MB file, each invocation will perform the same set of tasks:
- Download the file from S3
- Parse each message and extract the From and To fields, making sure to account for group sends (where the From user sends to multiple To addresses)
- Count the number of (From, To) occurrences
- Write the results to S3
I've shown the full listing of mapreduce/mapper.py in the following code block:
import csvimport itertools