Posted on by & filed under Content - Highlights and Reviews, Programming & Development.

A guest post by John Sullivan, a 15-year Java veteran who has been programming in Scala for 2+ years. He enjoys posting in-depth articles on his popular Scala-oriented blog, and he is currently employed as a Principal Sofware Engineer at the Broad Institute.

The Scala collections API is a very powerful framework. There are many collections that I previously built by hand in Java, but I now create these collections in Scala with a single method. Instead of dissecting the Scala collections API method by method, in this post, I will share a story that contains examples of using the API. I won’t describe what every method does, but you can read the Scaladoc to learn more about Seq and Map, which are shown here.

Demographic Data

My friend Sasha is opening a pet store in the Los Angeles area. In order to get a sense of the potential market there, she purchased some demographic data from a third party which provides her with information about people and their pets. She has a file that contains the following columns:

  • First Name
  • Last Name
  • Age
  • Annual Income
  • Street Address
  • City
  • State
  • Zip Code
  • Pet Type
  • Pet Name
  • Pet Age
  • Breed

For people that do not have any pets, the last four columns are empty. For people with multiple pets, the data in the earlier columns is duplicated, giving one line per pet.

Sasha asked my advice about what kind of database to store her data in that would allow her to run some basic analyses on the data set. When I learned that she only had a hundred thousand rows in the file, I told her that she didn’t need a database. Instead, she could load the data directly into program memory.

Processing the Data into Scala

I worked with Sasha to build a little domain model using Scala case classes:

Loading the Data

Next, we wrote a function that processes the file, loading the contents into a sequence. We now have all of the file data loaded into our domain model:

While the data set covers many Los Angeles neighborhoods, Sasha is focusing her business plans on high-income pet owners in West Hollywood. So the first thing we do is create sub-lists for her focus areas:

She wants to check for duplicates, as a quality check on the data she bought:

The numDuplicates function returns 0, but we are concerned about non-exact duplicates, such as misspellings, or the same person at multiple addresses. We talk it over for a while, and agree that while a robust solution would take some effort, we can smoke-test for problems like these with something quick and dirty. So we check for people with the same name and age:

We turn up more duplicates this way than we are comfortable with, so we decide to group them together and examine them, to determine if they are false matches. We use the groupBy method to group people by their vitals, and then we filter out groups of size 1:

This returns lists of people in a list with people that have the same vitals. Examining the results, we assure ourselves that most of our duplicates are actually unique people with the same names and ages. By now, we’ve poked around with the data enough to have some degree of confidence that it is clean.

Data Mining

Now we can start doing some simple analyses. For instance, for any given sequence of people, we want to know how many share the same address:

How about a sequence of pets for a sequence of people? How many pets are in a sequence of people?

And, how many people are there by zip code? How many pets are there by zip code?

Statistics on Income and Number of Pets

Sasha wants to collect some statistics on people’s income, and the number of pets they own. We decide to start with just the mean and standard deviation:

This story is mostly about operations for manipulating Scala collections, but in case you are interested, here is how we compute our Stats for a list of numbers:

Now we can produce income statistics on our overall list of people, our West Hollywood subset, and our high-income subset, with the following methods:

We are also interested in statistics on the number of pets people have:


Sasha and I continued on with many more such basic analyses. We benefited tremendously from the ease of use of the Scala collections library to interactively explore our little data set. Because many of these methods are so easy to write, we did a lot of experimentation directly in the Scala REPL. I can’t help but think about how each little example here would be its own little project in Java.

Safari Books Online has the content you need

Check out these Scala books available from Safari Books Online:

Scala in Action is a comprehensive tutorial that introduces Scala through clear explanations and numerous hands-on examples. Because Scala is a rich and deep language, it can be daunting to absorb all the new concepts at once. This book takes a “how-to” approach, explaining language concepts as you explore familiar programming challenges that you face in your day-to-day work.
This book takes a step-by-step tutorial approach to teaching you Scala. Starting with the fundamental elements of the language, Programming in Scala introduces functional programming from the practitioner’s perspective, and describes advanced language features that can make you a better, more productive developer.
Scala in Depth is a unique new book designed to help you integrate Scala effectively into your development process. By presenting the emerging best practices and designs from the Scala community, it guides you though dozens of powerful techniques example by example.

About the author

john_sullivan John Sullivan is a professional software engineer and technical blogger. A 15-year Java veteran, he has been happily programming in Scala for 2+ years. He enjoys posting in-depth articles on his popular Scala-oriented blog His major interests outside of Scala include software engineering best practices, the agile development process, and Domain Driven Design. John has a Master of Science in Computer Science from UMass Boston. He is currently employed as a Principal Software Engineer at the Broad Institute.

Tags: Data Mining, databases, map, pets, Scala, Scala Collections API, Seq,

2 Responses to “Scala Collections API: For People and their Pets”

  1. Jim

    Very nice.

    Just curious as to why you chose Seq. Scala has quite a number of different collection types and I’m still trying to understand their different strengths and uses.


  2. John Sullivan

    Thanks Jim. Why Seq – that’s an excellent question. When I first started writing Scala, I defaulted to the Java practice of using the abstract classes such as List for typing variables. The equivalent to Java’s List in Scala is Seq. It’s the most general type for ordered collections.

    But I’m starting to wonder if this is the best thing. For one, different Java List implementations, and different Scala Seqs, and different performance characteristics that should be noted. I’ve come to feel that it should be part of the API of the collection whether operations such as head and tail, or indexed access, are O(1) or O(n). I think in the future, I will be changing my practice to explicitly using either Vector or List in Scala, but at this point, I’m still using Seq. In these examples, I am just mapping over my Seqs, and the performance for map operations should be O(n) regardless of the underlying implementation.

    Best, John