O'Reilly logo

Analyzing the Analyzers by Marck Vaisman, Sean Murphy, Harlan Harris

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 1. Introduction

Binita, Chao, Dmitri, and Rebecca are data scientists. What does that statement tell you about them? Probably not as much as you’d like. You know they probably know something about statistics, programming, and data visualization. You’d hope that they had some experience finding insights from data, maybe even “big data.” But if you’re trying to find the best person for a job, you need to be more specific than just “doctor,” or “athlete,” or “data scientist.” And that’s a problem. Finding the right people for a task is all about efficient communication and, without the appropriate shared vocabulary, data science talent and data science problems are too often kept apart.

The three of us, organizers of data science events in Washington, DC, decided that we wanted to do something about this problem after too many personal experiences of failures caused by miscommunication. So in mid-2012 we surveyed data scientists, asking about their experiences and how they viewed their own skills and careers. The results may help us, as a professional community, settle on finer-grained descriptions and more effective means of communicating about what we do for a living.

We start by describing four fictitious data scientists, each typical of one of four categories that emerged from the survey. Their variety is striking.

Binita works for Acme Industries — a Fortune 100 manufacturing company — as Director of Analytics. She manages a small team of technical analysts and spends rather more time in meetings than she wishes. She really likes getting her hands dirty, diving into data sets when she has time, and helping her team design compelling visualizations and predictive models that will go into production. But the payoff is the presentation to senior management, translating statistical jargon to business lingo, p-values into profits. Binita has a bachelors in Industrial Engineering and an MBA, and she spent several years in consulting before moving to her role at Acme. She reads all about the new “big data” and “data science” buzzwords in the business press, and sees value in her skills, but isn’t sure which labels apply to her. Maybe she’ll start an analytics consulting firm of her own soon?

Chao has a finger in every pie. By day he builds interactive web graphics for a major newspaper, but by night he goes to technical Meetups and works on an open-source Python package for mapping spatial data. A few times a year he goes to hackathons, teaming up with others to prototype new businesses or dive into public data sets. Chao has an undergraduate degree in economics, minoring in computer science. He started a Master’s in statistics before dropping out and trying unsuccessfully to start a statistical consulting firm. He’s been following the blogs and tweets about data science since 2009, and his business cards (the ones he made on Moo with colorful data visualizations, not the boring ones he gets from work) say “Chao, Data Scientist Extraordinaire!”

Dmitri writes really fast, elegant, maintainable Machine Learning code. He works for a medium-sized consulting firm that provides predictive models for companies without the resources to build systems themselves. The skills section of his resume has five dense lines of technologies like Hadoop, SVM, and Scala. Dmitri keeps up with the Machine Learning literature, which he started reading when he was writing his Master’s thesis in computer science. He’s contributed a few patches to an open source big data package that he uses in his work. Dmitri is pretty happy with his job, but imagines he’ll find a different development job in a few years. Maybe something using Dremel or other massive columnar databases — that stuff looks pretty cool.

Rebecca works for an internet retailer and has the title Data Scientist. Ten years ago if you had asked her what she’d be doing now, she’d have said, “I guess I’ll be a professor by then.” After spending 10 years studying molecular biology, building statistical models, programming simulations, managing complex data sets, publishing papers, and presenting at conferences, she decided she was bored. Rebecca left her post-doc and started farming out her resume, tweaking the language based on articles she’d read about the need for data science in industry. Now she helps the company figure out which marketing practices are actually useful, builds predictive models of future sales, and looks for relevant patterns in Twitter data. Fun stuff! She still gets to read academic papers, learn new tools, and play with a vast array of data. But now her insights get noticed, and her work turns into real changes in the business.

Why do people use the term “data scientist” to describe all of these professionals? Does it clarify expectations, distinguish people with different strengths, and let practitioners and organizations communicate effectively and make good decisions? Does it define an attainable career path and suggest professional growth options? Or does it instead lead to confusion, misunderstandings, and missed opportunities?

We think that terms like “data scientist,” “analytics,” and “big data” are the result of what one might call a “buzzword meat grinder.” The people doing this work used to come from more traditional and established fields: statistics, machine learning, databases, operations research, business intelligence, social or physical sciences, and more. All of those professions have clear expectations about what a practitioner is able to do (and not do), substantial communities, and well-defined educational and career paths, including specializations based on the intersection of available skill sets and market needs. This is not yet true of the new buzzwords. Instead, ambiguity reigns, leading to impaired communication (Grice, 1975) and failures to efficiently match talent to projects.

In the rest of this article, we’ll see how miscommunication about data science skills and roles led to wasted time and effort for Dmitri and Binita. We’ll use the survey results to identify a new, more precise vocabulary for talking about their work, based on how data scientists describe themselves and their skills. We’ll discuss how data scientists are both broad and deep and what this means for career growth and effectiveness. And finally, we’ll turn from the practitioner’s to the organization’s point of view and consider how to apply the survey results when trying to identify, train, integrate, team up, and promote data scientists.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required