Chapter 2. An Operating System for Big Data

Data teams are usually structured as small teams of five to seven members who employ a hypothesis-driven workflow using agile methodologies. Although data scientists typically see themselves as jack-of-all-trades generalists with a wide array of data-oriented skills,1 they tend to specialize in either software, statistics, or domain expertise. Data teams therefore are composed of members who fit into three broad categories: data engineers are responsible for the practical aspects of the wiring and mechanics of data, usually relating to software and computing resources; data modelers focus on the exploration and explanation of data and creating inferential or predictive data products; and finally, subject matter experts provide domain knowledge to problem solving both in terms of process and application.

Data teams that utilize Hadoop tend to place a primary emphasis on the data engineering aspects of data science due to the technical nature of distributed computing. Big datasets lend themselves to aggregation-based approaches (over instance-based approaches) and a large toolset for distributed machine learning and statistical analyses exists already. For this reason, most literature about Hadoop is targeted at software developers, who usually specialize in Java—the software language the Hadoop API is written in. Moreover, those training materials tend to focus on the architectural aspects of Hadoop as those aspects demonstrate the fundamental ...

Get Data Analytics with Hadoop now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.