O'Reilly logo

Agile Data Science by Russell Jurney

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 1. Theory

We are uncovering better ways of developing software by doing it and helping others do it. Through this work we have come to value:

Individuals and interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan

That is, while there is value in the items on the right, we value the items on the left more.

The Agile Manifesto

Agile Big Data

Agile Big Data is a development methodology that copes with the unpredictable realities of creating analytics applications from data at scale. It is a guide for operating the Hadoop ‘Data Refinery,’ to harness the power of Big Data.

Warehouse scale computing has given us enormous storage and compute resources to solve new kinds of problems storing and processing unprecedented amounts of data. There is great interest in bringing new tools to bear on formerly intractable problems, to derive entirely new products from raw data, to refine raw data into profitable insight and to productize and productionize insight in new kinds of applications. Analytics applications. Our tools are processor cores and disk spindles, paired with visualization, statistics and machine learning. This is Data Science.

At the same time, during the last twenty years, the world wide web has emerged as the dominant medium for information exchange. During this time, software engineering has been transformed by the “Agile” revolution in how applications are conceived, built and maintained. These new processes bring more projects and products in on time and under budget, and enable small teams or single actors to develop entire applications spanning broad domains. This is Agile Software Development.

But there’s a problem: working with real data in the wild, doing data science, performing serious research, takes time. Longer than an agile cycle; on the order of months. More time than is available in many organizations for a project sprint, meaning today’s applied researcher is more than pressed for time. Data science is stuck on the old-school software schedule known as the ‘waterfall method.’

Our problem and our opportunity come at the intersection of these two trends: how can we incorporate data science, which is applied research and requires exhaustive effort on an unpredictable timeline, into the agile application? How can analytics applications do better than the waterfall method that we’ve long left behind? How can we craft applications for unknown, evolving data models?

This book attempts to make a synthesis of two fields, agile development and big data science, to meld research and engineering into a productive relationship. To achieve this, we present a lightweight toolset that can cope with the uncertain, shifting sea of raw data. We go on to show how to iteratively build value using this stack, to get back to agility and mine data for value to turn data to dollars.

Agile Big Data aims to put you back in the driver’s seat, ensuring your applied research produces useful products meeting the needs of real users.

Big Words Defined

Scalability, NoSQL, Cloud Computing, Big Data - these are all controversial terms. We define them here as they pertain to Agile Big Data.

  • Scalability is the simplicity with which one can grow or shrink some operation in response to demand. In Agile Big Data, it means software tools and techniques that grow sub-linearly in terms of cost and complexity, as load and complexity in an application grow linearly. We use the same tools for data, large and small, and we embrace a methodology that lets us build-once, rather than re-engineer continuously as we go.

  • NoSQL means 'Not only SQL.’ This means escaping the bounds imposed by storing structured data in monolithic relational databases. It means going beyond tools that were optimized for Online Transaction Processing (OLTP) and extended to Online Analytic Processing (OLAP), to use a broader set of tools that are better suited to viewing data in terms of analytic structures and algorithms. It means escaping the bounds of a single machine with expensive storage and starting out with concurrent systems that will grow linearly as users and load increase. It means not hitting a wall as soon as our database gets bogged down, and then struggling to tune, shard and mitigate problems continuously.

    The NoSQL tools we’ll be using are Hadoop - a highly parallel batch processing system, and MongoDB - a distributed document store.

  • Cloud Computing means employing infrastructure as a service from providers like Amazon Web Services to compose applications at the level of data-center as computer. As application developers, we use cloud computing to avoid getting bogged down in the details of infrastructure, while building applications that scale.

  • Big Data is a market around the belief that enormous value will be extracted from the ever-increasing pile of transaction logs being aggregated by the mission critical systems of today and tomorrow. Big Data systems use local storage, commodity server hardware, and free and open source software to cheaply process data at a scale where it becomes feasible to work with atomic records, voluminously logged and processed.

Agile Big Data Teams

Products are built by teams of people, and agile methods focus on people over process, so Agile Big Data starts with team.

Data Science is a broad discipline, spanning analysis, design, development, business and research. The roles of an Agile Big Data team, defined in a spectrum from customer to operations, looks something like this:

Customer, Business Development, Market Strategist, Product Manager, Experience Designer, Interaction Designer, Web Developer, Engineer, Data Scientist, Researcher, Platform Engineer, DevOps Engineer
Figure 1-1. The roles in an Agile Big Data team

We define these roles as:

Customers use your product. They click your buttons and links. Or they ignore you completely. Your job is to create value for them repeatedly. Their interest determines the success of your product.
Business Development signs early customers, either firsthand or through the creation of landing pages and promotion. Delivers traction from product in market.
Marketeers talk to customers to determine which markets to pursue. They determine the starting perspective from which an Agile Big Data product begins.
Product managers take in the perspectives of each role, synthesising them to build consensus about the vision and direction of the product.
User Experience Designers are responsible for fitting the design around the data to match the perspective of the customer. This role is critical, as the interpretation of the output of statistical models can be difficult to interpret by ‘normal’ users who have no concept of the semantics of the model’s output. i.e. how can something be 75% true?
Interaction Designers design interactions around data models so users find their value.
Web Developers create the web applications that deliver data to a web browser.
Engineers build the systems that deliver data to applications.
Data Scientists explore and transform data in novel ways to create and publish new features and combine data from diverse sources to create new value. Data scientists make visualizations with researchers, engineers, web developers and designers to expose raw, intermediate and refined data early and often.
Applied Researchers solve the heavy problems that data scientists uncover and that stand in the way of delivering value. These problems take intense focus and time and require novel methods from statistics and machine learning.
Platform Engineers solve problems in the distributed infrastructure that enable Agile Big Data at scale to proceed without undue pain. Platform engineers handle work tickets for immediate blocking bugs and implement long-term plans and projects to maintain and improve usability for researchers, data scientists and engineers.
Operations or DevOps professionals ensure smooth setup and operation of production data infrastructure. They automate deployment and take pages when things go wrong.

Opportunity and Problem

The broad skill-set needed to build data products presents both an opportunity and a problem. If these skills can be brought to bear by experts in each role working as a team on a rich dataset, problems can be decomposed into parts and directly attacked. Data science is then an efficient assembly line.

However, as team size increases to satisfy the need for expertise in these diverse areas, communication overhead quickly dominates. A researcher that is eight persons away from customers is unlikely to solve relevant problems, and more likely to solve arcane problems. Likewise, team meetings of a dozen individuals are unlikely to be productive. We might split this team into multiple departments, and establish contracts of delivery between them, but then we lose both agility and cohesion. Waiting on the output of research, we invent specifications and soon we find ourselves back in the waterfall method.

Flow of work/actions among: Customer, Business Development, Market Strategist, Product Manager, Experience Designer, Interaction Designer, Web Developer, Engineer, Data Hacker, Applied, Researcher, Platform Engineer, DevOps
Figure 1-2. Expert Contributor Workflow

And yet we know that agility and a cohesive vision and consensus about a product is essential to our success in building products. The worst product problem is one team working on more than one vision. How are we to reconcile the increased span of expertise and the disjoint timelines of applied research, data science, software development and design?

Adapting to Change

In order to remain agile, we must embrace and adapt to these new conditions. We must adopt changes in line with lean methodologies to keep productive.

Several changes in particular make a return to agility possible.

  • Choose generalists over specialists.

  • Small teams over large teams.

  • High level tools and platforms: cloud computing, distributed systems and Platforms as a Service (PaaS).

  • Continuous and iterative sharing of intermediate work, even when that work may be incomplete.

In Agile Big Data, a small team of generalists use scalable high level tools and cloud computing to iteratively refine data into increasingly higher states of value. We embrace a software stack leveraging cloud computing, distributed systems and platforms as a service. Then we use this stack to iteratively publish the intermediate results of even our most in-depth research to snowball value from simple records to predictions and actions that create value and let us capture some of it to turn data into dollars. Lets examine each item in detail.

The Power of Generalists

Three actors satisfying roles from: (Business Development, Market Strategist, Product Manager), (Experience Designer, Interaction Designer, Web Developer), (Engineer, Data Hacker, Applied, Researcher, Platform Engineer, DevOps)
Figure 1-3. Broad roles in an Agile Big Data Team

In Agile Big Data we value generalists over specialists. In other words, we measure the breadth of a teammates’s skills as much as the depth of their knowledge and their talent in any one area. Examples of good Agile Big Data team members include:

  • Designers that deliver working CSS.

  • Web developers that build entire applications and understand user interface and experience.

  • Data scientists capable of both research and building web services and applications.

  • Researchers that check in working source code, explain results and share intermediate data.

  • Product managers able to understand the nuances in all areas.

Designing a product is keeping five thousand things in your brain and fitting them all together in new and different ways to get what you want. And every day you discover something new that is a new problem or a new opportunity to fit these things together a little differently.

And it’s that process that is the magic.

Steve Jobs, The Lost Interview

Design in particular is a critical role on the Agile Big Data team. Design does not end with appearance or experience. Design encompasses all aspects of the product, from architecture, distribution, user experience to work environment.

Agile Platforms

In Agile Big Data we use the most easy to use, approachable distributed systems, along with cloud computing and platforms as a service to minimize infrastructure costs and maximize productivity. The simplicity of our stack helps enable a return to agility. We’ll use this stack to compose scalable systems in as few steps as possible. This lets us move fast and consume all available data without quickly running into a scalability problems that cause us to discard data or remake our application in flight. That is to say, we only build it once.

Sharing Intermediate Results

Finally, to address the very real differences in timelines between researchers and data scientists and the rest of the team, we adopt a sort of ‘data collage’ as our mechanism of mending these disjoint scales. ‘Data collage’ means we piece our app together from an abundance of views, visualizations and properties, that form the ‘menu’ for our application.

Researchers and data scientists, who work on longer timelines than agile sprints typically allow, generate data daily - albeit not in a ‘publishable’ state. In Agile Big Data, there is no unpublishable state. The rest of the team must see weekly, if not daily (or more often) updates in the state of the data. This kind of engagement with researchers is essential to unifying the team and enabling product management.

That means publishing intermediate results - incomplete data, the scraps of analysis. These ‘clues’ keep the team united, and as these results become interactive, everyone becomes informed as to the true nature of the data, the progress of the research, and how to combine clues into features of value. Development and design must proceed from this shared reality. The audience for these continuous releases can start small and grow as they become presentable, but customers must be included quickly.

Figure 1-4. Growing audience from conception to launch

Agile Big Data Process

The Agile Big Data process embraces the iterative nature of Data Science and the efficiency our tools enable to iteratively build and extract increasing levels of structure and value from our data.

With the spectrum of skills of a data product team, the possibilities are endless.Spanning so many disciplines, building web products is inherently collaborative. To collaborate, teams need direction: every team member passionately and stubbornly pursuing a common goal. To get that direction you require consensus.

Building and maintaining consensus while collaborating is the hardest part of building software. The principal risk in software product teams is building to different blueprints. Clashing visions result in un-cohesive holes that sink products.

Applications are sometimes mocked before they are built: product managers conduct market research while designers iterate mocks with feedback from prospective users. These mocks serve as a common blueprint for the team.

Real-world requirements change even when our data is static, as we learn from our users and conditions change. So our blueprints must change with time. Agile methods were created to facilitate implementation of evolving requirements, and to replace mockups with real working systems as soon as possible.

Typical web products - those driven by forms backed by predictable, constrained transaction data in relational databases - have fundamentally different properties than products featuring mined data. In CRUD applications, data is relatively consistent. The models are predictable SQL tables or documents, and changing them is a product decision. The data’s ‘opinion’ is irrelevant, and the product team is free to impose its will on the model to match the business logic of the application.

In interactive products driven by mined data, none of that holds. Real data is dirty. Mining always involves dirt. If the data isn’t dirty, it wouldn’t be data mining. Even carefully extracted and refined mined information can be fuzzy and unpredictable. Presenting it on the consumer internet requires long labor and great care.

In data products, the data is ruthlessly opinionated. Whatever we wish the data to say, it is unconcerned with our own opinions. It says what it says. This means the waterfall model has no application. It also means that mocks are an insufficient blueprint to establish consensus in software teams.

Mocks of data products are a specification of the application without it’s essential character: the true value of the information being presented. Mocks as blueprints make assumptions about complex data models they have no reasonable basis for. When specifying lists of recommendations, mocks often mislead. When mocks specify full-blown interactions, they do more than that. They suppress reality and promote assumption. And yet we know that good design and user experience are about minimizing assumption. What are we to do?

The goal of agile product development is to identify the essential character of an application and to build that first, before continuing to add features. This imparts agility to the project, making it more likely to satisfy its real, essential requirements as they evolve. In data products, that essential character will surprise you. If it doesn’t, you are either doing it wrong, or your data isn’t very interesting. Information has context, and when that context is interactive, insight is not predictable.

Code Review and Pair Programming

In order to avoid systemic errors, it is essential that data scientists share their code with the rest of the team on a regular basis, so code review is important. It is easy to fix errors in parsing that hide systemic errors in algorithms. Pair programming, where pairs of data hackers go over code line by line, checking its output and explaining the semantics, can help detect these errors.

Agile Environments: Engineering Productivity

Generalists require more uninterrupted concentration and quiet than do specialists. That is because the context of their work is broader, therefore their immersion is deeper. Their environment must suit this need.

Invest in 2-3x the space of a typical cube farm, or you are wasting your people. In this setup, some people don’t need desks, which drives costs down.

Rows of cubicles like cells of a hive. Overbooked conference rooms camped and decamped. Microsoft Outlook a modern punchcard. Monolithic insanity. A sea of cubes.

Deadlines interrupted by oscillating cacophonies of rumors shouted, spread like waves uninterrupted by naked desks. Headphone budgets. Not working, close together. Decibel induced telecommuting. The open plan.

Competing monstrosities seeking productivity but not finding it.

Poem by Author

Before very long, people get very confused that the process is the content. That’s ultimately the downfall of IBM. IBM has the best process people in the world. They just forgot about the content.

Steve Jobs

We can do better. We should do better. It costs more, but it is inexpensive.

In Agile Big Data, we recognize team members as creative workers, not office workers. We therefore structure our environment more like studio than office. At the same time, we recognize that employing advanced mathematics on data to build products requires quiet contemplation and intense focus. So we incorporate elements of the library as well.

Many enterprises limit their productivity enhancement of employees to the acquisition of skills. However, about 86% of productivity problems reside in the work environment of organizations. The work environment has effect on the performance of employees. The type of work environment in which employees operate determines the way in which such enterprises prosper.

Akinyele Samuel Taiwo

It is much higher cost to employ people than it is to maintain and operate a building, hence spending money on improving the work environment is the most cost effective way of improving productivity because of small percentage increase in productivity of 0.1% to 2% can have dramatic effects on the profitability of the company. —

Derek Clements-Croome and Li Baizhan The Sane

Workspace Creative workers need three kinds of spaces to collaborate and build together. From open to closed, they are: collaboration space, personal space and private space.

Collaboration Space

Collaboration space is where ideas are hatched. Situated along main thoroughfares and between departments, collaborative spaces are bright, open, comfortable and inviting. They have no walls. They are flexible and reconfigurable. They are ever changing, always rearranged. Bean bags, pillows and comfortable chairs. Collaboration space is where you feel the energy of your company: laughter, big conversations, excited voices talking over one another. Invest in and showcase these areas. Real, not plastic, plants keep sound from carrying and they make air :)

Private Space

Private space is where deadlines get met. Enclosed and soundproof, private spaces are libraries. There is no talking. Private space minimizes distractions. Dim light, white noise. There are bean bags, couches and chairs, but ergonomics demand proper workstations too. Separated sit/stand desks with docking stations behind (bead) curtains with 30” LCDs.

Personal Space

Personal space is where people call home. In between collaboration and private space in its degree of openness, personal space should be personalized by each individual to suit his or her needs. Shared office or open desks, half or whole cube. Personal space should come with a menu and a budget. Themes and plant-life should be encouraged. For some people, this is where you spend most of your time. For others… given adequate collaborative and private space, a notebook and mobile device, some people don’t need personal space at all.

Above all, the goal of the Agile Environment is to create immersion in data through the physical environment: printouts, posters, books, whiteboard, etc.

Data wall collage
Figure 1-5. Data immersion through collage

Realizing Ideas with Large Format Printing

Easy access to large format printing is a requirement for the Agile Environment. The realization of visualization in material form encourages sharing, collage, expressiveness and creativity.

The HP DesignJet 111 is a 24” wide large format printer that costs less than $1,000. Continuous ink delivery systems are available for less than $100 that bring the operational cost of large format printing, for instance 24” x 36” posters to less than one dollar per poster.

At this price-point, there is no excuse not to give a data team easy access to several large format printers for both plain-paper proofs and glossy prints. It is very easy to get people excited about data across departments when they can see concrete proof of the progress of the data science team.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required