18 Months to Build Software that Absolutely Has to Work for 4 Days

With the team’s focus so much on producing the software, adding features, and engineering for scale, a culture and way of working formed organically that was generally functioning, but had a couple of major flaws. We were finding ourselves shipping certain products (like the core of our API) and ignoring others (like the underpinnings for what would become our targeted Facebook sharing). This was happening without a grander view of what was important and without a focus on things that we as engineers found compelling, sacrificing other important products that needed to be built. In an effort to address these flaws while maintaining the agility and overall productivity, small teams were formed that focused on single workstreams.

As they were asked to do, these teams put their focus on servicing the needs of their workstream and only their workstream. Dividing the labor in this way ensured that we were not sacrificing functionality that would be incredibly important to one department in order to go deeper on other functionality for another department. In other words, it allowed us to service more widely rather than incredibly deeply.

The unfortunate flip side of this division of labor was an attitudinal shift away from everyone working together toward a single goal and toward servicing each workstream in a vacuum. This shift manifested itself in increased pain and frustration in integration, decreased intra-team communication, and increased team fracturing. Having that level of fracturing when we were metaphorically attempting to rebuild an airplane mid-flight was an incredible cause for concern. This fracturing grew over time, and after about a year it got to the point that it forced us to question the decision of dividing and conquering and left us searching for ways to help the various teams work together more smoothly.

A fortuitous dinner with Harper Reed (the Chief Technology Officer at OFA) and a friend (Marc Hedlund) led to a discussion about the unique problems a campaign faces (most startups aren’t dealing with arcane election laws) and the more common issues that you’d find in any engineering team.

While discussing team fracturing, insanely long hours, and focusing on the right thing, Marc suggested organizing a “game day” as a way to bring the team together. It would give the individual teams a well-needed shared focus and also allow everyone to have a bit of fun.

This plan struck an immediate chord. When Harper and I had worked at Threadless, large sales were planned each quarter. The time between each sale was spent refactoring and adding new features. A sale would give the team a laser focus on the things that were the most important, in this order: keep the servers up, take money, and let people find things to buy. Having that hard deadline with a defined desired outcome always helped the engineers put their specific tasks in context of the greater goal and also helped the business stakeholders prioritize functionality and define their core needs.

It also dovetailed nicely with the impending election. We were about two months out from the ultimate test of the functionality and infrastructure we had been building for all of these months. Over that time we had some scares. A couple of unplanned incidents showed us some of the limitations of our systems. The engineers had been diligent at addressing these failures by incorporating the failures into our unit tests and making small architectural changes to eliminate these points of failure. However, we knew that we had only seen the tip of the iceberg in terms of the kinds of scale and punishment our systems and applications would encounter.

We knew that we needed to be better prepared — both to know what could fail but also what that failure looks like, what to do in case of failure, and to make sure that we as a team could deal with that failure.

If we could do this game day right, we could touch and improve on a bunch of these issues and at the very least have a ton of fun doing it. With that in mind, we set out to do the game day as soon as possible. By this point we were about six weeks before election day.

Get Learning from First Responders: When Your Systems Have to Work now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.