Preface

As applications grow, two things begin to happen: they become significantly more complicated (and hence brittle), and they handle significantly larger traffic volume (which more novel and complex mechanisms manage). This can lead to a death spiral for an application, with users experiencing brownouts, blackouts, and other quality-of-service and availability problems.

But your customers don’t care. They just want to use your application to do the job they expect it to do. If your application is down, slow, or inconsistent, customers will simply abandon it and seek out competitors that can handle their business.

This book helps you avoid the aforementioned death spiral by teaching you basic techniques that you can utilize to build and manage your large-scale applications. Once you’ve mastered these skills, your applications will be able to reliably handle huge quantities of traffic as well as huge variability in traffic without affecting the quality your customers expect.

Who Should Read This Book

This book is intended for software engineers, architects, engineering managers, and directors who build and operate large-scale applications and systems. If you manage software developers, system reliability engineers, or DevOps engineers, or you run an organization that contains large-scale applications and systems, the suggestions and guidance provided in this book will help you make your applications run smoother and more reliably.

If your application started small and has seen incredible growth (and is now suffering from some of the growing pains associated with that growth), you might be suffering from reduced reliability and reduced availability. If you struggle with managing technical debt and associated application failures, this book will provide guidance in reducing that technical debt to make your application able to handle larger scale more easily.

Why I Wrote This Book

After spending many years at Amazon building highly scaled applications in both the retail and the Amazon Web Services (AWS) worlds, I moved to New Relic, which was in the midst of hyper growth. The company felt the pain of needing the systems and processes required to manage highly scaled applications, but hadn’t yet fully developed the processes and disciplines to scale its application.

At New Relic, I saw firsthand the struggles of a company going through the process of trying to scale, and realized that there were many other companies experiencing the same struggles every day.

My intent with this book is to help others working in these hyper-growth applications learn processes and best practices that can assist them in avoiding the pitfalls awaiting them as they scale.

Whether your application is growing tenfold or just 10 percent each year, whether the growth is in number of users, number of transactions, amount of data stored, or code complexity, this book can help you build and maintain your application to handle that growth, while maintaining a high level of availability.

A Word on Scale Today

Cloud-based services are growing and expanding at extremely high speeds. Software as a Service (SaaS) is becoming the norm for application development, primarily because of the need for providing these cloud-based services. SaaS applications are particularly sensitive to scaling issues due to their multitenant nature.

As our world changes and we focus more and more on SaaS services, cloud-based services, and high-volume applications, scaling becomes increasingly important. There does not seem to be an end in sight to the size and complexity to which our cloud applications can grow.

The very mechanisms that are state of the art today for managing scale will be nothing more than basic tenants tomorrow, and the solutions to tomorrow’s scaling issues will make today’s solutions look simplistic and minimalistic. Our industry will demand more and more complex systems and architectures to handle the scale of tomorrow.

The intent with this book is to provide content that stands the test of time.

Navigating This Book

Managing scale is not only about managing traffic volume—it also involves managing risk and availability. Often, all these things are different ways of describing the same problem, and they all go hand in hand. Thus, to properly discuss scale, we must also consider availability, risk management, and modern architecture paradigms such as microservices and cloud computing.

As such, this book is organized as follows:

Part I, “Availability”

Availability and availability management are often the first areas that are affected when an application begins to scale.

Chapter 1, What Is Availability?

To begin, we’ll establish what high availability means and how it differs from reliability.

Chapter 2, Five Focuses to Improve Application Availability

In this chapter, I provide five core areas to focus on in building your application in order to improve its availability.

Chapter 3, Measuring Availability

This chapter describes a standard algorithm for measuring availability and further explores the meaning of high availability.

Chapter 4, Improving Your Availability When It Slips

If your application is suffering from availability problems (or you want to make sure it never does), we provide some organization-level steps you can take to help you improve your application’s availability.

Part II, “Risk Management”

Understanding risk in your system is essential to improving availability as well as enhancing your application’s ability to scale to the high levels needed today and in the future.

Chapter 5, What Is Risk Management?

This chapter opens the topic of managing risk with highly scaled applications by outlining the basics of what risk management is all about.

Chapter 6, Likelihood Versus Severity

This chapter discusses the difference between severity of a risk occurring and the likelihood of it occurring. They are both important, but in different ways.

Chapter 7, The Risk Matrix

In this chapter, I present a system designed for helping you understand and manage the risk within your application.

Chapter 8, Risk Mitigation

This chapter discusses how to take known risks within your system and reduce the impact they have on your application.

Chapter 9, Game Days

This chapter looks at ongoing testing and evaluation of your risk-management plans, mitigation plans, and disaster plans. It reviews the techniques for doing this in production environments and the advantage of doing so.

Chapter 10, Building Systems with Reduced Risk

In this chapter, I give suggestions on how to reduce risk within your applications and build applications with lower risk.

Part III, “Services and Microservices”

Services and microservices are an architecture strategy for building larger and more complicated applications that need to operate at higher scale.

Chapter 11, Why Use Services?

This chapter explores why services are important to building highly scalable applications.

Chapter 12, Using Microservices

Here, I provide an introduction on creating microservice-based architectures, focusing on sizing of services and determining where service boundaries should be created in order to improve scaling and availability.

Chapter 13, Dealing with Service Failures

In the final chapter of this part, we’ll discuss how to build services to handle failures.

Part IV, “Scaling Applications”

Scaling is not just about traffic, it’s about your organization and how it responds to larger application needs.

Chapter 14, Two Mistakes High

This chapter describes how to scale your system to maintain high availability, even in light of other failures.

Chapter 15, Service Ownership

This chapter looks at how paying attention to ownership of services can help your organization and application scale.

Chapter 16, Service Tiers

This chapter describes a way of labeling the criticalness of your services that helps manage service expectations.

Chapter 17, Using Service Tiers

After defining service tiers, we put them to use to help manage the impact of service failures, responsiveness requirements, and expectation management.

Chapter 18, Service-Level Agreements

In this chapter, we’ll discuss using SLAs as a way of managing interdependence between service owners.

Chapter 19, Continuous Improvement

This chapter provides techniques and guidelines for how to improve the overall scalability of your application.

Part V, “Cloud Services”

Cloud-based services are becoming increasingly important in building and managing large, critical applications with significant scaling requirements.

Chapter 20, Change and the Cloud

This chapter explores the ways cloud computing has changed how we think about building highly scaled web applications.

Chapter 21, Distributing the Cloud

This chapter outlines how to effectively use regions and availability zones to improve availability and scale.

Chapter 22, Managed Infrastructure

This chapter describes how you can use managed services such as RDS, SQS, SNS, and SES to scale your application and reduce management load.

Chapter 23, Cloud Resource Allocation

Here, we discuss how cloud resources are allocated and the implications of different allocation techniques on your application’s scalability.

Chapter 24, Scalable Computing Options

This chapter looks at highly scalable programming models such as AWS Lambda, which you can use to improve scaling, availability, and application manageability.

Chapter 25, AWS Lambda

The final chapter in this part provides a more in-depth exploration of AWS Lambda, a technology that offers extremely high scalability options for events with simple computational requirements.

Part VI, “Conclusion”

Chapter 26, Putting It All Together

This chapter pulls together the major topics from each of the previous sections into a simple summary, which can be read as a reminder of what was covered in each chapter.

Online Resources

The Architecting for Scale website offers additional information about this book, including links to the supplementary material. You can find more information about me on my website, and you can also follow my blog.

Conventions Used in This Book

The following typographical conventions are used in this book:

Italic

Indicates new terms, URLs, email addresses, filenames, and file extensions.

Constant width

Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.

Constant width bold

Shows commands or other text that should be typed literally by the user.

Constant width italic

Shows text that should be replaced with user-supplied values or by values determined by context.

Tip

This icon signifies a tip or suggestion.

Note

This element signifies a general note.

Warning

This icon indicates a warning or caution.

Safari® Books Online

Note

Safari Books Online is an on-demand digital library that delivers expert content in both book and video form from the world’s leading authors in technology and business.

Technology professionals, software developers, web designers, and business and creative professionals use Safari Books Online as their primary resource for research, problem solving, learning, and certification training.

Safari Books Online offers a range of plans and pricing for enterprise, government, education, and individuals.

Members have access to thousands of books, training videos, and prepublication manuscripts in one fully searchable database from publishers like O’Reilly Media, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, Course Technology, and hundreds more. For more information about Safari Books Online, please visit us online.

How to Contact Us

Please address comments and questions concerning this book to the publisher:

  • O’Reilly Media, Inc.
  • 1005 Gravenstein Highway North
  • Sebastopol, CA 95472
  • 800-998-9938 (in the United States or Canada)
  • 707-829-0515 (international or local)
  • 707-829-0104 (fax)

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://bit.ly/architecting-for-scale.

To comment or ask technical questions about this book, send email to .

For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.

Find us on Facebook: http://facebook.com/oreilly

Follow us on Twitter: http://twitter.com/oreillymedia

Watch us on YouTube: http://www.youtube.com/oreillymedia

Acknowledgments

While there are more people who helped make this book possible than I could possibly ever list here, I do want to mention several people specifically who were particularly helpful to me:

  • Bjorn Freeman-Benson, who supported me significantly in the early stages of developing this book, and who gave me opportunities at New Relic that helped provide me the insights I needed for this book.

  • Kevin McGuire, who has been a friend and confidant. We started at New Relic together, and it was his foresight and imagination that has helped give my career the needed focus and direction that guides me today.

  • Natasha Litt, who has been a good friend and provided much encouragement and support.

  • Jade Rubick, who with his constant smile and positive outlook, has provided me well-reasoned advice and guidance. What a great friend to have.

  • Jim Gochee, who introduced me to the magic that was New Relic, both as a product and eventually as a career.

  • Lew Cirne, whose vision has given us New Relic, and me a career and a home. The joy and driven enthusiasm you get after meeting with Lew one on one is highly infectious and hugely empowering. No wonder New Relic is so successful.

  • Abner Germanow, Jay Fry, Bharath Gowda, and Robson Grieve, who took a chance on me and fought to get me my current role at New Relic. Who says you can’t take a square peg and put it into a round hole? And have it actually fit! This is, by far, the most fun, rewarding, and personally fulfilling role I have ever had.

  • Mikey Butler, Nic Benders, Matthew Flaming, and the rest of the New Relic engineering leadership, for all of their support over the years.

  • Kurt Kufeld, who mentored me and helped me fit into the weird, chaotic, challenging, draining, and ultimately hugely rewarding work environment at Amazon.

  • Greg Hart, Scott Green, Patrick Franklin, Suresh Kumar, Colin Bodell, and Andy Jassy, who gave me opportunities at Amazon and AWS I could not have ever imagined.

  • Brian Anderson, my editor, who took a chance on me in writing this book, and helped me every step of the way.

I would like to give a very special acknowledgment to Abner Germanow and Bjorn Freeman-Benson. They both made it possible for me to work on this book. This book would not, could not, have happened without their support. For that, I will always be grateful.

To my family, and especially my wife Beth, who is my constant light and guide through this life we have together. My days are brighter, and my path is clearer, because she is with me.

To all these people, and all the people I did not mention, my heartfelt thank you.

I can’t end without also mentioning the furry ones: Izzy, the snoring spaniel, and Abby, the joyful corgi. And finally, Budha, the krazy kitty, who contributed more than his share of typos to this book.

Get Architecting for Scale now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.