Preface

Does your company have IT incidents?

Have you ever had a high-severity IT incident that disrupted the service your customers rely on you to provide?

Did that incident damage your company’s reputation, erode trust with your customers or investors, or create adverse business or financial impact?

Was the incident response and resolution slow, unorganized, or poorly managed?

If you answered “Yes” to any of these questions, ask yourself one more question: are you happy with how your company currently responds to IT incidents? If the answer is “No,” then you must be looking for a better way, and this book was written for you.

We will show you a new way to think about responding to and resolving IT incidents. This new way works in small DevOps teams all the way up to the largest enterprise and service provider organizations. Collectively, we have more than 100 years of incident command experience leading special operations teams in busy urban fire departments on the east and west coasts and managing critical infrastructure throughout the United States. We also have specialized incident management experience in more than 40 countries at the highest levels of government and industry. In this book, we identify the best practices from outside the IT industry that have literally undergone trial by fire, and apply the same incident response methodology to the world of IT operations.

How and why did we make the jump from the fire service to IT? In the fall of 2012, at the suggestion of Jesse Robbins (see Foreword), the original adopter of using IMS tactics and strategies in IT, we met with a select group of IT professionals who were responsible for uptime at large social media companies and other large-scale web operations. The goal of the meeting was to discuss and better understand the challenges faced by IT companies when it comes to resolving IT incidents. These IT professionals wanted to look outside their industry for a better way because the models inside the IT industry were not working well. They wanted to hear how the fire service manages time-sensitive incidents in which the stakes are high, the decision-making environment is poor, the conditions are changing, and the outcomes are uncertain.

As the discussion progressed, the group recognized the striking similarities between urgent public safety incidents requiring a fire department response and urgent technology incidents requiring an IT response. In fact, the first IT people to respond to a service disruption or a cybersecurity attack are the “first responders,” just like the first firefighters on the scene of a fire. By the end of the meeting, we all agreed that there were significant opportunities to translate the best parts of the public safety IMS, especially as it pertains to leading people and managing time, directly to the IT industry. The principles are the same and directly applicable to the type of high severity/priority incidents experienced by the IT industry.

“A few years ago, we looked for a better way to manage incidents at Salesforce,” says Kwesi Ames, VP of Site Reliability Engineering at the company. “By adapting the principles of IMS, the same system used by firefighters, we saw a tremendous improvement in responding to critical technology incidents. It was a real game changer for us,” he emphasized. “It provided a disciplined template that allowed us to be consistently efficient in our response and handling of major issues in our production environment. This has resulted in faster recovery times and minimized loss of customer trust.”

IMS establishes the framework of incident response and the norms of behavior for the incident responders. High severity/priority events place the incident responders under critical time pressure to resolve IT incidents when customer trust, adverse financial impacts, and the company’s reputation are at stake. Using IMS increases your chances at having a good outcome and protecting the company’s business.

It is a fact that the future of computing promises more scale, more complexity, and certainly more change—all at greater and greater speed. It’s also true that the odds increase every day that your organization will have a major technology incident, created internally or externally.

Without a predictable way to respond to incidents, any organization—growing or mature—is at risk. Torsten Rueter, global platform services leader at GE Capital, looks at it this way: “Frameworks such as ITIL have helped mature large-scale IT operations. The Incident Management System addresses some of the more subtle—yet extremely important—aspects of managing incidents. Without strong leadership, collaboration, and shared working patterns, an Incident Response Team won’t live up to its full potential.”

Building on Jesse Robbin’s work, we adapted the Incident Management System (IMS) from public safety to corporate IT environments. We incorporate IMS into ITIL, DevOps, Agile, and Lean practices. We collaborate with customers to build a culture of incident response.

We bring a unique viewpoint to the IT industry. Collectively, we bring over 100 years of fire service and critical infrastructure experience to IT incident management. We blend our deep global experience in fire, hazardous materials (HazMat), weapons of mass destruction (WMD), and counterterrorism incident response with fiber networks, data centers, oil and gas, power, and capital markets to improve incident management performance.

Our customers generate $400 billion of revenue and create $1 trillion of market cap, while employing over 850,000 people around the world. These companies rank in the top 10% of the Fortune 500 and PwC Global 100 Software Leaders, operating globally in the industrial, financial services, consumer products, telecommunications, and software sectors, serving markets in North America, Europe, the Middle East, Africa, Asia, and the Pacific.

Assessment, training, evaluation, and exercises are the best predictor of future performance. We’ve delivered our incident management programs across three continents and into nine countries to thousands of Incident Commanders (IC), subject matter experts (SME), executives, and corporate staff. They work on site reliability, cybersecurity, mission-critical support, unified command, enterprise IT, operations, R&D and engineering/technology (network, database, SAN/Storage, server, automation, applications), legal, crisis communications, and executive management teams. Those teams work in global command centers, emergency operations centers, regional operations centers, war rooms, and board rooms at some of the biggest companies running the largest technology stacks in the world. We maximize uptime during high severity IT incidents.

This book will show you a best practice in incident management for the IT industry, even though it’s from outside the IT industry. We’ll offer not just a better way to think about incident response but a look inside the battle-tested techniques of IMS used in other IT organizations and throughout the United States fire service to organize, lead, and resolve life-or-death situations. We will offer the same thoughts, perspectives, and advice that helped a Fortune 50 financial services company reduce their Mean Time To Repair (MTTR) by 35%. Also, we’ll discuss use cases representative of some of the largest (and some of the smallest) IT operations teams in the world.

 

There is a better way to respond to incidents, and you just found it.

 

Conventions Used in This Book

The following typographical conventions are used in this book:

Italic

Indicates new terms, URLs, email addresses, filenames, and file extensions.

O’Reilly Safari

Note

Safari (formerly Safari Books Online) is a membership-based training and reference platform for enterprise, government, educators, and individuals.

Members have access to thousands of books, training videos, Learning Paths, interactive tutorials, and curated playlists from over 250 publishers, including O’Reilly Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, and Course Technology, among others.

For more information, please visit http://oreilly.com/safari.

How to Contact Us

Please address comments and questions concerning this book to the publisher:

  • O’Reilly Media, Inc.
  • 1005 Gravenstein Highway North
  • Sebastopol, CA 95472
  • 800-998-9938 (in the United States or Canada)
  • 707-829-0515 (international or local)
  • 707-829-0104 (fax)

To comment or ask technical questions about this book, send email to .

For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.

Find us on Facebook: http://facebook.com/oreilly

Follow us on Twitter: http://twitter.com/oreillymedia

Watch us on YouTube: http://www.youtube.com/oreillymedia

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://shop.oreilly.com/product/0636920036159.do

Acknowledgments

The authors would like to thank the staff at O’Reilly Media for all their support and assistance, especially Brian Anderson who guided us through the process. In addition, the authors would like to thank Jesse Robbins, who had the foresight and vision to bring the IMS concepts to the IT world. In addition to being a wonderful friend, he has provided a tremendous amount of advice and counsel to each of us, for which we are very grateful. Another person who was invaluable in the development of this text was Andrea Walter, who did our initial editing and formatting. She had a challenging task in wrangling our thoughts into this text. Tom Welch, a Fire Chief from the San Francisco Bay area weighed in on the text, providing insights and comments on the content from a fire service perspective. We would also like to thank our reviewers who provided the guidance on enhancing this text: Jason Hand and John Allspaw. Ashley and Amber Vidal provided edits and honest feedback, which helped streamline and explain our points in plain English as well. We must also acknowledge our customers and the thousands of IT responders we have trained and interacted with around the world. They are contributors to this as much as we are in that their experiences, challenges, and success are brought to you in the pages of this book.

Get Incident Management for Operations now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.