Chapter 1. What Is Incident Response?

All too often, when organizations develop information security programs, they treat security issues as a simple “check-box” on the list of required corporate functions. After giving security its due attention (often times very little), senior executives happily and honestly check the box indicating they’ve somehow dealt with IT security and then move on to the next issue. Many of these organizations assume that once the security program is established (the box is checked, remember), they are assured of complete security and that -- like hanging a painting on the wall -- once in place it requires little further attention. Nothing could be farther from the truth.

For one thing, there is no such thing as total security. Good security controls keep folks honest and make it so challenging for an adversary to get around such controls that they will give up and move on to an easier target. Security supports business operations and ensures uptime and efficiency of mission-critical systems needed by the business in its daily operations to generate revenue and profit. From that perspective, security is as critical to business operations as the reliability and stability of the company’s networks, servers, and phone lines. But what happens when something unexpected happens, or someone manages to get around the established security controls in a manner that threatens business operations, and subsequently revenue and profitability?

For years, mature IT organizations have recognized the advantages of keeping data centers running effectively and efficiently. Yet, far too many otherwise mature IT organizations fail to adequately address incident response (or IT security in general). Some view incident planning as a self-fulfilling prophecy that ensures that something will go wrong. Those same organizations would not dare install a data center without adequate fire protection or burglar alarms, however; not because they expect a fire to occur, but because they want to be able to respond quickly to a fire to contain the damage and get back to doing business. While most companies today invest some level of effort and funding to employing security protection mechanisms in their IT infrastructures, very little attention gets paid to planning how to handle information security incidents.

Effective incident response planning is similar in some ways to effective fire planning, and incident response operations are very similar to firefighting: both disciplines strive to prevent, detect, contain, extinguish, and investigate bad things that may or do happen. The goal of incident response, just like the goal of fire response, is to minimize the impact of an incident to a company and allow it to get back to work as quickly as possible. And, much like a fire response system, an incident response system should be constantly vigilant and able to respond instantly.

While an enterprise information security program looks to mitigate risks by implementing security and process controls, an incident response program is necessary to conduct post-event crisis management activities when something manages to get around the security controls.

This book will help readers of all technical skills and management levels learn the importance of an incident response program, how to implement a reliable incident response capability, and where to go for help and assistance if necessary.

Real-Life Incidents

Contrary to public perceptions, not all incidents have dramatic dollar losses or make the front page news in sensational stories of computer terrorists wreaking havoc around the cyber world. Rather, most incidents rarely get a passing glance from even the most investigative reporter, and are often rather mundane and uninteresting for people outside of the affected area or company. As this book is not sensational and does not make unrealistic claims of gloom and doom, let’s look at some typical situations that incident responders deal with on an almost-daily basis.

One case occurred a few years ago at a major university’s (University X) primary computer lab. Apparently out of the blue, the 25-plus Unix workstations in the lab started crashing one by one in rapid succession, until each of the monitors had a single message on its display: “Kernel panic, core dump.” Fortunately, these high-end Unix systems recognized that there was a problem and began to reset and reboot themselves to correct the problem.

The issue was resolved and considered a computer “hiccup” until thirty minutes later when the exact same thing happened again. The “Kernel panic, core dump” message appeared, and the staff reset and rebooted the computers. Sensing wrongdoing, one of the more vigilant system administrators in the lab placed a network sniffer diagnostic device on the network and began to watch for suspicious activity.

Thirty minutes went by and the sequence occurred again. However, this time, the episode was captured on the sniffer in a manner similar to recording a television program on a VCR. A quick examination of the recorded events on the sniffer showed that each of the Unix computers had received an SMTP (Simple Mail Transfer Protocol) packet from another well-known university (University Y) immediately prior to each computer crashing. The system administrator then placed a phone call to the Carnegie Mellon CERT, (Computer Emergency Response Team) to report a possible security incident. CERT is now called the CERT Coordination Center or CERT/CC.

The pager of the on-duty member of the CERT/CC went off a few minutes later. To get a first-hand picture of what was happening, he called the system administrator in the University X lab. After spending several minutes on the phone trying to figure out what was happening to their Unix computers, they created a laundry list of things to do. One of the first priorities was to start tracking down where the malicious SMTP packets were originating from (University Y) and also the manufacturers of the two universities’ computers to see if there were any unknown features in the computers or operating systems being used.

Tip

CERT/CC members are always on call to handle incidents, but to avoid team member burn-out, they rotate a “first alert” pager among the team as the initial point of contact for security incidents.

After a few such phone calls, it turned out that University Y was running a new IBM mainframe version of Unix known as AIX 370, and had recently upgraded its TCP/IP networking code. University X was running its computing lab on Digital Equipment Corporation’s (DEC) Ultrix 3.1 operating system and DECstation 3100 computers. Although somewhat dated at the time of the incident, the Ultrix 3.1 operating system was still in wide use in the university community.

After several phone calls to the University X system administrator, the CERT/CC staffer discovered that the problem was, in fact, two problems. First, he learned that the new TCP/IP networking code used by University X’s IBM mainframe contained a minor bug that, when combined with a minor bug on the Ultrix 3.1 operating system, caused a network memory array’s index to be a negative number (one should never, ever, use a signed integer as an array index!). Having that information made it easy to fix this vulnerability in the University X computer systems.

What lessons can be learned from this episode?

For one thing, although University X’s Information Technology (IT) staff reported the incident to CERT/CC, they could have probably handled this situation entirely themselves, with their ample and technically competent IT staff. However, by contacting CERT/CC, both parties were able to learn about this bizarre and previously unknown interaction between an IBM mainframe and DEC workstations. This was good news for the CERT/CC; it was able to document this new incompatibility so that future incidents could be handled much more quickly. At that time -- around 1991 -- this insight was quite helpful, because DECstations were commonly found in universities, and AIX 370 (and similar reports of problems) were beginning to pop up more frequently. Although system administrators may be able to handle a perceived problem or attack, in many cases they play it safe and place a call for help to the CERT/CC. This call to an objective and knowledgeable third party might very well shed light on a new security issue (whereby both parties -- CERT/CC and the caller -- can learn new things) or the CERT/CC may be able to provide guidance on a problem new to the caller, but familiar to CERT/CC.

Secondly, the incident helped the CERT/CC in ways the two universities were probably not aware of; it helped CERT/CC locate critical engineering staff at both IBM and DEC to add to its database of vendor contacts. CERT/CC uses these contacts to work with system vendors when vulnerabilities or actual incidents are identified and reported.

Thirdly, this incident clearly illustrated how two computer systems could interact in a way that had every appearance of being an actual security-related event. In many cases, it is all too easy to jump to incorrect conclusions in the world of computer security. Once the pagers start chirping, and the adrenaline gets pumping, it’s critical to keep a level head. Thus, it is critical that incident responders never, ever assume!

This incident was a rather innocuous event in the grand scheme of computer incident response. Let’s now look at a major corporation whose networks are filled with intellectual property worth millions, if not billions, of dollars and what can happen when a devious IT staffer decides to do his own thing on corporate resources.

A system administrator at Corporation A discovered that one of his coworkers had established an FTP (File Transfer Protocol) site on the company’s own network and appeared to be using it to exchange illegally copied commercial software (warez ) with Internet users. It was a Saturday afternoon. One of the company’s IT staff was walking past a colleague’s cubicle and noticed some activity on the screen of the computer in the cubicle. Looking closer, he realized that he was watching an FTP session in progress, and that the files being transferred appeared to be commercial software products. Thinking this was a little suspicious, the IT staff member contacted the immediate manager of the person running the FTP server from his cubicle.

Later that day, the manager confronted the staff member who set up the computer containing the FTP site. The staff member admitted setting up the FTP site, but claimed that it enabled him “to support the company’s staff more effectively.” His manager refused to believe him, and subsequently informed him that he was being terminated from the company for violating their Acceptable Use of Information Resources Policy. The surprised employee was told to collect his personal effects and leave the premises immediately. For the next 20-30 minutes, the employee went back to his cubicle, put all of his belongings into a box, and left the building without saying another word.

Several hours later, when senior IT staff found out what happened, they called in a commercial incident response team to determine the extent of the potential damage done by the employee -- in particular, they were trying to find out what else this fellow might have done. That same evening, the commercial team arrived at the client site and started the damage assessment. Fortunately for the company and the response team, the now-terminated employee had not deleted any of the evidence regarding his activities (such as log files, caches, and the archives of commercial software on his FTP server) so the team was able to quickly find and verify all of his activity on the network system. After some hours verifying and cross-checking log files, it was declared that the only misuse of the company’s network by this employee was indeed the FTP server. However, the team discovered the FTP site had been active for several days and had been visited hundreds of times by third parties who illegally downloaded commercial software via the Internet. Had this incident made the news, Corporation A would have been legally liable for trafficking in pirated software, and may have had to pay a hefty sum to the makers of the software found on the FTP site.

One of the major lessons to be learned from this episode is that there was a lack of focus in resolving the situation, and a general lack of awareness regarding company policies and procedures for responding to such a situation. For example, after being informed that he was terminated, the employee had a large period of time when he was left alone in his office. During that time, he could have easily removed all traces of his activities and made it much harder to determine what he had been doing. More dangerously -- since he was an employee of the company’s IT department with significant network access privileges -- he could have deleted company files, or worse yet, left back doors in their network to use at a later date from an off-site location. Furthermore, the employee was asked to cooperate by describing his actions only after he had been terminated, when he had nothing to lose by not cooperating or truthfully answering the questions. Such “shoot first and ask questions later” approaches rarely result in the quality of answers that one would want. Unfortunately, these mistakes are all too common in today’s organizations. Kill the symptom, and investigate the underlying causes later on.

A third example of an incident involves a financial services company that signed a contract with a commercial IT security contracting firm to provide general information protection services to the client, including around the clock on-call response to potential IT security crises. As part of that service, the contractor had a list of key people and their phone, pager, and home numbers so that it could rapidly contact any of the company’s critical security personnel. Additionally, the contractor had already spent time with the client to learn its unique IT architecture, and was familiar with all of the client’s critical business systems.

One of the client’s critical business systems was a World Wide Web site used for electronic commerce. The web site included a commercial firewall product to protect it from typical security threats, such as ping sweeps, inbound FTP, and attacks against internal Windows machines, among many others.

For the past couple of months, the firewall had been reporting suspicious activities on the network. The company’s in-house IT staff reviewed the collected firewall log data and was unable to determine the cause of the warnings. The IT staff contacted their security contractor and sent the firewall logs over for analysis. After reviewing it, the contractor was likewise unable to discover the cause of the problem. The client decided to continue to monitor the situation and wait and see.

On a particularly busy Friday afternoon, the company requested on-site assistance from the contractor. The contractor sent two of its experienced incident response staff members to the client’s data center. They arrived on Saturday morning and were back home by Sunday evening after diagnosing the problem. Working side by side with the company’s on-site IT staff, they discovered, and more importantly, confirmed that the cause of the problem was not a cracker (as was assumed by their client) but rather, a bug in the commercial firewall. The contractor verified the bug with the firewall vendor and recommended that the client company upgrade their firewall software to a version that did not contain the bug.

Within a couple of days, the contractor was able to pass along the technical details of the incident, such as the specifics of the bug in the firewall product, to a technical group of other incident response experts, enabling others to benefit from the knowledge and recognize the symptoms in other situations.

Tip

It should be noted that at this time, incident response was in its infancy -- incident response “experts” were a loose group of academic system administrators taking an interest in security on top of their regular duties.

As a result of this proactive and objective workflow, the total downtime for the client’s e-commerce web site was zero, and the cost of the contractor’s support in resolving this problem was minimal compared to the uninterrupted revenue stream produced by the web site. Most importantly, though, the confidence in the web commerce system was restored, without any of the customers ever knowing that there was a potential problem. When it comes to IT security incidents, silence can be golden to a company!

What makes these stories any different from other situations where companies or organizations are faced with security events? What lessons are there to be learned? This book describes the process and the discipline of incident response and analyzes what could have been done differently in these and other situations. You should come away more knowledgeable about the ways of incident response and better prepared to handle incidents in your own organization.

Get Incident Response now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.