O'Reilly logo
live online training icon Live Online training

Network Troubleshooting Using the Half Split and OODA

Russ White

Troubleshooting is a fundamental skill for all network engineers, from the least to most experienced. However, there is little material on correct and efficient troubleshooting techniques in a network engineering context, and no (apparent) live training in this area. Some chapters in books exist (such as the Computer Networking Problems and Solutions, published in December 2017), and some presentations in Cisco Live, but the level of coverage for this critical skill is far below what engineers working in the field to develop solid troubleshooting skills.

This training focuses on one process, the half-split, and one model, the Observe/Orient/Decide/Act (OODA) loop, to provide engineers with a solid set of mental tools to effectively troubleshoot problems. This training considers the difference between the root cause and the immediate cause, and the concept of technical debt in terms of break/fix. This training also considers some basic concepts of resilience, including the tradeoffs around redundancy, and how they impact the Mean Time to Repair (MTTR).

What you'll learn-and how you can apply it

In this live training, you learn two basic processes or action models useful for troubleshooting computer networks at any scale. The first of these, the half split, has been used in electronic and radio frequency engineering for decades; it is one of the most useful and productive troubleshooting techniques when dealing with complex systems in real life. The second, the OODA loop, is often applied to security, but it is applicable to troubleshooting (and preparing to troubleshoot) as well.

You can apply these techniques to real-world failures and outages, reducing the time required to find a solution, in turn reducing MTTR.

This training course is for you because...

  • You want to move from ad hoc styles of troubleshooting to more systematic styles
  • You want to have specific, actionable methods to use for troubleshooting network problems and to stage information to improve MTTR
  • You want to understand the relationship between redundancy and resilience better
  • You want to understand the relationship between technical debt, root causes, and problem repair better

Prerequisites

  • A basic understanding of network design and operation (perhaps at the network professional level)
  • A basic understanding of OSPF, IS-IS, BGP, and IP forwarding

Resources

Common Misunderstandings

  • Troubleshooting is best learned through experience alone; there are no processes or techniques that can help
  • Troubleshooting always leads to the root cause, and repairs always improve the overall stance of the system
  • Troubleshooting is almost always ad-hoc
  • Finding the problem quickly is most often just luck or instinct

About your instructor

  • Russ White has more than twenty years' experience in designing, deploying, breaking, and troubleshooting large scale networks. Across that time, he has co-authored more than forty software patents, spoken at venues throughout the world, participated in the development of several internet standards, helped develop the CCDE and the CCAr, and worked in Internet governance with the Internet Society. Russ is currently a member of the Architecture Team at LinkedIn, where he works on next generation data center designs, complexity, security, and privacy. He is also currently on the Routing Area Directorate at the IETF, and co-chairs the IETF I2RS and BABEL working groups. His most recent works are The Art of Network Architecture, Navigating Network Complexity, Unintended Features, the Intermediate System to Intermediate System LiveLesson, and Computer Networking Problems and Solutions.

    MSIT Capella University, MACM Shepherds Theological Seminary, PhD in progress from Southeastern Theological Seminary

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Segment 1: Understanding MTBF, MTBM, MTTR, and the redundant to resilient tradeoff (50 minutes)

  • Redundancy as the traditional mechanism to add resilience to a network system
  • Why this works from the perspective of MTBF calculations
  • Why this doesn’t work from the perspective of MTBM, MTTR (through complexity), and grey failures

10 Minute Break

Segment 2: Staging Troubleshooting: The OODA Loop (50 minutes)

  • An introduction to the OODA loop
  • How to improve observation for troubleshooting
  • How to improve orientation for troubleshooting
  • How to lay out premade decisions to counter failure

10 Minute Break

Segment 3: The Half Split Method (50 minutes)

  • Understanding the half split method
  • How the half split interacts with OODA
  • An example of half splitting to find a problem

10 minute final Question and Answer Period