“Explanations exist; they have existed for all time; there is always a well-known solution to every human problem — neat, plausible, and wrong.”
H.L Mencken - "The Divine Afflatus" in New York Evening Mail (16 November 1917)
“Your product is broken!”
Sometime in 2007 or 2008, a customer called with bad news. The company’s employees used our product to sign on to almost all the applications they used to do their jobs. Attempts to reach the sign-on page returned an error message. An unknown failure locked employees out of every tool. In industry parlance, they were experiencing a P0 issue–the system was in a total outage, no functionality was working, and it wasn’t clear what was causing the issue.
Our customer success engineers had so far failed to diagnose the issue. They struggled to get timely information from the customer–especially detailed log data from hundreds of distributed servers. The customer’s support engineers appeared panicked and lacked a troubleshooting methodology.
The customer, short on patience, demanded a product manager join an open conference call where the two teams worked together to figure out the problem. My phone rang half a dozen times, each caller demanding I drop everything I was doing and immediately join the call. I grabbed a member of my product team, and we jumped on.
Product manager as punching bag
The executive leading the customer’s efforts was irate. I waited while he screamed like a banshee for ten minutes or so. He was having a bad day and needed to vent. When he finished, my product manager and I started asking questions. Did we have error logs from the product server and the web servers where all the distributed agents were running? Was the behavior the same for all of the distributed agents? What had changed in the environment before the outage? Was anyone from the customer’s infrastructure or networking teams available to help us troubleshoot?
The clock ticked while we tested several hypotheses. We uncovered that every agent had started refusing connections at the same time–at midnight the night before the outage. Finally, my product manager asked whether anyone had checked the agents’ digital certificates–the things used to secure the connections between the server and the agents. We quickly learned the certificates on all of the agents had expired. At midnight. The night before.
We verified that updating the agent’s certificate restored the product to normal operations. There was no product bug, no code to fix. Even though the product relied on the certificates to communicate securely, certificate handling was outside our product’s control. The product documentation recommended regularly rotating the certificates, and we reiterated that recommendation before dropping off the call. The customer executive was angrily contrite (if that is even a thing).
Root Cause Analysis
Feeling pretty beat up, we followed the incident with a half-hearted root cause analysis exercise. We promptly vindicated ourselves, indicted the customer for their lack of insight and process, and added a couple of low-priority features to have the agent warn the server upon imminent certificate expiration and improve the information collected in our logs.
The customer’s Six Sigma Black Belt conducted a more thorough root cause analysis using (I found out later) the Five Whys method. Their analysis concluded that our product needed to include a full-blown certificate management system. I decided the conclusion was ludicrous. Certificate management was its own market category occupied by mature vendors with sophisticated products developed over many years.
During the incident’s post-mortem, I explained our position on building certificate management into the product while the customer’s executive excoriated us through gritted teeth. Years later, I ran into the executive at an industry conference, and he was still mad at me.
The Five Whys
The Five Whys is a root cause analysis tool developed by Taiichi Ohno and Eiji Toyoda and used by the Toyota Production System (TPS). The software industry adopted and adapted lean practices, and if you use agile methodologies, chances are you’ve used a Five Whys when something has gone wrong. Five Whys is more or less the gold standard for RCAs in agile software development.
I’ve found limited success using the Five Whys method. I assumed I was somehow doing it wrong. I didn’t prepare properly, didn’t invite the right people, didn’t frame the problem well, didn’t refine the problem statement with the team, or guide the team through the Why sequence to produce quality insights.
Doing more research, I found critics pointed out structural issues with the method. So, maybe it wasn’t me?
The part that has always worked well is gathering people for the ceremony. Taking sixty minutes to think deliberately about a problem and how to solve it, using the full attention of stakeholders who’ve experienced pain because of the problem, creates focus and a sense of urgency. The ceremony encourages accountability.
Having the ceremony is good, but executing it to an outcome is harder than it looks. For me, the two biggest challenges have been framing the problem and knowing when your Five Whys is yielding inaccurate answers.
Framing the problem
Framing the problem is difficult, and two factors complicate developing an accurate and compelling problem statement.
The first is what I call the ‘interested parties’ problem. I’ve found it hard to know who should and shouldn’t participate. I tend toward openness and candor and leave the invitation wider than perhaps I should. Unfortunately, in larger groups, people succumb to the Asch paradigm, where they tend to conform to the majority beliefs of the group. The willingness to speak hard truths diminishes as the audience gets larger.
Narrowing the group doesn’t always work because you miss out on specialized knowledge that you need to get more accurate answers. Numerous experiments left me dissatisfied that I’ve hit the right balance. Something always feels out of whack.
I’ve observed a tendency to frame the problem to the more distant root cause. In my case, we subconsciously framed the problem so our analysis would find the customer was the root cause. Everyone directly involved with the troubleshooting and resolution left the call thinking the same thing: “That guy is a jerk, and his team doesn’t know what they are doing.”
People who were not directly involved but had an incentive to protect our reputation quickly conformed to that general view. There were issues with our product that, once resolved, could have helped the customer overcome their lack of visibility and process. We ran an intellectually dishonest exercise and got the outcome that showed us in the best light. This cognitive bias might be the most common reason Five Whys appear to succeed when they’ve actually failed.
The second issue is refining the problem statement. The most common practice is to present an initial problem statement, have each participant brainstorm their own problem statement, and then refine or replace the initial problem statement based on that feedback from the team. However, the whole framing process is subject to bias. Most of the time, when I run Five Whys, the participants frame the problem from their unique perspective, often contradicting other statements. So, the team has to do their best to decide which problem statement to choose–there isn’t room for more than one in this model. Of course, the initial framing determines the possible answers to the subsequent WHY questions. Each problem statement could take the team down completely orthogonal paths.
In our outage example, we could have framed the problem in several different ways.
”Employees couldn’t log into their applications.”
“Communication between product components failed.”
“Certificates expired without warning.”
Answering Why
The premise of Five Whys is the sequence of asking Why five times magically leads you to the root cause. Each problem statement above sets the bounds for possible answers to the Why questions. By the time you are five levels deep, you may find you’ve solved A problem, but not necessarily THE problem. During my experiments, I haven’t figured out a consistent way to tell when that is happening.
Each statement frames a legitimate problem. The problems are interrelated and contributed to the outage. The root cause for certificates expiring without warning (the expiration date is embedded in the certificate) is totally different from the root cause for components failing to communicate (the product only supports a single configuration controlling component interactions).
A Five Whys exercise artificially oversimplifies a complex problem.
In our outage example above, that is precisely what happened. Our problem statement framed the analysis to reach the conclusion that the customer was to blame–the root cause–and the solution was out of our hands. They needed to improve their certificate handling process. Tough shit for them, eh?
Their RCA reached a very different conclusion. The product failed because it depended on the certificates and failed to warn them of the impending total outage proactively. In fairness, the logging in our product kind of sucked and didn’t indicate that the communication failure was because of an expired cert. We eventually improved that logging, and that error jumped out every subsequent time we saw the problem reoccur.
Every enterprise software product relies on other infrastructure totally out of its control. Yes, your product might be communicating over the network. Still, it’s unreasonable to expect every software product to somehow anticipate a network outage or a misconfiguration that disrupts that product’s operation. The same applies to data storage and server infrastructure.
You can write only so much code to manage these kinds of external dependencies. And, because they are non-functional, this is the first place you cut when your roadmap starts constricting around your dwindling capacity. It’s not right or wrong; it’s just what happens.
Our Five Whys failed to represent this nuance and complexity.
Do a Five Whys for your failed Five Whys?
Do a Five Whys to figure out why your Five Whys didn’t help you find the actual root cause of your complex problem.
Just kidding.
Critics accurately point out that the Five Whys method is too single-threaded to address complex real-world problems. These same critics suggest that fault tree analysis or fishbone diagrams are far more adept at capturing the complexity of most real-world situations.
In our case, the system as a whole failed. Our analysis might have benefited by applying the fault tree methodology, but it seems pretty heavyweight for the nature of our problem. After all, fault tree analysis was developed to keep commercial aircraft aloft and send astronauts to the moon. The fishbone methodology seems more preventive and something to use proactively to determine all the possible causes of communication failures between our product’s components. Helpful, but maybe not the best tool to get us to the outage’s root cause.
I’ll keep experimenting to find a middle ground between a Five Whys and a fault tree analysis.