What is a Root Cause Analysis?
Think for a minute. What usually happens when a serious bug makes it to production?
Someone walks by and asks you to look into it. ‘Find the cause’, they say. So you go in search of the cause, with the goal of fixing it as soon as possible. This is good, but it is by far not the most ideal situation. Let me explain.
Consider the very real error that led to the immediate destruction of an aircraft before take-off. In an unfortunate turn of events, instead of flipping the switch for raising the flaps, a pilot accidentally flipped the switch for raising the landing gear. Although nobody was injured the pilot’s mistake inconvenienced customers and led to expensive repairs. The cause of the failure was noted as pilot error. Why wasn’t this the end of it?
Finding the cause
Fix the results of the error without challenging the underlying cause, and the same thing is bound to happen again. And indeed it did. Several pilots made this error before eventually an investigation was ordered.
During a Root Cause Analysis, you go in search of the underlying conditions that led to the error. A Root Cause Analysis assumes that it is the process or the environment that led to the failure, and not the person(s).
When examining the Root Cause of the pilots’ errors, the investigation team discovered that the switch for raising the landing gear and the switch for changing the position of the flaps were located next to each other, among the vast array of other controls in the cockpit. What’s more, the switches were the same shape, size and colour. If a pilot flipped the wrong one, the results were catastrophic. In the end the cockpit was redesigned, making it a lot harder to make this mistake.
Preventing future mistakes
When researching a serious bug, we might discover that we failed to account for a particular input. This is the cause of the bug. To minimize the impact on our customers, we fix the cause as soon as possible. But why do we so often stop there?
The root cause could be any number of things, and we haven’t fixed that at all. Perhaps there is no review process in place for new code. Maybe there is, but the team was working on code they had no experience with. Perhaps the team knew the code well, but the specifications were missing vital information. Maybe everyone did all they could, and the error was a simple oversight. It happens.
One thing is sure. If we don’t take the time to look into it, we will never know. It’s not about finding out who’s fault it was, it’s about adjusting the underlying processes to better support the people working with them. A simple change in process, such as adding a mandatory review, could prevent future issues from occurring in the first place.
Root Cause Analysis as a form of Retrospective
One thing remains. I’ve noticed that the term ‘Root Cause Analysis’ can be confusing. The name conjurers up images of a cause, rather than a solution. On top of that, without explanation it’s not always clear what the real difference between a cause and a root cause is.
Something that is familiar to many of us is the concept of a retrospective. During a retrospective we recognise what went well during the last period so that we can keep doing it. We also look at what could be improved. If needed, we create action points to aid us in our progress. In Agile development, perhaps the term Bug Retrospective is more representative of the process than Root Cause Analysis.
Many of us in the world of Agile have experienced first hand the benefits of retrospectives. That’s why I’d like to see more of us doing Bug Retrospectives. Next time you have a blocker, why not take a step back and look at what happened? How did it come about? What went well? And what could we change to make this less likely to happen again?
It need not take the whole team and last an hour. Ten minutes with a couple of team members could make all the difference. One thing is certain, taking the time to reflect will not only help you learn, but also improve the quality of your team’s work.
With thanks to Donald Norman for providing the example of the cockpit switches in his book The Design Of Everyday Things.