Six Questions for a Quick & Easy Root Cause Analysis

What’s a Root Cause Analysis?

When an issue is found at a customer, my team first solves the issue. Then we take a closer look at how the issue occurred to see how we can prevent the same thing from happening again.

How do I do one?

A Root Cause Analysis is otherwise known as a bug retrospective. There are many models and frameworks available, like the Five Why’s or Fishbone Diagrams. But at its core, a Root Cause Analysis is very straightforward. You just want to know two things:

What happened?
How can we prevent this from happening again?

To help you get started, I want to share my guide to a Quick and Easy Root Cause Analysis.

Root Cause Analysis: The Quick & Easy Method

Simply sit down with a couple of team mates, and answer the following six questions. It’s that easy!

A: What Happened?

What was the timeline?
How many customers did this affect? How many reported it?
In which issue or story did you introduce the bug?
What was the Cause/Root Cause?

B: How can we prevent this from happening again?

What can we do now so that if it happens again we’d spot it immediately?
What could we do differently next time to prevent this from happening at all?

Quick & Easy RCA: An Example

Here’s an example that I did with my team recently. Let me walk you through the six questions.

A: What Happened?

1. What was the timeline?

When was the bug noticed, reported, picked up, fixed? Who was involved?

In our case, one of our mobile features was malfunctioning. This was reported by three large customers over a couple of days. Two days later we had picked it up and a week and a half later the fix was live.

2. How many customers did this affect? How many reported it?

How many incidents or other internal issues were registered? How many customers had the problem?

We had only three incidents, but many more customers were affected. Either they didn’t notice, or they did but didn’t report it.

3. In which issue or story did you introduce the bug?

Was it something we expected? Why or Why Not?

We narrowed it down to a few potential stories in which we changed the backend interaction with the database. We didn’t expect these components to interact like this.

4. What was the Cause / Root Cause?

What went wrong? In which component? Note that we’re talking about a technical cause, (lack of) knowledge/experience. The goal is NOT to blame someone specifically.

In our case the malfunction was due to a database call hanging. Subsequent requests done in the backend afterwards failed. The broken feature didn’t make any database queries itself. That’s why we didn’t notice the malfunction at the time.

B: How can we prevent this from happening again?

5. What can we do now so that if it happens again we’d spot it immediately?

You could add logging or metrics to try to spot your bug sooner, or perhaps add it to your list of error-prone components for manual testing.

We created extra low-level automated tests, to try to catch similar issues should they occur.

6. What could we do differently next time to prevent this from happening at all?

Can we change our process to better support us? How can we share/reuse our new knowledge?

We concluded that because the buggy feature is critical for many of our customers, we would cover it with an end to end test.

Conclusion

Answering theses six questions took us less than half an hour. We can use the insight we gain during RCA’s to sharpen our bug prevention and up our quality. If you want to do the same thing at your team, I’d be curious to learn how it went!

About the author: Hollis Hazel

Senior Software Tester at TOPdesk. Working on Agile projects during transition from a monolith to a service architecture. Helping teams and testers with training, advice and inspiration so that they can do their job well, impress our customers and have fun.