#2 Leading Teams to Learn from Incidents

Learn through honest, blameless reflection on lessons learned, how complex systems fail, and discover insights and techniques to facilitate a post-mortem review.

Jan 17, 2024

I'm not sure how many of you have had the experience of participating in incident investigations in a production environment—those unexpected events that can completely halt the service you're responsible for, or in more severe cases, impact an entire area or the entire business.

Here are some articles on software incidents that offer thoughtful and inspiring ideas on how we should approach incident management.

The VOID

The first article describes an initiative called The Verica Open Incident Database, a.k.a. The VOID, that makes public software-related incidents. The text offered some thoughts like:

Software outages and incidents aren’t going to magically stop. We can’t make our systems “flawless” or systematically map out all the potential faults.
The purpose of an (effective) internal write-up is to represent the richest understanding of the event for the broadest possible audience of internal, hands-on staff so that they may benefit from the experience.
This type of software failure is unexpected and surprising. The past event’s action items may or may not prevent the next one. (Even if they do, teams likely won’t notice events that don’t happen.)

Read the full article to understand some key findings on how the sharing of information about domestic accidents helped to reduce the number of fatal crashes in U.S. aviation.

Tip: Coming Soon: The VOID 2024 Report Join Courtney Nash for an in depth walk through the new findings in an exclusive webinar on on Wednesday, February 21st.

How Complex Systems Fails

The second article is about How Complex Systems Fail. It was written 22 years ago by Richard Cook as a result of almost 20 years of research about resilience engineering in complex systems and like good wines it aged well.

Here are a few ideas about it:

Complex systems contain changing mixtures of failures latent within them.
- The complexity of these systems makes it impossible for them to run without multiple flaws being present. Eradication of all latent failures is limited primarily by economic cost but also because it is difficult before the fact to see how such failures might contribute to an accident.
Complex systems run in degraded mode.
- Complex systems run as broken systems. After an accident reviews nearly always note that the system has a history of prior ‘proto-accidents’ that nearly generated catastrophe.
Hindsight biases post-accident assessments of human performance.
- This means that ex-post facto accident analysis of human performance is inaccurate. The outcome knowledge poisons the ability of after-accident observers to recreate the view of practitioners before the accident of those same factors.

Read the full article for all the 18 reasons why they might fail.

Learning from incidents how things went right

There are many outdated ideas about how operational work in services should be handled; there are better ways we can learn from incidents within our teams and recognize the human role.

Nick Stenning talked about this in his talk Learning from incidents: how things went right and here are some quotes about it:

Human error is just a label.
A Common trap is the Attribution to Human Error: This is a symptom and not the cause. Humans do make mistakes, but system design, organizational context, and personal context affect when, how, and with what impact.
- Human error is a label that causes us to stop investigation at precisely the moment we're about to discover something interesting about our system.
Another trap is Counterfactual reasoning - which is telling a story about events that did not happen, to explain events that did. Usually comes together with words like "should have", "could have", "failed to", and "did not". Time spent talking about things that didn't happen is time not spent trying to understand how what happened, happened.
- Move beyond "The problem wasn't detected in canary" and get to "How was it detected?" "How effective is a canary usually when it comes to detecting this kind of problem?"
Run a facilitated post-incident review
- Facilitators should ideally not have been involved in response to the incident
- Pick interesting incidents, not big scary ones. Start with one a month.
Ask Better Questions
- Language matter: prefer "how" over "why"
Ask how things went right
- Don't stop at understanding what went wrong: ask about how we returned the system to a satisfactory state.
Keep review and planning meetings separate
- Keep discussion of future mitigation out of the post-incident review, otherwise, they will act as a distraction from the purpose of that meeting.

That's all for this week. I will share more insights on incidents and outages soon.

I would also like to thank all the subscribers for supporting this initiative, and I hope it's as helpful to you as it is to me.

See you next week!

Júlio Leitão

Jan 19, 2024Edited

Thank you for sharing this content. I believe this is a subject that should be more talked and invested. We learn with the incidents and we should avoid the mistakes but it is necessary to create mechanisms to allow that. What do you think about it?

Expand full comment

1 reply by Lucas Farias

1 more comment...

Ideas for Tech Readers

Discussion about this post