Takeaways from John Allspaw Talk at Velocity 2011

While the title of this talk by John Allspaw (@allspaw) was intriguing (Advanced Postmortem Fu and Human Error 101), it didn’t really do it justice.

John’s talk broke through all of the BS of “which tools do what”, and got to the core of the challenge we all in the #DevOps community face: we’re all human, we’re going to make mistakes, and our success or failure will largely be governed by the strength of the culture we’ve assembled around ourselves.

How popular are these get togethers? Standing room only at #velocityconf. So please excuse my “side angle” slide photos… it’s the only spot I managed to claim as my own!

So, without further ado, my big takeaways from @allspaw’s talk.

Crisis Patterns

There is a flow of events that can be found when looking back at a problem. It looks something like this, all leading up to the final event: the post-mortem.

The problem starts, and from there, the states that unfold are:

  1. Detection
  2. Evaluation
  3. Response
  4. Stable
  5. Confirmation
  6. All Clear, followed by a Post Mortem

And on top of these states over time, you know where your stress levels fall:

If you’re monitoring sucks, and for some period of time the problem goes undetected, your stress curve gets compressed quite a bit, but the flow of events is still the same.

Where your systems, tools, teams, culture, and ultimately, your entire organization will get put to the test, will be when you’ve detected a problem, evaluated it, and are in the process of responding… and it takes a long time to fix the problem. These are the most stressful of all scenarios, and it’s a scenario we all strive to avoid.

So, what can be done during Post Mortem to better prepare ourselves for the future?

Post Mortem

Folks have written at length on techniques of root cause analysis for identifying “what went wrong?”

Five Why’s: For each answer to “Why?”, ask another “Why?”

Swiss Cheese: What I can only describe as “Allspaw’s Swiss Cheese”, building a mental model around layers of protection against failure (the cheese) and the complex interactions that can lead to failure (the holes in the cheese).

What, you don’t believe me?

Many more efforts (both commercial and academic) have spent significant time looking into techniques and categorizations, but deliver limited real-world value. So, what is valuable?

There Is No Root Cause

What is valuable is stepping back and understanding that there is no single root cause!

These are complex systems. They have many, many interacting components. Not the least of which are the human beings that have created and are responsible for these systems. The most challenging problems are almost always systemic in origin.

It’s not just a “web server” that failed… it’s that a feature on the roadmap that was supposed to go out yesterday actually went out today (due to miscommunication from a poor dashboard design and the fact that the DBA’s car broke down yesterday so folks made a decision to delay) and news of the feature was picked up by TechCrunch sending HUGE amounts of traffic to the site and putting HUGE strain on the servers which were underprovisioned (running low on that last round of funding…) and no one was watching for this because… my favorite part… the entire ops team was at Velocity.

The problems of systemic origin do not benefit from a traditional root cause analysis. Even when applied, most folks conclude the problem was “Human Error”, but what value is that conclusion? It’s not like people come into work today planning on taking down the web site! Nobody comes into work with the intention of doing a bad job. The only real solution to these systemic problems is as complex as the cause: it comes down to the people, and the culture those people have built. So, if you end your root cause analysis with “human error”, you have to dig further. You have to look at your culture that led to that failure.

What Can We Do To Improve Our Culture?

A lot. A couple of ideas from John:

  • A Failure Gone Wrong is a Success: Rather than evaluating your failures and trying to figure out what went wrong, evaluated your successes and evaluate what went right. Maybe you did 100 code pushes, and 6 caused problems. If you only focus on the failures, you have 6 sets of data to evaluate. But if you switch the question around, and try to figure out what in those 94 code pushes went right, you’ll open yourself up to many more opportunities for insight.
  • A Just Culture: A just culture balances accountability with learning. No room for malice
  • Near Misses: Post mortem discussions are extremely important, and most people realize that. What’s often missed are those “near disaster” moments. Many lessons can be learned by investigating what almost went horribly wrong, and we need to have a culture that values honesty and humility to send out an email like this:

 

  • Pre Mortem: What’s better than a post-mortem is a pre-mortem! Discussing what COULD go wrong before it does. Communication is key.
  • Effective Organizational Structure: People can only be held accountable for the things that they’ve been given both the responsibility AND the authority for.

Another fantastic talk by @allspaw!