Cloud
Blameless Postmortems and Incident Analysis for SRE Teams
Umegbewe Great Nwebedu
August 13, 2023
Published on
August 13, 2023
Read More
Divine Odazie
19 Jan, 2025

The longer you spend building infrastructure, the quicker you realize that stuff breaking is just part of the job. Still, as engineers, we spend an enormous amount of time anticipating and architecting systems to handle failure, be it through disaster recovery or fault tolerance strategies. 

As someone who has been on SRE teams of varying sizes, postmortems and incident response have been different at every last one, and doing them without blame is an even more interesting topic. In this post, I will attempt to put down some of the learnings from incident response and cultivating a blameless postmortem culture.

Why postmortems anyways?

Postmortems have become synonymous with SRE culture nowadays; however, its roots lie in medical practice when doctors needed to write up a report on a deceased patient. This concept carries over nicely to software, when an application fails, you want to know why and how. Postmortems provide a good retrospective on an incident and how it can be prevented in the future.

Incident analysis seeks to understand the more technical details of what happened, precise timings, and how long an incident spanned, but this brings about the question that shaped this article if incidents don’t happen in isolation, can postmortems really be blameless?

Blameless postmortems

Popularized by the folks at Google, the idea behind blameless postmortems is to reflect on the root cause of an incident, timeline, solution, and steps to prevent it in the future. Switching back to the earlier question, in my experience, finger-pointing is very much cultural, when teams do not see an incident as a collective responsibility, it breeds an environment where individuals are unable to speak up and take ownership.

As you might have gathered, the problem with blameless postmortems is not in the technicality of the incident but within humans. So, how does one breed a culture where you can freely bring issues to light, and more importantly, how can you translate this to postmortems?

Creating the right environment

Before talking about blameless postmortems, it's worth asking, what kind of environment actually makes them possible? 

There are a few ways to approach building the right environment, and it starts at the top. As a leader, you set the tone for how your team responds to incidents. If you're quick to ask "who did this?" instead of "what happened?", you're signaling that blame matters more than learning. Suddenly everyone's more concerned with covering their tracks than fixing the problem.

The right environment is one where taking ownership isn't career suicide. I've seen and heard stories of teams where admitting you pushed the breaking change means getting thrown under the bus in the next all-hands. Compare that to teams where "yeah, my migration script took down prod" is met with "What'd we learn?"  and “how do we avoid this in the future?”.

When engineers can own their mistakes, they're also more likely to speak up when they spot potential issues 

How does the environment factor into postmortems?

I like to see blameless postmortems as an offspring of the right environment. When your team is able to tackle issues and not themselves, it becomes second nature to talk about them without pointing fingers. 

In a healthy environment, blameless isn't a rule you enforce, it's just how things work. Nobody has to remind anyone to "keep it blameless" because why would you blame someone for a systemic issue? 

Tenets of a solid postmortem

Having established the why behind a blameless culture, let's dig into the how. After reading and writing my fair share of postmortems, I’ve noticed that the good ones tend to follow a few core principles.

Stick to the facts, debate your choices later

Surprisingly harder than it sounds. When you’re in the thick of writing up an incident that kept you up until 3am, its tempting to justify every decision you made. “We chose to restart the database because…”. The postmortem isn’t the place to defend your 2am logic, its the place to document what actually happened.

The fly.io example does this beautifully - they don’t say “we had to scale up because we thought it would fix the problem”, they say “The team attempts to scale up to accommodate the load.

Timelines are your friend

Good postmortems live and die by their timelines. Not because your manager is going to nitpick whether the incident lasted 47 or 52 minutes, but timelines force you to think chronologically.

When you’re building that timeline, resist the urge to compress events. “around 3pm things started going sideways” isn’t nearly as useful as “15:23 - first alerts fired, 15:27 - on-call engineer acknowledged, 15:35 - incident escalated to team lead.” The precision matters because it helps future responders understand how quickly things can escalate.

Context is everything

It's easy to jump straight into the technical details without explaining why the system was in that state to begin with. Was this during a planned migration? Had there been recent changes? Was it black Friday, and traffic was 10x normal?

Context doesn’t just help readers understand the incident, It helps your team and possibly future you understand what was happening at the time and gives you a better idea of the state of things. Some of the best postmortems I’ve read spend almost as much time on the “what was happening before” as they do on the “what went wrong.”

Speak with one voice.

This is probably the trickiest part when multiple teams are involved. When you’re writing across team boundaries, establish upfront who’s driving the narrative and make sure everyone reviews before it goes out. Nothing kills the vibe faster than one team throwing another under the bus in a public postmortem.

Closing thoughts

Incident response will never not be a challenging effort, everything is on the line, and it's all to easy to throw someone under the bus in the process. In my experience, i found that the best places I have worked at enabled engineers to speak freely and own up to their mistakes because they are inevitable.

Stay ahead with the latest updates, exclusive insights, and tailored solutions by joining our newsletter.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
What is your email addres?
Featured posts
This is some text inside of a div block.
This is some text inside of a div block.

Stay ahead with the latest updates, exclusive insights, and tailored solutions by joining our newsletter.

We care about your data in our privacy policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
ABOUT THE AUTHOR