Practical Postmortems at Etsy
Author: Daniel Schauenberg
The days of running a web site on a single computer are basically over. Any basic architecture of a modern site consists of a myriad of systems interacting with each other. In the best case you start out with a web frontend and a database representing your initial feature. And then you add things: features, billing, a background queue, another handful of servers, another database, image uploads, etc. And of course you hire more people to also work on all of this. And this is at the point when you realize you work in a complex socio-technical system. You did from the beginning, but it starts to get a little more obvious now. As interactions between components become more complex, the lack of understanding of how the whole system works shrinks, and you start to witness emergent behaviour that you weren’t aware of before. Behaviour that can’t be explained by just looking at the single components but are a result of the interplay between them. Things start to break a lot more often.
Traditionally it has never been great to have things break. You are suddenly confronted with a situation you thought wasn’t possible. Because if you did, you would have put guard rails and protections in place for this. Everybody is caught by surprise. You want the business to run, you want the site to be up. You are working hard to fix the problem, while your Amygdala is hijacking your brain. Maybe people are running around in the office or are asking question in your chat system. How could this have happened? Why didn’t we have tests for the code path that broke? Why didn’t you know about that failure case? When is the site going to be back up? Eventually everything will recover and your web application is humming along nicely again. Time to talk about what happened.
In what has been called the “Old View” – the traditional way of handling the aftermath of such an outage – we would now come together and yell at the person who was maintaining the system that broke. Or the one who wrote the code. Or the one working on fixing it. Surely if they had just done a better job, we wouldn’t have had that outage. This usually ends in a very unproductive meeting where afterwards everybody feels worse. And on top of that you didn’t even find out what really happened.
This is why in the “New View” of approaching systems safety – the foundation of what is now commonly known as “blameless postmortems” – we take a different route. The fundamental difference is that we don’t stop at “human error” as the reason for why something broke. Humans don’t generally come to work to do a bad job. Nobody sets out to bring the site down when they come into the office in the morning. The fundamental assumption we must make when we go into a blameless postmortem debriefing is that whatever decisions people made or whatever actions people took, they made sense at the time. They believed to make something better, fix something, deploy a change that was flagged off in production, deleted a file that wasn’t referenced anymore. If we had a time machine and could go back to ask that person if their change would break the site, they would tell us “no way”. Otherwise they would not be deploying that change. Being able to point to that change in the debriefing as the root cause of the outage is a function of hindsight and the encompassing hindsight bias at play. The Austrian physicist and philosopher Ernst Mach said in 1905:
knowledge and error flow from the same mental sources; only success can tell one from the other.
Meaning that any action you take can always be judged as a success or a failure, depending on the outcome. Focusing on the action itself and the human as the perpetrator doesn’t give us any advantage in learning how the incident came to be. Even more so, a person feeling like they are being judged will not readily talk about all the influences that went into the decisions they made. They will try to get out of that meeting as fast as possible.
This is why we focus not on the action itself – which is most often the most prominent thing people point to as the cause – but on exploring the conditions and context that influenced decisions and actions. After all there is no root cause. We are trying to reconstruct the past as close to what really happened as possible. A challenge that is only made harder by the human brain’s tendency to misremember things. If one person can do what seems to be a mistake in the face of hindsight, then anyone could have done it. And so we are at a point where we can punish the person pushing the button, and the next person who does it, and the next person, and the next person. Or we can try to find out why it made sense at the time, how the surrounding system encouraged or at least didn’t warn about impending problems and what we can do to better support the next person pushing the button. We can have an open and welcoming exchange where we treat the person who supposedly broke the site as the actor closest to the action and thus most knowledgeable about the surprise we just uncovered. This is one of the biggest opportunities we have to learn more about how our socio-technical system behaves in reality and not just in theory.
Blameless Postmortems at Etsy
At Etsy we strive to make postmortems as open and welcoming as possible. This means there are no restrictions on who can attend (save for the limit of people that fit into a conference room, and it’s not rare that we have standing-room only postmortems). There are only rules around the minimum number of people who need to be there. As this is a learning event, it doesn’t make sense to talk about what happened if the people most knowledgeable about the past aren’t in the room. So everyone who worked on fixing the outage, helped communicate, got paged, or contributed otherwise to how the situation went down needs to be there. This is to ensure we learn the most we can out of what happened, and that we get the most accurate timeline of events possible.
Read full article: infoq.com/articles/postmortems-etsy