Until we have a tool that can automate this process, we need to capture the following data after a larger issue causing downtime has occurred. The main goal of writing a post-mortem is to capture the timeline of events and the impact of an incident so that it can be presented in a subsequent review meeting.
Writing the post-mortem/retrospective
Timeline: This will constitute the majority of the post-mortem. Start by including important changes in incident status or impact to customers and any major actions taken by responders, engineers, or subject matter experts. Additionally, for each item, include a data source or metric (such as a DataDog graph, tweets showing customer impact, etc.).
Analysis: A simple summary of what happened. This should capture the underlying cause of the incident, how many customers were affected, and the overall impact on customers (e.g., what functionality was degraded or affected).
Action items: List the actions that were identified and undertaken during the incident, as well as any necessary follow-up tasks. These action items should be captured in the post-mortem so that they can be assigned later on.
External messaging: Assuming this was a major incident, draft the external messaging to customers, recapping some of the details above.
Reviewing the post-mortem/retrospective
Once you’ve filled out the post-mortem template, send it out to all parties ahead of the post-mortem meeting. Key stakeholders to invite to the meeting include the Incident Leader, any technical service owners; key responders, engineers, or subject matter experts involved in the incident response. Invite all members to leave comments or make edits to the report, especially to the timeline portion. Regardless of length, the post-mortem review meeting should focus on the following:
Alignment on the timeline. Quickly recap and review the timeline and ensure that everyone is on the same page.
Discussion of how the problem could have been caught. Capture any new action items along the way.
Discussion of customer impact and the external messaging, if needed.
Review and assignment of action items, along with ETAs.
Publishing the post-mortem/retrospective
Once you’ve completed the post-mortem review meeting, there’s one final but important step you have to take: publishing the post-mortem. Distribute the post-mortem as an internal communication, typically via email, to all relevant stakeholders, describing the results and key learnings and providing a link to the full report (which can reside here in Freshdesk)
Tracking post-mortems/retrospective
After some months of having a well-structured post-mortem process in place, we should have a list of post-mortem documents, ideally tracked in a wiki or another searchable tool (Like Freshdesk). Why does this matter?
There are many benefits to having a detailed, searchable collection of post-mortems:
1. A list of post-mortems serves as a major incident log that can be used to inform future incident response. The next time you are in the heat of a major incident, the information you need may not be at hand. Having an easily searchable record of past incidents allows you to quickly look at similar cases and even reference specific graphs or data points. No more digging for old information in new places.
2. Post-mortems can help align the whole business by providing everyone access to the same information about an incident—a benefit no matter the size of the company. Once the post-mortem is published, the information within it can be used by many departments for a variety of purposes. For example, Sales can consult post-mortems when customers or prospects ask them about a past incident; having a log of these incidents will put the key messaging and details at the Sales team’s fingertips. Or, Finance can consult a post-mortem to evaluate the impact to the customer in case credits need to be issued for a service degradation. And so on.
3. Post-mortems provide a business case for technical reinvestment. Having a rich post-mortem log allows engineering team leads to more easily inspect which parts of the technical architecture might need some reinvestment. A pattern of similar or repeated incidents with the same underlying root cause can point to the need for larger architectural changes. Post-mortems contain all of the data an engineering manager needs to help get buy-in and alignment from Product counterparts, as well as other teams that may need to spend time working on fixing issues in the longer term. Post-mortems are a great way to bring awareness to these issues and quantify them in business-speak.
While it may seem like the creation of post-mortems as documentation takes a lot of time and investment, in reality the effort is quite minimal compared to the time and money that is lost when companies remain mired in major tech debt or disorganized incident response processes. I look back on my time spent putting out fires on call, and I think about how different things would have been if we had recognized the value of post-mortem documentation sooner. We could have saved many hours of lost sleep, fostered a culture of continuous learning, delivered better software, and saved customers a whole lot of pain.