Incident postmortems: customer communication

Incidents happen. The question is whether or not we’re learning from them. There are a bunch of postmortem resources collected here to help teams maximize the learning and service reliability improvements they can gain from an incident.

However, there’s a separate question about how to communicate about incidents with customers. This definitely involves communications during the incident, but I’m especially interested in customer-facing communications after an incident.

These seem to be the key questions customers need answers to:

What happened?

This needs to be clear, brief, and factually oriented. The customer needs to trust that the company/service provider understands what went wrong.

Don’t be wishy-washy and don’t try to give mitigating context. People can tell when you’re dancing around problems or making excuses, and that’s counterproductive to rebuilding trust¹ with customers.

A tick-tock of events can help eliminate subjectivity and improve clarity while keeping this section brief.

How did the company’s normal safeguards fail to prevent the problem before it escalated into an incident?

We should all be aware that there’s absolutely no place for blame here. Customers don’t care that Kerry² deployed bad code or that your service provider failed. They care to know that you understand the system and context that allowed Kerry to make a bad deploy or that allowed an infrastructure failure to interrupt the service.

If you have to admit that you have no automated testing that detected the fault before it was deployed and no monitoring that detected the fault after the deploy, be honest and straightforward about it. If your service has a giant SPOF, don’t try to hide it.

What has the company done (or is doing) to ensure that this problem and similar problems never happen again?

You may have regrets, but this isn’t the time or place for them. And, though you should consider apologizing³ to your customers for the incident, this absolutely needs to be about what you’ve done or are doing to prevent the problem from ever recurring.

If you’ve clearly identified the what and the how above, those should point to areas that need attention, even if the changes needed there may not be easy. This is truly where the magic happens, since the lessons you take from a failure are demonstrated in the changes you make in response to it.

Whatever the incident was, it’s was probably a violation of the customer’s trust in the reliability and sincerity of your company’s claims about the service, as well as the company’s competence to live up to those claims. You’ll find more about the reliability+competence+sincerity trust model here, including advice from corporate educators on how to rebuild trust. This Forbes article takes the same trust model a bit further. ↩︎
A fictional name with an intentionally ambiguous gender. Extra: I selected that name from this list. ↩︎
It’s also worth reading up on to role of apologies in relationships with customers. Economist Benjamin Ho has been studying apologies for a while, and was recently the coauthor of a major field study on apologies. The authors of that paper wrote:

An apology can temporarily restore a customer’s loyalty after an adverse outcome. However, an apology acts as a promise that the adverse outcome was due to unexpected external factors, and that the customer should therefore expect better outcomes in the future. When those higher expectations go unmet, the firm’s reputation suffers more than if no apology had been tendered at all. Apologies should therefore be used sparingly and ideally only after unexpectedly bad outcomes that are unlikely to repeat again in the near future.

Emphasis mine, and I’m really trying to highlight the implicit promise in an apology. I don’t agree with the “unexpected external factors,” at least not for online services. Sophisticated customers know that infrastructure failures do happen, and they expect us to build systems that are resiliant to those failures.

Cliff notes version of the academic research here: this was discussed on the Freakonomics radio podcast (download MP3). ↩︎

Menu

What happened?

How did the company’s normal safeguards fail to prevent the problem before it escalated into an incident?

What has the company done (or is doing) to ensure that this problem and similar problems never happen again?