|

Fastly crippled much of the Web, then issued the perfect apology

Fastly runs a CDN — a content delivery network that makes delivery of Web pages faster globally. Due to a bug, major Fastly clients including Amazon, Reddit, Spotify, eBay, and Pinterest were unable to serve pages to customers for about an hour Tuesday. Fastly apologized — and its apology is a clinic in how to apologize for a technical problem.

Here’s Fastly’s full blog post published Tuesday.

Summary of June 8 outage

Published June 8, 2021

Nick Rockwell, Senior Vice President of Engineering and Infrastructure

We experienced a global outage due to an undiscovered software bug that surfaced on June 8 when it was triggered by a valid customer configuration change. We detected the disruption within one minute, then identified and isolated the cause, and disabled the configuration. Within 49 minutes, 95% of our network was operating as normal.

This outage was broad and severe, and we’re truly sorry for the impact to our customers and everyone who relies on them.

What happened?

On May 12, we began a software deployment that introduced a bug that could be triggered by a specific customer configuration under specific circumstances.

Early June 8, a customer pushed a valid configuration change that included the specific circumstances that triggered the bug, which caused 85% of our network to return errors.

Here’s a timeline of the day’s activity (all times are in UTC):

09:47 Initial onset of global disruption
09:48 Global disruption identified by Fastly monitoring
09:58 Status post is published
10:27 Fastly Engineering identified the customer configuration
10:36 Impacted services began to recover
11:00 Majority of services recovered
12:35 Incident mitigated
12:44 Status post resolved
17:25 Bug fix deployment began

Once the immediate effects were mitigated, we turned our attention to fixing the bug and communicating with our customers. We created a permanent fix for the bug and began deploying it at 17:25.

Where do we go from here? 

In the short term:

  • We’re deploying the bug fix across our network as quickly and safely as possible.
  • We are conducting a complete post mortem of the processes and practices we followed during this incident.
  • We’ll figure out why we didn’t detect the bug during our software quality assurance and testing processes.
  • We’ll evaluate ways to improve our remediation time.

We have been — and will continue to — innovate and invest in fundamental changes to the safety of our underlying platforms. Broadly, this means fully leveraging the isolation capabilities of WebAssembly and Compute@Edge to build greater resiliency from the ground up. We’ll continue to update our community as we make progress toward this goal.

Conclusion

Even though there were specific conditions that triggered this outage, we should have anticipated it. We provide mission critical services, and we treat any action that can cause service issues with the utmost sensitivity and priority. We apologize to our customers and those who rely on them for the outage and sincerely thank the community for its support. Customers should always feel free to email support@fastly.com for more information.

Why this is the perfect apology

You can’t control when and how problems with your company affect customers. What you can control is what you do when there is a problem. Communication with customers should be fast, clear, contrite, and not overly defensive. Fastly’s blog post is practically a template for how to do it right. Take note of these elements of the post:

  • The title is clear and neutral — it’s about the outage.
  • The lede describes what happened in neutral terms: “We experienced a global outage” — and what happened when Fastly fixed it “Within 49 minutes, 95% of our network was operating as normal.”
  • There is a clear apology. It doesn’t evade who is responsible, and is directed to those who were harmed — Fastly’s customers and their customers’ customers: “We’re truly sorry for the impact to our customers and everyone who relies on them.” It’s neither minimizes the issue nor blubbers — it’s just a sincere apology.
  • There is a detailed description of what happened and in what order, starting with introducing a bug and then the way that a single customer caused that bug to affect customers globally. Notice that there is no attempt to avoid blame: “We began a software deployment that introduced a bug.”
  • The post includes a public plan describing follow up, including how to avoid similar bugs in the future and speed recovery.
  • It ends with taking responsibility and outreach to customers: “[W]e should have anticipated it. We provide mission critical services, and we treat any action that can cause service issues with the utmost sensitivity and priority. We apologize to our customers and those who rely on them for the outage and sincerely thank the community for its support.”

If you’re going to screw up, this is how to do it

Practice writing apologies and post mortems like this.

It will give your customers confidence in your worst moments.

 

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

4 Comments

  1. I’ve worked in coding and systems integrations since the early 90s. No matter how long you test or what data you test with, there is always something that’s unforeseen or unexpected when the code hits production and you unleash it on non-technical users. Not even counting on hackers and villains, but just normal people poking around.

    Junior programmers used to whine about how hard I was on ensuring their code had adequate error processing and logical functions. A common complaint was “why should I do that? no one would ever do that in real life!” I’d respond with that they need to think of everything, including what happens if someone drops coffee on their keyboard, because we all know that NEVER happens in the real world.

  2. Josh, you’ve trained me to scour my writing for the passive voice — and I don’t think Fastly’s apology contains a single passive sentence. In fact, we is the subject in almost every sentence. Well done.

    I also like how, twice, Fastly stresses that the customer who caused the outage did nothing wrong. It was a valid configuration change. Not their fault. Our fault.

  3. Agreed. That was an excellent apology for all the reasons you listed, and it was written clearly.