Outage [SEPT 29]

Hey all,

I wanted to give an update on the outage that happened today, what we know, and what we’re planning to do next.

As a reminder, we have been working on replacing our load balancer system (see the context in this post for more details). We have had several issues over the past couple of weeks, but today’s outage was different than those issues, in that this wasn’t us trying to increase traffic to the new codepath. In fact, since last Friday we’ve actually been at 100% (see Fede’s post here).

What happened today? We are working on improving observability on the balancer service so that we can better support the system. Today, we were making what was seemingly an innocuous change that we ran through tests that we believed were sufficient for a safe deployment. However, we discovered the tests missed an edge case and this deployment ended up causing the full blown outage. We apologize for any inconvenience this may have caused.

What are we doing going forward? We are still doing a root cause analysis on what was broken in that change, but we’re pausing deployments to Balancer until we safely reproduce the issue in a lower environment, and solve the bug. While we tested the change, the testing failed to detect the problem, likely due to a difference between production and our test environment. Once we figure out why, we’ll update the testing procedure to fix the root issue.

What went well? As per my previous post, we committed to doing the following:

  • Commitment: Improving our alerting systems so we detect degraded performance from our internal systems rather than our users, and can mitigate issues earlier

    • Today: We were alerted right away, 10+ engineers jumped on a call, and we reverted and updated as soon as possible (~16 minutes downtime).
  • Commitment: Communicating more proactively with users, including better management of status page updates

    • Today: We updated the status page right away, and were in communication with the forum.