Outage [SEPT 29]

tj-bubble · September 29, 2025, 11:08pm

Hey all,

I wanted to give an update on the outage that happened today, what we know, and what we’re planning to do next.

As a reminder, we have been working on replacing our load balancer system (see the context in this post for more details). We have had several issues over the past couple of weeks, but today’s outage was different than those issues, in that this wasn’t us trying to increase traffic to the new codepath. In fact, since last Friday we’ve actually been at 100% (see Fede’s post here).

What happened today? We are working on improving observability on the balancer service so that we can better support the system. Today, we were making what was seemingly an innocuous change that we ran through tests that we believed were sufficient for a safe deployment. However, we discovered the tests missed an edge case and this deployment ended up causing the full blown outage. We apologize for any inconvenience this may have caused.

What are we doing going forward? We are still doing a root cause analysis on what was broken in that change, but we’re pausing deployments to Balancer until we safely reproduce the issue in a lower environment, and solve the bug. While we tested the change, the testing failed to detect the problem, likely due to a difference between production and our test environment. Once we figure out why, we’ll update the testing procedure to fix the root issue.

What went well? As per my previous post, we committed to doing the following:

Commitment: Improving our alerting systems so we detect degraded performance from our internal systems rather than our users, and can mitigate issues earlier
- Today: We were alerted right away, 10+ engineers jumped on a call, and we reverted and updated as soon as possible (~16 minutes downtime).
Commitment: Communicating more proactively with users, including better management of status page updates
- Today: We updated the status page right away, and were in communication with the forum.

Topic		Replies	Views
Workflow error - Temporary bug Bugs	152	3238	May 2, 2024
Should we be worried? Questions	36	7772	December 5, 2023
Bubble down 22nd March - DDoS and other Issues Bugs	167	10844	March 24, 2023
Bubble Down [Aug 21] UPDATE: back online Meta	108	1707	August 22, 2025
Bubble Down May 8th Need help	174	3664	May 13, 2024

Outage [SEPT 29]

Related topics