Performance issues / test live issues [Sept 19]

tj-bubble · September 19, 2025, 10:01pm

Hey all, I want to share an update on the recent outages this week on Monday and today

Context

In recent weeks, and historically, Bubble has had a number of SEV-1 outages that are caused by our request routing architecture, which requires us to manually keep a number of different sets of servers in sync. When they are out of sync, the effect ranges from degraded performance to the complete loss of access to Bubble.

The Platform team is working on upgrading the systems that route traffic from users to Bubble servers. Today, this routing depends on a set of custom servers we call Balancer. Balancer is responsible for:

Serving certificates to allow for encrypted communication
Identifying which cluster an application is running on, which involves communication with the database
Routing traffic directly to the next available server in that cluster

To simplify our infrastructure and de-risk operations that are not core competencies of the team, we’re reworking all aspects of Balancer.

As part of this effort – and relevant to the performance issues seen this week – we’re simplifying Balancer’s routing to send traffic to two AWS load balancers that themselves route traffic to available Bubble servers. This work will eliminate our reliance on Balancer’s custom routing algorithm and position us to eventually remove Balancer altogether, with all traffic running directly through AWS infrastructure.

Sept 15, 2025

After having routed 5%, 10%, 20%, and 40% of requests to the new AWS load balancers, with no change in performance, the rate was moved up once more to 80% at 11:35AM. After a period of stability, the editor began to experience decreased performance, reported in the forum. Once it was determined that the timing correlated most closely with Balancer change, we were able to confirm with metrics, and deploy a reversion back to 100% of requests going through the old pathway in Balancer. This fixed the problem.

The incident was then investigated over the next few days, involving Bubble’s Platform team as well as engineers at AWS to attempt to determine the root cause. While some signs were found, nothing definitive could be determined with the data that the incident gave us. Possible solutions were implemented.

The team then implemented as much logging and metrics as possible to try to get as much information as we could, and disambiguate some other data we were already collecting. At this point we returned to sending 40% of the requests through the AWS infrastructure, to collect data and try to see if whatever happened could be fully debugged.

Traffic was then increased to 80% again, to gather more data. All metrics were monitored closely and the degraded performance did not happen.

Sept 19, 2025

The 80% traffic state was, unfortunately, left on over night, as it had been determined that the effect on performance was fixed. It was still routing 80% of requests through the AWS load balancers when the load increased to a point that the performance degradation was again felt.

What’s Next

The work on Balancer will continue, and we’re approaching it with extra care to avoid repeating the kinds of incidents seen recently. To make this work more reliable, we’re focusing on both project-specific improvements and broader operational changes.

Project-specific changes

Working more closely with AWS and enabling all the data sources they provide (such as logs of all communication).
Implement a way to adjust the rate at which requests are routed more quickly than deploying the service
Running at higher request rates only when closely monitored, ensuring any problems are detected and fixed immediately.

Operational changes

Improving our alerting systems so we detect degraded performance from our internal systems rather than our users, and can mitigate issues earlier
Communicating more proactively with users, including better management of status page updates

Topic		Replies	Views
Bubble down 22nd March - DDoS and other Issues Bugs	171	9858	April 4, 2023
Slow Response from Bubble Bugs	45	5403	June 5, 2017
Workflow error - Temporary bug Bugs	153	2469	May 8, 2024
Incident postmortem from 11/29 Announcements	45	4817	December 21, 2022
[SOLVED] Bubble down or just me? Bugs	56	3256	September 30, 2018

Performance issues / test live issues [Sept 19]

Context

Sept 15, 2025

Sept 19, 2025

What’s Next

Related topics