Hey all, I want to share an update on the recent outages this week on Monday and today
Context
In recent weeks, and historically, Bubble has had a number of SEV-1 outages that are caused by our request routing architecture, which requires us to manually keep a number of different sets of servers in sync. When they are out of sync, the effect ranges from degraded performance to the complete loss of access to Bubble.
The Platform team is working on upgrading the systems that route traffic from users to Bubble servers. Today, this routing depends on a set of custom servers we call Balancer. Balancer is responsible for:
-
Serving certificates to allow for encrypted communication
-
Identifying which cluster an application is running on, which involves communication with the database
-
Routing traffic directly to the next available server in that cluster
To simplify our infrastructure and de-risk operations that are not core competencies of the team, we’re reworking all aspects of Balancer.
As part of this effort – and relevant to the performance issues seen this week – we’re simplifying Balancer’s routing to send traffic to two AWS load balancers that themselves route traffic to available Bubble servers. This work will eliminate our reliance on Balancer’s custom routing algorithm and position us to eventually remove Balancer altogether, with all traffic running directly through AWS infrastructure.
Sept 15, 2025
After having routed 5%, 10%, 20%, and 40% of requests to the new AWS load balancers, with no change in performance, the rate was moved up once more to 80% at 11:35AM. After a period of stability, the editor began to experience decreased performance, reported in the forum. Once it was determined that the timing correlated most closely with Balancer change, we were able to confirm with metrics, and deploy a reversion back to 100% of requests going through the old pathway in Balancer. This fixed the problem.
The incident was then investigated over the next few days, involving Bubble’s Platform team as well as engineers at AWS to attempt to determine the root cause. While some signs were found, nothing definitive could be determined with the data that the incident gave us. Possible solutions were implemented.
The team then implemented as much logging and metrics as possible to try to get as much information as we could, and disambiguate some other data we were already collecting. At this point we returned to sending 40% of the requests through the AWS infrastructure, to collect data and try to see if whatever happened could be fully debugged.
Traffic was then increased to 80% again, to gather more data. All metrics were monitored closely and the degraded performance did not happen.
Sept 19, 2025
The 80% traffic state was, unfortunately, left on over night, as it had been determined that the effect on performance was fixed. It was still routing 80% of requests through the AWS load balancers when the load increased to a point that the performance degradation was again felt.
What’s Next
The work on Balancer will continue, and we’re approaching it with extra care to avoid repeating the kinds of incidents seen recently. To make this work more reliable, we’re focusing on both project-specific improvements and broader operational changes.
Project-specific changes
-
Working more closely with AWS and enabling all the data sources they provide (such as logs of all communication).
-
Implement a way to adjust the rate at which requests are routed more quickly than deploying the service
-
Running at higher request rates only when closely monitored, ensuring any problems are detected and fixed immediately.
Operational changes
-
Improving our alerting systems so we detect degraded performance from our internal systems rather than our users, and can mitigate issues earlier
-
Communicating more proactively with users, including better management of status page updates