[Update and Post-Mortem] A step toward improving platform stability

Hi everyone,

TJ from Bubble’s platform team here, you might’ve seen me post a few times last month…

Today, however, I wanted to share an exciting update: We now have 100% of our traffic running through our dramatically simplified balancer service. While the rollout encountered challenges — including temporary outages — things are now running smoothly. This is a significant milestone in our continued work to reduce future risk and strengthen Bubble’s infrastructure.

If you followed along during those outages, you might have seen my brief explanation of Balancer and details in the monthly community update. Now I want to share details about what we did (including a post-mortem), why it matters, and what’s next.

How Balancer works

For those of you who didn’t see the earlier forum responses, “Balancer” is a custom service we built to route traffic from users to Bubble servers. It handled four main responsibilities:

  • Serving SSL certificates to allow for encrypted communication

  • Identifying which cluster an application was running on (which involved database lookups)

  • Maintaining a list of the active Bubble servers, their states, and the load that they are under

  • Routing traffic directly to the next available server in that cluster

While it works, Balancer has become a source of complexity and risk. When things go wrong with Balancer, the effects range from degraded performance to complete loss of access to Bubble, so it was important to make changes.

What we changed

Over the last few weeks, we have been working on re-routing traffic directly to AWS infrastructure by removing three of the Balancer’s core responsibilities:

  • Balancer no longer serves custom SSL certificates.

  • It no longer knows about individual Bubble servers.

  • Now Balancer routes traffic to AWS load balancers, which handle the routing to available servers using industry-standard algorithms.

This eliminates our reliance on Balancer’s custom routing algorithm and positions us to eventually remove Balancer service altogether and run all traffic through AWS infrastructure directly.

We also simplified the logic that updates which Bubble servers are currently serving traffic. This process had been prone to timing issues in our Continuous Deployment pipeline which caused outages in the past.

Finally, we improved our ability to test Balancer, allowing for smoother development as we do further work to optimize our traffic routing infrastructure.

Post-mortem: What happened and how we fixed it

As a part of this rollout, however, we encountered some challenges including the following outages. We recognize how frustrating these issues were, and want to apologize again for the disruption. Below are more details:

  • September 15: In order to roll out fundamental changes to how our custom load balancer works, we stepped up traffic through the replacement pathway incrementally, from 5-40%, with no issues. At 80%, connections began to time out or close, leading to page load issues. We fixed this by rolling back while we investigated the root cause. You can read more about this incident here.

  • September 25: Requests were returned with a “you hit balancer” response, a regression that was immediately reverted. The root cause of this issue was a misunderstanding of how certain configuration parameters were populated, leading to empty values sending requests to a testing response.

  • September 29: An issue occurred during what should have been a routine update. In order to monitor load balancer performance, we made changes to how metrics and logs were gathered. During this work, a new dependency broke the library that handles connections, but only under certain circumstances that our tests could not catch. Our other internal systems alerted us to the traffic reduction and we reverted the change. We then removed the broken package and performed additional testing before a successful update.

Benefits for you

As a result of the rollout, the Balancer service is now significantly less likely to be a cause of system-wide failures because:

  • We’ve replaced custom code that had flaws (it would sometimes unfairly overload certain servers) with AWS’s more robust load balancing system.

  • Connections now route directly through low-latency AWS services, removing processing overhead.

  • Deployments are more reliable without Balancer’s entanglement with our main server updates and the timing issues that previously caused outages.

What’s next

This is one step in a larger effort to simplify Bubble’s infrastructure and reduce opportunities for failure.

Our Platform team is working to fully eliminate the Balancer service by moving to AWS infrastructure and DNS-based routing late this year or early Q1. This will lead to even more stability and reliability for our users, since we can use DNS to route requests rather than needing Balancer to decode the app and where it lives.

Thanks for your patience during the recent outages and for the reports you filed that helped us identify and fix issues quickly.

— TJ and the Platform team

31 Likes

Can we expect that futures updates like this to always be done under a planned “maintenance” that will last until this is fully completed?

5 Likes

Could this be the reason I was not seeing any logs today? Only a few webhooks were in the logs, but nothing else.

Really happy to see some tech debt retired, and looking forward to its full completion. I understand the challenge of keeping a complex system such as Bubble accessible during such operations, and I appreciate the updates! :bubbles: :rocket:

:slightly_smiling_face: :+1:

7 Likes

Apart from this, what’s the reason Bubble has to serve a standard html boilerplate and then do a call to data=something to identify the app and fetch the actual content? That’s 1 second lost.

Can you not route traffic to the correct endpoint directly on entry?

SSR rendering should be a thing too, this could probably be some flag to inform web-socket to not send data to client on changes.

Also, probably 50% of CPU time of the architecture cost is due to all those realtime reads that are not always needed. Bubble is not Figma. If the user wants fresh data, they refresh like every other popular stack

A lot of Bubble apps that I have made for clients rely on Bubble updating data in real time. I feel that a page refresh is not realistic for a lot of the apps out there.

8 Likes

@akamarski we could add that if you want to design it this way… you can! just use state…

2 Likes

I think some performance could be gained by adding a checkbox or a toggle for real-time data. For some data types, it’s really important to have auto updating data. For others, it’s not necessary at all. Some apps might not care about it at all.

4 Likes

I am a simple person, I see degraded performance on my app for 3 days now and status.bubble.io no longer shows page load speed, and measured latency is volatile. Image load times are volatile too. I attribute all this to this update.

Hey @tj-bubble, can you clarify if the following error is something we should no longer be seeing, or is it still to be expected occasionally until your Balancer service is fully eliminated? (This is actually the first time I’ve seen it in a long while - just happened a couple minutes ago. App is on the main cluster.) :face_with_raised_eyebrow:

1 Like

It should be opt-in for “blog” style pages.

2 Likes

I recommend you send a bug report to the team if you haven’t already

So, “I tried to load a page” is sufficient as a bug report? Honestly, it’s not worth the back-and-forth time with support who always (and understandably) wants reproducible steps (and I always provide such when I do submit reports). This just doesn’t seem to be a good bug report candidate as far as a good use of my time is concerned. By its very nature, it’s not reproducible.

I asked a simple (and importantly, on-topic) question… Should we no longer be seeing these errors? Surely, @tj-bubble can respond and not simply disappear after dropping the announcement.

Anyway, the timestamp of the event is within a couple minutes of my post, so if any Bubble team members want to correlate it with some logs, they can.

4 Likes

hey @sudsy following up on behalf of TJ. I confirmed with him and his team that users might still come across some balancer errors even after our upgraded Balancer work (explained above).
They are confident that error rates aren’t higher (meaning things aren’t getting worse) and will review app error patterns in case there are new points where more observability that could help in the future

2 Likes