Platform Stability Follow-Up

josh · May 17, 2024, 3:16pm

Hi all,

I’m here to follow up on our previous posts about platform stability, to provide some more transparency, let you know the work we’ve done to stabilize the platform so far, and share more of our plans going forward.

When we break down the challenges we’ve faced over the last several months, the problems fall into a few categories:

DDoS Attacks
Database issues and backend rate-limiting
Regressions and code bugs
Alerting and communications

We’ll walk through each category one by one. Please note, this is a fairly technical post; we’re sharing for those of you who want a deeper understanding of what we are doing to improve the stability of our platform.

DDoS Attacks

In general, Bubble is already highly reliable against DDoS attacks. Over the last three months, we’ve seen at least one attack against a domain we host almost every single day without there being any visible impact to our users. This is in part due to CloudFlare’s ability to detect and automatically mitigate attacks, as well as infrastructure hardening that we performed after past incidents.

Even the most robust protections against DDoS attacks are challenged from time to time as attackers evolve their techniques. Last week, on May 6th and 8th, we had extensive user-facing downtime due to a series of attacks that managed to break through our defenses. The attacks were almost certainly caused by the same actor, and they used a new attack vector sent through third-party video-streaming websites that we hadn’t seen before.

We were able to identify commonalities in the malicious traffic and set up defenses that blocked this new attack vector. In addition, we tightened rate-limiting across our infrastructure, which should make it harder for new forms of attack to break through in the future.

We also investigated which parts of our infrastructure failed in the face of the attacks, and found two culprits:

We found a component in our web servers that consumed unnecessarily high amounts of memory, causing server crashes when under heavy load. This was the cause of one of the downtimes induced by the DDoS attack.
We found a hot shard in one of our redis clusters that gets a disproportionate amount of load under certain forms of heavy traffic, causing it to fall behind on serving requests. This was the cause of the other downtime.

We’ve already fixed the first, and are working on the second.

In addition, some longer term investments we plan to make in DDoS resiliency include:

Splitting up the main cluster in multiple shared environments. This will limit the impact of both DDoS attacks as well as other forms of incidents, by reducing the impact to a fraction of our shared hosting users rather than our entire userbase.
Improving how quickly our infrastructure can scale up to handle bursts of load.

Database issues and backend rate-limiting

One of the primary challenges of shared infrastructure is ensuring adequate isolation, so that one application’s activity can’t consume resources needed to continue to serve other applications. While this isn’t a major challenge for MVP-builders, landing page hosters, and other lightweight tools, Bubble hosts serious production-grade applications on its main cluster, including some with hundreds of thousands of users and millions of items in their database. Our approach to managing this involves three basic strategies:

Sharding: We have multiple main cluster databases storing user data, so a malfunctioning application can only ever impact a fraction of our applications.
Resource constraints: We monitor how much work each of our applications is doing, and limit how much of our total resources it can consume at any given point in time.
Query cancellation: As a backstop to our resource constraints, if a database query is taking an unexpectedly long time, we cancel it, to avoid poorly-optimized queries from causing visible impact to overall database performance.

Over the last few months, we’ve had simultaneous challenges with both our second and third strategies, which has led to an increasing number of incidents where one of our database shards had a massive spike in latency. The result of those spikes is that applications hosted on that database saw slow data loading, or in severe cases, complete downtime.

On the resource constraints side of things, there have been two main issues:

Some of the performance optimizations we’ve made to speed up bulk data manipulation have inadvertently made it possible for us to execute on more workstreams than our system can safely handle.
Every Bubble app is different, and as our community grows and we host an increasingly large number of production-scale apps, we see new types of large requests we have not yet built limits to manage.

On the query cancellation side, we rely on an outdated piece of technology to run database stored procedures, and a bug in the latest version of that technology blocks our ability to cancel queries that depend on it. We are unable to downgrade to the previous version for various reasons, and so we have been working to eliminate our dependency on it.

We have a team of engineers working full-time to address this set of problems and restore database stability. Some of the things that the team has already accomplished:

We’ve removed many sources of unexpectedly long or otherwise expensive queries. While this does not fix the root causes of the problem, it a) prevents recurrence of the exact same outage, helping with short-term stability, and b) often leads to significant user-facing performance improvements for apps that rely on that type of query.
We’ve added code to break some large operations into smaller chunks. This allows us to impose stricter limits on operation size, removing large categories of potential problems. We’ve already shipped this for certain types of operations, and we are systematically going through the rest of our operations as well.
We are now routing calls to our database through a new service layer that lets us transparently change how we execute them. This took a lot of work over the last two months to set up, and it will enable us to make a lot of these planned changes rapidly, including some of the operation breakup work. More importantly, it will enable us to migrate off the outdated stored procedure technology and restore our ability to cancel queries as needed.

We are now very close to making our most-heavily used stored procedures cancellable, which should provide a significant improvement in our database reliability. We expect some of this work to make it into production over the next week.

We are also very close to fixing the loopholes that allow too much work to begin in parallel. We’ve already deployed some of the prerequisite code and expect to have this finished over the next week to two weeks.

These changes should get us back to a stable baseline on the database front. Longer term, we have plans to continue to improve main cluster isolation:

At the database layer, we are working to move a lot of our queries out of the database and into purpose-built systems designed to make those queries fully isolated, lightning-fast, and scalable to hundreds of millions of items.
At the workflow layer, we plan to implement an event-driven architecture that would allow us to seamlessly switch between different applications in a scalable and perfectly sandboxed way.

Regressions and code bugs

While most of our customer-facing downtime has been caused by changes in Bubble’s usage, we have also had some incidents where it was caused by changes we’ve tried to make to our systems. The most notable recent example was the hour-long outage on April 24th. This was an especially severe incident, because the performance bug that was introduced caused our caching layer to fail, and our mechanism for reverting the change relied on our caching layer being available. That meant we had to do manual work to get our systems back online and extended what otherwise would have been a brief outage into a full hour.

This outage was an exception: For the most part, the impact of any accidental bugs is limited to a subset of customer apps, rather than the entire main cluster. We isolate the impact of bugs via a combination of automated testing, feature flags, progressive rollouts, and the Immediate vs Scheduled clusters, which all serve to limit the breadth and severity of impact. That said, the rate of regressions we ship is higher than we would like it to be, and we see a number of opportunities to further reduce this source of instability.

The measures we are currently taking include:

To specifically address the April 24th outage, we are working on a rollback system that will function correctly even if other parts of our infrastructure have issues, so that were this same situation to happen again, we’d be able to get back online as soon as we identified the problem. This work is in flight, and we are aiming to complete it within two weeks.
While we work to uplevel our overall stability, our engineering management team is meeting daily to go over all planned releases, assess risk, and make sure we have appropriate testing and mitigation plans in place.
This is not a new practice, but we are reinforcing our commitment with the team to write automated tests that would have prevented the bug from making it to production whenever we cause a regression, as well as writing tests for all new development we do.

Longer-term measures we plan to take include:

Continuing to improve our planning and testing practices to reduce the risk with each code change we make. We’ve recently added a number of experienced engineering managers to our team, and they are identifying opportunities to improve our day-to-day development practices.
Building out load testing as part of our automated testing capabilities, including making our QA environment more closely resemble production. The April 24th outage could have been prevented if our automated tests ran in an environment with real production load, which would have surfaced the bug — whereas in our current QA environment, the bug didn’t cause enough impact to be noticeable.
Making it possible for main-cluster customers to opt in to the timing of new Bubble code, similar to the capability we offer on dedicated hosting today. This requires significant infrastructure work and needs to wait until a number of other improvements on our roadmap are complete, but it’s the north star we are heading for.

Alerting and Communications

One of the areas in which we’ve made a lot of improvements over the last six months is reliably detecting and notifying our community when there are issues with our platform. In the past, we’ve had situations where customers were experiencing extended issues with no communication from our team about what was going on. I’m pleased to say that this is now a rare occurrence: Between our support team and our automated systems, we typically have an engineer responding within 3–5 minutes of an outage’s start, and we reliably put some indication up on our status page that we are aware of an issue. We are continually adjusting the sensitivity of our alert systems, and while there have been one or two cases over the last month where we responded more slowly than we target, we rapidly fixed the alerting in those cases. Overall, we feel good about our ability to notice and respond to problems.

However, while we’ve improved the consistency of our response and communication, we have a lot of room to work on the quality of our updates. For example, this incident on our status page is unfortunately typical of a lot of our incident comms. For the record, that particular incident was a bug with our billing systems that prevented customers from subscribing to a plan with us, and had no impact on running applications. But our update gave no indication that that’s what was going on, and someone reading our status page would be justified in concluding that we were suffering unexpected downtime or some other severe problem.

We know that being able to understand what’s going on with our systems is important to you all: It helps you know what to expect and plan accordingly. We know that new customers to Bubble often check our status page, and that it’s frustrating for agencies and freelancers when their prospects see a messy status page with lots of unexplained short incidents.

Short-term, we are working to improve our incident communications by overhauling the way we let you know what is going on when there’s an issue, both in terms of process we follow and tools we use. We’re currently evaluating different tooling, and we aim to make a decision and roll it out to the team over the next month.

Longer-term, we want to continue improving our alerting: There are still classes of bugs and issues that we rely on bug reports to detect, and we want to move closer and closer to a world where we catch these problems, ideally before deploy, but if they make it into production, immediately afterward via automation rather than relying on human reporting.

Wrapping up

We are committed to being transparent about the challenges we face as a platform and what we are doing to address them. We know how impactful Bubble’s stability is to your businesses and livelihoods, and we owe it to you to be direct about how things are going and what we are doing. Each investment into reliability, scalability, and performance we make for one customer compounds for all our customers, and we are working to outpace traditional software development in terms of what we can offer as a hosting platform. We are on the cutting edge of what is possible with no-code, both in terms of complexity of apps, and number and size of customers. Being on that cutting edge brings unique challenges, but also unique opportunities to help our customers outperform their peers. We know your success is our success, and we will continue to push hard to make Bubble stable, scalable, and powerful.

Back to work,

Josh and Emmanuel

chris.williamson1996 · May 17, 2024, 3:29pm

Thank you for being proactive and highly detailed with this response instead of waiting for the next outage. I’m sure it’s been a hell of a last couple weeks for the whole team.

gaimed · May 17, 2024, 4:57pm

Thanks a lot for a good update! Hopefully you will be able to tackle this soon. Sounds like quite a difficult challenge!

Nobi · May 17, 2024, 5:13pm

The bubble community really needed a post like this to rehabilitate the confidence in the platform.

TipLister · May 17, 2024, 8:56pm

thanks for sharing this info with us.

indeed there have been many performance improvements in the last year

kevinh94 · May 18, 2024, 12:20am

Thank you for actively working on this. I imagine some new errors and downtimes will happen as you make these necessary changes but hopefully it’s for the better long term.

ihsanzainal84 · May 18, 2024, 1:27am

It’s a lot more reassuring for me to read technical explanations like these. As a customer and a community member who still sees a bright future in Bubble I always want to know what happens behind the curtain.

Though I’m a bit concerned about the systems in place for resource limiting. I understand the need for it in terms of stability but I hope this doesn’t become a choke on main cluster apps in the future (speed is definitely great right now). The upgrade to WU was to also allow us more server resources in our non dedicated apps.

chad · May 18, 2024, 8:32am

@josh

Since you are diversifying by creating new instances is there any chance one of those new instances can be in Australia

Sergiofojr · May 18, 2024, 9:20pm

@josh It’s amazing, thanks for sharing it. Recently @fede.bubble help us out, making everything clear and sharing with Bubble’s team about some of our frustrations.

I would like to suggest and idea. Maybe it could be possible, maybe not, so:

What if, as the page “index”, “reset_pw” and “404” that are undeletable, you guys allow us to build a page like “under maintenance” or something like this, that whatever Bubble is “offline”, automatically this page will appear to ours users?

It could be useful, as soon that, when Bubble is OFFLINE, our users couldn’t understand what’s happening and my support center suddenly become crowded.

In this way, we could use our own language, own design, own way to communicate that something isnt’t right.

Once again, thanks for letting us know about it.

DataJunky · May 19, 2024, 12:25am

@josh Thanks for the highly detailed and transparent update. I cannot understate how valuable it is to have this sort of update and would encourage you to do them earlier and more regularly. Folks paying for development and potential investors may not be technically savvy (although many are) but most know how to use Google and ChatGPT, so can confirm fairly quickly that the recent performance is not one what should expect or accept for an expended period of time. Your post provides invaluable ammunition to convince them that the recent instability is a temporary blip and that you have a clear line of site on how to fix things.

I applaud your decision to split up the main cluster in multiple shared environments. I am convinced that the increased costs associated with this move will prove to be a very wise investment in the long run. Hopefully the challenges of managing multiple clusters will not create another huge headache for you and your team.

Best of luck to you and your team with the implementation of what sound like very challenging changes!

raquel.r.baier · May 19, 2024, 5:50am

Thank you so much for your transparency. It really means a lot. I appreciate having an answer for the outages my customers faced just two days after our launch! Good luck, sounds like you know the issues and are working to fix it. Thank you Bubble Team!

matthieu.bouilde · May 19, 2024, 2:53pm

It’s always good to have this sort of communication from you or Emmanuel.
It reassures us as developers, but also our clients who ultimately, are the one who decide which nocode platforms they want to use for their projects.

If they are reassured on bubble’s ability to create a brighter future, they’ll continue to route for Bubble.

adzbiz11 · May 19, 2024, 7:46pm

Hi, I don’t know if it’s related to the resolution of a bug or an update but I’ve never seen my Bubble application be so fast… Hopefully it continues

dylantromp · May 19, 2024, 7:50pm

Same here!

saviorabrams · May 19, 2024, 8:19pm

I’ve been preaching this functionally for quite some time. This would be a nice feature.

adzbiz11 · May 20, 2024, 11:26am

You agree that today it’s back to slow loading ? I think it was just a short miracle yesterday

fede.bubble · May 20, 2024, 3:58pm

4 posts were split to a new topic: Bubble having some issues today? May 20

fede.bubble · May 20, 2024, 4:06pm

A post was merged into an existing topic: Bubble having some issues today? May 20

fede.bubble · May 20, 2024, 4:07pm

If you are experiencing issues with Bubble today, please share here: Bubble having some issues today? May 20

Topic		Replies	Views
3/22/2023 Incident Postmortem Announcements	28	8138	May 6, 2024
Update about Bubble downtime on May 8 (and previous days) Announcements	53	4977	May 15, 2024
Stability Weekly Update May 24 Announcements	12	2120	May 26, 2024
Stability Weekly Update June 14 Announcements	10	1178	June 19, 2024
DDOS & odd story Questions	1	449	May 5, 2023

Platform Stability Follow-Up

Related topics