Hi all,
Right now, I see my most important job as ensuring improvements in Bubble’s stability and reliability, so I’m planning to post updates on platform stability once a week for the next month. This is a continuation of the update I posted last Friday.
As anyone who’s worked with engineers to build and scale a web application knows, bugs and downtime come with the territory. When you’re programming that application traditionally, your engineers have to constantly reinvent the wheel, dealing with the same challenges and outages over and over again. One of Bubble’s main value propositions is the fact that when we make an improvement for one customer, in terms of robustness, performance, or scalability, that improvement can then benefit the rest of our customers. The idea is that we ought to be able to buy you instant scale, robustness, and performance compared to the millions of dollars you’d spend to build that yourself.
Our platform is hosting more customers, at more scale, than we ever have in the past. But our stability lately has lagged behind our scale, and it doesn’t matter how feature-rich, performant, or flexible Bubble is if we can’t provide a consistent experience. We have some catch-up to do here, and we are actively working to make that happen.
I firmly believe that we are going through a temporary rough patch and that we’ll be on the other side of it in the near future. The goal of these weekly updates is to create transparency into our efforts: We know transparency is the only basis we can give you for sticking with us through this rockiness and coming out the other side.
Key improvements we’ve made since our last update
Our biggest focus over the last week has been closing loopholes in our rate-limiting. Rate-limiting gaps have been a factor in almost all of the database-related downtime we’ve experienced over the last month, so we put it at the top of our list to address. I’m pleased to report that we’ve finished our work on these loopholes: We fixed some bugs in our current implementation, and we added a brand new layer of controls as a secondary safety layer.
Since we rolled out those changes, we’ve tested extensively and have not been able to find holes in our defenses. I won’t declare mission accomplished until these changes have been live for several months without similar incidents occurring, but we are now ready to focus on other opportunities to improve stability.
Our secondary focus has been moving off the outdated stored procedure system that I mentioned in my last update. We’ve also made great progress here: One of Bubble’s key database operations is nearly ready to move off that outdated system in production, and we expect to move two other key operations off next week. At that point, there are some less critical, lower-risk operations to address, but we’ll have mitigated the vast majority of the risk here.
We’ve also significantly increased the free memory in our redis caching cluster. This work has been ongoing since April, and the project wrapped up this week. Unfortunately, the final rollout was rocky, and it was related to the redis incident on Monday (which I’ll discuss more below). That said, our redis cluster is in a much more stable place today, and while there are some follow-up projects we’d like to do, we now consider there to be lots of headroom to continue scaling.
Speaking of headroom, we’ve added a meaningful amount of infrastructure capacity to the database that stores applications (we call it “appserver” internally). Its previous, more limited capacity caused of some issues two weeks ago, as well as the issues we experienced Tuesday. More on that below as well. I’m pleased to report that appserver is now considerably more stable, and we are working on a long-term plan to ensure we can continue scaling it.
Other initiatives we are actively working on:
-
The system for instant rollbacks I mentioned in my previous update is code complete. We are now testing and validating that it works as intended, targeting deployment next week.
-
We are vetting vendors to help improve our incident communications system.
-
As mentioned in my previous update, during the DDoS attacks two weeks ago, we identified a codepath that puts load on one particular redis shard. We prioritized the other initiatives listed above because we think the risk of an immediate recurrence is low, but we plan to pick this back up later in June.
Incidents since my last update
I want to continue to be transparent about incidents and downtime that we experience and inform you about how they’re affecting project prioritization in real time. Since my last update, we have had three incidents published to our status page:
Monday, May 20
Degraded performance on some apps. We saw intermittent slowdowns and downtime (roughly every 10–20 minutes) affecting many main-cluster applications over a period of four hours. The cause of this incident was the redis memory improvement work mentioned above. We ran into some undocumented behavior around redis replication where under certain circumstances, freeing up large quantities of memory over a short time period can lead to a crash loop. During the loop, redis periodically tries to replicate the changes to its backup server, but the replication fails, causing a server crash and failover. This problem only occurs at scale, working with the extremely high volumes of data that we were attempting to clean up, which is why we failed to anticipate it or catch it in testing. We resolved the problem by temporarily pausing the replication process. In terms of future roadmap implications, we think this problem is very unlikely to occur again because it was specific to the large data cleanup we did. Our main takeaway is to perform large data operations in a staggered, spaced-out manner to minimize risk of running into this kind of unknown unknown.
Tuesday, May 21
Issues with main Bubble cluster. The symptoms of this issue were very similar to the above: intermittent slowdowns and downtime affecting main-cluster apps. This time, however, the root cause was completely different: Appserver, mentioned above, hit disk-usage limits and started experiencing intermittent latency spikes. We knew that there were potential stability risks with appserver, as it was implicated in some of the downtime we experienced on May 13th, so we had been investigating its performance and health. However, we had a limited number of engineers focusing on it because we felt that getting the rate-limiting loopholes fixed was higher priority and likely to increase appserver’s stability in general.
In hindsight, I’m not sure I would have made the same prioritization call, and I take responsibility for the miss. That said, when problems began on Tuesday, the team swung into gear and implemented fixes on a number of levels: temporarily disabling high-load queries (including certain very large merges in our version control system, which we re-enabled Wednesday morning), scaling out the infrastructure for the database, and shepherding some background maintenance operations through the system. We are working on better monitoring to give us a clearer danger signal going forward and a medium-term plan for continuing to scale the system as the number of apps hosted on Bubble grows.
Wednesday, May 22
Unexpected behavior with API CORS. For a period of about 40 minutes, API calls to Bubble apps that rely on CORS (cross-origin request) settings were failing. Most apps do not depend on this feature, and were unaffected. The outage was the result of a bug that we accidentally shipped as part of an important workstream to make our API layer easier for us to maintain, test, and monitor. This feature was built many years ago and not documented or tested in our code. The existence of features like these is part of the motivation for this project, and we are proceeding as carefully as we can to safely identify, document, and test them: We regret that we missed this case.
Looking forward
Right now, our focus is on fixing the tactical root causes of all recent infrastructure incidents as fast as possible to create a dramatic short-term improvement to overall platform stability. Once we are caught up on our short-term queue of work, we plan to shift focus toward longer-term projects that make our infrastructure fundamentally more resilient, such as:
-
Multiple shared environments, so that infrastructure incidents are contained to a small subset of our users, and to enable us to easily scale out as the number of people using Bubble grows
-
Transformation of our primary user database, to allow for multiple backends, which we believe will make manipulating and querying data on Bubble blazing fast, up to scales that a single postgres instance can’t compete with
-
Building out a robust QA environment with real production data, to allow testing infra changes more safely
-
De-coupling the Bubble runtime code from our infrastructure code, which would allow us to have different apps on different versions of our codebase even on shared clusters
These projects will not only increase stability, but also move Bubble closer to our vision, with out-of-the-box performance, scalability, security, and robustness that can’t be matched without founders investing tens of millions of dollars.
We are committed to getting there ASAP, and we appreciate all your support.
— Josh and Emmanuel