Page data not updating

Hello, Bubble Community! :wave: My name is Payam (pronounced pie um, he/him), and I am the new Director of Engineering for Platform at Bubble. My organization has a key role in helping scale our services and improve their quality. I bring a deep background in service reliability that has spanned several industries, including internet-scale companies. I’m honored to have this role because I truly believe in Bubble’s mission.

First, I appreciate you sharing with us both the pains that this recent outage has caused for you and for your businesses and for the helpful details in how it has manifested. I acknowledge that it may have felt that your pains were not being heard. I regret that our goal of effective communication was not met in this case. In addition to taking steps to improve that communication, I’d like to share with you some more details about the nature of the issue that was experienced.

One of the big advantages of Bubble’s offering is that you don’t have to worry about the infrastructure that runs applications. Here we fill a similar need as other Platform-as-a-Service (PaaS) offerings from major cloud providers, such as AWS Lambda or Google App Engine but in a way that is seamless and transparent to our users. Providing a comparable level of coverage to them is an ambitious goal that we approach zealously.

One of the key challenges of maintaining system stability in large PaaS services like ours is the ability to handle unpredictable usage and load from customers. In my three weeks here, I’ve been surprised and delighted in the ways that people are using our product. We employ a variety of techniques to manage this load reliably, such as queuing of request volume, automatic retries in the event of failure, pooling of resources to address sudden increases in load, and flushing requests that threaten the stability of the system. And while we offer dedicated infrastructure for customers who want it, we still maintain some dependencies on Bubble components that are global, such as our notifier system.

The nature of the failure with workloads last night pertains to the latter: the system attempted to cope with a sudden increase in load from a particular application. This led to a global degradation of service, which explains some intermittent successes, and why data would refresh upon reload, but not automatically as expected. The threshold of the degradation did not meet the existing threshold we had set that would trigger alerts. The tradeoff here is, if the alerts are too sensitive, we risk continuously disrupting our teams in a way that prevents us from making Bubble better. On the other hand, if our alerts are too insensitive, we fail to catch system issues in a timely manner.

Once the issue was fully recognized, we needed to determine whether the high load on the system was due to external circumstances, such as a DDoS attack, or internal circumstances, such as bugs in our scaling logic. We then on-the-fly created additional monitoring which could help us answer this question. We determined that the issue was internal: when we allocate additional resources to cope with increased load, those resources go through an onlining process. For example, they have to connect to a database, discover their workloads, resolve and discover other system dependencies, and so on. In this way, the onlining of additional resources itself created a new scaling problem.

To stabilize the system, we took a series of steps that methodically reduced the rate at which additional resources came online by provisioning them over time, instead of all at once to meet the all-at-once-demand. This class of problem is one experienced mainly by the largest technology consumers in the world: Facebook(1) and Robinhood(2) are among companies who have experienced significant outages in the past resulting from the thundering herd problem. The idea is that in order to respond to a stampede of user requests, the system creates its own stampede, which poses its own problems.

Having addressed the thundering herd problem, we have several actions remaining:

  1. Like many modern tech companies, Bubble uses SLOs (Service Level Objectives) as a tool to monitor whether our users are able to do the things they need to do, and whether components of the system are having the success rate we expect. We are actively working on improving the extent and tuning of our SLOs so that in the future we can catch this problem as soon as it starts affecting users.

  2. The thundering herd problem has a known set of causes, and we have identified the key areas we believe caused it in this case. We have made some enhancements to prevent it from happening in the future.

  3. We are currently evaluating multiple potential candidates of approaches to make the particular subsystem affected much more resilient to sudden increases in load, as well as providing greater isolation between Bubble applications to prevent high load from one application from spilling into the rest of the system.

My vision for Bubble includes both identifying system issues before they start to impact customers, as well as creating mechanisms to automatically heal them. The approach we plan to take is one where we are consistently making gains in our service reliability. I’m blessed to have an incredibly talented team who shares this vision and is working tirelessly to achieve it.

Thank you for the trust that you’ve placed in us. I am eager to get to know the community more.

Best,
Payam

20 Likes