Page data not updating

Seems to have been resolved for me just now.

This issue has been happening for 15 or so hours now. It would be nice for Bubble to have a system in place that alerts someone and also updates the status when it is first detected. All well and good swinging into action when it is US morning but it doesnā€™t resolve the hours of downtime during the US night time.

1 Like

help , this is not acceptable, what is going on
???

not hereā€¦ I am on the growth and professional plans

all my RG not working and this is causing all kinds of problems in my databass

how can I know if this is being handled or looked at by the team and when it is solved?

@nabil2ā€¦ you can get info in this ongoing thread as well as here.

2 Likes

My bubble version-test have stopped workingā€¦ canā€™t implement a fix :frowning:

Anyone have a solution? Only way to get updates seems to use an api connector and backend workflows :confused:

The silence from the Bubble team during a debilitating outage like this isā€¦ deafening.

5 Likes

Really dont get thisā€¦

I only got this:

Hello,

Thank you for reaching out about your concern regarding your workflows. I would like to let you know that our team has received and taken a preliminary look into this.

To best assist you, I am escalating your report to our Tier 2 support team. This team works Monday-Friday, 9 AM - 6 PM EST. You will hear back from us as promptly as possible, with an update or any additional questions from our end.

I appreciate your patience here ā€“ given the nature of your concern, this is truly the best team to get you the most thorough answer.

FYIā€¦

1 Like

Real question now is: Are they actually solving the problem or just alleviating the problem until it happens again, like before?

I feel like this kind of problem without any solutions in sight makes us developers lose credibility and trust from our clients.

I am going to remove using real updating from all critical parts of my appā€¦ 3rd time in like a month

1 Like

accident
bubble engineers

not criticising the engineers btw, i have no doubt itā€™s the fault of bad resource allocation and pushing people to work on new features without making sure what we have works

1 Like

Well, at least this gave me the motivation to explore Xano so at least I know how data updates will behave on a given dayā€¦

FYIā€¦

resolved for 7 days

2 Likes

Hello, Bubble Community! :wave: My name is Payam (pronounced pie um, he/him), and I am the new Director of Engineering for Platform at Bubble. My organization has a key role in helping scale our services and improve their quality. I bring a deep background in service reliability that has spanned several industries, including internet-scale companies. Iā€™m honored to have this role because I truly believe in Bubbleā€™s mission.

First, I appreciate you sharing with us both the pains that this recent outage has caused for you and for your businesses and for the helpful details in how it has manifested. I acknowledge that it may have felt that your pains were not being heard. I regret that our goal of effective communication was not met in this case. In addition to taking steps to improve that communication, Iā€™d like to share with you some more details about the nature of the issue that was experienced.

One of the big advantages of Bubbleā€™s offering is that you donā€™t have to worry about the infrastructure that runs applications. Here we fill a similar need as other Platform-as-a-Service (PaaS) offerings from major cloud providers, such as AWS Lambda or Google App Engine but in a way that is seamless and transparent to our users. Providing a comparable level of coverage to them is an ambitious goal that we approach zealously.

One of the key challenges of maintaining system stability in large PaaS services like ours is the ability to handle unpredictable usage and load from customers. In my three weeks here, Iā€™ve been surprised and delighted in the ways that people are using our product. We employ a variety of techniques to manage this load reliably, such as queuing of request volume, automatic retries in the event of failure, pooling of resources to address sudden increases in load, and flushing requests that threaten the stability of the system. And while we offer dedicated infrastructure for customers who want it, we still maintain some dependencies on Bubble components that are global, such as our notifier system.

The nature of the failure with workloads last night pertains to the latter: the system attempted to cope with a sudden increase in load from a particular application. This led to a global degradation of service, which explains some intermittent successes, and why data would refresh upon reload, but not automatically as expected. The threshold of the degradation did not meet the existing threshold we had set that would trigger alerts. The tradeoff here is, if the alerts are too sensitive, we risk continuously disrupting our teams in a way that prevents us from making Bubble better. On the other hand, if our alerts are too insensitive, we fail to catch system issues in a timely manner.

Once the issue was fully recognized, we needed to determine whether the high load on the system was due to external circumstances, such as a DDoS attack, or internal circumstances, such as bugs in our scaling logic. We then on-the-fly created additional monitoring which could help us answer this question. We determined that the issue was internal: when we allocate additional resources to cope with increased load, those resources go through an onlining process. For example, they have to connect to a database, discover their workloads, resolve and discover other system dependencies, and so on. In this way, the onlining of additional resources itself created a new scaling problem.

To stabilize the system, we took a series of steps that methodically reduced the rate at which additional resources came online by provisioning them over time, instead of all at once to meet the all-at-once-demand. This class of problem is one experienced mainly by the largest technology consumers in the world: Facebook(1) and Robinhood(2) are among companies who have experienced significant outages in the past resulting from the thundering herd problem. The idea is that in order to respond to a stampede of user requests, the system creates its own stampede, which poses its own problems.

Having addressed the thundering herd problem, we have several actions remaining:

  1. Like many modern tech companies, Bubble uses SLOs (Service Level Objectives) as a tool to monitor whether our users are able to do the things they need to do, and whether components of the system are having the success rate we expect. We are actively working on improving the extent and tuning of our SLOs so that in the future we can catch this problem as soon as it starts affecting users.

  2. The thundering herd problem has a known set of causes, and we have identified the key areas we believe caused it in this case. We have made some enhancements to prevent it from happening in the future.

  3. We are currently evaluating multiple potential candidates of approaches to make the particular subsystem affected much more resilient to sudden increases in load, as well as providing greater isolation between Bubble applications to prevent high load from one application from spilling into the rest of the system.

My vision for Bubble includes both identifying system issues before they start to impact customers, as well as creating mechanisms to automatically heal them. The approach we plan to take is one where we are consistently making gains in our service reliability. Iā€™m blessed to have an incredibly talented team who shares this vision and is working tirelessly to achieve it.

Thank you for the trust that youā€™ve placed in us. I am eager to get to know the community more.

Best,
Payam

20 Likes