resolved for 7 days
Hello, Bubble Community! My name is Payam (pronounced pie um, he/him), and I am the new Director of Engineering for Platform at Bubble. My organization has a key role in helping scale our services and improve their quality. I bring a deep background in service reliability that has spanned several industries, including internet-scale companies. I’m honored to have this role because I truly believe in Bubble’s mission.
First, I appreciate you sharing with us both the pains that this recent outage has caused for you and for your businesses and for the helpful details in how it has manifested. I acknowledge that it may have felt that your pains were not being heard. I regret that our goal of effective communication was not met in this case. In addition to taking steps to improve that communication, I’d like to share with you some more details about the nature of the issue that was experienced.
One of the big advantages of Bubble’s offering is that you don’t have to worry about the infrastructure that runs applications. Here we fill a similar need as other Platform-as-a-Service (PaaS) offerings from major cloud providers, such as AWS Lambda or Google App Engine but in a way that is seamless and transparent to our users. Providing a comparable level of coverage to them is an ambitious goal that we approach zealously.
One of the key challenges of maintaining system stability in large PaaS services like ours is the ability to handle unpredictable usage and load from customers. In my three weeks here, I’ve been surprised and delighted in the ways that people are using our product. We employ a variety of techniques to manage this load reliably, such as queuing of request volume, automatic retries in the event of failure, pooling of resources to address sudden increases in load, and flushing requests that threaten the stability of the system. And while we offer dedicated infrastructure for customers who want it, we still maintain some dependencies on Bubble components that are global, such as our notifier system.
The nature of the failure with workloads last night pertains to the latter: the system attempted to cope with a sudden increase in load from a particular application. This led to a global degradation of service, which explains some intermittent successes, and why data would refresh upon reload, but not automatically as expected. The threshold of the degradation did not meet the existing threshold we had set that would trigger alerts. The tradeoff here is, if the alerts are too sensitive, we risk continuously disrupting our teams in a way that prevents us from making Bubble better. On the other hand, if our alerts are too insensitive, we fail to catch system issues in a timely manner.
Once the issue was fully recognized, we needed to determine whether the high load on the system was due to external circumstances, such as a DDoS attack, or internal circumstances, such as bugs in our scaling logic. We then on-the-fly created additional monitoring which could help us answer this question. We determined that the issue was internal: when we allocate additional resources to cope with increased load, those resources go through an onlining process. For example, they have to connect to a database, discover their workloads, resolve and discover other system dependencies, and so on. In this way, the onlining of additional resources itself created a new scaling problem.
To stabilize the system, we took a series of steps that methodically reduced the rate at which additional resources came online by provisioning them over time, instead of all at once to meet the all-at-once-demand. This class of problem is one experienced mainly by the largest technology consumers in the world: Facebook(1) and Robinhood(2) are among companies who have experienced significant outages in the past resulting from the thundering herd problem. The idea is that in order to respond to a stampede of user requests, the system creates its own stampede, which poses its own problems.
Having addressed the thundering herd problem, we have several actions remaining:
-
Like many modern tech companies, Bubble uses SLOs (Service Level Objectives) as a tool to monitor whether our users are able to do the things they need to do, and whether components of the system are having the success rate we expect. We are actively working on improving the extent and tuning of our SLOs so that in the future we can catch this problem as soon as it starts affecting users.
-
The thundering herd problem has a known set of causes, and we have identified the key areas we believe caused it in this case. We have made some enhancements to prevent it from happening in the future.
-
We are currently evaluating multiple potential candidates of approaches to make the particular subsystem affected much more resilient to sudden increases in load, as well as providing greater isolation between Bubble applications to prevent high load from one application from spilling into the rest of the system.
My vision for Bubble includes both identifying system issues before they start to impact customers, as well as creating mechanisms to automatically heal them. The approach we plan to take is one where we are consistently making gains in our service reliability. I’m blessed to have an incredibly talented team who shares this vision and is working tirelessly to achieve it.
Thank you for the trust that you’ve placed in us. I am eager to get to know the community more.
Best,
Payam
Welcome to the forum @payam.azadi , you’re getting thrown in at the deep end Thank you for a helpful post mortem.
Are there plans for a Bubble SLA which is probably overdue? It feels like one of those things where taking action by implementing an SLA at least shows that Bubble (and you personally) are confident in your abilities to deliver reliable service. If not, it’s nothing more than words with no substance.
I knew it, someone’s app broke Bubble again. Alright own up now so I can make a voodoo doll.
The funny part (to me, at least) is that I almost nuked Payam’s account because the system flagged his first post, and I have seen a couple of instances in the past of people trying to impersonate Bubble employees. Fortunately, I did some due diligence before turning my key, but yeah, my dumb ass almost blinked Bubble’s new Director of Engineering for Platform out of existence here in the forum before he even got started.
Anyway, welcome to the party, Payam!
Haha, well, I just granted Payam admin status, so going forward it should be more obvious he’s on the team! Thanks for your vigilance as always, @mikeloc
I’m very excited to welcome Payam – I view improving Bubble’s performance, reliability, and robustness as our most important business priority right now, and I see Payam as an amazing partner who three weeks in has already started transforming the way we think about managing our platform. Outages like this are really painful – they have a huge impact on the success of our customers and trust in Bubble’s platform – and I see a lot of opportunity for us to prevent them more proactively as well as respond to the ones that do happen more efficiently. I’m very much looking forward to see our continued evolution here!
All well and good but it doesn’t stop the fact that the bug first surfaced at about 8:30PM EDT and then messaging support, get a message that it has been escalated and will be looked at between 9-5AM EDT.
What steps are being put in place so that when something like this happens ‘out of hours’ in the US timezone, it can be rectified so we aren’t just twiddling our thumbs and losing a full day of development and having a degraded app for a full day of users in the timezone we are in? It simply doesn’t showcase a reliable or global outlook for Bubble.
Maybe it was @jerseyikes accidently taking down the main cluster again scheduling 10,000,000,000 API workflows
Fortunately, my mistake came 5 days ago 🥹
Respectful response. My team and I appreciate your thoroughness and the explicit priority of reliability going forward from you and Josh.
Of course, what matters in the end is action. Still, we are happy you addressed this and hope to see the changes you are making create substantial improvements quickly.
Thanks for responding. So we should assume the real time might not always be real time?
@peryam @josh
Thanks for the clarification on what happened and what’s being done to fix it.
As has been requested countless times before, can we please add a “Force RG to refresh” action? This not only helps us users handle outages like this in our apps, but also helps us make sure our apps reload data in environments where Bubble’s real-time updates have never worked (specifically mobile Safari, which has a cache system that breaks real-time after pages are inactive for 10+ minutes).
Previously, there was a workaround where by changing a parameter in the search using a state, RGs could be forced to refresh. This workaround has recently stopped working, probably due to performance improvements.
The long and the short of this is that you guys can’t and never will be able to guarantee us that real-time update functionality is going to work at all times and in all environments. This is understandable considering the complexity, but it necessitates a means of manually forcing a refresh when required.
We’ve needed this for years, why not knock it out now?
+1, seems like very little dev time to add.
It’s not like anyone will abuse it with how expensive WUs are, compared to the old pricing system where one might add a force refresh every few seconds just in case…
Exactly. I believe that was the original pushback on it. Now, with how WU works, we should have the option to use more WU to force refreshes when desired.
Here is the page on the ideaboard concerning this feature: https://bubble.io/ideaboard?idea=1616909173076x895689600085000200
I would also like an answer as to what Bubble is doing if anything to manage outages that happen outside of US East timezone. It was super frustrating to lose a day of development and having to deal with support issues related to this for businesses primarily in Asia and if there is no learnings from this / ability to cater to a global audience then it really puts into question bubble’s feasibility for my projects going forth. Sweeping it all under the carpet now that it is fixed is all well and good but what happens the next time this happens at 9PM EST time? A 12 hour outage before the problem even exists on the status page again?
That’s nice to hear. But can your vision also include following:
- communicating the users when such issues occur
- supporting customers when such issues occur and
- empowering users to be able to do something when such issues occur?
Because while I understand that in tech products there will be some bugs or scaling issues or some unforeseen circumstances causing problems, the troubling parts are these:
-
Status page was showing total green. There was no acknowledgment of issues.
-
Like @lizzie highlighted, support team works only in certain hours and not in other timezones. Until the point comes where you can assure that problems will occur only in your timezone and not in others’ I think there must be a support available during your non-business hours and days.
-
As has been highlighted several times in past, biggest issue that we have is that we feel helpless. When customers shout at us, when our internal teams shout at us, we just can’t do anything or tell them that we are working on fixing the issue or be able to communicate a timeline or to be able to assess whether it is our problem or problem at Bubble’s end or put a message about what exactly is going on, how it can be worked around etc. Right now all that we can say is We have raised a ticket with Bubble, but they will wake up only in their mornings. What can be done to make us feel bit empowered in such cases?
-
As @aj11 and others have told countless times, give us a way to “force refresh” RG and other data.
-
Every now and then these issues keep occurring and one of you would come and say that you are taking it seriously, have made some band-aid fixes etc. and that you would try to resolve the issues etc. But the issues recur (and I understand that issues would recur as permanent fixes may not be possible itself), but you people do not address the problems that we highlight. Also, after calming down the fire, there is no update later on on those issues whether some fixes have been made for long term.
Please just help us feel more confident about building serious applications using this platform where we put our reputations and sweat assuming that we can build applications and businesses out of it. Please let it not be just a tool to make MVPs or hobby projects or static websites where it is okay if there are downtimes with no assurance of them coming up in some time.
Seems we’re not worthy to get an answer on this, I’m sure we’ll get a monthly update with how many new hires, how beta testing the table element and how many support tickets were handled but no actual substance on status page being wrong for 12+ hours and lack of meaningful support or status overwatch outside of EST timezone.
Following up on some of the questions and concerns –
Re:
… that’s a great point – that was our main hesitation with this before, but we’d likely be comfortable releasing it on the new WU plans. I can’t promise we’ll prioritize it immediately, because given the way our client-side caching works, the implementation of this feature isn’t completely trivial, and we’re trying to carve out space in our roadmaps for more reliability, observability, and bug fixing. But I will highlight that feedback to the team because it’s definitely worth reconsidering.
Here’s what we currently provide:
- We have a 24/7 on-call engineering rotation, so that there’s always someone who can be paged in the event of an emergency
- We have an automated alerting system that will wake the on-call engineer up if necessary in the event of an emergency
- We have 24/7 tier 1 customer support. While they generally aren’t able to diagnose issues themselves, one of their responsibilities is to monitor incoming bug reports for patterns indicative of an infrastructure emergency, and wake up the on-call engineer as needed.
This isn’t perfect:
- Our automated alerting has gaps, and we’re still calibrating thresholds; as @payam.azadi mentioned above, in this case, the amount of degradation did not meet the threshold that would set off the alert.
- It can be difficult to for our 24/7 support teams to discern patterns when the issue only affects a smaller fraction of our user base. While Bubble has many apps that rely on real-time updates for mission-critical functionality, for the vast majority of apps it is a nice-to-have feature, so an outage doesn’t result in as many bug reports as an outage affecting a more universally-depended on feature.
This isn’t an excuse – I want us to get to a state ASAP where we don’t miss a major feature degradation like this one. Part of the reason I’m excited to welcome Payam to the team is his previous experience at companies with world-class operations, and I expect to see significant improvements to our observability and alerting over the coming months.
Would be great to have some ideas around what we can expect from the real time. But asuming from your post it should always work <3
It seems like the first thing that should be done is to set the sensitivity on the alerts. Seems like it was acknowledged that the sensitivity errs way too far on the side of caution, but yet it has not been mentioned an an immediate implementation to reduce this occurrence. Crank that sensitivity way up and ping your engineers early and often. Or automate the response to run a series of tests and then alert back. So many options here to mitigate the downtime.
Oh and I think Bubble owes at least one of their clients for 9 lost paying subscribers. Or at least some subscription free months, maybe one per user?