Clustes Issues and Bubble stability

I greatly value Bubble’s transparency policy, so I applaud it! However, it’s important to emphasize that promoting news of failures should not reduce the need for focus on resolution.

Recurring problems with the clusters are becoming increasingly numerous and short-term. On the Bubble status page, some cases/days, incidents are “opened and closed” no far from 5 minutes. And look at how many times this has happened today.

In follow-ups like this: Our systems are functional and we are closing out this incident. Posted 3 hours ago. May 08, 2024 - 13:19 EDT.

Bubble quickly states that everything is okay. But it’s not! As of today, until now (5:00 PM), activities are practically at a standstill!

Bubble Team:
Could you please please share with all of us the reality of what is happening?

  • Are there failures in cluster protection? Are there human errors?
  • What is being done to correct and prevent more effectively?
  • What is being done to increase the time between failures?
  • Are there no clusters or redundant environments for a brief restart without affecting our Apps?

The status on the status page does not comply with reality. It would be great if you could make an official announcement on this topic. We, your customers and entrepreneurs, are fearful! Because who will explain this to our end users? Will they understand us in the same way we have to understand Bubble?

I appreciate the follow-up in advance.

1 Like

thank you, I’ll make sure to pass this to the team

1 Like

Hello @fede.bubble
In this case, I thank you. You all Bubblers are always kind.

PS. I have also issued a separate ticket today to the support email. Although I know that this is something not isolated to my app, but to everyone. Therefore, registering here is to help all colleagues have access to this possible feedback, so that we can “know what steps” we can take here, in relation to our own customers.

1 Like

or maybe you can provide some official information, we paid for it, we need to know.

1 Like

we’ll share more once the team has had time to do a post-mortem, but essentially is the same I said in Bubble Down May 8th : a bad actor out there decided to disrupt Bubble services this week (started Monday) and so we are suffering all these outages as the team works to patch and mitigate the incidents

3 Likes

Thanks Fede, but I mean, it is happening 3-4 wees ago at least. I’m concern what about next week, month. etc.

@mmendezgracia Thanks for add.

Exactly that! This is the main purpose of this post, to understand what is actually being done. Because the recurrence of critical failures is numerous and in the short term. We cannot and will not be able to maintain a good experience using our Apps, with our end point users, with something like this happening.

hey @rafaelfernandes following up here. I got some remarks from Josh:

  • Over the last month, we’ve had some downtime due to human error, but the majority of downtime, including all the downtime this week, has been failures in cluster protection

  • The way our status page works is it is connected to the tool we use to alert engineers that an emergency response is needed. In turn, that tool is hooked up to a bunch of automated alerting that detects abnormalities. (We can also trigger it manually and do so if we see a spike in related bug reports over a short period of time). When you see a brief message on our status page, followed by it being marked as resolved, that generally means that a problem occurred, our alerting went off but our systems were able to self-recover before a human had a chance to do anything. When that happens, we follow up after the fact to understand what went wrong and implement a fix.

  • Sometimes, a major issue manifests as our alerting alternating between being stable (as we or our systems bring the problem under control temporarily) and firing (as things change because we haven’t fully fixed the issue). In that case, it can create some noise on our status page. We typically try to have a human post a message explaining what is going on, but usually the first responder is an engineer focused on understanding the problem and sometimes it takes us time to get a user-understandable explanation up. We are discussing this process internally and may make changes or improvements here because we know it can be frustrating and scary.

5 Likes

This topic was automatically closed after 14 days. New replies are no longer allowed.