Something is down?

it appears to be working normally

There is a way to opt out of updates but it’s only available on higher tiers, but agree with your points about watching the forum post-code push.

Also — is there any code reviews before the code is pushed to production?

I think really the only way to get through to flag this as a real big issue to Bubble right now is by submitting a big volume of the same type of bug (which is why these forum posts are made, I think? To see if others are having issues, if so, post a bug report) then their success team escalates.

Also — does doing @ Bubble still ping the internal team?

My question to that is how long does it actually take to escalate. Although I wasn’t directly affected, would love to see how this is going to be prevented in the future @josh! It’d also be great to get a Postmortem

1 Like

I mean to be fair, since users have full control (just about) over their app, it’d be almost impossible to find all the edge cases. We’re just a little prone to more errors because, at least in the last year, there’s been so many updates. The standard and best practices have changed versus someone who has joined bubble recently. That’s my opinion at least.

Could there be improvements, yes. I’d say if you notice issues, check to see the release page from bubble as well. You can tie recent releases to bugs if you notice them. Releases | Bubble

Yeah, I know there’s an option for higher tiers, that’s why I remarked for shared clusters.

They can’t use us as beta testers.

2 Likes

Agreed! Looking forward to the postmortem

3 Likes

I also do admit that the unstableness is sorta getting old

I do get that updates are being pushed at a rate that maybe internal processes aren’t being updated at the frequency they should be but maybe there should be a pause for at least 1 week to figure out the internal processes before continuing to ship potentially broken, buggy code

3 Likes

Can we please have an explanation as to why the Status page is all green today even though the main cluster had issues for more than 1h?

How can we explain to users of our apps why their sessions weren’t working correctly when no issue was reported?

What’s the point of subscribing to the Status page if metrics do not represent the true states?

Many thanks.

7 Likes

@josh Has there been any postmortem on this issue? What is Bubble doing to prevent this from happening again? Also, my biggest frustration was that after checking out the status page which was all green, I thought the issue was on my end. So I started to dig everywhere until I (fortunately) came across this post on the forum

Some follow-up here:

In terms of updating the status page, after the last round of changes we made following my post here, this is our process:

  • We declare an incident whenever:
    • Our automated monitoring detects changes in metrics or system unavailability that indicate a severe issue
    • Our customer support team gets 3 or more bug reports pointing towards something having recently broken in our system. Because bug reports are spread out across different members of the team, we have a process for tagging them and looking for patterns to detect if incoming reports are related (we get a constant stream of reports about long-standing issues, corner cases, and bugs that turn out to be on the user’s end, that we have to filter through to determine if something is actually wrong).
  • Declaring an incident does two things:
    • It automatically posts a message to our status page that we’re investigating
    • It pages an on-call engineer to ensure someone begins investigation
  • The on-call engineer then keeps the status page up-to-date with the progress of their investigation

What happened this time around:

We shipped a bug that caused privacy rules using the ‘contains’ operator to fail, causing data protected by those rules to be hidden. Using ‘contains’ in privacy rules is relatively uncommon: over the course of the incident, we received 10 total bug reports about it. Those reports starting gradually trickling in around the time this forum thread got started.

Because of the relatively low volume and slow trickle (compared to other incidents where we’ve received 100s of bug reports), it took about 45 minutes for our support team to notice the pattern and that we had exceeded the 3 related bug report threshold. By the time they noticed, the engineering team who had shipped the bug was already aware of the issue and working on a rollback because one of the team-members noticed this forum thread.

This hit an ambiguity in our process: our support team was unsure if they were supposed to declare an incident given that engineering was already aware of and responding to the situation. Meanwhile, our engineering team wasn’t updating status page because they weren’t officially on-call and aware that this was an ‘incident’ that needed communication. We are fixing both ambiguities going forward:

  • We are clarifying with the support team that they should always declare an incident if there isn’t currently one ongoing, regardless if engineering is aware of and working on the issue already
  • We are educating the engineering and product teams on how to declare incidents and making sure they know it’s important to do so in a situation like this even if support hasn’t officially declared one.

In terms of the technical root cause of the bug and how it was shipped to production, the issue was with a low-level component of our query evaluation system, and the engineers working on it weren’t aware that it could have implications for the ‘contains’ operator in the context of privacy rules, so didn’t carefully test that use case. Our automated tests do provide coverage for the ‘contains’ operator in general, and do test privacy rules, but we didn’t have a test that specifically covers the combination of using ‘contains’ inside a privacy rule. We’re writing more automated tests to make sure there’s good coverage here. One of the challenges we face in general, which is why our reliability isn’t where I would like it to be, is that Bubble’s features can be combined in an exponential variety of ways, and it’s sometimes possible for there to be bugs that only manifest for certain combinations of features. Our strategy here is to dramatically ramp up the investment in test creation and coverage, which we’re spending time on this quarter, but to be honest I expect getting to full coverage to be a journey, not a quick hit, because of the myriad of ways our users can combine features.

Finally, to the point about opting out of automatic updates on shared clusters, this is something we would love to provide, but because of the nature of shared servers, it’s a big technical investment that requires substantial changes to how we generate, package, and host applications. We are working in this direction but it’s far enough off I can’t give an ETA. I will point out, though, that we do offer the Scheduled tier at a lower price point than Dedicated: it’s available on the Growth plan and above. While the Scheduled tier doesn’t offer the same level of control as Dedicated (we still push updates automatically, but we do it on a time-delay), it means you are only exposed to bugs that we don’t notice til the next day. In this case, because we fixed the bug in less than an hour from the first reports, apps on the Scheduled tier were not impacted. You can change your tier on the Settings → Versions tab of the editor.

4 Likes

Reading about how ‘contains’ in privacy rules is considered ‘uncommon’ made me chuckle. I wonder if anyone at Bubble actually uses Bubble for work purposes? I find that some of my most pressing issues seem to come from design decisions made by someone who hasn’t had to use the feature themselves.

Here’s a breakdown of how integral the ‘contains’ operator is in our privacy settings:

  • User’s company contains transaction
  • Message contains this user
  • User access contains edit, view, delete
  • Company contains user
  • Thread contains user
  • User’s role contains customer, production, admin
  • Invoice contains user
  • Dozens more…

In my app, these aren’t edge cases; they’re the norm. I find the ‘contains’ operator to be absolutely crucial for basic privacy setups, and it’s hard to imagine building a functional app without it.

5 Likes

Given the complexity of filing a bug report—especially when one doesn’t know what exactly is wrong—I’d suggest a “HEY YOU BROKE SOMETHING” button for the community. If enough clicks accumulate in a specific category, then you could start investigating.

Also, it’s 2023; why not deploy some AI to monitor the forum? If a thread gains 30+ “me too” or “Bubble’s broke” posts within 20 minutes, perhaps it’s worth looking into.

8 Likes

When I read that “contains” in privacy rules were uncommon it gave me a fright! “Like have I been doing something wrong…?” :sweat_smile:

5 Likes

That’s what Bubble devs has been asking during this year suffering from editor bugs…

P.S. we’ve tried to collect bug reporting improvements here:

2 Likes

I still feel the exact same way as after the previous editor issues a month ago:

Echoing this.

2 Likes

I am wondering why a similar Notifications for spiking workload unit consumption can’t be implemented for notifying bugs and issues? This would make the developer debugging process much easier, instead of navigating to the Forum or Status page.

Also, why not update the Status page post-mortem to reflect events that happened? Not only would this allow us developers to follow Service disruptions, but in instances where we didn’t witness the bug, we could go back and cross-reference the timestamps issues happened.

3 Likes

Exactly, a ‘Down Detector’ style feature on the status page would be incredibly useful. It could help us quickly identify if an issue is platform-wide or just related to our specific app. Seems like a no-brainer to implement.

2 Likes

Is that an excuse or a mea culpa?

@josh as bubble developers are your true customers, can we vote on what’s more important? I believer test creation and FULL coverage comes before any other features or updates. Any Bubblers disagree??

1 Like

@Benjamin_Rodgers great suggestions and here are some more @josh:

  1. Have a dedicated monitor of the forum for the 12-24 hours after ANYTHING is shipped to production;
  2. Trigger a investigation whenever a list of knowledgeable bubblers post a bug
  3. Ensure the engineering team is “on call” when anything is shipped to production (see quote below)
2 Likes