Something is down?

Some follow-up here:

In terms of updating the status page, after the last round of changes we made following my post here, this is our process:

  • We declare an incident whenever:
    • Our automated monitoring detects changes in metrics or system unavailability that indicate a severe issue
    • Our customer support team gets 3 or more bug reports pointing towards something having recently broken in our system. Because bug reports are spread out across different members of the team, we have a process for tagging them and looking for patterns to detect if incoming reports are related (we get a constant stream of reports about long-standing issues, corner cases, and bugs that turn out to be on the user’s end, that we have to filter through to determine if something is actually wrong).
  • Declaring an incident does two things:
    • It automatically posts a message to our status page that we’re investigating
    • It pages an on-call engineer to ensure someone begins investigation
  • The on-call engineer then keeps the status page up-to-date with the progress of their investigation

What happened this time around:

We shipped a bug that caused privacy rules using the ‘contains’ operator to fail, causing data protected by those rules to be hidden. Using ‘contains’ in privacy rules is relatively uncommon: over the course of the incident, we received 10 total bug reports about it. Those reports starting gradually trickling in around the time this forum thread got started.

Because of the relatively low volume and slow trickle (compared to other incidents where we’ve received 100s of bug reports), it took about 45 minutes for our support team to notice the pattern and that we had exceeded the 3 related bug report threshold. By the time they noticed, the engineering team who had shipped the bug was already aware of the issue and working on a rollback because one of the team-members noticed this forum thread.

This hit an ambiguity in our process: our support team was unsure if they were supposed to declare an incident given that engineering was already aware of and responding to the situation. Meanwhile, our engineering team wasn’t updating status page because they weren’t officially on-call and aware that this was an ‘incident’ that needed communication. We are fixing both ambiguities going forward:

  • We are clarifying with the support team that they should always declare an incident if there isn’t currently one ongoing, regardless if engineering is aware of and working on the issue already
  • We are educating the engineering and product teams on how to declare incidents and making sure they know it’s important to do so in a situation like this even if support hasn’t officially declared one.

In terms of the technical root cause of the bug and how it was shipped to production, the issue was with a low-level component of our query evaluation system, and the engineers working on it weren’t aware that it could have implications for the ‘contains’ operator in the context of privacy rules, so didn’t carefully test that use case. Our automated tests do provide coverage for the ‘contains’ operator in general, and do test privacy rules, but we didn’t have a test that specifically covers the combination of using ‘contains’ inside a privacy rule. We’re writing more automated tests to make sure there’s good coverage here. One of the challenges we face in general, which is why our reliability isn’t where I would like it to be, is that Bubble’s features can be combined in an exponential variety of ways, and it’s sometimes possible for there to be bugs that only manifest for certain combinations of features. Our strategy here is to dramatically ramp up the investment in test creation and coverage, which we’re spending time on this quarter, but to be honest I expect getting to full coverage to be a journey, not a quick hit, because of the myriad of ways our users can combine features.

Finally, to the point about opting out of automatic updates on shared clusters, this is something we would love to provide, but because of the nature of shared servers, it’s a big technical investment that requires substantial changes to how we generate, package, and host applications. We are working in this direction but it’s far enough off I can’t give an ETA. I will point out, though, that we do offer the Scheduled tier at a lower price point than Dedicated: it’s available on the Growth plan and above. While the Scheduled tier doesn’t offer the same level of control as Dedicated (we still push updates automatically, but we do it on a time-delay), it means you are only exposed to bugs that we don’t notice til the next day. In this case, because we fixed the bug in less than an hour from the first reports, apps on the Scheduled tier were not impacted. You can change your tier on the Settings → Versions tab of the editor.

4 Likes