What’s been going on with Bubble?

Hey @josh,

I was wondering if you could fill us in on what’s been going on with Bubble causing downtime?

4 Likes

I get main cluster issue notifications several times a day, on most days. This has been going on for quite a while now.

2 Likes

+1 here!

Thanks

Yep! I see that it may be database issues but I want to understand what’s being done to mitigate this from happening in the future.

1 Like

Same here. These “hiccups” are killing me in support tickets from customers.

Yeah, sorry about the issues, all. Will write a longer explanation of what’s been going on later this afternoon. The good news is we think we’ve found the cause, and are testing a fix now

7 Likes

Okay, fix is live. I’m giving it about ~80% chance of making the problems go away; we have very strong circumstantial evidence that this is the root cause, but not proof.

Backing up a bit:

  • This current batch of issues started late Sat / early Sun ET over the weekend
  • The symptom was that, roughly every 6 to 24 hours, a query on one of our databases would get “stuck”: it wouldn’t ever finish running, and it wasn’t killable using postgres’ sysadmin tools
  • Having a query that never finishes running is a problem; other things back up behind it, and query performance on the database starts degrading, to the point where it starts registering as transient outages.
  • The only way we’ve found so far to kill the stuck query is restarting the entire database, which also causes a transient outage
  • We currently have our users’ apps on the main cluster split between 6 databases; this happened a couple times on database #2, and once on database #3. So the majority of our users didn’t have any downtime, but apps on #2 and #3 were affected.
  • This took us a long time to get to the bottom of because there wasn’t any obvious pattern in which queries were getting stuck: they were different queries for different apps. From our vantage point, it appeared that randomly, once in a while, some query would just get into this stuck state
  • What we figured out after this morning’s issue is that there is a pattern: the stuck queries were running on the same database connection immediately after a different query failed with a specific kind of error. This error by itself was unremarkable: it had to do with taking the product of a long list of numbers and getting a number too large to store. For some reason, which we still don’t understand, this seems to corrupt the database process and lead to subsequent queries getting stuck.
  • The fix we just deployed prevents this particular error from occurring, by intercepting the issue (trying to do a calculation that will result in a number that’s too large) upstream and throwing a different, definitely harmless error before we can get to the point where it was causing the problem

What we don’t understand is why this problem just started happening over the weekend: the relevant code has been live for years, and we’re not aware of any recent changes that might have introduced the problem, though we’ve been auditing our code for them. It’s possible that apps just started running into this particular error now: we don’t see many instances of it in our logs, and it’s not common that a user would want to take the product of a list of numbers and get a result that’s big enough to exceed the system limits. But because of this uncertainty, we’re only somewhat confident that the fix we deployed actually resolves the issue, since there’s aspects to the situation we haven’t figured out yet.

That said, I do think there’s a good chance that this was the correct fix. The only way we’ll know for sure though is by waiting a few days and seeing if the issue happens again.

24 Likes

Thanks Josh!

@josh Have you tried the debugger? :crazy_face:

7 Likes

I love the extreme transparency here. Thank you @josh very much for the detailed explanation. It builds so much trust. I am very grateful!

11 Likes

This topic was automatically closed after 14 days. New replies are no longer allowed.