What’s been going on with Bubble?

johnny · September 8, 2021, 2:30pm

Hey @josh,

I was wondering if you could fill us in on what’s been going on with Bubble causing downtime?

gnelson · September 8, 2021, 2:33pm

I get main cluster issue notifications several times a day, on most days. This has been going on for quite a while now.

rpetribu · September 8, 2021, 2:36pm

+1 here!

Thanks

johnny · September 8, 2021, 2:49pm

Yep! I see that it may be database issues but I want to understand what’s being done to mitigate this from happening in the future.

BrianHenderson · September 8, 2021, 5:25pm

Same here. These “hiccups” are killing me in support tickets from customers.

josh · September 8, 2021, 5:37pm

Yeah, sorry about the issues, all. Will write a longer explanation of what’s been going on later this afternoon. The good news is we think we’ve found the cause, and are testing a fix now

josh · September 8, 2021, 7:10pm

Okay, fix is live. I’m giving it about ~80% chance of making the problems go away; we have very strong circumstantial evidence that this is the root cause, but not proof.

Backing up a bit:

This current batch of issues started late Sat / early Sun ET over the weekend
The symptom was that, roughly every 6 to 24 hours, a query on one of our databases would get “stuck”: it wouldn’t ever finish running, and it wasn’t killable using postgres’ sysadmin tools
Having a query that never finishes running is a problem; other things back up behind it, and query performance on the database starts degrading, to the point where it starts registering as transient outages.
The only way we’ve found so far to kill the stuck query is restarting the entire database, which also causes a transient outage
We currently have our users’ apps on the main cluster split between 6 databases; this happened a couple times on database #2, and once on database #3. So the majority of our users didn’t have any downtime, but apps on #2 and #3 were affected.
This took us a long time to get to the bottom of because there wasn’t any obvious pattern in which queries were getting stuck: they were different queries for different apps. From our vantage point, it appeared that randomly, once in a while, some query would just get into this stuck state
What we figured out after this morning’s issue is that there is a pattern: the stuck queries were running on the same database connection immediately after a different query failed with a specific kind of error. This error by itself was unremarkable: it had to do with taking the product of a long list of numbers and getting a number too large to store. For some reason, which we still don’t understand, this seems to corrupt the database process and lead to subsequent queries getting stuck.
The fix we just deployed prevents this particular error from occurring, by intercepting the issue (trying to do a calculation that will result in a number that’s too large) upstream and throwing a different, definitely harmless error before we can get to the point where it was causing the problem

What we don’t understand is why this problem just started happening over the weekend: the relevant code has been live for years, and we’re not aware of any recent changes that might have introduced the problem, though we’ve been auditing our code for them. It’s possible that apps just started running into this particular error now: we don’t see many instances of it in our logs, and it’s not common that a user would want to take the product of a list of numbers and get a result that’s big enough to exceed the system limits. But because of this uncertainty, we’re only somewhat confident that the fix we deployed actually resolves the issue, since there’s aspects to the situation we haven’t figured out yet.

That said, I do think there’s a good chance that this was the correct fix. The only way we’ll know for sure though is by waiting a few days and seeing if the issue happens again.

johnny · September 8, 2021, 8:47pm

Thanks Josh!

mac2 · September 8, 2021, 9:37pm

@josh Have you tried the debugger?

BrianHenderson · September 9, 2021, 1:08am

I love the extreme transparency here. Thank you @josh very much for the detailed explanation. It builds so much trust. I am very grateful!

system · September 22, 2021, 2:31pm

This topic was automatically closed after 14 days. New replies are no longer allowed.

Topic		Replies	Views
What caused the recent outages? Questions	3	752	February 15, 2022
Quick note on the recent outages Announcements	16	1613	February 11, 2022
Something wrong with Bubble? Questions	9	790	February 23, 2017
Bubble down again - live and test Need help	5	337	December 9, 2021
What’s with all the Major System Outages? Meta	13	325	April 1, 2025

What’s been going on with Bubble?

Related topics