Yeah, sorry about the issues, all. Will write a longer explanation of what’s been going on later this afternoon. The good news is we think we’ve found the cause, and are testing a fix now
Okay, fix is live. I’m giving it about ~80% chance of making the problems go away; we have very strong circumstantial evidence that this is the root cause, but not proof.
Backing up a bit:
This current batch of issues started late Sat / early Sun ET over the weekend
The symptom was that, roughly every 6 to 24 hours, a query on one of our databases would get “stuck”: it wouldn’t ever finish running, and it wasn’t killable using postgres’ sysadmin tools
Having a query that never finishes running is a problem; other things back up behind it, and query performance on the database starts degrading, to the point where it starts registering as transient outages.
The only way we’ve found so far to kill the stuck query is restarting the entire database, which also causes a transient outage
We currently have our users’ apps on the main cluster split between 6 databases; this happened a couple times on database #2, and once on database #3. So the majority of our users didn’t have any downtime, but apps on #2 and #3 were affected.
This took us a long time to get to the bottom of because there wasn’t any obvious pattern in which queries were getting stuck: they were different queries for different apps. From our vantage point, it appeared that randomly, once in a while, some query would just get into this stuck state
What we figured out after this morning’s issue is that there is a pattern: the stuck queries were running on the same database connection immediately after a different query failed with a specific kind of error. This error by itself was unremarkable: it had to do with taking the product of a long list of numbers and getting a number too large to store. For some reason, which we still don’t understand, this seems to corrupt the database process and lead to subsequent queries getting stuck.
The fix we just deployed prevents this particular error from occurring, by intercepting the issue (trying to do a calculation that will result in a number that’s too large) upstream and throwing a different, definitely harmless error before we can get to the point where it was causing the problem
What we don’t understand is why this problem just started happening over the weekend: the relevant code has been live for years, and we’re not aware of any recent changes that might have introduced the problem, though we’ve been auditing our code for them. It’s possible that apps just started running into this particular error now: we don’t see many instances of it in our logs, and it’s not common that a user would want to take the product of a list of numbers and get a result that’s big enough to exceed the system limits. But because of this uncertainty, we’re only somewhat confident that the fix we deployed actually resolves the issue, since there’s aspects to the situation we haven’t figured out yet.
That said, I do think there’s a good chance that this was the correct fix. The only way we’ll know for sure though is by waiting a few days and seeing if the issue happens again.