Hi all, I want to apologize and give a quick update on the recent outages we’ve been having.
We are fairly sure the outages today, Feb 7, and Feb 2 are all caused by the same underlying problem. We believe the issue is a scaling challenge with the Bubble app that manages our own homepage and signup system. This app is load-bearing for a number of our backend processes, so the scaling demands placed on it are more extreme than with most other Bubble applications.
What we’ve been seeing is that certain large queries on that app have been piling up rather than completing, causing the database that hosts that app to run out of memory and crash. Although this database mainly just hosts that app as well as a small number of user apps (which we are in the process of moving to other databases), when that app goes down, it interferes with a lot of our systems and can cause performance issues and problems for other applications.
We thought after the outage two days ago that we had narrowed the problem down to a specific query that was behaving badly, and we disabled that query, which we believed was sufficient to solve the issue. It now looks like there are multiple kinds of queries that can give rise to the problem. We’ve disabled all of the queries we could find, and confirmed that this was in fact enough to stabilize the database during today’s outage.
We are not sure this is the end of the issues – it’s possible there are other queries that could lead to the same problem that we’ve missed. Also, we suspect that the fact this problem occurs with multiple types of queries means that the root cause is ultimately a scaling issue with this app that we’ll need to identify and fix. We are continuing to investigate, as well as taking other measures to lighten the load on the database to prevent recurrence.
As I write this, our systems all currently look stable, so I’m crossing my fingers that the queries we disabled were sufficient to prevent the problem from happening again until we’re 100% sure we’ve fixed the root issue. We’'ll keep you posted.
Again, we’re incredibly sorry this is happening – we know Bubble’s uptime is critical for many of the businesses that run on us and that outages like this are very painful and costly for you.
Hi @josh, thank you for your note. I might be misinterpreting what you wrote (I’m not a developer like many other Bubblers) but it looks like the approach is “let’s disable some queries and hope it will solve”. This is fine for solving an urgency but it is not reassuring for the future. Wouldn’t it be the case to double check the architecture so that problems are avoided “by design” rather than “by patch”? If there’s a flow in the design, problems will pop up again. It might take weeks to have a due diligence on the design but it would be an investment toward reliability. Thanks
I think that Bubble should create a new workflow/conditional, something like “When Bubble is presenting a partial or major outage” so that we can send our users an email or maybe just allow us to customize the error message that our users see when Bubble is down. This will allow us to inform our users that we know that our app is down and we are working on it (we are not actually working on it but Bubble is).
Since Bubble probably is not able to access our database when there is an outage, for instance, the solution might be to use a backup database to store just our custom error messages presented when Bubble is down.
Something similar to what Bubble does at their Status Page (https://status.bubble.io/ ). We should be able to do that for our users as well somewhow.
I just submitted this feature request in the https://bubble.io/ideaboard although you have to scroll down a lot to see it cause there is a long list of requests.
Yes, 100% agree. That’s why we’re continuing to investigate – we have the patch solution to restore stability, but we want to actually fully understand the design issue to solve to the root cause. Two days ago, we believed that the problem was isolated to a single query and was not indicative of a larger design issue; from what we’re seeing today, we think that assessment was wrong, and plan to make sure the design is robust
This is a good idea and something we definitely want to do, but it’s also a lot of effort for our team to implement (because we can’t re-use our already-existing infrastructure to build it, to make sure it stays up when everything else is down). So for the short term, we’re focusing on efforts on increasing reliability; longer-term, we’d love to build this
I like the idea also. I would be happy with a redirect option to a specified site outside Bubble. I wouldn’t mind creating a simple site out of Bubble when systems go down.
I think it might be time for Bubble to start offering a way for customers to export an app and host it on whatever infrastructure they want, just like Discourse does for example.
These queries that are causing issues and that you have disabled now, do these belong to our applications? If yes, can we know which are these queries? So that we can take measures on removing these queries ourselves or to have a workaround etc.? If you just disable, then our applications may get affected, right?