EDIT: renamed thread, adding postmortem at the bottom of the post
Hi all – couple quick updates on the outage:
-
We think systems are currently stable although we are still working to make sure it won’t happen again
-
We’re very sorry about the lack of public response. We had some engineers working on the problem, but due to some miscommunications / lack of process clarity, no one currently awake was handling outbound communication to the forum or our status page. The engineers working on it were mostly heads-down trying to solve the problem, which was the correct response given what they had been told to focus on.
-
Our current best assessment of the user-facing impact:
- Roughly 10% of our apps experienced performance degradation
- The plugins tab in the editor did not load for a few hours
-
We’ll post a longer technical post-mortem once we’re done working on the issue, but it looks like this was a pre-existing scaling issue with our system that got triggered by specific apps increasing their workload in a very specific way
---------------- Postmortem -----------------
Breaking this into two sections:
- Incident response and communication
- Technical root cause and resolution
Incident response and communication
Timeline
- Issue started at ~11 pm ET on 11/29
- None of our critical alerts went off, but our error logs started showing evidence of problems
- We also started getting bug reports from users
- Our 24/7 support team flagged the problem at ~2 am, but didn’t have a clear escalation channel and no one on the engineering team was alerted
- An engineer noticed the errors at ~6 am, and began working on a solution
- The issue was resolved at ~7 am
- We posted our first public update on the issue at ~8 am
What went wrong
- Our out-of-hours emergency escalation protocol is based around our automated alerting, and we don’t have a clear protocol for issues that have significant user-facing impact but aren’t severe enough to trigger our automated alerting
- As a result, our 24 / 7 support staff did not have a clear escalation path to wake up an engineer able to diagnose the issue to resolution
- The engineer who diagnosed and resolved the issue wasn’t given any guidance on when to manually update our status page
- Our automated alerting does not always go off when performance degrades enough to impact functionality but not so much that Bubble is completely inaccessible
Solutions and next steps
- Train our 24 / 7 support team on how to escalate to our US-based engineering team after hours
- Train our entire engineering team on incident management protocols, including when to declare a production issue, and who is responsible for external communication
- Hire more cloud engineers (JD coming shortly) to audit our alerting and lower our thresholds for emergency alarms, as well as providing faster response times to incidents
Technical root cause and resolution
What caused the initial issue
- Very high activity of one app () and to a lesser extent a second app (), mainly around automated temporary user deletion, overwhelmed one of our database shards, which led to progressively slower queries on that shard, some queries getting stuck and never recovering, which froze some Postgres transaction locks, preventing most queries from ever resolving, and building up until the problem was addressed.
What systems were affected?
-
Impacted shard had heavily degraded performance (slightly slower queries for the first hour, then progressively fewer and fewer queries ever completed)
-
the two apps that caused this originally were offline because they were out of capacity while the automated temporary user deletion took place (and was stuck)
-
our home page, which uses the impacted database shard, suffered from poor performance on several pages that loaded a lot of data
-
~10% of bubble apps, which live on that shard, suffered similar performance loss or timeouts
-
in the editor for everyone, the plugins tab did not load.
-
in the editor for everyone, it was not possible to create new actions, if there were plugins installed on the app that provided new action types.
How was the issue resolved?
-
Since a new expired item deletion kept being scheduled and run regularly (so long as the apps generated new temp users), a temporary patch was put in place to prevent automated temporary user deletion on the two apps that were causing this problem
-
Since a lot of queries were stuck, there was an attempt to cancel/terminate the queries in questions in the SQL console, which did not work.
-
ultimately what did it was a restart of the database, which caused 40 seconds of extra downtime, but the system came back online safely afterwards.
-
Upon restarting the server, query stampede warming up caches specifically for the query which is necessary in order to load the editor’s plugins tab, caused the problem to persist for ~30 more minutes.
-
Some failed attempts at quickly speeding up the underlying query (with some custom sql indexes, then with some redis cache layer) were made in order to allow all bubble threads to get a warm cache containing the list of plugins, and mitigate the stampede problem.
-
Ultimately though, within 30 minutes the errors died down as some threads stopped retrying the queries, and the problem went away.
What are the next steps to prevent this issue from happening again?
-
Isolate our home page from other apps by moving them off of the database shard that contains our home page
- This would have prevented a database issue created by a customer from impacting editor features.
-
Make our clearing temporary user logic much lighter/faster
-
Make pagination more efficient than offsets
- listing 3k plugin rows should be “basically free” yet it timed out during the outage, mostly because the queries were using offset-based pagination which made higher offsets much slower to query, and under degraded performance, failed to be fetched under the fiber time limit (65 seconds). Because we didn’t cache partial results, we kept retrying from scratch
-
Mitigate stampede on plugin fetching by only fetching it once (cache in redis, and prevent from concurrent fetching)
-
Offer more graceful degradation:
- actions should still be creatable when plugin info isn’t loaded! (if the only missing piece is the “display” name of the installed plugin actions, we should fallback to and still render everything)
Final thoughts
Once alerted to the issue, our engineering team did in my opinion a pretty excellent job at diagnosing and resolving a very complicated cascading failure. However, the long delay between the incident starting and the team being alerted was completely unacceptable, and represents:
- A lack of investment in robust incident management processes
- Underinvestment in our automated alerting
As mentioned in my last monthly forum post, we are making some leadership changes and I am taking over direct leadership of the engineering team. One of my priorities is to invest in better production support, including hiring more people onto our cloud infrastructure team, investing more in maintenance of our alerting, and completely overhauling our incident response policies. I expect this to take on the order of 1 - 2 months to get the fundamentals in place, and another 4 - 5 months for the hiring and investment in alerting to fully pay off. I am very sorry for this incident, and I am committed to upping our game as an organization in how we respond to situations like this.