Incident postmortem from 11/29

EDIT: renamed thread, adding postmortem at the bottom of the post


Hi all – couple quick updates on the outage:

  • We think systems are currently stable although we are still working to make sure it won’t happen again

  • We’re very sorry about the lack of public response. We had some engineers working on the problem, but due to some miscommunications / lack of process clarity, no one currently awake was handling outbound communication to the forum or our status page. The engineers working on it were mostly heads-down trying to solve the problem, which was the correct response given what they had been told to focus on.

  • Our current best assessment of the user-facing impact:

    • Roughly 10% of our apps experienced performance degradation
    • The plugins tab in the editor did not load for a few hours
  • We’ll post a longer technical post-mortem once we’re done working on the issue, but it looks like this was a pre-existing scaling issue with our system that got triggered by specific apps increasing their workload in a very specific way

---------------- Postmortem -----------------

Breaking this into two sections:

  1. Incident response and communication
  2. Technical root cause and resolution

Incident response and communication

Timeline

  • Issue started at ~11 pm ET on 11/29
    • None of our critical alerts went off, but our error logs started showing evidence of problems
    • We also started getting bug reports from users
  • Our 24/7 support team flagged the problem at ~2 am, but didn’t have a clear escalation channel and no one on the engineering team was alerted
  • An engineer noticed the errors at ~6 am, and began working on a solution
  • The issue was resolved at ~7 am
  • We posted our first public update on the issue at ~8 am

What went wrong

  • Our out-of-hours emergency escalation protocol is based around our automated alerting, and we don’t have a clear protocol for issues that have significant user-facing impact but aren’t severe enough to trigger our automated alerting
    • As a result, our 24 / 7 support staff did not have a clear escalation path to wake up an engineer able to diagnose the issue to resolution
    • The engineer who diagnosed and resolved the issue wasn’t given any guidance on when to manually update our status page
  • Our automated alerting does not always go off when performance degrades enough to impact functionality but not so much that Bubble is completely inaccessible

Solutions and next steps

  • Train our 24 / 7 support team on how to escalate to our US-based engineering team after hours
  • Train our entire engineering team on incident management protocols, including when to declare a production issue, and who is responsible for external communication
  • Hire more cloud engineers (JD coming shortly) to audit our alerting and lower our thresholds for emergency alarms, as well as providing faster response times to incidents

Technical root cause and resolution

What caused the initial issue

  • Very high activity of one app () and to a lesser extent a second app (), mainly around automated temporary user deletion, overwhelmed one of our database shards, which led to progressively slower queries on that shard, some queries getting stuck and never recovering, which froze some Postgres transaction locks, preventing most queries from ever resolving, and building up until the problem was addressed.

What systems were affected?

  • Impacted shard had heavily degraded performance (slightly slower queries for the first hour, then progressively fewer and fewer queries ever completed)

  • the two apps that caused this originally were offline because they were out of capacity while the automated temporary user deletion took place (and was stuck)

  • our home page, which uses the impacted database shard, suffered from poor performance on several pages that loaded a lot of data

  • ~10% of bubble apps, which live on that shard, suffered similar performance loss or timeouts

  • in the editor for everyone, the plugins tab did not load.

  • in the editor for everyone, it was not possible to create new actions, if there were plugins installed on the app that provided new action types.

How was the issue resolved?

  • Since a new expired item deletion kept being scheduled and run regularly (so long as the apps generated new temp users), a temporary patch was put in place to prevent automated temporary user deletion on the two apps that were causing this problem

  • Since a lot of queries were stuck, there was an attempt to cancel/terminate the queries in questions in the SQL console, which did not work.

  • ultimately what did it was a restart of the database, which caused 40 seconds of extra downtime, but the system came back online safely afterwards.

  • Upon restarting the server, query stampede warming up caches specifically for the query which is necessary in order to load the editor’s plugins tab, caused the problem to persist for ~30 more minutes.

  • Some failed attempts at quickly speeding up the underlying query (with some custom sql indexes, then with some redis cache layer) were made in order to allow all bubble threads to get a warm cache containing the list of plugins, and mitigate the stampede problem.

  • Ultimately though, within 30 minutes the errors died down as some threads stopped retrying the queries, and the problem went away.

What are the next steps to prevent this issue from happening again?

  • Isolate our home page from other apps by moving them off of the database shard that contains our home page

    • This would have prevented a database issue created by a customer from impacting editor features.
  • Make our clearing temporary user logic much lighter/faster

  • Make pagination more efficient than offsets

    • listing 3k plugin rows should be “basically free” yet it timed out during the outage, mostly because the queries were using offset-based pagination which made higher offsets much slower to query, and under degraded performance, failed to be fetched under the fiber time limit (65 seconds). Because we didn’t cache partial results, we kept retrying from scratch
  • Mitigate stampede on plugin fetching by only fetching it once (cache in redis, and prevent from concurrent fetching)

  • Offer more graceful degradation:

    • actions should still be creatable when plugin info isn’t loaded! (if the only missing piece is the “display” name of the installed plugin actions, we should fallback to and still render everything)

Final thoughts

Once alerted to the issue, our engineering team did in my opinion a pretty excellent job at diagnosing and resolving a very complicated cascading failure. However, the long delay between the incident starting and the team being alerted was completely unacceptable, and represents:

  • A lack of investment in robust incident management processes
  • Underinvestment in our automated alerting

As mentioned in my last monthly forum post, we are making some leadership changes and I am taking over direct leadership of the engineering team. One of my priorities is to invest in better production support, including hiring more people onto our cloud infrastructure team, investing more in maintenance of our alerting, and completely overhauling our incident response policies. I expect this to take on the order of 1 - 2 months to get the fundamentals in place, and another 4 - 5 months for the hiring and investment in alerting to fully pay off. I am very sorry for this incident, and I am committed to upping our game as an organization in how we respond to situations like this.

42 Likes

Hey @josh - Thank you for the update.

Unfortunately I’m still experiencing issues when quickly navigating between tabs in the Database tab. The content under the tabs hides itself and Design, Workflow tab etc are unusable. Only a page refresh solves this issue.

Please watch this short Loom video: Loom | Free Screen & Video Recording Software | Loom

Thanks!

1 Like

@oliviercoolen We’re looking at your Loom video but it may just be happening to your app – if you haven’t already, can you please submit a bug report in case the issue is limited to a small subset of apps?

1 Like

@josh Sure, will do. Thanks for the fast response.

1 Like

It’s ok.
We had a little fun and a rest, some of us took an afternoon nap, some cleaned her house :joy: and we decided to declare November 30th the Bubble holiday day

23 Likes

If you guys want to advertise Bubble as being a solution on which you can build real businesses, you can’t really have situations where there are bugs this big that go hours and hours without a response.

This is especially true in the economic environment as it is now where “platform risk” is a huge thing that people consider when deciding where their stack is built.

I can’t believe any of the engineers that were working on the problem didn’t think to take literally two minutes to post on the forums “Don’t worry, we’re on it”. Or that they should maybe wake the person who should be doing that up to say “Hey, we’ve got a significant problem”.

Events (and especially responses) like this are extremely confidence-eroding.

15 Likes

Hey @josh our production app is live and was working fine but somehow just now it stopped working in a way, that when we open application it load the page, and UI elements get rendered on scree. But as a user, we can perform any action, not even scroll, can’t touch any element - kind of stuck

1 Like

@ehsan1, for what it’s worth, Josh is likely going to ask you to submit a bug report so they can take a look at your app. So, you might want to go ahead and do that now.

3 Likes

Agreed that the lack of response for that length of time was very worrying.

Though I would rather not have engineers working on anything else but fixing the issue. As a communications executive in my organization, I am tasked to take a very cautious approach in regard to any outgoing communication with the public. There are designated faces speaking for specific issues. We’re an organization of 100 so I’m sure a company like Bubble will have its own comms management team and protocols.

4 Likes

I had a very specific problem: couldn’t work on adding new actions in the Workflow tab. Now it’s solved. Thank you!

1 Like

we faced the same issue earlier this morning.

1 Like

Hey @josh thanks for the update. Better late than never :grin:

It looks like we are back to normal now :crossed_fingers:

By the way, having a status page is a good thing to reassure users if it shows outages when they occur. At least we would know that it’s been identified and someone must have been alerted. However we couldn’t see this specific outage on the status page… and the green message saying “All Systems Operational” - when it obviously wasn’t - was somehow disturbing since it led to believe that no one at Bubble was aware of it…

My take is that some key services might not be monitored yet. I hope that your postmortem will also address this issue and identify key services that need to be monitored from now on to reflect a more comprehensive overview of Bubble’s realtime “health”.

Thanks

5 Likes

are you comedian hahaha🫡😎

1 Like

No, welcome here: Setting up workflows/plugins - Bug?

1 Like

I’ll second @tbenita’s reply - it is critical for users to know the operational status of the service from a source we can trust.

I recommend that status.bubble.io and @bubblestatus on Twitter need to be updated, within minutes of service degradation, and at ~15 min intervals thereafter, even if it’s “we’re still working on it: next update in X minutes” - that would give us reassurance. If you don’t, then you undermine the integrity of both and we can’t trust them.

For example, on status.bubble.io the 90 day Main Bubble Cluster line is entirely green, yet further down the page it is also reporting several Past Incidents over the same period. And the last update from @bubblestatus was on Oct 25th. I feel I can’t trust either of these sources now and I will have to monitor or clutter the forum with irritating “Does anyone have issue X?” questions.

Personally, I would prefer frequent comms even if it takes someone, not necessarily an engineer, away from working the issue. I need to be able to know if I’m seeing an error in my dev env, live app or bubble as quick as possible.

I would also separate these Status Updates from the Forum Announcements Topic - by all means start here (and prevent replies) but create a new Topic for each outage and provide updates there while allowing folk to reply so they can provide. you with feedback on what they are seeing.

Please take these is the positive spirit they are intended.

Good luck with the post-mortem - I look forward to a summary, and the (communications) improvements you will make going forward.

Ian.

9 Likes

Strongly agreed re: the comments about the importance of keeping our status page updated, and the gaps in our alerting and communication processes that led to it not being updated is going to be a big focus of our internal postmortem

18 Likes

Agreed. Bubble is not a startup anymore. Its an enterprise solution for many corporates and is advertised as such. There can be no excuse for the lack of communication. It was unprofessional and needs to be admitted as such otherwise no improvement will be made in future incidents. Consider loss of revenue for enterprise users. I am not one but I can only imagine the panic they must have felt. Not cool Bubble

3 Likes

I think you mean the Public Postmortem… right? :_)

If you want to enhance credibility, you need to enhance transparency @josh .

1 Like

@josh the problem is that this is not the first or rare instance where this communication issue happens and we are left helpless.

One not so old instance was this where status page was showing all fine, no communication on forum, no response to bug reports: None of my repeating groups are loading - #20 by mghatiya

And again even that was not the first or rare instance either.

2 Likes

I am still having issues, currently I can’t view any of my workflows as the workflow page is empty. I think Bubble should update their Status page because currently it says All Systems Operational, which clearly they are not, unless this is just me having this issue.

UPDATE: It’s working again now … was down for about 25 mins.

1 Like