Postmortem discussion thread

georgecollier · April 25, 2024, 10:43pm

Original thread is locked.

Postmortem on today's incident coming soon

Hi all,

Following up on Josh’s post, I’d like to present the post-mortem from the outage we had yesterday.

Impact:
From approximately 3:01PM eastern on April 24 until 4:00PM eastern, our main Bubble cluster (both immediate and scheduled) were unavailable, resulting in those Bubble applications’ frontends being unreachable, and their backend workflows not executing properly. This incident did not affect customers on Dedicated plans.

What Happened:
Code was deployed which inadvertently created enormous pressure on our caching layer. Within 5 minutes, automated alerting alerted us to the problem, and we initiated a rollback process. The rollback process deploys new infrastructure running on the previous version of code. As part of bringing the new infrastructure live, we run health checks before putting them into service. The health check includes a check against our caching layer, which was unhealthy. The rest of the time was spent manually bringing the new infrastructure down, clearing load on the cache layer, and then bringing the new infrastructure back up again. This was where most of the time during the outage went.

The purpose of the code change was to add additional observability into our services that manage scheduled tasks for apps in our shared environment. The new instrumentation was to ensure we can identify and mitigate any edge cases that otherwise might result in a single app causing system-wide degradation. The implementation of this observability collected data in a way that did not perform well under high load.

The broader context is that we recently came off of a week-long code freeze that was put in place while all Bubble engineers gathered in our New York headquarters for a team offsite. We had measures put in place to make sure that the resumption in code deployments was not a dash that would lead to incidents. Unfortunately these measures did not completely work, as this code and a couple of other errant code changes created an outage and some regressions yesterday.

At the highest level, we are trying to move swiftly to gain ground on known flaws we have in our stability. As we work with a legacy codebase, this sometimes requires us to carefully identify the right balance between speed and risk so that we can move fast but not break too many things. We are working towards converging on a shared understanding of how to strike this balance, and then making sure we are meeting it consistently.

Learnings and action items:

There are several learnings here, and action items for each.

Learning: although our code review process screens for software performance, we do not have load testing continuously running in our test pipelines. This would have caught non-performant code that missed the eyes of human reviewers and passed tests running on a single machine with a single user, by validating it in practice by simulating normal user load.

Action: there are additional improvements to our code review process that could have picked up concerns with this code at review time, which we will make over the next few days. Additionally, we already had building a load testing environment in our roadmap. This incident elevates the priority of delivering it.

Learning: including dependent services such as caching layers in the health check for our application servers is not serving us, and greatly exacerbated the downtime of this incident, while also making it more complicated to troubleshoot. We had other incidents in the past where this came up, but the benefits of the health check checking dependencies outweighed the risks. That thinking has changed and we now see the risks outweigh the benefits.

Action: in the next few days, review and replace healthchecks for application infrastructure that are dependent on other services. This will make it significantly easier to quickly identify which part of our system is broken, and in many cases, keep some parts of our environment working instead of having nothing working.

Learning: this whole process of doing a deployment, detecting a flaw, and rolling back, should have been an automated and turn-key process. Instead it required engineers with deep institutional knowledge to troubleshoot live.

Action: investigate over the next few days whether we can tie certain automated alerts (such as the ones that were fired here) to an automatic roll-back process. When paired with changing the health check, this could have reduced the impact of the incident to a few minutes instead of close to one hour.

Learning: especially on our platform team, almost our entire roadmap is dedicated entirely to improving reliability. That said, we do have reliability incidents day to day, and while we have recently gotten very good at writing post mortems for everything, we have not gotten good yet at nailing down the action items and getting them executed and sharing them.

Action: over the next few days, we are doing a postmortem-and-learning-review-a-rama. All other work will stop until every major regression and outage from the last week has a documented post mortem, and in-person learning reviews that lead to consensus on action items to tackle have been conducted, and that those action items are completed. This is a backlog clearing process that I will personally oversee.

Learning: there was an unusually high number of known risky code changes being deployed due to the code freeze. More generally rather than specific to this incident, we should have additional boundaries in the post-code-freeze atmosphere to prevent this class of failures.

Action: for the next code freeze, following the unfreeze, managers will be responsible for vetting and scheduling any code changes that are known to be risky with their team and with their partner teams.

I take responsibility for this outage and I am extremely sorry for the disruption in your business and your day that this has caused. The incident met our current objectives of 15 minutes acknowledgement time 24/7/365, and our <2 hour resolution time, but the nature of the incident leaves a lot of room for improvement. As I stated in my previous post and as I plan to say in an upcoming reply there about reliability more broadly, my goal continues to be getting us to consistency and predictability in our delivery.

Best,

Payam Azadi
Director, Engineering

georgecollier · April 25, 2024, 10:48pm

Haven’t quite a few dedicated plan people said this did affect them in some ways?

I still feel like the Bubble response leaves out any hard commitment to reliability/uptime in any respect, which is really disappointing.

How do you expect us to sell this to customers? When they ask us what happens if Bubble’s down, we have to tell them, hey, tough luck, it’s the cost of using the platform. At least help us out by committing to % uptime or certain SLOs

Still missing:

Solid commitments to reliability and performance (SLA)
Clear timeline and make public your reliability targets
Scheduled releases- being on the shared cluster feels like being guinea pigs for your engineers

Also, don’t lock the threads. You know full well we’re not quiet people

johnny · April 25, 2024, 11:20pm

I believe some dedicated clusters was affected from edge cases like a plugin in testing mode? (I guess a plugin in testing mode relies on the shared cluster? Not sure how that works)

I chatted with another bubbler who works on a dedicated cluster and they had no issues

chris.williamson1996 · April 25, 2024, 11:35pm

They’ve already committed to making reliability a focus, look how that’s going. I’m not sure how much a commitment will mean unless it’s in a SLA that they are legally bound to.

nocodejordan · April 26, 2024, 12:06am

Could the Bubble team share more on their deployment method? is their shared cluster all just one big cluster that they roll out updates to? Can’t they do a deployment for a small cluster as beta tester(maybe starting with Starter plans) before moving on to the more expensive plans?

My client was a Team plan (one step below dedicated) and he was hit by this as well…this incident definitely triggered him to think of a backup plan.

Am just wondering why when every outage happens, it causes chaos to every Bubble user…

GH5T · April 26, 2024, 12:25am

I believe this has an issue with new plugin versions as well, as I’m no longer able to retrieve the correct object back on production vs the correct object on testing in an API call. As noted in my prior post.

dbevan · April 26, 2024, 12:51am

Quick question @georgecollier & @chris.williamson1996.

For myself and others reading these type of threads with less than expert knowledge on Bubble’s backend framework/infrastructure, what SHOULD we expect as customers?

Obviously 100% uptime and 100% transparent communication would be ideal but when do you believe issues like this become a deal-breaker for developers and clients?

Interested in both of your thoughts as well as anyone knowledgable on the subject.

code-escapee · April 26, 2024, 12:59am

They locked the thread!

Res Ipsa Loquitur (“the thing speaks for itself.)

johnny · April 26, 2024, 1:37am

I think that the only two things in terms of their deployment method are the Immediate Releases and Scheduled Releases, but in this case, both were affected

Good point, wondering the same

This outage was definitely one of the longer ones that the apps I manage were directly affected in a while now, I feel like a lot of the ones reported on the status page are reported even if 1 app is experiencing some type of technical issue that triggers their automations

mghatiya · April 26, 2024, 3:37am

Frankly, to me that long post is just a lot of mumbo jumbo now. All these mean nothing if they keep coming again and again. I don’t get assured by them now and am immunised.

As he wrote himself, they have become expert at writing postmortem messages rather than fixing things.

stuart8 · April 26, 2024, 4:38am

Seems to be a new trend with Bubble staff writing posts and then either never responding to the community or just locking the threads so nobody can actually respond.

It is clearly a shift in the way that Bubble are operating and a big show of lack of care towards us little people. I understand that apparently a new community guy is starting at the end of the month. They have quite the job ahead to try and reconnect as at the moment, it seems the community is in pieces and more and more people are giving up / moving to other tools.

nocodejordan · April 26, 2024, 4:43am

Great point, I have moved my app to scheduled release. Hope that in future it won’t be affected.

johnny · April 26, 2024, 5:01am

If there’s something about Bubble, I’ve definitely learned from being a Bubbler for the last 6 years, that they do definitely care and listen. Even if they don’t respond, they’ll read every post.

I do agree that it appears there are a lot of empty promises that we’re not really seeing the results of

boston85719 · April 26, 2024, 5:34am

This is great as we are able to know who is responsible for what and who we should reach out to directly if we have more serious concerns that support are not suitable to handle.

We have direct messaging capability with them. So far, I’ve been impressed with the way the higher ups and more senior people at Bubble continue to be accessible.

Like any relationship, it takes hard work, dedication, open communication (which I believe Bubble is a part of) and an understanding there will be ups and downs, for which both parties need to ride the tides of in order to reach a point where the issues become resolved. Anybody who is married or has a business partner could say the same thing.

I would say, at least they have mastered the first step…Nobody became a Black Belt in Karate without wearing the white belt for a while.

I think we can look at competitors, some of which have published their dedicated uptimes, of which I’ve seen average 99.5% - which Bubble is beating.

If a Live app of mine was experiencing issues, and we already heard from our clients as to what they were, and it was an ‘all hands on deck’ type of situation, my phone would be off until the issue was resolved.

Thanks @payam.azadi for the details from the Postmortem…I think we ALL look forward to NEVER having to read another again

ihsanzainal84 · April 26, 2024, 6:41am

Pretty sure they locked the thread because there are already at least 3 threads, excluding this, about the same issue and also moderators like @mikeloc currently not active anymore.

gaimed · April 26, 2024, 7:05am

I think they responded pretty quickly. Also the post looks like they take downtime seriously.

There will always be downtime in a full stack platform.

psycholabdesign · April 26, 2024, 7:05am

Well, looks like a chill day at NY office.

Though I would suggest at least 30% compensation for this month operational costs to all Main Cluster Paid apps. The status page clearly shows a day by day outage.
Thanks.

lindsay_knowcode · April 26, 2024, 7:54am

there are two timeless conversations

we customers insist on ~ 100% reliability and full transparency
we customers demand very cheap

Controls and processes for reliability mean additional costs - that will pass to customers as price increases.

Just saying - you can moan about one or the other but not both (please)

georgecollier · April 26, 2024, 8:32am

They locked the thread because they know what the response would be.

I’m no Bubble infrastructure expert. But you’d expect at least less uptime than the dependencies (AWS/Cloudflare etc). But certainly not as low as the current uptime.

Yes, but not to this extent…

Please, if I could pay 3x as much for a reliable service I would I have seen apps on the starter plan that make 100k+ per year, some apps are grossly underpaying (though don’t get me started on WU )

I feel like two days after serious issues people kind of forget how serious they are and how much they affect the business operations facilitated by your apps and are quick to give the benefit of the doubt to Bubble with the ‘they’re trying!’ and ‘no service is perfect!’. Yeah, that’s right, but those services compensate users for reliability, publicly commit to specific SLOs, and have appropriate rollout procedures that stop the continual ‘oh look, Bubble’s testing in production again!’ related problems.

georgecollier · April 26, 2024, 8:36am

You’d think so… but it seems to be a consequence of decisions made years ago that make it too late to change. That’s their problem, rather than ours, but it means it seems unlikely they’ll ever give shared clusters proper release management. So, they just have to test before deploying. But apparently they don’t do load testing? They don’t test then normal, expected, platform-wide consequences of changing certain code.

Topic		Replies	Views
Update: Postmortem on April 24 incident Announcements	33	4545	May 7, 2024
Incident postmortem from 11/29 Announcements	45	5033	December 21, 2022
Workflow error - Temporary bug Bugs	152	3240	May 2, 2024
My Life Went Miserable Tips	38	2488	February 9, 2019
Outage [SEPT 29] Bugs	81	1269	October 2, 2025

Postmortem discussion thread

Related topics