August 6, 2024 outage postmortem

Hi all,

I’m Payam (pie um, he/him), director of platform engineering here at Bubble. This is the postmortem for the outage that we had last night.

Reliability and building user trust is Bubble’s #1 priority. I’m very sorry to the community that this incident happened. I know how frustrating it is to lose some of your work. And I want to thank everyone who wrote in last night and worked with me in realtime to help sort things out. I will explain what the impact was, why it happened, what we did to fix it, and what’s next for us.

Impact:

  • Between 2:05–5:18 AM ET, the Bubble editor was in a degraded or unavailable state for all shared environments.
  • Changes made in the editor between that time were lost.
  • Four dedicated environments had editor and run mode down for roughly 2.5 hours starting at 12 AM ET.
  • Performance of editor functions was degraded starting around 5:18 AM ET and lasted until about 10 AM ET.
  • Run mode and live delivery of production apps were not impacted on the shared environment.

Why did this happen?
AWS forced an upgrade to a subset of Bubble’s databases that is incompatible with our code, despite Bubble having auto-upgrade disabled. This led to a series of outages induced by these upgrades, which we were first alerted to in a dedicated environment.

AWS Support said that the forced upgrade happened because they deprecated this database version. We weren’t informed or given the opportunity to opt out or postpone the process and are pursuing a separate postmortem for this with Amazon.

The upgrade leapt ahead two versions, when a single version upgrade would have been sufficient and prevented an outage.

How did we fix it?
We began manually upgrading to version one step ahead for all databases as fast as we could, which would prevent the errant auto-upgrade.

Unfortunately, the auto-upgrade on our AppServer database, the backend for the editor, began right as we noticed the problem. That auto-upgrade immediately triggered an alert and was our first update on the status page.

In order to fix AppServer, we had to wait for the forced upgrade to complete, then create a new instance on the supported version based on the most recent snapshot available. This process took a few hours and created an approximately two hour window of data loss in the editor.

Afterward, we ensured that every database in our system was on the correct version and that auto-upgrade remained off.

As AppServer came back up, we observed performance degradation for editor functions, which was expected for a new database with a cold cache. The impact of that degradation included behavior like unsuccessful merges. We continued to monitor the database, and it steadily returned to a natural state over the next few hours as the cache warmed up.

What’s next?
We are pursuing a separate postmortem with Amazon, including unearthing why we weren’t informed, why the forced upgrade leapt two versions instead of one, and how to make sure an incident like this doesn’t happen again. We’re also considering an upgrade to Amazon Aurora, which has a feature called point-in-time recovery, which would have alleviated some but not all of the data loss in the editor.

Regarding the code incompatibility I’ve mentioned: We’ve been hard at work removing that dependency over the last few months. It is a software library we are deprecating, and we are moving to a new implementation that will improve performance and enable a significant rearchitecting of our entire data plane. This will lead to much more scalability, and improve reliability and performance. Also, the database version we are now on has extended support, so we certainly won’t run into this any time soon.

As we continue to learn and improve, so will our reliability, as it steadily has been in recent months. I’m proud of that work, but I am personally upset that this incident happened, and I feel genuine pain for the users who lost hours of work in the editor. We’ve all experienced that kind of loss, and it hurts. What I can promise is that we are going to continue to get better.

I want to thank each of you for continuing to place your trust in Bubble, and for the way you partnered with me last night to identify and resolve the issue. Community members even reached me directly in my DM on the forum to offer support, and that was an incredible feeling. We all rise together.

With respect,

Payam Azadi

45 Likes

Tough day. Appreciate the openness. It’s easy to throw stones but these things happen. All the time. To everyone. Most people don’t talk about it.

15 Likes

Was any app data lost? I don’t even know how we would find out if our users lost data due to an issue like this, and I don’t believe Bubble has any way for us to find out so we can alert our users appropriately.

Every single outage like this is screaming for Bubble to create a self-hosted option so we can rely on our own infrastructure and testing, rather than hope Bubble doesn’t have an issue that suddenly takes our apps down. My guess is that the initial work to create this would be significant, but long-term, it would require much less maintenance vs. having to ensure all apps stay online through your own servers/DBs.

A WordPress-like model with a licensing fee (vs. WP’s open source) would work really well and ensure peace of mind for businesses like mine who rely on our Bubble apps for income.

10 Likes

Transparency and timely communication is a key to Bubble’s success! Thank you @payam.azadi :v:

4 Likes

+1 at least give us our own regional hosting

It just couldn’t make more sense

6 Likes

Thanks for the update and for waking up in the middle of the night to help resolve this but I don’t believe this quote. I’m sure they’ve been sending emails about this for months.

Seems like this isn’t a new or isolated thing:

RDS Postgres Minor version forced upgrade : r/aws (reddit.com)

Kinda crazy considering Amazon has dependencies that are in roughly 33% of the entire internet.

Glad to see this isn’t really a bubble caused issue.

4 Likes

I want to express my gratitude and support. As a developer, I understand that challenges occur, and I appreciate your transparency and dedication. Your efforts reinforce my trust in Bubble.io.

Thank you for your hard work.

2 Likes

I will accept an apology in the form of a flight ticket to Bubblecon. :laughing:

7 Likes

Hey Payam,

Thanks for the detailed explanation. It’s great you’re being open about what happened.

One big issue for us was not knowing about the problem sooner. I checked the status page when things looked off, but it was all green even though the forum was blowing up with posts. A lot of us kept working and lost data because of this.

Could we maybe:

  1. Update the status page right away when something’s wrong?
  2. Add alerts in the editor that pop up based on the status page?

This way, we’d know what’s up quickly and could decide whether to keep working or not.

What do you think? Could something like this work?

Thanks,

7 Likes

@payam.azadi

We got another problem you did not mention, but cost us losing our customers. We estimate losing $5k - $10k revenue we expected to have, which we consider a lot more damage than few hours of work.

Our live app started to do infinite update. Users could not log in or do any operations that involve data write.

I believe this is related to how bubble was fixed yesterday and I do not want that happen again.

$50-$10k lost in two hours? Sounds like a good business you’re running…

3 Likes

We run a online live course with 50~100 students, and the issue happened in the middle of the session where we have been having students converting for paid longer courses in the past.

Our product provide value to a customer in a very short period of time in real time, so when the site doesn’t run in those times, we lose a lot of revenue we expect…

1 Like

Interesting perspective. I think it’s valid to wonder why “roughly 33% of the entire internet” did not go down last night, or why a random reddit user is aware of a previous forced migration, but a VC-backed company that recently raised a $100M round had no idea about the most recent forced migration. Seems like there should be a guy paid low 6 figures a year whose sole job it is to keep up with this stuff.

3 Likes

To clarify, the 33% number comes from ANY Amazon dependency, that ranges from web hosting, to cloud scripts, to libraries, to databases. It’s the softwares job to keep things up to date and off depreciated versions or tools.

2 Likes

you should postpone the WU switch another year until you get all this figured out

3 Likes

Lol, how does WUs correlate with anything related to this outage?

4 Likes

I see a lot of people trying to make it seem that Bubble should take the lion’s share of the blame. Bubble has made plenty of bad decisions in the past but they have also make some great ones too. In this case I don’t think Bubble is at fault.

To be fair to Bubble, they have been consistently working to update their codebase to prevent issues related to legacy services. While I can agree that it may have been an oversight by someone at Bubble, even the reddit comments in that posts say that Amazon are not kind enough to send constant reminders about the upgrade.

I can say the same thing about Bubble developers who will moan and whine that switching from legacy plans to the WU plans suddenly have their costs explode or they weren’t prepared. All because they did not do their due diligence for the switch despite the announcements many months ago.

I switched my legacy apps to WU within the first few months after the WU launch. This was after taking time to experiment with WU and optimizing my apps. It was tedious, frustrating and worrisome but I did it nonetheless, just like what Bubble has been trying to do by updating their codebase. I’m now at a much better place when optimizing Bubble apps.

While I agree that Bubble is still lacking in terms of user features for nitty gritty WU optimization (the kind I like), what happened here isn’t a good reason for a delay. I’m very sure that support for legacy apps is holding back Bubble from implementing features and updates to improve Bubble as a whole.

7 Likes

My experience is that Amazon has a clear set of rules when something runs out or support and needs an upgrade. That’s also my experience with any other cloud hosting provider. 6-12 months ahead everybody should be able to know if there might be an issue at the horizon.

2 Likes