Hi all,
I’m Payam (pie um, he/him), director of platform engineering here at Bubble. This is the postmortem for the outage that we had last night.
Reliability and building user trust is Bubble’s #1 priority. I’m very sorry to the community that this incident happened. I know how frustrating it is to lose some of your work. And I want to thank everyone who wrote in last night and worked with me in realtime to help sort things out. I will explain what the impact was, why it happened, what we did to fix it, and what’s next for us.
Impact:
- Between 2:05–5:18 AM ET, the Bubble editor was in a degraded or unavailable state for all shared environments.
- Changes made in the editor between that time were lost.
- Four dedicated environments had editor and run mode down for roughly 2.5 hours starting at 12 AM ET.
- Performance of editor functions was degraded starting around 5:18 AM ET and lasted until about 10 AM ET.
- Run mode and live delivery of production apps were not impacted on the shared environment.
Why did this happen?
AWS forced an upgrade to a subset of Bubble’s databases that is incompatible with our code, despite Bubble having auto-upgrade disabled. This led to a series of outages induced by these upgrades, which we were first alerted to in a dedicated environment.
AWS Support said that the forced upgrade happened because they deprecated this database version. We weren’t informed or given the opportunity to opt out or postpone the process and are pursuing a separate postmortem for this with Amazon.
The upgrade leapt ahead two versions, when a single version upgrade would have been sufficient and prevented an outage.
How did we fix it?
We began manually upgrading to version one step ahead for all databases as fast as we could, which would prevent the errant auto-upgrade.
Unfortunately, the auto-upgrade on our AppServer database, the backend for the editor, began right as we noticed the problem. That auto-upgrade immediately triggered an alert and was our first update on the status page.
In order to fix AppServer, we had to wait for the forced upgrade to complete, then create a new instance on the supported version based on the most recent snapshot available. This process took a few hours and created an approximately two hour window of data loss in the editor.
Afterward, we ensured that every database in our system was on the correct version and that auto-upgrade remained off.
As AppServer came back up, we observed performance degradation for editor functions, which was expected for a new database with a cold cache. The impact of that degradation included behavior like unsuccessful merges. We continued to monitor the database, and it steadily returned to a natural state over the next few hours as the cache warmed up.
What’s next?
We are pursuing a separate postmortem with Amazon, including unearthing why we weren’t informed, why the forced upgrade leapt two versions instead of one, and how to make sure an incident like this doesn’t happen again. We’re also considering an upgrade to Amazon Aurora, which has a feature called point-in-time recovery, which would have alleviated some but not all of the data loss in the editor.
Regarding the code incompatibility I’ve mentioned: We’ve been hard at work removing that dependency over the last few months. It is a software library we are deprecating, and we are moving to a new implementation that will improve performance and enable a significant rearchitecting of our entire data plane. This will lead to much more scalability, and improve reliability and performance. Also, the database version we are now on has extended support, so we certainly won’t run into this any time soon.
As we continue to learn and improve, so will our reliability, as it steadily has been in recent months. I’m proud of that work, but I am personally upset that this incident happened, and I feel genuine pain for the users who lost hours of work in the editor. We’ve all experienced that kind of loss, and it hurts. What I can promise is that we are going to continue to get better.
I want to thank each of you for continuing to place your trust in Bubble, and for the way you partnered with me last night to identify and resolve the issue. Community members even reached me directly in my DM on the forum to offer support, and that was an incredible feeling. We all rise together.
With respect,
Payam Azadi