Update: Postmortem on April 24 incident

Hi all, I wanted to apologize for the outage earlier today. We had an hour of downtime, that affected all main cluster apps (both Immediate and Scheduled), as well as some minor impact on certain Dedicated functionality (most Dedicated apps were unaffected). We will share a public postmortem in the next few days. At this point, the situation is fully resolved, and we understand what happened: the remaining work is to document the exact chain of events, make sure there aren’t any loose ends, and commit to action items.

33 Likes

Hi all,

Following up on Josh’s post, I’d like to present the post-mortem from the outage we had yesterday.

Impact:
From approximately 3:01PM eastern on April 24 until 4:00PM eastern, our main Bubble cluster (both immediate and scheduled) were unavailable, resulting in those Bubble applications’ frontends being unreachable, and their backend workflows not executing properly. This incident did not affect customers on Dedicated plans.

What Happened:
Code was deployed which inadvertently created enormous pressure on our caching layer. Within 5 minutes, automated alerting alerted us to the problem, and we initiated a rollback process. The rollback process deploys new infrastructure running on the previous version of code. As part of bringing the new infrastructure live, we run health checks before putting them into service. The health check includes a check against our caching layer, which was unhealthy. The rest of the time was spent manually bringing the new infrastructure down, clearing load on the cache layer, and then bringing the new infrastructure back up again. This was where most of the time during the outage went.

The purpose of the code change was to add additional observability into our services that manage scheduled tasks for apps in our shared environment. The new instrumentation was to ensure we can identify and mitigate any edge cases that otherwise might result in a single app causing system-wide degradation. The implementation of this observability collected data in a way that did not perform well under high load.

The broader context is that we recently came off of a week-long code freeze that was put in place while all Bubble engineers gathered in our New York headquarters for a team offsite. We had measures put in place to make sure that the resumption in code deployments was not a dash that would lead to incidents. Unfortunately these measures did not completely work, as this code and a couple of other errant code changes created an outage and some regressions yesterday.

At the highest level, we are trying to move swiftly to gain ground on known flaws we have in our stability. As we work with a legacy codebase, this sometimes requires us to carefully identify the right balance between speed and risk so that we can move fast but not break too many things. We are working towards converging on a shared understanding of how to strike this balance, and then making sure we are meeting it consistently.

Learnings and action items:

There are several learnings here, and action items for each.

  • Learning: although our code review process screens for software performance, we do not have load testing continuously running in our test pipelines. This would have caught non-performant code that missed the eyes of human reviewers and passed tests running on a single machine with a single user, by validating it in practice by simulating normal user load.
    • Action: there are additional improvements to our code review process that could have picked up concerns with this code at review time, which we will make over the next few days. Additionally, we already had building a load testing environment in our roadmap. This incident elevates the priority of delivering it.
  • Learning: including dependent services such as caching layers in the health check for our application servers is not serving us, and greatly exacerbated the downtime of this incident, while also making it more complicated to troubleshoot. We had other incidents in the past where this came up, but the benefits of the health check checking dependencies outweighed the risks. That thinking has changed and we now see the risks outweigh the benefits.
    • Action: in the next few days, review and replace healthchecks for application infrastructure that are dependent on other services. This will make it significantly easier to quickly identify which part of our system is broken, and in many cases, keep some parts of our environment working instead of having nothing working.
  • Learning: this whole process of doing a deployment, detecting a flaw, and rolling back, should have been an automated and turn-key process. Instead it required engineers with deep institutional knowledge to troubleshoot live.
    • Action: investigate over the next few days whether we can tie certain automated alerts (such as the ones that were fired here) to an automatic roll-back process. When paired with changing the health check, this could have reduced the impact of the incident to a few minutes instead of close to one hour.
  • Learning: especially on our platform team, almost our entire roadmap is dedicated entirely to improving reliability. That said, we do have reliability incidents day to day, and while we have recently gotten very good at writing post mortems for everything, we have not gotten good yet at nailing down the action items and getting them executed and sharing them.
    • Action: over the next few days, we are doing a postmortem-and-learning-review-a-rama. All other work will stop until every major regression and outage from the last week has a documented post mortem, and in-person learning reviews that lead to consensus on action items to tackle have been conducted, and that those action items are completed. This is a backlog clearing process that I will personally oversee.
  • Learning: there was an unusually high number of known risky code changes being deployed due to the code freeze. More generally rather than specific to this incident, we should have additional boundaries in the post-code-freeze atmosphere to prevent this class of failures.
    • Action: for the next code freeze, following the unfreeze, managers will be responsible for vetting and scheduling any code changes that are known to be risky with their team and with their partner teams.

I take responsibility for this outage and I am extremely sorry for the disruption in your business and your day that this has caused. The incident met our current objectives of 15 minutes acknowledgement time 24/7/365, and our <2 hour resolution time, but the nature of the incident leaves a lot of room for improvement. As I stated in my previous post and as I plan to say in an upcoming reply there about reliability more broadly, my goal continues to be getting us to consistency and predictability in our delivery.

Best,

Payam Azadi
Director, Engineering

23 Likes

Thanks for the write up @payam.azadi, but now I have more concerns than before reading this thread.

This incident did not affect customers on Dedicated plans.

I migrated my application to dedicated on Monday SPECIFICALLY to mitigate main cluster outages. ~$3,500/mo is a decent chunk of change just so I can get reliability. My dedicated app went down on Wednesday.

Come to find out, a dedicated app with a plugin in test mode lives in the main cluster, thus taking a dedicated app down when main cluster is down. I highly suggest updating the “Welcome to Dedicated” intro docs to state that, as its not written anywhere.

I do want to give a shoutout to my Technical Success Specialist, Mirea, she was awesome in responding back to me and explaining the root cause.

Second, this comment is extremely concerning:

Learning: there was an unusually high number of known risky code changes being deployed due to the code freeze. More generally rather than specific to this incident, we should have additional boundaries in the post-code-freeze atmosphere to prevent this class of failures.

Why on earth are there ANY risky code changes being deployed? Let alone an unusually high number. I have been a full time bubble developer for years, and theres always been a running joke that bubble hires interns to write code. Say it aint so.

At this point, I beg the bubble product team to simply not touch anything. Just let the app run. March we had a 99.3% uptime. That is BAD. Please stop making any changes.

16 Likes

this is for a different reason, but actually bubble always says to use non-production apps to test plugins. whoever is developing the plugin should have known that a tedious part of the job is setting up the app environment for the testing to spare troubles to production apps.

4 Likes

Why are critical code changes being pushed at 3:00pm Eastern???

That is bizarre.

5 Likes

:face_with_diagonal_mouth:

As I’m sure you were not the one that directly caused this outage, respect to you sir for being a great leader and taking the hit :handshake:

6 Likes

Absolutely affected our dedicated instance yesterday. While our apps were available the editor was not. We took siestas. So thanks for that but cmon.

4 Likes

From the perspective of enterprise / dedicated
I don’t know about other’s experiences, but we’ve had nothing but transparency from our Technical Support Specialist (hope I got his title right), and he’s always been very upfront and clear with us about “worst case scenarios” like a main cluster outage. We are aware that should the main cluster have issues, we may run into issues on the editor, including complete unavailability.

Nothing surprised us or caught us off guard. I will say our development team was on an off-site get together during this last outage, so the editor crashing wasn’t an issue for our team, but we also didn’t hear much from our end-users. We had 1 end user reach out and inform us that the file uploader was throwing an error, but thanks to the knowledge and wisdom shared by our Bubble representative, I can piece together an educated guess on the why without reaching out.

I do think it’s a bit of an overstatement to say that the it didn’t affect people on Dedicated though, because truthfully even dedicated instances have several dependencies on the main cluster still. The biggest plus though is that in most cases, any hiccups only affects our development team, and not the end users. I’d rather myself be annoyed and unable to be productive than have our end users have that experience. But, it’s not fair to say that it didn’t have an affect at all, because it did. A more fair term to say there would be: “Most applications on Dedicated plans remained online, and those who didn’t a root cause has been identified” - Obviously I am assuming a bit at the end there, but I feel like that is a better way to present the events on Dedicated applications.

7 Likes

Thats fair, but I am just asking for better communication about the true impact of using a test plugin (especially in a dedicated plan). Theres a big difference between a general warning that states my app might be slower if I test my plugins in production vs stating it bridges my app to the main cluster.

When i migrated to dedicated, i received a 13 page guide on the transition, this should have been on there. What other edge cases are there that might take down a dedicated server when the main cluster is down?

4 Likes

I 100% see what you mean, it’s definetly an unexpected issue and I’m sure there wil be a note about that in the next version of that guide.

Anyway for me that warning has always been enough to leave prodution app to production plugin, and use toy apps for testing :man_shrugging:

3 Likes

The question remains the same regardless. Why don’t you guys push code after hours, meaning after 5pm west coast time? Why risking so much? How many people have to be on duty to revert the pushes? You guys force us to push our apps after hours for our clients, why don’t you do the same for your clients?

Honestly it makes no sense that this happens every other week.

9 Likes

For me this is weird. Supposedly the editors are running on our instances. Seems like the code would be isolated but ‘parently not

Also

Ours (Mirea) is awesome! And super helpful.

5 Likes

I’m reaching out with deep concern about the ongoing disruptions we are experiencing with Bubble.io hosting. The impact of these interruptions on our business operations and client relationships has reached a critical point, and I feel compelled to address this issue publicly.

The assumption of full responsibility by Bubble.io does little to mitigate the actual issues we are facing. The fact is, the frequency of disruptions is costing us customers, damaging our reputation, and affecting our bottom line.

One of the core issues is the lack of a proper testing environment. It appears that Bubble.io is using production environments for testing purposes, resulting in constant disruptions for paying subscribers like us. This practice is unacceptable and directly contributes to the problems we are encountering on a weekly basis.

In my twelve years of running my app on my own servers, I experienced only three service disruptions. Since migrating to Bubble.io, I am now dealing with disruptions almost every week. This level of instability is unsustainable and is putting our business at risk of losing significant clients.

I have had numerous meetings with upset customers who are frustrated with the continuous issues we are facing. These disruptions not only affect our operations but also erode the trust and confidence of our clients in our services.

I urge Bubble.io to take immediate action to rectify these issues. Whether it involves restructuring testing procedures, enhancing communication regarding maintenance and updates, or implementing stricter quality control measures, something needs to change.

As a paying customer, I expect a level of service that is reliable and consistent. If these disruptions persist without significant improvement, I may have no choice but to explore alternative hosting options or revert to previous technologies, which is a decision I would prefer not to make.

I hope that by bringing attention to these issues, we can work together to find solutions and ensure a more stable hosting environment for all Bubble.io users.

Thank you for your understanding and prompt attention to this matter.

11 Likes

Appreciate the postmortem and all your hard work. I personally please ask for a lengthy period of non risky coding to be pushed so I can rebuild trust and reputation for long enough to build a user base that will enable me to move to a dedicated server :pray:

2 Likes

agreed

1 Like

Hey Bek, this is a great point. We’re working to improve transparency and set clearer expectations about dedicated servers, and the Welcome to Dedicated packet is our first step with that! This will be expanded as we continue to grow, but for now just a quick reference:

Dedicated instances run on dedicated application and database compute, which is isolated from other Bubble applications. Dedicated instances do share Bubble’s overall infrastructure, and can be impacted by issues that occur in other parts of our infrastructure stack. The main dependencies Dedicated instances have on shared infrastructure are:

  • Our DNS, CDN, and networking layer
  • Real-time notifications
  • Some aspects of using the Bubble editor, including logging in, searching for, and installing plugins
  • Logs, as displayed on the “Server Logs” tab of the editor
  • Metrics, as displayed on the “App Metrics” tab of the editor
  • Accessing test plugins in runmode

Thank you for your feedback here, and for choosing dedicated! If you have questions about any of the dependencies I listed above, please reach out to your technical point of contact.

9 Likes

Dear Founder,

The ongoing disruptions and instability plaguing Bubble.io’s hosting services are no longer tolerable. As a paying customer, I demand immediate and decisive action to rectify these issues. The current state of affairs is damaging our business and reputation, and it’s imperative that you take responsibility and implement effective solutions. Our trust and investment in Bubble.io demand a reliable hosting environment, and anything less is unacceptable. I urge you to prioritize this matter and demonstrate a commitment to delivering the quality of service we expect and deserve.

4 Likes