Update: Postmortem on April 24 incident

Hi all,

Following up on Josh’s post, I’d like to present the post-mortem from the outage we had yesterday.

Impact:
From approximately 3:01PM eastern on April 24 until 4:00PM eastern, our main Bubble cluster (both immediate and scheduled) were unavailable, resulting in those Bubble applications’ frontends being unreachable, and their backend workflows not executing properly. This incident did not affect customers on Dedicated plans.

What Happened:
Code was deployed which inadvertently created enormous pressure on our caching layer. Within 5 minutes, automated alerting alerted us to the problem, and we initiated a rollback process. The rollback process deploys new infrastructure running on the previous version of code. As part of bringing the new infrastructure live, we run health checks before putting them into service. The health check includes a check against our caching layer, which was unhealthy. The rest of the time was spent manually bringing the new infrastructure down, clearing load on the cache layer, and then bringing the new infrastructure back up again. This was where most of the time during the outage went.

The purpose of the code change was to add additional observability into our services that manage scheduled tasks for apps in our shared environment. The new instrumentation was to ensure we can identify and mitigate any edge cases that otherwise might result in a single app causing system-wide degradation. The implementation of this observability collected data in a way that did not perform well under high load.

The broader context is that we recently came off of a week-long code freeze that was put in place while all Bubble engineers gathered in our New York headquarters for a team offsite. We had measures put in place to make sure that the resumption in code deployments was not a dash that would lead to incidents. Unfortunately these measures did not completely work, as this code and a couple of other errant code changes created an outage and some regressions yesterday.

At the highest level, we are trying to move swiftly to gain ground on known flaws we have in our stability. As we work with a legacy codebase, this sometimes requires us to carefully identify the right balance between speed and risk so that we can move fast but not break too many things. We are working towards converging on a shared understanding of how to strike this balance, and then making sure we are meeting it consistently.

Learnings and action items:

There are several learnings here, and action items for each.

  • Learning: although our code review process screens for software performance, we do not have load testing continuously running in our test pipelines. This would have caught non-performant code that missed the eyes of human reviewers and passed tests running on a single machine with a single user, by validating it in practice by simulating normal user load.
    • Action: there are additional improvements to our code review process that could have picked up concerns with this code at review time, which we will make over the next few days. Additionally, we already had building a load testing environment in our roadmap. This incident elevates the priority of delivering it.
  • Learning: including dependent services such as caching layers in the health check for our application servers is not serving us, and greatly exacerbated the downtime of this incident, while also making it more complicated to troubleshoot. We had other incidents in the past where this came up, but the benefits of the health check checking dependencies outweighed the risks. That thinking has changed and we now see the risks outweigh the benefits.
    • Action: in the next few days, review and replace healthchecks for application infrastructure that are dependent on other services. This will make it significantly easier to quickly identify which part of our system is broken, and in many cases, keep some parts of our environment working instead of having nothing working.
  • Learning: this whole process of doing a deployment, detecting a flaw, and rolling back, should have been an automated and turn-key process. Instead it required engineers with deep institutional knowledge to troubleshoot live.
    • Action: investigate over the next few days whether we can tie certain automated alerts (such as the ones that were fired here) to an automatic roll-back process. When paired with changing the health check, this could have reduced the impact of the incident to a few minutes instead of close to one hour.
  • Learning: especially on our platform team, almost our entire roadmap is dedicated entirely to improving reliability. That said, we do have reliability incidents day to day, and while we have recently gotten very good at writing post mortems for everything, we have not gotten good yet at nailing down the action items and getting them executed and sharing them.
    • Action: over the next few days, we are doing a postmortem-and-learning-review-a-rama. All other work will stop until every major regression and outage from the last week has a documented post mortem, and in-person learning reviews that lead to consensus on action items to tackle have been conducted, and that those action items are completed. This is a backlog clearing process that I will personally oversee.
  • Learning: there was an unusually high number of known risky code changes being deployed due to the code freeze. More generally rather than specific to this incident, we should have additional boundaries in the post-code-freeze atmosphere to prevent this class of failures.
    • Action: for the next code freeze, following the unfreeze, managers will be responsible for vetting and scheduling any code changes that are known to be risky with their team and with their partner teams.

I take responsibility for this outage and I am extremely sorry for the disruption in your business and your day that this has caused. The incident met our current objectives of 15 minutes acknowledgement time 24/7/365, and our <2 hour resolution time, but the nature of the incident leaves a lot of room for improvement. As I stated in my previous post and as I plan to say in an upcoming reply there about reliability more broadly, my goal continues to be getting us to consistency and predictability in our delivery.

Best,

Payam Azadi
Director, Engineering

25 Likes