3/22/2023 Incident Postmortem

Hi all,

I wanted to follow up on the extended downtime we had on Wednesday with our postmortem and reflections. We take uptime extremely seriously, because we know for everyone who depends on us as the primary tech stack for their business, downtime can be extremely disruptive and costly, and can undermine the trust you have built with your own users. Speaking for our team, we all feel personally responsible when something like this happens, and want to make sure we are constantly learning and improving to drive reliability.

Impact

From approximately 10:02 am to 12:20 pm EDT on March 22, our main Bubble cluster (both Immediate and Scheduled) was sporadically unavailable; many apps were not loading at all or suffering severe performance degradation. Many workflows scheduled during that period of time either did not run on time or failed with errors. A minority of applications continued to have page loading issues until 1 pm.

During this time, customers on Dedicated plans experienced issues editing their apps, including trouble logging in to their editors and problems loading various editor features (primarily plugins). However, apps running on Dedicated were not impacted.

The Bubble forum was down for part of this time as well.

What Happened

The incident was caused by a distributed denial-of-service attack (DDoS) targeting multiple pages on bubble.io. We do not know the motive for the attack. Our general policy regarding DDoS is to assume we need to be prepared to defend against DDoS attacks targeting our own website as well as our users’ websites at any time. We do not attempt to engage with or investigate the responsible party.

During the peak of the attack, we received 35X our normal traffic volumes. We were unable to scale our infrastructure in time to a level that could handle that traffic. This led to a series of failures causing our internal traffic routing to cut in and out over the course of the incident, as well as our primary Bubble servers failing to fulfill requests.

The Dedicated impact occurred because Dedicated servers contact the main cluster to:

  • Authenticate users logging in (the “sign in with Bubble” button)
  • Fetch various pieces of metadata, including the list of available plugins

The timeline of the event is as follows:

  • 10:02 am EDT: DDoS attack begins, traffic volume starts ramping up
  • 10:04 am: We start seeing alerts, an engineer begins investigating
  • 10:09 am: CloudFlare’s automated systems notify us that we are under attack, and they enable automatic mitigations on their end
  • 10:15 am: Our full incident response team is working on the issue
  • 10:15–11:45 am: We are investigating the issue and attempting various mitigations, some of which temporarily restore service
  • 11:45 am: We put a mitigation in place that stabilizes the main cluster from an infrastructure perspective, although various features (including loading any assets requiring CORS) remain broken
  • 12:20 pm: Main cluster fully restored
  • 12:53 pm: We see customer bug reports about lingering issues caused by various caches being in bad states
  • 1:00 pm: We finish clearing caches and fully restore service

Things We Learned

In the past, we have successfully repelled DDoS attacks with no or minimal user impact via a combination of CloudFlare’s automated DDoS mitigation functionality and a layer of built-in-house defenses between CloudFlare and the rest of our infrastructure.

This attack was able to penetrate our defenses because it was an unusually-large and well-executed attack. DDoS attacks work by attempting to look like normal traffic (as if your regular users just increased the volume of their activities); DDoS defense involves differentiating between the attacking traffic and normal traffic so that you can block or gate the former without impairing the latter. There are various levels of aggressiveness of defense: to simplify a complex subject, you can err on the side of only blocking traffic that you are sure is bad to avoid blocking your actual users, or you can err on the side of impeding your actual users to make sure you block all the bad traffic.

CloudFlare’s default configuration attempts to strike a balance between those concerns and is weighted somewhat conservatively to avoid interrupting legitimate traffic. However, it provides a rich set of tools that can be used to customize your posture, both in advance of an attack to take advantage of what you know about your own typical traffic patterns, and during the attack itself to dynamically respond to what the attackers are doing and limit the extent of the impact.

Based on the success of past DDoS mitigations, we had not invested extensively in modifying CloudFlare’s default settings to fit our use case, and our engineering team was not particularly familiar with all the capabilities that CloudFlare offers. Wednesday’s attack was able to get through CloudFlare’s default filtering, which only blocked a fraction of the traffic. As a team, we were working hard to find the most effective approach to mitigate the attack, which included a combination of:

  • Attempting to scale out our infrastructure on an emergency basis
  • Fixing the failures caused by our infrastructure being overwhelmed
  • Modifying our in-house-built defenses to better respond to the attack
  • Learning about CloudFlare’s tooling and adjusting our configuration

Our main technical takeaway from the incident is that our time would have been spent most effectively on top-of-stack defenses. Also, fully familiarizing ourselves with CloudFlare’s tooling in advance, as well as setting up some configuration upfront based on what we already know about our usage, would have made our response much faster and potentially prevented any user-facing impact.

From a communication and incident management standpoint, another takeaway was that we did not use status.bubble.io effectively. We have tooling that updates status.bubble.io automatically whenever we detect that Bubble is down, which we built years ago when there was typically at most one engineer (often myself) responding to an issue. However, in a situation like Wednesday’s, where Bubble is sporadically unavailable, it results in status page spam (and text message spam for everyone who subscribes to it). Meanwhile, although our engineering team was investigating within two minutes of the attack’s start, we did not post a human-written message to our status page until we were 29 minutes into the incident. We plan to disable the automated alerts and instead build a consistent process for the engineering team to keep status.bubble.io updated.

Action Items

We are taking the following short-term actions in response to the incident:

  • Adjusting our CloudFlare configuration to better reflect our typical traffic patterns
  • Training our team on CloudFlare’s suite of capabilities
  • Building a DDoS incident runbook so that we can execute a prepared plan if there is a future attack that gets through our automated defenses
  • Training the team on when to update status.bubble.io and running through simulated incidents to make sure everyone knows their role

Additionally, the following long-term roadmap items would have reduced the impact of the DDoS (all of these were already on our roadmap prior to the incident):

  • Segregating our main cluster into multiple shared hosting environments, which will contain the impact of any DDoS to a subset of customers, and enable us to scale our infrastructure to handle much higher traffic volumes
  • Moving our bubble.io homepage to an isolated environment to keep it separated from user apps
  • Moving editor authentication and metadata out of the main cluster to avoid customers on Dedicated plans experiencing issues accessing the editor in a situation like this

We are extremely sorry for the impact that Wednesday’s events had on our community. We know how much Bubble’s stability matters to all of you, and are committed to investing in reliability to make Bubble a world-class hosting platform. We appreciate the support we have received from our community over the last two days!

91 Likes

Thanks for the transparency, @josh!

7 Likes

I’m grateful for the transparency and thoroughness of your explanation, @josh Thank you!

5 Likes

Thanks, it was really cool that you’ve gone into so much detail. DDOS is such a lame attack. I think Bubble has handled this great. Thanks again!

4 Likes

Keep up the tough work and we appreciate the info.

1 Like

That’s a well-explained postmortem, Thanks Team Bubble for getting back in time.

2 Likes

Hey @josh,

Thanks for the update and for letting us know what happened during the extended downtime on Wednesday. We appreciate your transparency and the team’s efforts to make sure it won’t happen again in the future.

It’s unfortunate that the DDoS attack was able to get through the default settings, but we’re glad to hear that you’ve already taken steps to modify the configurations and familiarize yourselves with CloudFlare’s tooling for future incidents.

Also, thank you for highlighting the importance of top-of-stack defenses and using status.bubble.io effectively. We’ll keep those in mind moving forward.

2 Likes

Thanks for the post @josh. Really interesting analysis.

1 Like

Your postmortem gives me confidence. Reassuring to know it’s your hands on the steering wheel.

5 Likes

@josh so does this mean that any scheduled workflows e.g. recursive wfs - in this timeframe would have not run, therefore ending the loop? If yes, ideally having scheduled workflows paused when the main cluster is down would be optimal if at all possible, or some kind of notification process

5 Likes

Even though it sucks that there was an attack, i think it was for the better. You guys were able to find flaws or weak points in you process/ infrastructure. Thank you for the transparency, glad to see we have good people in control of this!

3 Likes

Really appreciate the thoughtful postmortem, it’s clear you took this very seriously and your reflections and action items will certainly inspire growing confidence in the scaling and preparedness of the platform. Thanks @josh !

1 Like

Cheers Josh, is there a way to expose a white label version of the Bubble status page so we can inform our own users of the current situation?

4 Likes

@recouk31 Have you experimented with using the available webhook?

4 Likes

I haven’t, thanks for pointing it out!

2 Likes

Just keep in mind if Bubble is down you can’t rely on a Bubble app to notify your users… :laughing:

6 Likes

sounds good guys thank you

1 Like

Hi @josh :clap: :clap: :clap:
I usually say: Problems / incidents, we all have them. If it can be avoided or mitigated, great. However, acting with transparency and humility in sharing the relevance of facts and post-actions is what differentiates an ethical company from those who have lost their sense of responsibility and empathy.

Your positioning is sensational. And I believe that your attitude gives us a more solid feeling
to be able to trust and follow with Bubble. So congratulations!

You must to consider: it’s really important to be lessons learned. And I mean, no investment is considered too much when we talk about security, IT, Governance, customer Care.

Ps.: So may you all always remember, here on the other side, there are many entrepreneurs betting on the future of our companies and employees, through Bubble. We do not want to be wrong on choosing Bubble as a partner. And, so important as well, remember that we are in many countries, not just the USA. On this case, please, don’t forget us, take care of us.

Thank you and regards :wink: :+1:

3 Likes

Hey @recouk31,

Check this out:

Works for Instatus too

2 Likes

Johnny, How are you always the first to respond to these things!? You’re timing is amazing man!

3 Likes