Hi all,
I wanted to follow up on the extended downtime we had on Wednesday with our postmortem and reflections. We take uptime extremely seriously, because we know for everyone who depends on us as the primary tech stack for their business, downtime can be extremely disruptive and costly, and can undermine the trust you have built with your own users. Speaking for our team, we all feel personally responsible when something like this happens, and want to make sure we are constantly learning and improving to drive reliability.
Impact
From approximately 10:02 am to 12:20 pm EDT on March 22, our main Bubble cluster (both Immediate and Scheduled) was sporadically unavailable; many apps were not loading at all or suffering severe performance degradation. Many workflows scheduled during that period of time either did not run on time or failed with errors. A minority of applications continued to have page loading issues until 1 pm.
During this time, customers on Dedicated plans experienced issues editing their apps, including trouble logging in to their editors and problems loading various editor features (primarily plugins). However, apps running on Dedicated were not impacted.
The Bubble forum was down for part of this time as well.
What Happened
The incident was caused by a distributed denial-of-service attack (DDoS) targeting multiple pages on bubble.io. We do not know the motive for the attack. Our general policy regarding DDoS is to assume we need to be prepared to defend against DDoS attacks targeting our own website as well as our users’ websites at any time. We do not attempt to engage with or investigate the responsible party.
During the peak of the attack, we received 35X our normal traffic volumes. We were unable to scale our infrastructure in time to a level that could handle that traffic. This led to a series of failures causing our internal traffic routing to cut in and out over the course of the incident, as well as our primary Bubble servers failing to fulfill requests.
The Dedicated impact occurred because Dedicated servers contact the main cluster to:
- Authenticate users logging in (the “sign in with Bubble” button)
- Fetch various pieces of metadata, including the list of available plugins
The timeline of the event is as follows:
- 10:02 am EDT: DDoS attack begins, traffic volume starts ramping up
- 10:04 am: We start seeing alerts, an engineer begins investigating
- 10:09 am: CloudFlare’s automated systems notify us that we are under attack, and they enable automatic mitigations on their end
- 10:15 am: Our full incident response team is working on the issue
- 10:15–11:45 am: We are investigating the issue and attempting various mitigations, some of which temporarily restore service
- 11:45 am: We put a mitigation in place that stabilizes the main cluster from an infrastructure perspective, although various features (including loading any assets requiring CORS) remain broken
- 12:20 pm: Main cluster fully restored
- 12:53 pm: We see customer bug reports about lingering issues caused by various caches being in bad states
- 1:00 pm: We finish clearing caches and fully restore service
Things We Learned
In the past, we have successfully repelled DDoS attacks with no or minimal user impact via a combination of CloudFlare’s automated DDoS mitigation functionality and a layer of built-in-house defenses between CloudFlare and the rest of our infrastructure.
This attack was able to penetrate our defenses because it was an unusually-large and well-executed attack. DDoS attacks work by attempting to look like normal traffic (as if your regular users just increased the volume of their activities); DDoS defense involves differentiating between the attacking traffic and normal traffic so that you can block or gate the former without impairing the latter. There are various levels of aggressiveness of defense: to simplify a complex subject, you can err on the side of only blocking traffic that you are sure is bad to avoid blocking your actual users, or you can err on the side of impeding your actual users to make sure you block all the bad traffic.
CloudFlare’s default configuration attempts to strike a balance between those concerns and is weighted somewhat conservatively to avoid interrupting legitimate traffic. However, it provides a rich set of tools that can be used to customize your posture, both in advance of an attack to take advantage of what you know about your own typical traffic patterns, and during the attack itself to dynamically respond to what the attackers are doing and limit the extent of the impact.
Based on the success of past DDoS mitigations, we had not invested extensively in modifying CloudFlare’s default settings to fit our use case, and our engineering team was not particularly familiar with all the capabilities that CloudFlare offers. Wednesday’s attack was able to get through CloudFlare’s default filtering, which only blocked a fraction of the traffic. As a team, we were working hard to find the most effective approach to mitigate the attack, which included a combination of:
- Attempting to scale out our infrastructure on an emergency basis
- Fixing the failures caused by our infrastructure being overwhelmed
- Modifying our in-house-built defenses to better respond to the attack
- Learning about CloudFlare’s tooling and adjusting our configuration
Our main technical takeaway from the incident is that our time would have been spent most effectively on top-of-stack defenses. Also, fully familiarizing ourselves with CloudFlare’s tooling in advance, as well as setting up some configuration upfront based on what we already know about our usage, would have made our response much faster and potentially prevented any user-facing impact.
From a communication and incident management standpoint, another takeaway was that we did not use status.bubble.io effectively. We have tooling that updates status.bubble.io automatically whenever we detect that Bubble is down, which we built years ago when there was typically at most one engineer (often myself) responding to an issue. However, in a situation like Wednesday’s, where Bubble is sporadically unavailable, it results in status page spam (and text message spam for everyone who subscribes to it). Meanwhile, although our engineering team was investigating within two minutes of the attack’s start, we did not post a human-written message to our status page until we were 29 minutes into the incident. We plan to disable the automated alerts and instead build a consistent process for the engineering team to keep status.bubble.io updated.
Action Items
We are taking the following short-term actions in response to the incident:
- Adjusting our CloudFlare configuration to better reflect our typical traffic patterns
- Training our team on CloudFlare’s suite of capabilities
- Building a DDoS incident runbook so that we can execute a prepared plan if there is a future attack that gets through our automated defenses
- Training the team on when to update status.bubble.io and running through simulated incidents to make sure everyone knows their role
Additionally, the following long-term roadmap items would have reduced the impact of the DDoS (all of these were already on our roadmap prior to the incident):
- Segregating our main cluster into multiple shared hosting environments, which will contain the impact of any DDoS to a subset of customers, and enable us to scale our infrastructure to handle much higher traffic volumes
- Moving our bubble.io homepage to an isolated environment to keep it separated from user apps
- Moving editor authentication and metadata out of the main cluster to avoid customers on Dedicated plans experiencing issues accessing the editor in a situation like this
We are extremely sorry for the impact that Wednesday’s events had on our community. We know how much Bubble’s stability matters to all of you, and are committed to investing in reliability to make Bubble a world-class hosting platform. We appreciate the support we have received from our community over the last two days!