Hi all,
As I promised earlier, here’s the postmortem on the “Insert Dynamic Data” bug from yesterday, September 19.
Before I go there, some broader thoughts.
A lot of the frustration we’re hearing is due to the incident itself, but also due to the pattern of reliability problems with our platform this year. This incident touches on two patterns in particular that we are frustrated with as a team and are working to change:
- Bugs and glitches using the editor, making the experience of editing Bubble applications painful and unreliable
- Sustained bugs and outages where you only hear from the Bubble team after-the-fact, and are left wondering for hours what’s going on
We’ve talked about both patterns in the past and some of what we’re working on to solve them, but we know that hearing us say “hey, we’re doing all this work behind-the-scenes” doesn’t help much when you can’t see tangible progress and still get burned.
We are fed up with this too. We get up in the morning trying to make Bubble better for our users – yeah, I know, some of you probably disagree with that sentiment right now! – but that’s what we’re working towards, and when the result is outages, bugs, and frustration, it’s undercutting everything we put our hearts into doing.
I want to let you know that we are going to step up our game on both fronts – editor bugs, and incident management – in a big way.
On the editor side of things, as I mentioned on the forum a couple weeks ago, we are pivoting our roadmap for the rest of the year towards tooling to support editor reliability: better tests, better monitoring, and fixing some of the underlying technical issues that make bugs common by moving us towards a componentized design system. Our VP of Product, Allen, is going to post a follow-up reply sharing more about what we are doing here.
On the incident management side of things, what I’m realizing from yesterday’s postmortem and the last couple issues we had where we were too slow to respond, is that we are defining severity 1, “drop everything and fix it, wake people up, sound the fire alarms” incidents too narrowly.
Some backstory on that – not all that long ago, I was the only person in the company to be on-call to be woken up in the middle of the night in an emergency, supported by a couple of very senior engineers who worked basically around the clock to keep our systems up. That came out of our bootstrapped roots as a company, our relatively small team size, and some of the system complexity we have that makes it challenging to get new engineers to the point they can support our infrastructure.
That mode of operating leads to an extreme triage mentality: if we only have so many people, working so many hours, we have to carefully pick our battles, and if that means it takes us an extra couple hours to respond to a slightly-less-urgent issue, so be it.
Over the last year, we’ve changed that: we now have an on-call rotation across a big chunk of the engineering team, we’ve hired a lot more people including some senior leadership, and we’re investing in building a team that collectively owns and supports our systems.
But we still have some cultural lag in the way we architected our incident management processes. For instance, we have not been considering an editor bug a severity 1 issue, because it doesn’t break running user apps, which as you will see below is related to how things went down yesterday. For issues that do break running apps, we don’t always consider them severity 1 if the number of incoming bug reports isn’t that high, which was involved in the recent real-time updates outage that we responded to too slowly.
This lag is not aligned with the reality of what Bubble is: a global platform, used across all timezones, for mission-critical applications where both running apps and editing apps are very important to people’s economic success.
So, we’re going to change it. I am discussing the implementation details with the team now, but over the next 1-2 weeks, I want us to be in a place where:
- We default to escalating issues: if it falls in a gray area where we have enough bug reports that it is pretty clear we broke something or something changed for the worse, the whole Bubble team is trained to push the big red “Houston we have a problem” button
- We define severity 1 problems broadly, not narrowly
- Any issue that gets escalated and confirmed as a severity 1 problem makes it to our status page
We have the team for this now and I think it is very realistic for us to get to that world fast. The incidents we’ve mishandled over the last year have almost universally been gray area issues where the team knew something was wrong but wasn’t sure how loudly to bang the “hey we need to fix this and get a status update out to users now!!” drum. So we are going to make it very clear exactly what everyone should do in a situation like that going forward such that we err on the side of over-response rather than under-response.
Okay. That’s the big picture update. There’s also a bunch of tactical things we want to fix based on what happened yesterday, so for transparency, here’s a postmortem on what happened:
———
User Impact
For approximately 14 hours, the “Insert Dynamic Data” button in the editor was broken. This is a major feature, and the breakage significantly impaired our user’s ability to edit their apps during this time period. We did not communicate about the issue publicly until the issue was resolved.
This issue did not impact users on the Scheduled tier or on Dedicated clusters.
Root Cause
We deployed a bug fix to address some glitches in editor interactions. This fix changed behavior that the “Insert Dynamic Data” button relied on, causing the button to become unusable. Because this was a bug fix, it was not deployed behind a feature flag, and instead immediately impacted production behavior.
We have automated tests that exercise the “Insert Dynamic Data” button, but the tests work by activating the javascript click handler for that button, vs actually simulating a click, and did not catch the problem.
We became aware of the bug via user-submitted bug reports, identified the issue, and attempted to revert the code that introduced the problem.
However, due to an Amazon Web Services incident occurring during the same timeframe, our ability to deploy new servers to production was broken. Based on the initial AWS ETA for a fix, we decided to wait out the outage rather than doing a risky operation to route around the issue and deploy servers anyway. The outage ended up taking longer than anticipated, and we deployed the fix when a Europe-based team member came online early the next morning.
Based on our team’s severity rubric, because this was an editor issue that broke some but not all editor functionality, the issue got classified as a severity 2 priority issue, not severity 1. As a result, communication about the issue was limited to the team working on it vs the broader engineering organization, and no one flagged the need to communicate about it on our status page or forum. The lower classification also contributed to the decision to let the issue persist overnight instead of calling for an all-hands-on-deck effort to get it fixed as soon as possible.
Timeline
(All times are in EDT, although we had team members responding across multiple timezones)
- Sept 18 3:27 pm – We deploy the code that caused the issue
- 4:21 pm – first user bug report comes in
- 4:51 pm – based on further bug reports, we escalate this to a high priority issue
- 6:01 pm – we identify the cause of the issue
- 6:26 pm – we attempt to deploy a fix
- 7:58 pm – we realize that the deployment failed because of the ongoing AWS issue
- Sept 19 2:08 am – AWS reports their issue is resolved
- 3:00 am – a Europe-based team-member becomes aware of the problem, but doesn’t have the full context on the situation
- 5:00 am – team members in Europe reconstruct the situation and identify that we still need to deploy the fix
- 5:17 am – fix goes live
Action Items
- Better alerting around issues with our automated deployments. Deployment failures should be treated as severity 1 issues since they put us at risk of incidents like this one
- Broaden our definition of severity 1 issues to make sure broad-based impact to our user base from regressions is always included
- Define simple process for everyone in engineering and customer success to escalate severity 1 issues
- Ensure severity 1 issues always involve updates to our status page
- Update our automated tests to test browser-level clicks instead of just simulating a click via javascript. This is a large project to do across the board, but we will prioritize tests for critical features including “Insert Dynamic Data”.
- Increase QA scrutiny for non-feature-flagged fixes that touch code used across multiple parts of the editor