Incident Postmortem and Updates on Reliability

Hi all,

As I promised earlier, here’s the postmortem on the “Insert Dynamic Data” bug from yesterday, September 19.

Before I go there, some broader thoughts.

A lot of the frustration we’re hearing is due to the incident itself, but also due to the pattern of reliability problems with our platform this year. This incident touches on two patterns in particular that we are frustrated with as a team and are working to change:

  • Bugs and glitches using the editor, making the experience of editing Bubble applications painful and unreliable
  • Sustained bugs and outages where you only hear from the Bubble team after-the-fact, and are left wondering for hours what’s going on

We’ve talked about both patterns in the past and some of what we’re working on to solve them, but we know that hearing us say “hey, we’re doing all this work behind-the-scenes” doesn’t help much when you can’t see tangible progress and still get burned.

We are fed up with this too. We get up in the morning trying to make Bubble better for our users – yeah, I know, some of you probably disagree with that sentiment right now! – but that’s what we’re working towards, and when the result is outages, bugs, and frustration, it’s undercutting everything we put our hearts into doing.

I want to let you know that we are going to step up our game on both fronts – editor bugs, and incident management – in a big way.

On the editor side of things, as I mentioned on the forum a couple weeks ago, we are pivoting our roadmap for the rest of the year towards tooling to support editor reliability: better tests, better monitoring, and fixing some of the underlying technical issues that make bugs common by moving us towards a componentized design system. Our VP of Product, Allen, is going to post a follow-up reply sharing more about what we are doing here.

On the incident management side of things, what I’m realizing from yesterday’s postmortem and the last couple issues we had where we were too slow to respond, is that we are defining severity 1, “drop everything and fix it, wake people up, sound the fire alarms” incidents too narrowly.

Some backstory on that – not all that long ago, I was the only person in the company to be on-call to be woken up in the middle of the night in an emergency, supported by a couple of very senior engineers who worked basically around the clock to keep our systems up. That came out of our bootstrapped roots as a company, our relatively small team size, and some of the system complexity we have that makes it challenging to get new engineers to the point they can support our infrastructure.

That mode of operating leads to an extreme triage mentality: if we only have so many people, working so many hours, we have to carefully pick our battles, and if that means it takes us an extra couple hours to respond to a slightly-less-urgent issue, so be it.

Over the last year, we’ve changed that: we now have an on-call rotation across a big chunk of the engineering team, we’ve hired a lot more people including some senior leadership, and we’re investing in building a team that collectively owns and supports our systems.

But we still have some cultural lag in the way we architected our incident management processes. For instance, we have not been considering an editor bug a severity 1 issue, because it doesn’t break running user apps, which as you will see below is related to how things went down yesterday. For issues that do break running apps, we don’t always consider them severity 1 if the number of incoming bug reports isn’t that high, which was involved in the recent real-time updates outage that we responded to too slowly.

This lag is not aligned with the reality of what Bubble is: a global platform, used across all timezones, for mission-critical applications where both running apps and editing apps are very important to people’s economic success.

So, we’re going to change it. I am discussing the implementation details with the team now, but over the next 1-2 weeks, I want us to be in a place where:

  • We default to escalating issues: if it falls in a gray area where we have enough bug reports that it is pretty clear we broke something or something changed for the worse, the whole Bubble team is trained to push the big red “Houston we have a problem” button
  • We define severity 1 problems broadly, not narrowly
  • Any issue that gets escalated and confirmed as a severity 1 problem makes it to our status page

We have the team for this now and I think it is very realistic for us to get to that world fast. The incidents we’ve mishandled over the last year have almost universally been gray area issues where the team knew something was wrong but wasn’t sure how loudly to bang the “hey we need to fix this and get a status update out to users now!!” drum. So we are going to make it very clear exactly what everyone should do in a situation like that going forward such that we err on the side of over-response rather than under-response.

Okay. That’s the big picture update. There’s also a bunch of tactical things we want to fix based on what happened yesterday, so for transparency, here’s a postmortem on what happened:

———

User Impact

For approximately 14 hours, the “Insert Dynamic Data” button in the editor was broken. This is a major feature, and the breakage significantly impaired our user’s ability to edit their apps during this time period. We did not communicate about the issue publicly until the issue was resolved.

This issue did not impact users on the Scheduled tier or on Dedicated clusters.

Root Cause

We deployed a bug fix to address some glitches in editor interactions. This fix changed behavior that the “Insert Dynamic Data” button relied on, causing the button to become unusable. Because this was a bug fix, it was not deployed behind a feature flag, and instead immediately impacted production behavior.

We have automated tests that exercise the “Insert Dynamic Data” button, but the tests work by activating the javascript click handler for that button, vs actually simulating a click, and did not catch the problem.

We became aware of the bug via user-submitted bug reports, identified the issue, and attempted to revert the code that introduced the problem.

However, due to an Amazon Web Services incident occurring during the same timeframe, our ability to deploy new servers to production was broken. Based on the initial AWS ETA for a fix, we decided to wait out the outage rather than doing a risky operation to route around the issue and deploy servers anyway. The outage ended up taking longer than anticipated, and we deployed the fix when a Europe-based team member came online early the next morning.

Based on our team’s severity rubric, because this was an editor issue that broke some but not all editor functionality, the issue got classified as a severity 2 priority issue, not severity 1. As a result, communication about the issue was limited to the team working on it vs the broader engineering organization, and no one flagged the need to communicate about it on our status page or forum. The lower classification also contributed to the decision to let the issue persist overnight instead of calling for an all-hands-on-deck effort to get it fixed as soon as possible.

Timeline

(All times are in EDT, although we had team members responding across multiple timezones)

  • Sept 18 3:27 pm – We deploy the code that caused the issue
  • 4:21 pm – first user bug report comes in
  • 4:51 pm – based on further bug reports, we escalate this to a high priority issue
  • 6:01 pm – we identify the cause of the issue
  • 6:26 pm – we attempt to deploy a fix
  • 7:58 pm – we realize that the deployment failed because of the ongoing AWS issue
  • Sept 19 2:08 am – AWS reports their issue is resolved
  • 3:00 am – a Europe-based team-member becomes aware of the problem, but doesn’t have the full context on the situation
  • 5:00 am – team members in Europe reconstruct the situation and identify that we still need to deploy the fix
  • 5:17 am – fix goes live

Action Items

  • Better alerting around issues with our automated deployments. Deployment failures should be treated as severity 1 issues since they put us at risk of incidents like this one
  • Broaden our definition of severity 1 issues to make sure broad-based impact to our user base from regressions is always included
  • Define simple process for everyone in engineering and customer success to escalate severity 1 issues
  • Ensure severity 1 issues always involve updates to our status page
  • Update our automated tests to test browser-level clicks instead of just simulating a click via javascript. This is a large project to do across the board, but we will prioritize tests for critical features including “Insert Dynamic Data”.
  • Increase QA scrutiny for non-feature-flagged fixes that touch code used across multiple parts of the editor
42 Likes

Hi all,

Allen here from the Bubble Product team. I wanted to chime in to this conversation in light of the dynamic expression bug last night as well as editor reliability issues in general.

Summary: We don’t view the current state of product reliability as satisfactory. Starting in Q4, we are prioritizing investments to tackle tech debt in areas that will improve product reliability going forward (along with a couple other benefits), and in doing so, will slow down the roadmap of user-facing features.

It is very clear - to you all and also to the Bubble team - that product reliability, especially with the editor, has been sub-par. Here, I’m using the term “reliability” to encompass performance issues, bugginess, and anything that makes it harder for users to do the work they want to do. We know that this disrupts your ability to build on Bubble. And, we are painfully aware that even as we push to improve the overall product, these kinds of reliability issues erode trust in Bubble as a production-grade solution. I’m writing today to share a bit about how we’re thinking about the situation, from the lens of the head of Product.

The Product team is constantly thinking about what our priorities should be and how we can most make an impact on the product towards our goals for users and the company. We are not really a “move fast and break things” kind of company - I can firmly say that is not the mindset our Product + Engineering org has. But, in spite of the precautions the team is currently taking as we do our work, clearly they’re not enough and there are still reliability issues.

Folks on the Product + Engineering teams here are grappling with this tension: on the one hand, we have big, exciting dreams of how we want to improve the product, and on the other, we are working with a product that’s been around for 10+ years and which has enormous product surface area. There are singular features of Bubble that are by themselves incredibly complex, and they interact with other highly complex features within the tightly-integrated bundle that is Bubble. And on top of that, a lot of the code is, at this point, old - in other words, we have a lot of “tech debt”.

So, we have to continuously try to find the right balance in this tension, and often the exciting things we want to do force us to wade through very tricky patches of tech debt. In fact, the bigger spikes in reliability issues that have happened in the recent past have related to fundamental, deeper improvements we’ve made to the core product just to lay the foundation for more exciting changes we want to make. For example…

  • Several months ago, we gave the design canvas (what you interact with on the Design tab) a major upgrade behind-the-scenes. This helped improve performance of the editor and laid a much more modern foundation which we can build on top of, but at the cost of a decent number of bugs that harmed reliability.
  • A similar story is playing out today with the expression composer. We want to make some improvements to this very core piece of editor functionality - new features like letting users insert segments in the middle of an expression, showing “parentheses” to communicate order of operations, etc. - but these require essentially rebuilding the entire expression composer from scratch behind-the-scenes. As we’ve gone through pre-launch stages (e.g. beta) of the new expression composer, we’ve been addressing a significant list of bugs.
  • However, last night’s bug with dynamic expressions was actually not related to the expression composer overhaul, but to pretty deep-down code around how input fields in the editor work. We happened upon this in the course of the project relating to breakpoints, but when we tried to fix this bug, it caused the bug you saw with inserting dynamic expressions.

Last night’s bug was a particularly unfortunate one, which was compounded with an AWS outage at the time that prevented us from deploying new code yesterday evening. Though that AWS outage cleared up by mid-evening, it was still a few hours before somebody on the team could deploy the fix we were trying to deploy yesterday, which meant that the bug continued to basically block work on editing apps and cause a huge amount of frustration for users. We’re doing a retro on how we can improve going forward, touching on not just reliability and bugginess but also on how to better communicate with the user base around situations like these.

I also wanted to zoom out a bit and share how we’re planning to tackle the broader issue of editor reliability. In short, we’re ramping up our investment in paying down tech debt starting in Q4 (i.e. starting October), and our teams are currently planning their roadmaps around this push. There are many flavors of tech debt projects, but these projects share a certain set of goals like: reduce the likelihood of deploying bugs, help engineers fix bugs faster, allow Engineers to move faster AND safer with their code changes, improve performance of different parts of Bubble (e.g. page load performance, data operation performance), and modernize key parts of our technology that are overdue for some attention. The planning for this shift starting in Q4 started several weeks ago, though as the incident last night shows, it is the direction we need to move in.

How does this all help with a situation like last night’s? For example, we’re planning to do projects like:

  • Implementing a new automated test framework for editor code (our current one is old and clunky)
  • Writing a lot more automated tests using that new framework (we don’t have great automated test coverage today)
  • Implementing a tool to help us monitor editor performance and spot worsening performance (otherwise known as “observability”; we don’t easily have this kind of visibility today), etc.

Short of completely stopping all code development on the product, these kinds of projects are the kind that improve product reliability at a more foundational level.

The tradeoff is that these projects mean we will have less bandwidth for other work, such as user-facing features. At this point, it is a tradeoff we’re willing to take. In aggregate, we’re treating Q4 as a quarter of shoring up the foundations, so the majority of our Product + Engineering resources will be on tech debt reduction projects like these. Going into 2024, we’ll assess how much progress we’ve made and plan accordingly, but we’ll most likely continue investing significantly on reliability-improving projects.

The whole team - including Engineers, Product Managers and Designers - are all aware of and in favor of this tech debt push. We’re excited about it because we all recognize that reliability is a priority, and also because, frankly, as we’ve tried to push for bigger product improvements, we’ve been stumbling over ourselves quite a bit by creating these reliability issues - so we see the reliability improvements as an important foundation for bigger and better user-facing pushes in the future.

That’s an introduction to how we’re thinking about and addressing the recent reliability issues. We will be doing a lot of work to improve here, and we’re aiming for the same thing the user base wants: a dependable product that you can use to build your businesses. Thanks for your attention and your continued support as part of the Bubble community - we’re working hard to do better by you all.

Thanks,
Allen

37 Likes

Good to see that it looks like a change of process in handling these things with severity will happen. I cannot fathom how someone knew that this dynamic expression thing was broken and thought it totally ok to clock off for the evening and fix it in the morning. That is astoundingly ignorant of what Bubble is and who the users are.

Appreciate the detailed response guys.

I think a lot of us here suspected there was a tech debt problem, and I at least am very much in favour of less new features in favour of reliability.

A challenge that many bootstrapped companies eventually will face if they get big enough, so on one level this is a great testament to your success. But definitely can’t leave this any longer!

14 Likes

Increasing observability I think will improve the boldness of your actions. no ? That would enable agressive positive improvement ?

As is so often the case, it all sounds great. The Bubble Tech Debt Battle of Q4 2023 will be interesting to see unfold.

Proof is in the pudding, fellas…

1 Like

Thanks guys. I just dont understand. Why not create a alpha environment with users that opt in to test and work with alpha software where you can test all of this?

7 Likes

Yes! I would much rather have a stable platform of what Bubble is today (indicating the user facing features are pretty much enough, although there are loads of small quality of life improvements that could be made), rather than an unstable platform that ships out new user facing features that fall flat on deployment because of bugginess or lack of important functions (thinking table element?).

10 Likes

1st. Thank You.
2nd. Bubble has enable alot of us (nocoder) to step into Software Dev industry, or even Startup
3rd. Our Career are FUNDAMENTALLY building on Bubble.io

Stability of Bubble has almost the TOP priority among most of us.
Performance come 2nd.
Price would be 3rd.

2 Likes

It’s great that Bubble is finally taking responsibility for it.

That’s a great move.
As many of us have been posting on the forum and in Twitter during this year - stability of existing features is much more critical than new features.

Just wanna to clarify will slow down the roadmap of user-facing features point. By user-facing features we should understand the Editor only, so other features like improving bulk operations will still be developed according to their roadmap?

Second this. No need to roll-out new features and not critical bug fixes for all apps at once.

1 Like

In terms of what slows down to focus on tech debt + reliability - editor features definitely will slow down, bulk operations / data performance will likely slow down a bit as well.

Over time we’re trying out different ways of getting features tested. We have tried flavors of running alphas or closed betas or experimental features. A challenge with these approaches we’ve had is that either not enough users use alternate environments seriously enough, and/or it’s tough to work on their production apps. That means the testing is much more limited. Often the majority of bugs deal with more poweruser use cases. We’re continuing to try different ideas here though!

3 Likes

Got it!

Good luck with a though challenge :slight_smile:

This is great news!

Thank you @josh for the detailed, honest and well written summary. Onwards!

Hey Josh,

thank you for your detailed response but I suppose it comes back to one super simple concept and that is, core values? I looked at yours here: https://bubble.io/values and nothing really jumps off the page in terms of how you treat your customers.

In situations that occurred the other day, one needs to sit back and reflect on your culture and values and ask the question, “are we living and breathing to our values?”, and with nothing there that actually speaks to the role that your customer plays in your organisation, I would say that you actually don’t have anything to reflect on. Fix your values would be my first step in fixing the problem, and the next step would be living them. Take a look at Atlassian’s core values here and in particular: Don’t #@!%
the customer: Customers are our lifeblood. Without happy customers, we’re doomed. So considering the customer perspective - collectively, not just a handful - comes first. As a longtime customer of Atlassian as is Bubble (https://status.bubble.io/ is powered by Atlassian’s Statuspage product), I can honestly say that I do not get #@!%ed by Atlassian and I am sure you could probably say the same. On the other hand I feel that I am getting repeatedly #@!%ed by Bubble.

Fix your values and everything else will take care of it’s self.

10 Likes

@josh @allenyang Would it be possible to bundle any updates to the editor with regular engine version updates that are opt-in upgrades through Settings? I feel like that should make most of the problems we’ve been seeing lately more manageable on our side as Bubble users.

1 Like