Incident Postmortem and Updates on Reliability

allenyang · September 20, 2023, 12:18am

Hi all,

Allen here from the Bubble Product team. I wanted to chime in to this conversation in light of the dynamic expression bug last night as well as editor reliability issues in general.

Summary: We don’t view the current state of product reliability as satisfactory. Starting in Q4, we are prioritizing investments to tackle tech debt in areas that will improve product reliability going forward (along with a couple other benefits), and in doing so, will slow down the roadmap of user-facing features.

It is very clear - to you all and also to the Bubble team - that product reliability, especially with the editor, has been sub-par. Here, I’m using the term “reliability” to encompass performance issues, bugginess, and anything that makes it harder for users to do the work they want to do. We know that this disrupts your ability to build on Bubble. And, we are painfully aware that even as we push to improve the overall product, these kinds of reliability issues erode trust in Bubble as a production-grade solution. I’m writing today to share a bit about how we’re thinking about the situation, from the lens of the head of Product.

The Product team is constantly thinking about what our priorities should be and how we can most make an impact on the product towards our goals for users and the company. We are not really a “move fast and break things” kind of company - I can firmly say that is not the mindset our Product + Engineering org has. But, in spite of the precautions the team is currently taking as we do our work, clearly they’re not enough and there are still reliability issues.

Folks on the Product + Engineering teams here are grappling with this tension: on the one hand, we have big, exciting dreams of how we want to improve the product, and on the other, we are working with a product that’s been around for 10+ years and which has enormous product surface area. There are singular features of Bubble that are by themselves incredibly complex, and they interact with other highly complex features within the tightly-integrated bundle that is Bubble. And on top of that, a lot of the code is, at this point, old - in other words, we have a lot of “tech debt”.

So, we have to continuously try to find the right balance in this tension, and often the exciting things we want to do force us to wade through very tricky patches of tech debt. In fact, the bigger spikes in reliability issues that have happened in the recent past have related to fundamental, deeper improvements we’ve made to the core product just to lay the foundation for more exciting changes we want to make. For example…

Several months ago, we gave the design canvas (what you interact with on the Design tab) a major upgrade behind-the-scenes. This helped improve performance of the editor and laid a much more modern foundation which we can build on top of, but at the cost of a decent number of bugs that harmed reliability.
A similar story is playing out today with the expression composer. We want to make some improvements to this very core piece of editor functionality - new features like letting users insert segments in the middle of an expression, showing “parentheses” to communicate order of operations, etc. - but these require essentially rebuilding the entire expression composer from scratch behind-the-scenes. As we’ve gone through pre-launch stages (e.g. beta) of the new expression composer, we’ve been addressing a significant list of bugs.
However, last night’s bug with dynamic expressions was actually not related to the expression composer overhaul, but to pretty deep-down code around how input fields in the editor work. We happened upon this in the course of the project relating to breakpoints, but when we tried to fix this bug, it caused the bug you saw with inserting dynamic expressions.

Last night’s bug was a particularly unfortunate one, which was compounded with an AWS outage at the time that prevented us from deploying new code yesterday evening. Though that AWS outage cleared up by mid-evening, it was still a few hours before somebody on the team could deploy the fix we were trying to deploy yesterday, which meant that the bug continued to basically block work on editing apps and cause a huge amount of frustration for users. We’re doing a retro on how we can improve going forward, touching on not just reliability and bugginess but also on how to better communicate with the user base around situations like these.

I also wanted to zoom out a bit and share how we’re planning to tackle the broader issue of editor reliability. In short, we’re ramping up our investment in paying down tech debt starting in Q4 (i.e. starting October), and our teams are currently planning their roadmaps around this push. There are many flavors of tech debt projects, but these projects share a certain set of goals like: reduce the likelihood of deploying bugs, help engineers fix bugs faster, allow Engineers to move faster AND safer with their code changes, improve performance of different parts of Bubble (e.g. page load performance, data operation performance), and modernize key parts of our technology that are overdue for some attention. The planning for this shift starting in Q4 started several weeks ago, though as the incident last night shows, it is the direction we need to move in.

How does this all help with a situation like last night’s? For example, we’re planning to do projects like:

Implementing a new automated test framework for editor code (our current one is old and clunky)
Writing a lot more automated tests using that new framework (we don’t have great automated test coverage today)
Implementing a tool to help us monitor editor performance and spot worsening performance (otherwise known as “observability”; we don’t easily have this kind of visibility today), etc.

Short of completely stopping all code development on the product, these kinds of projects are the kind that improve product reliability at a more foundational level.

The tradeoff is that these projects mean we will have less bandwidth for other work, such as user-facing features. At this point, it is a tradeoff we’re willing to take. In aggregate, we’re treating Q4 as a quarter of shoring up the foundations, so the majority of our Product + Engineering resources will be on tech debt reduction projects like these. Going into 2024, we’ll assess how much progress we’ve made and plan accordingly, but we’ll most likely continue investing significantly on reliability-improving projects.

The whole team - including Engineers, Product Managers and Designers - are all aware of and in favor of this tech debt push. We’re excited about it because we all recognize that reliability is a priority, and also because, frankly, as we’ve tried to push for bigger product improvements, we’ve been stumbling over ourselves quite a bit by creating these reliability issues - so we see the reliability improvements as an important foundation for bigger and better user-facing pushes in the future.

That’s an introduction to how we’re thinking about and addressing the recent reliability issues. We will be doing a lot of work to improve here, and we’re aiming for the same thing the user base wants: a dependable product that you can use to build your businesses. Thanks for your attention and your continued support as part of the Bubble community - we’re working hard to do better by you all.

Thanks,
Allen

Topic		Replies	Views
Postmortem discussion thread Meta	27	784	April 26, 2024
Placeholder for postmortem on editor issues last night Announcements	10	5597	September 20, 2023
August 6, 2024 outage postmortem Announcements	32	1383	August 12, 2024
Update: Postmortem on April 24 incident Announcements	36	4247	May 7, 2024
3/22/2023 Incident Postmortem Announcements	28	8143	May 6, 2024

Incident Postmortem and Updates on Reliability

Related topics