Updates on Bubble Quality and Reliability

Dear Bubble community,

If you haven’t met me yet, my name is Payam (pie-um, he/him), Platform Engineering Director for Bubble. I have the privilege of leading an amazingly diverse, talented, and mission-driven team of engineers responsible for all of Bubble’s engineering outside of the editor and ecosystem. This includes the entire technical “platform” that Bubble is built on top of, including infrastructure, cloud engineering, databases, internal systems, and APIs. I also oversee system performance and reliability across all Bubble engineering teams.

I was hired six months ago to help transform our reliability and scalability and help accelerate the timeline in which it is possible for any app to grow from MVP to IPO. This half-year has been an incredible experience, and I’m planning to share more about it with you soon.

For now, I want to highlight some of the great work that the Platform and broader Bubble team has done over the last few months and how it has upleveled our reliability and scalability.

First, I want to thank you for continuing to place your trust in Bubble and giving us grace as we transform our operations. I’ve been keeping a close eye on the forum, connecting with our extraordinarily user-centric Product, Success, Engineering, and Product Marketing teams, meeting you at BubbleCon, and trying to learn everything I can about where we’re succeeding and where we have work to do.

Based on these learnings it’s evident to me that you, as builders and entrepreneurs, desire consistency the most. Making the experience on Bubble more consistent and predictable is exactly what I hope to deliver. I’m excited to share some serious leaps we’ve already made that will help get us there.

  • As of three weeks ago, 100% of our engineering teams have people on-call 24/7/365 for the product surface areas that they own. Before, there were only five engineers on-call from the Platform team for all Bubble systems.
  • In October, we expanded our definition of “emergency” to include degradations to Bubble’s performance, sparked by the long Notifier outage.
  • 100% of our emergency incidents are now tracked in our alerting tool, PagerDuty. I estimate that this was less than half in October. Plus, 100% of these incidents now automatically sync to our StatusPage.
  • We redefined our triage definitions, which govern the response time and resolution objectives to all user issues that enter Engineering from the Success team. In most cases, this means a significant improvement to response time for each class of issue, expansion of the class of users who receive urgent support, better prioritization of the engineering team’s resources against the product roadmap, and streamlining internal handoffs and automations.
  • We have fully onboarded a brand-new test automation tool to replace our existing tooling. The old tooling was difficult to learn and use, proved brittle and unreliable in practice, and lacked transparency. Our new tooling is reliable, easy to use for all teams, and transparent. We’re in the process of migrating tests, and all new tests will be written in the new framework.
  • By rule, most new product development will be packaged with automated test coverage. This will greatly reduce the number of bugs that arise out of new development, limiting the scope of total possible bugs.
  • There are also numerous performance, reliability and scalability improvements, with more to come!

Here are some of the things we’re working on right now, with updates to come within the next few months:

  • A huge investment in the reliability of our cloud and database (more to come on this soon).
  • Empowering all Bubble teams with better observability tools they can use to build more SLOs (Service Level Objectives) for their product areas, which will expand the surface area that alerts us whenever degraded user behavior is detected.
  • Additional components in StatusPage that will provide more detail on what components exactly are having trouble (e.g., workflows, editor, real-time notifications, and more).
  • New technology to make local development match production more closely, which will prevent classes of errors arising out of differences between how code runs on engineers’ machines and how it runs in production.
  • New manual and automated test environments, which together will catch more classes of bugs before they reach production.
  • A big investment in our internal and external observability and monitoring to help us prioritize our engineering roadmap for maximum value, and publishing thresholds where we expect (and can prove) Bubble to be reliable for 90% of use cases or more.

This is all the work of a village.

I did not have to sell and poke and prod and persuade anybody at any level from any team into coming with me on these projects. Everybody wanted to do it, and stepped up in a big way, and that’s how we were able to get so much done on a compressed timeline. We all know these are the things we need to do in order to build on your trust and demonstrate we’re ready for primetime.

I’ve led reliability efforts at an array of companies in the past, and I’ve never seen projects like this materialize this fast before. This experience has given me even more confidence that I made the right choice in coming to Bubble. It’s an amazing product built by an incredible team, and I can’t wait to help make it more consistent and put it in the hands of more builders.

What I’ve described here is just a slice of the work that’s been going on. There are many more exciting things the Platform team can’t wait to talk to you about that we think you’re going to love. Finally, Platform has also been partnering very closely with our Mobile and AI teams to ensure the success of the new product releases that we promised at the last BubbleCon.

Thank you for reading, for your continued support, and for your continued passion and engagement on all things Bubble. You’ll be hearing from me again very soon!

Cheers,

Payam

52 Likes

We appreciate that Bubble recognises that platform stability, reliability, and performance is undeniably the most important thing, and at this stage everything else comes second. Innovation is great but if our apps go down it doesn’t matter.
I imagine a lot of people will be relieved by this post. Lets see if reliability improves this year, we all hope it does.

We like transparency, however please recognise that the status page has multiple functions. yes it lets us developers know whether the Bubble team is aware of an issue we are experiencing, however it is also often sent to possible clients who are investigating moving their business to Bubble. If they see the status lit-up like a christmas tree it will make closing deals very hard and decrease Bubble adoption rate.
The status page had already been tweaked to not show all issues shortly after it was changed to show all issues because it was making outsiders view Bubble as an unrealiable platform (See this post by @adamhholmes and @josh’s response). It would be amazing if your efforts in increasing reliability are already so effective that we won’t experience the same problem we did previously, but i would err on the side of caution

Looking forward to this! Great work

9 Likes

This is good to see that something has changed since the few big issues (downtime) that happened at the end of last year.

1 Like

:+1:

2 Likes

Thanks for sharing and transparency. Looking forward to all the improvements!

2 Likes

Thank you for your post and all the hard work of the team (village!). Posts like these really do help reassure.

Nice sentiments but probably best not to use too much ai to write this stuff. I wish folk would use their own language. Always less fluffy.

1 Like

Not everybody has the literary chops of @sudsy

:clap:

Thanks @payam.azadi for the work that you and the team are doing

1 Like

@payam.azadi any updates on recent frequency of downtime and issues? I see so many people asking and complaining on the forums, yet no reply from the Bubble team, which is really concerning for those planning to launch soon, and even more for those who already are live with clients on demand.

Thanks

4 Likes

+1 here.

We had a stressful morning wondering whether our sales presentation could go ahead.

We heard above how reliability was at the forefront of your mind and then we have 2 major outages within a week. Without any response.

3 Likes

@payam.azadi

Just wanted to reach out and see if Bubble was ready to share anything on the recent outages and reliability issues over the past couple of weeks?

6 Likes

Here we are, down again!

5 Likes

We can die waiting here… of hundreds of outages we’ve had over the years (been bubbling since 2019) I could point to 1 or 2 post mortems… the rest… we are left in the dark.

2 Likes

Commitment to a reliability (99.5+% ) would be a huge relief for a lot of us.

A financial commitment, not verbal

1 Like

Frankly, even 99.5 isn’t good enough. That’s where we are now.

I meant 99.95% :slight_smile:

This is a good model from AWS
AWS Service Level Agreements

Bubble Uptime
98.10% Nov
98.63% Dec
99.80% Jan
99.64% Feb
99.31% Mar
99.74% Apr

3 Likes

Strangely saleforce only guarantees a 98% uptime

seems to be moving up?

3 Likes