Stability Weekly Update May 31

Hi all,

For those of you who may have missed my prior stability update, I plan to report weekly on our progress toward improving Bubble’s uptime and reliability for the next few weeks, since it is our top priority as a company.

Currently, there’s an ongoing issue with our free SendGrid integration: see this post for a solution. This free integration is intended for testing purposes only, and we strongly recommend that anyone who relies on SendGrid in production use their own API key following the provided instructions. More details below.

While this week brought a few incidents with vendors we use, including the SendGrid issues, I’m happy to report that there have been no major Bubble system failures since my prior update. One week without a problem is a low bar to aim for, but it’s a good start, and we’ll celebrate the wins where we find them! Plus, this needed respite has allowed us to make fast progress on our ongoing stability initiatives.

Key improvements we’ve made since our last update

Last week, I reported on our work to close loopholes in the number of background tasks that can be run simultaneously. This week, our fix was put to the test: There was an instance where an app kicked off almost 70,000 data-trigger workflows simultaneously. In the past, this would have resulted in performance degradation or downtime for other applications. This time, our new code worked and the issue was fully contained, with zero impact on other apps.

Another piece of progress to report: We’ve fully stabilized appserver, the database that stores applications. All system metrics are healthy, and we’re wrapping up a push to build early-warning observability so that if it starts getting unhealthy again in the future, we’ll see it ahead of time and take action before it results in user-facing downtime. We’ll also continue to assess our strategy for scaling appserver going forward.

As mentioned in a previous update, we’ve been working on the capability to instantly roll back to an earlier version of our code in order to speed up time to recovery and avoid issues like the one we had a few weeks ago, where we were unable to revert because of problems with our caching layers. The code for this is now live! We plan to test it and train the team on using it over the next week.

Also as mentioned, we’re transitioning away from the outdated stored procedure technology that’s interfering with our ability to cancel long-running queries. We are rolling out the first major piece of this today, and we’ll be continuing to move more and more of our queries over to the new system over the next month.

Finally, we’ve been doing a lot of work to make queries to our user databases more efficient, which should have the double benefit of speeding up individual app performance, as well as giving us more scaling headroom across the entire system. Some things we’ve achieved on this front:

  • We now use a separate system for storing extremely large scheduled tasks (in terms of the amount of data passed to the scheduled workflow) from the one used for most scheduled tasks. In moving these outlier data payloads out of our main databases, we’re now using up only half as much of our total database network bandwidth.

  • We also now store extremely large database fields (which are usually caused by a user trying to store a file directly in the database as base64-encoded text instead of as a file in S3 — don’t do this, it hurts the performance of your app) in a separate system, which has resulted in much faster queries to fetch those database items, and a much smaller total load on our databases.

  • We’re almost done overhauling an internal system that optimizes storage for apps that haven’t been accessed recently. This will greatly reduce database load, which should lead to improved performance.

Incidents since our last update

As mentioned above, our systems have been stable since our last update, but there have been a handful of smaller incidents worth discussing.

UPDATE SINCE POSTING: the SendGrid outage has been resolved
SendGrid outage for users without an API key (ongoing). We encourage all apps who use our “Send Email,” “Reset Password,” and other email-sending actions to provide their own SendGrid API key to ensure email deliverability. That said, to help accelerate development and let users try out Bubble without creating a SendGrid account, we do allow you to send a limited number of emails using a shared key that we manage on behalf of our users. Because this key is often used by users sending test emails, sometimes to non-existent email addresses, it sometimes gets flagged by SendGrid’s internal systems. This outage was caused because they disabled our shared key, and we’re working with them to re-enable it. This isn’t the first time we’ve run into this issue, so we recommend that any serious production email sending usage be done via your own API key.

Sidenote: We posted a “We are investigating reports of issues with our systems” message to our status page Thursday, May 30 at 2:15 AM ET. It turned out to be additional bug reports related to the above SendGrid issue, so we deleted that message.

Support / ticket creation down. The vendor we use to manage support tickets had an outage, which resulted in users being unable to reach us via our normal support channels. This situation is now fully resolved. This is the first time we’ve experienced an issue with this vendor, and we are monitoring and following up with them about the incident to ensure this doesn’t become a common occurrence.

In addition to the above incidents, there were several emergency responses by our engineering team that did not have widespread user impact. We responded to two issues that only impacted one customer each, as well as an early warning indicator for one of our systems that we addressed before it resulted in user-visible problems.

Looking forward

We are already in a much better place in terms of stability than we were a couple of weeks ago. There are always unknown unknowns, and we still have a lot of medium-term and long-term work on our roadmap to make our infrastructure fully scalable, isolated, and blazing fast, so we plan to continue pushing hard on stability and reliability as a main business objective. We also want to continue driving down the number of emergency responses we make as a company, even ones that don’t have widespread user impact — the fewer fire drills we have, the more attention we can give preventative maintenance and avoiding problems in the first place.

Thanks for all the support, and we will continue to keep you posted.

— Josh and Emmanuel

49 Likes

Love the updates! Thank you @josh

1 Like

This is very cool - I’ve had to repair a couple of apps where the dev had done this and it really makes things slow (the query is so large you can’t even open it in devtools)…

Thanks for the update, looking forward to more reliability improvements.

7 Likes

Thank you, appreciate the hard work

Thanks a lot! I love all the positive vibes. Lets hope this year we can find 90 days without downtime!!

7 Likes

2 posts were split to a new topic: My apps is under DDoS attack, I need help

Been a great week as a bubble user :clinking_glasses: My ‘intercom’ has been user error only and a joy to be an app owner. Thank you @josh and team for all the hard work

2 Likes

Thank you Josh / Whole Bubble Team. As mentioned before, I think my thought on trusting that you are the guys to keep the dream alive, still seems right.

1 Like

Thank you for keeping this weekly update going guys

Thanks for this update and the first improvements!
Without stability, the rest is nothing!! Therefore, very much appreciated that you @josh take this matter on personally.
Please keep up these weekly reports and your dedication & work to make Bubble the most reliable low-code platform in the market!! A big THX!!

3 Likes

Thank you, this is very encouraging.

Honestly the last few days I’ve been deep in building my app and must have uttered the words “god I love bubble” to myself a dozen times! It’s really life changing and impacted mental health dramatically. Very grateful to you guys, your team and the whole community. :heart_eyes:

3 Likes

HAHAHA…Bubble needs a system like AA where they get coins for 30 days, 60 days, 90 days, 6 months, 1 Year.

6 Likes

Was that my team? :grimacing:

Sounds like us :rofl:

2 Likes

Thank you @josh

Looking forward to building more with Bubble!!!

@josh
Would you mind allowing Bubble in the lower tier plans to purchase the option for a static IP?

I have come across this approach (seeing ultra large text fields … “justifiably”) in client apps … glad to get reassurance that this is not good practice.

It’s a specific rich text editor plugin that does it, I can’t remember which. I know Quill does it under some configurations, but there’s definitely a more widely used one that saves the base64 rather than the URL.

1 Like

Based on the Bubble manual, the only limitation is a maximum of 50 recipients per email, which really isn’t a limitation at all for magic links, reset passwords, and confirmation emails, which will never exceed that amount.

So … what is the actual limit? (if there is one) Are users notified of the limit when they start with the Bubble SendGrid account? Are there notifications when the limit has been reached / exceeded?

This is great. @josh can you provide the cutoff as to how many characters are considered an “extremely large” field that would be stored in a separate system?