Stability Weekly Update June 7

Hi all,

Stability is our top priority as a company, and I have been reporting weekly on our progress. If you missed it, you can read last week’s update here.

This week brought the tail end of the SendGrid incident from last week as well as two regressions caused by errors by our team, which I’ll discuss more below. That said, this is the second week in a row without any outages due to scaling pressure or infrastructure load, which reflects the successful hardening work the team has done over the last month.

Key improvements we’ve made since our last update

As mentioned last week, we’ve stabilized our appserver database in the wake of the incidents with it in May. This week, we completed two more follow-up projects to make sure it stays stable:

  • We overhauled our monitoring dashboard to give us early warning of any future problems so we can intercept them before they cause downtime.

  • We made some behind-the-scenes improvements to our version control merge algorithm. As you may remember, one way we stabilized the database was by temporarily blocking large merges, since they are one of the most costly operations we run on appserver. To prevent them from becoming an issue in the future, we changed the cap on total memory consumption of large merges and built in automatic safety controls so that a runaway merge can’t do too much system damage.

Another workstream I mentioned in my previous update was transitioning away from the outdated stored procedure technology that’s interfering with our ability to cancel long-running queries. We attempted to roll a major chunk of this work out on Monday, but we had to revert it because it led to data-loading bugs, discussed in detail in the next section.

We originally kicked this workstream off as part of a two-pronged push to fix stability issues with the databases that store each app’s data. The other prong, which has already been successfully deployed, was tightening our rate-limiting controls around running expensive workloads. We believe either prong will likely be sufficient on its own to prevent future database-related downtime like that we recently experienced, but we decided to do both to achieve a truly robust system with extra insurance.

We kicked off both projects with extreme urgency since we didn’t know at the time which one would take longer, and we wanted to stabilize our systems ASAP. Now that the rate-limiting prong is complete, the tradeoff between speed and caution has shifted, as demonstrated by Monday’s incident. So we plan to continue with this workstream, but we are going to slow down even further and build in even more more safeguards and controls first.

Incidents since our last update

SendGrid

First, I want to close the loop on the SendGrid outage for users without an API key. As of Monday, SendGrid has restored our free email sending service. That said, we are not yet satisfied with their guarantees that this won’t happen again in the future. We’re working with SendGrid to discuss what measures we can put in place to ensure we get advanced notice of impending issues, as well as exploring backups or alternatives to SendGrid. We continue to recommend that users relying on our free shared email sending in production switch to their own SendGrid API key, since the shared sending is meant for testing and early-development purposes.

Short downtime on Thursday

On Thursday, we had just over a minute of downtime after we used our new instant rollback capability mentioned in previous updates. The problem resolved itself before we were able to update our status page. We had activated the rollback to resolve a regression that was impacting a very small number of customers. The failover was slower under production load than it was in our test environments, which caused a window where we had no active servers to handle traffic. We believe this problem is likely fixable, but we are putting use of the tool on pause until we can fully evaluate and fix it.

Monday’s regression

A more serious issue occurred on Monday. As I mentioned above, we attempted to roll out some of the work that would transition us off of our outdated stored procedure system. Specifically, we tried to release an overhauled batch data loader, which is the code that handles bulk-loading from the database in situations where an app already knows the object IDs it needs (usually because they are referenced in a field on an already-loaded object), but needs to fetch the data for those objects. The new batch loader had a bug where under certain relatively rare conditions, it would return objects in a different order than the code using that loader expected. This resulted in the wrong objects being attached to IDs. Compounding the problem, we cache this data, so even after we returned to our old loader, apps were still seeing incorrect data coming out of our cache for a period of time after the reversion.

The impact of this incident varied by app, since each app uses data differently. While most impacted users were fully restored once all the incorrect data was flushed out of our cache, some apps ran workflows during the incident that saved the incorrect data elsewhere in their database. If you are still seeing incorrect data as a result of this incident and don’t know how to find and fix it, please reach out to us via our bug report tool (unfortunately, we do not have a way of fixing it automatically).

We view this as a very serious issue, since data integrity is even more foundational and important than uptime. Fortunately, we’d taken precautions that worked to limit the impact of this incident, mainly that we’d rolled it out to a fraction of our traffic rather than to all of it. Because this bug only occurred in certain conditions, the net result was that only a very small fraction of our total data loads were impacted, and most apps that we host did not experience problems. However, for the apps that were affected, this was extremely disruptive, and we are determined to strengthen our data integrity protections going forward.

Our postmortem of the issue identified many opportunities where we could have rolled this change out in a safer manner and prevented the bug from occurring. The overall root cause, in our determination, was moving too fast: As I mentioned, we fast-tracked this project because of the urgency around stabilizing our systems, and we tried to make ambitious changes to extremely sensitive parts of our system on a short timeline. While we did put a number of precautions in place and tested the new data loader on many apps, our precautions were oriented around preventing performance degradations and downtime — instead, given the code we were touching, we should have realized that data integrity issues were an even bigger concern.

As I mentioned above, we’re going to hold off on rolling this out again, and instead focus on how to safely make these kinds of foundational system changes to the way we manage and load data. Our databases are stable and protected enough that this code doesn’t need to be deployed ASAP, and it makes more sense for us to do it deliberately rather than quickly. My main personal takeaway from this incident is that it’s precisely the highest urgency, most important changes that require the closest technical scrutiny: while it’s important for our team to react quickly and do what we need to do to protect our systems, we can’t risk data integrity in favor of uptime.

The one silver lining is that we collected some production performance data on the new bulk uploader before we realized there was a problem, and the numbers look fantastic: We saw 50% speed improvements compared to the old system across the board. For an app that loads a lot of data, that can translate into very visible user-facing changes. So, while we are going to take our time on getting this change back out there, we do see it as a major future improvement, both to system stability and to performance.

One final note: I occasionally see questions on the forum about why we make changes to our systems during ET daytime hours. We have users all over the world, working in all timezones, so there is no “good” time for Bubble to be down. We generally roll out new code five to ten times a day, because we believe (and there’s a lot of empirical industry evidence that supports) that lots of frequent, small updates are safer overall than infrequent, big updates. Plus, most of our team is US-based, and we believe it’s safer to make changes when we are near peak staffing, because if something does go wrong, we have more people available to catch the issue and work to get it resolved. As it happened, that proved true with this particular incident: Team members who were not actively working on that particular project jumped in to help resolve it quickly, but they would not have been as immediately available had it happened out of their normal working hours.

Looking forward

In the spirit of learning from this week’s events, we are beginning to shift from a short-term, “work as fast as possible to stabilize our systems” mode, to a medium-term, “make smart investments to improve stability while moving us toward our long-term desired infrastructure” mode. While the hustle was appropriate given the state of our systems a few weeks ago, we’ve already implemented many of the key short-term measures, and we believe that shifting toward a longer time horizon is more likely to drive meaningful value for our community while preventing more self-inflicted wounds.

That doesn’t mean stability is any less of a priority, but it does mean we’re focusing less on “what can we do this week” to “where do we want our systems to be in three months, six months, or nine months.” We believe that making foundational investments in the reliability and performance of Bubble’s hosting will have resounding impact for the success of our customers and for Bubble’s own growth as a platform.

I plan to write at least one more of these weekly updates. At some point this month, we will likely discontinue them and shift to using our monthly community update as the primary vehicle for keeping you apprised of the progress we are making toward our stability goals. We will keep you posted!

Thanks,

— Josh and Emmanuel

20 Likes

Thank you @josh - Great update. Lots of recent challenges but this is a great symbol of the effort from the Bubble team to deliver a great product.

3 Likes

Would be great if the following things could be fixed:

  1. Allow sendgrid api even if no custom domain is attached. Sometimes we need that during development already before production.

  2. Move the redirect all traffic to x to domain settings. Its currently under email settings (doesnt make sense) and prevents development when you attach a domain but dns records have not been propegated.

1 Like

I love these posts! I understand full and well that you would need to phase out these these weekly mails on stability, but as as a guy who has had been dealing with issues like these for many years, I thoroughly appreciate the detail and depth of the reports.

I hope that these technical “deep dives” can be added further down the road when new bigger changes to the platform are announced (and are warranted). It really adds depth to the secret sauce that is otherwise hidden from us on purpose of building nocode! :nerd_face::call_me_hand:

6 Likes

Can we get a timestamp of when this issue started and ended, so we can verify our data between these times?

3 Likes

Better yet, can you notify us if any of our apps were in the small list of those that were potentially affected by this data issue?

1 Like

@josh I have and I’m still waiting for help! And I was the first to post on the forum about it and created a bug report within minutes of confirming I wasn’t going mad. I have data that was corrupted and calculating incorrectly and that’s only the data I know about so far. It’s small at the moment and I hope it stays that way, but you can imagine my fear that it’s much bigger and I’ll find more as time goes on.

I’m currently sifting through data from 3rd June and comparing it to data that I have backed up by using @lindsay_knowcode PlanB because I wasn’t able to be provided with my non corrupted data backup by Bubble.

Sounds dramatic but the anxiety I’ve had since June 3rd is some I’ve never experienced before.

I’m generally a happy bubbler…

But this data issue is something I never expected. I’m lost for words this time and my faith is wearing thin…as much as I bloody love bubble.

UPDATE: I’ve heard from the support team since posting :raised_hands: Looking forward to this being resolved

2 Likes

12:00PM to 6:00PM EDT on June 3rd

Is there a world where we can get a heads up on this so we can prepare our users…and ourselves?

The data issue on June 3 was hell really.

I had so many clients and their customers blowing up my email with angry and confused emails. All the bubble apps were unusable for several hours that day until the issue was fixed.

I spent a few hours trying to figure out what was going on thinking I had miffed something up majorly on my end. Alas, another bubble issue completely out of my hands. I also checked the bubble issue logger page but no issue was reported there - I did not think to check the forum as to me that’s not where I feel I should look.

I suggest you add an “urgent alert” feature into the builder of bubble for devs to notify us of when there are major issues like this. Sending an email to all bubble admins would also have been a great step…

What would be even better would be to have a “bubble made another mistake” default page in the builder so we can customize it to our brand and every time bubble makes a major mistake like this (which seems to be almost monthly now) we can show the “something broke” page until it’s fixed. This would allow better communication to the end user as well as the admins…

Instead there was radio silence and that caused wasted time and resources and needless concerns. These forum posts are great but not everyone is on the forum everyday.

6 Likes

You can subscribe to email / SMS on the status page though of course that only works when an issue is reported

I recommend having a maintenance mode built into your site using DB that you can just toggle if there’s a serious issue like this. Of course, won’t work if you’re not awake!

It could work if bubble would send webhooks with error codes and we could handle those codes in our backend like we want. (if its not completely down)

2 Likes

I’ll start with the caveat that I’m a non-technical founder of a bubble built product. With that said, I can’t explain how frustrating that this celebration of uptime is included in the preamble to a massive incident that cost me at least one major customer.

Our users are demanding assurances this type of unforced error won’t happen again. How can I give them that, Josh? Because I promise you they didn’t really care about improved bulk uploading speeds when their invoices went haywire.

We are a B2B SaaS business, so I haven’t had a chance to compute the amount of revenue we lost because our customer “can’t trust their own invoices”. We’ll be cleaning this mess up for at least the rest of the month to get all the invoices straight, which is needle in a haystack type work.

I’d like to suggest another KPI to add to the Bubble Status page (which has been bookmarked on my browser for several months due to how often I have to revert back to it). If Bubble is going to be celebrating uptime when data has been compromised, can you add a “% of confidence in data” display? Because uptime means absolutely nothing if customers can’t trust their own data.

5 Likes

No mention of the breaking change with the way that front-end requests to the back-end were handled which broke a lot of flows for a lot of people. Was frustrating to have support inbox blow up due to something outside of our control and with no mention of it anywhere, of course it doesn’t go down as a big stability thing but would argue when it is a breaking change for apps using the method then it is kinda a big deal but no mentions, no acknowledgement on any of the posts regarding it, no apologies, no care for those affected.

1 Like

This is possible on https://status.bubble.io/

3 Likes

Has anyone else had any experience with this…

The incorrect data being saved on my app actually isn’t related to June 3rd incident :face_exhaling: as support have found a record of this being saved wrong back in April…I’m not sure if this is better or worse :persevere:

After a week of back and forth support haven’t been able to provide a solution yet of how to fix this issue for my user. so coming here in the hope someone can help me :pray:

The data for a particular ‘service’ is going through the log correctly but then saving to the database incorrectly:

Unit total 60, quantity 1, total 60

The total is then saving in the database as 140. when this should obviously be 60*1=60

I’ve deleted the service and recreated and it’s still doing it.

Caching was the latest reason support gave me…but it can’t be caching as I’ve tested from my computers and also these are created via an enquiry form so it’s being replicated from other users devices too.

This only seems to be happening to one user and I literally have no idea how I can resolve this. Any help greatly appreciated

Are you sure there’s not another log that modified it from 60 to 140? A trigger, or just another Make changes to a thing/list of things that is including this object correctly but by mistake?

I’m a 1000% sure. This enquiry form has been live for over 18 months with 10s of 1000s things been created since and this is an anomaly