Hi all,
Stability is our top priority as a company, and I have been reporting weekly on our progress. If you missed it, you can read last week’s update here.
This week brought the tail end of the SendGrid incident from last week as well as two regressions caused by errors by our team, which I’ll discuss more below. That said, this is the second week in a row without any outages due to scaling pressure or infrastructure load, which reflects the successful hardening work the team has done over the last month.
Key improvements we’ve made since our last update
As mentioned last week, we’ve stabilized our appserver database in the wake of the incidents with it in May. This week, we completed two more follow-up projects to make sure it stays stable:
-
We overhauled our monitoring dashboard to give us early warning of any future problems so we can intercept them before they cause downtime.
-
We made some behind-the-scenes improvements to our version control merge algorithm. As you may remember, one way we stabilized the database was by temporarily blocking large merges, since they are one of the most costly operations we run on appserver. To prevent them from becoming an issue in the future, we changed the cap on total memory consumption of large merges and built in automatic safety controls so that a runaway merge can’t do too much system damage.
Another workstream I mentioned in my previous update was transitioning away from the outdated stored procedure technology that’s interfering with our ability to cancel long-running queries. We attempted to roll a major chunk of this work out on Monday, but we had to revert it because it led to data-loading bugs, discussed in detail in the next section.
We originally kicked this workstream off as part of a two-pronged push to fix stability issues with the databases that store each app’s data. The other prong, which has already been successfully deployed, was tightening our rate-limiting controls around running expensive workloads. We believe either prong will likely be sufficient on its own to prevent future database-related downtime like that we recently experienced, but we decided to do both to achieve a truly robust system with extra insurance.
We kicked off both projects with extreme urgency since we didn’t know at the time which one would take longer, and we wanted to stabilize our systems ASAP. Now that the rate-limiting prong is complete, the tradeoff between speed and caution has shifted, as demonstrated by Monday’s incident. So we plan to continue with this workstream, but we are going to slow down even further and build in even more more safeguards and controls first.
Incidents since our last update
SendGrid
First, I want to close the loop on the SendGrid outage for users without an API key. As of Monday, SendGrid has restored our free email sending service. That said, we are not yet satisfied with their guarantees that this won’t happen again in the future. We’re working with SendGrid to discuss what measures we can put in place to ensure we get advanced notice of impending issues, as well as exploring backups or alternatives to SendGrid. We continue to recommend that users relying on our free shared email sending in production switch to their own SendGrid API key, since the shared sending is meant for testing and early-development purposes.
Short downtime on Thursday
On Thursday, we had just over a minute of downtime after we used our new instant rollback capability mentioned in previous updates. The problem resolved itself before we were able to update our status page. We had activated the rollback to resolve a regression that was impacting a very small number of customers. The failover was slower under production load than it was in our test environments, which caused a window where we had no active servers to handle traffic. We believe this problem is likely fixable, but we are putting use of the tool on pause until we can fully evaluate and fix it.
Monday’s regression
A more serious issue occurred on Monday. As I mentioned above, we attempted to roll out some of the work that would transition us off of our outdated stored procedure system. Specifically, we tried to release an overhauled batch data loader, which is the code that handles bulk-loading from the database in situations where an app already knows the object IDs it needs (usually because they are referenced in a field on an already-loaded object), but needs to fetch the data for those objects. The new batch loader had a bug where under certain relatively rare conditions, it would return objects in a different order than the code using that loader expected. This resulted in the wrong objects being attached to IDs. Compounding the problem, we cache this data, so even after we returned to our old loader, apps were still seeing incorrect data coming out of our cache for a period of time after the reversion.
The impact of this incident varied by app, since each app uses data differently. While most impacted users were fully restored once all the incorrect data was flushed out of our cache, some apps ran workflows during the incident that saved the incorrect data elsewhere in their database. If you are still seeing incorrect data as a result of this incident and don’t know how to find and fix it, please reach out to us via our bug report tool (unfortunately, we do not have a way of fixing it automatically).
We view this as a very serious issue, since data integrity is even more foundational and important than uptime. Fortunately, we’d taken precautions that worked to limit the impact of this incident, mainly that we’d rolled it out to a fraction of our traffic rather than to all of it. Because this bug only occurred in certain conditions, the net result was that only a very small fraction of our total data loads were impacted, and most apps that we host did not experience problems. However, for the apps that were affected, this was extremely disruptive, and we are determined to strengthen our data integrity protections going forward.
Our postmortem of the issue identified many opportunities where we could have rolled this change out in a safer manner and prevented the bug from occurring. The overall root cause, in our determination, was moving too fast: As I mentioned, we fast-tracked this project because of the urgency around stabilizing our systems, and we tried to make ambitious changes to extremely sensitive parts of our system on a short timeline. While we did put a number of precautions in place and tested the new data loader on many apps, our precautions were oriented around preventing performance degradations and downtime — instead, given the code we were touching, we should have realized that data integrity issues were an even bigger concern.
As I mentioned above, we’re going to hold off on rolling this out again, and instead focus on how to safely make these kinds of foundational system changes to the way we manage and load data. Our databases are stable and protected enough that this code doesn’t need to be deployed ASAP, and it makes more sense for us to do it deliberately rather than quickly. My main personal takeaway from this incident is that it’s precisely the highest urgency, most important changes that require the closest technical scrutiny: while it’s important for our team to react quickly and do what we need to do to protect our systems, we can’t risk data integrity in favor of uptime.
The one silver lining is that we collected some production performance data on the new bulk uploader before we realized there was a problem, and the numbers look fantastic: We saw 50% speed improvements compared to the old system across the board. For an app that loads a lot of data, that can translate into very visible user-facing changes. So, while we are going to take our time on getting this change back out there, we do see it as a major future improvement, both to system stability and to performance.
One final note: I occasionally see questions on the forum about why we make changes to our systems during ET daytime hours. We have users all over the world, working in all timezones, so there is no “good” time for Bubble to be down. We generally roll out new code five to ten times a day, because we believe (and there’s a lot of empirical industry evidence that supports) that lots of frequent, small updates are safer overall than infrequent, big updates. Plus, most of our team is US-based, and we believe it’s safer to make changes when we are near peak staffing, because if something does go wrong, we have more people available to catch the issue and work to get it resolved. As it happened, that proved true with this particular incident: Team members who were not actively working on that particular project jumped in to help resolve it quickly, but they would not have been as immediately available had it happened out of their normal working hours.
Looking forward
In the spirit of learning from this week’s events, we are beginning to shift from a short-term, “work as fast as possible to stabilize our systems” mode, to a medium-term, “make smart investments to improve stability while moving us toward our long-term desired infrastructure” mode. While the hustle was appropriate given the state of our systems a few weeks ago, we’ve already implemented many of the key short-term measures, and we believe that shifting toward a longer time horizon is more likely to drive meaningful value for our community while preventing more self-inflicted wounds.
That doesn’t mean stability is any less of a priority, but it does mean we’re focusing less on “what can we do this week” to “where do we want our systems to be in three months, six months, or nine months.” We believe that making foundational investments in the reliability and performance of Bubble’s hosting will have resounding impact for the success of our customers and for Bubble’s own growth as a platform.
I plan to write at least one more of these weekly updates. At some point this month, we will likely discontinue them and shift to using our monthly community update as the primary vehicle for keeping you apprised of the progress we are making toward our stability goals. We will keep you posted!
Thanks,
— Josh and Emmanuel