Okay, we’ve found the root cause of the issue. I personally screwed this one up this time. I was doing some security auditing work yesterday and rotated one of our credentials, and when I was pushing the new ones out, I missed deploying them to the scheduled cluster. The broken credential caused a number of downstream issues, including both bugs described in the above post. When we automatically deployed the latest version of our code at 9 am this morning to the scheduled cluster, this automatically fixed the problem.
We’ll be doing an internal postmortem of this incident, but some quick thoughts on how we plan to prevent it from happening again:
This is the second occasion where we’ve had an issue on the scheduled cluster because we did some operational work on the main cluster and didn’t remember to also update the scheduled cluster. This is something we can fix via automation, which we plan to build
I occasionally do some operational work myself instead of relying on the rest of the engineering team in situations where I have access levels that would make it easier for me to do it. However, this is risky, because when I do side-channel work instead of going through our normal processes, we don’t necessarily have the same levels of controls and double-checks. This is a standard growing pain for startups (founders used to making direct changes to infrastructure having to figure out how to transition that to the team), and I should have been more genre-savvy here. I’m sorry, this was dumb of me and I’m going to work with the team to be more careful going forward.
Thank you for the detailed explanation and for your personal emails as well.
Aside from the automation fix, is there anything that can show in the future on the Bubble Status page to indicate that there are code-related issues - such as showing an increase in Bug Reports within a short time frame and within specific clusters which would send you an alert when a specific increase during a short time period is noted?
This would give end users confidence that their reports are acknowledged and are indeed affecting other users in a certain time period in a certain cluster.
Your mea culpa, while refreshing, doesn’t address the core issues above. Namely:
There is no 24/7 app support.
No way to pay for better support.
I’ve been watching people blow their stack and get extremely worked up for 6 years now. Support, and Bubble-side app-breakage, despite hundreds, maybe thousands of convos about this, is still seen by many forum users as very substandard.
At some point you need to stop talking about this/ stop apologizing, and just fix it.
very good points. Thinking about this a bit more, while it’s bad that I broke things, the biggest issue here is the response time between the breakage and this getting escalated as an emergency.
On the metrics front, yes, we’ll look into whether we could have automatically detected this from either our logs or from the volume of bug reports.
On the team coverage front, we’ve gone from 30 people at the beginning of the summer to 55 people today, so we are definitely staffing up quickly. (We have to train everyone who comes on board so doubling the team size is very aggressive growth). We just hired our first Europe-based success person, who starts soon, to start getting better timezone coverage; we’d like to hire more and would encourage people sharing the JD on our jobs page. 24/7 support is in the cards but we need team members across time zones first so that there’s someone at their desk monitoring who can make the call whether it’s worth waking engineers. We do have automated alerts that will wake someone, but they tend to only catch complete downtime, not issues like this where some apps are hitting errors. That said, I think we can tune them better to do a better job catching things like this.
Thank you @josh for that explanation and transparency. Really appreciated it. Many of us using Bubble are startups so we can definitely relate.
To the other point, what would have made this incident a lot better is if we would have gotten early notification from Bubble that this is indeed a known bubble issue and comfort that bubble is looking into it as a priority (Even if on a weekend given its severity).
In my case I spent hours trying to find out why, looking at DNS tests, mac vs Pc, chrome vs safari, wifi issues, clearing cookies, monitoring, trying to find entries in the logs etc and got my developers also involved over the weekend to confirm.
We are reliant on Bubble to operate as that’s ultimately where we ourselves get trust with our clients. So if there is any way we can help you and the Bubble team with inputs, user feedback or anything else please don’t hesitate to reach out. We all want this to be successful!
@josh Can I second the vote to double down on your customer support? Right now I’m having trouble getting resolutions during working hours let alone out of hours . As a production customer I haven’t been able to safely deploy my app from for over two weeks due a simple bug, yet I haven’t been able to get a resolution on it despite multiple emails. Often there’s a quick initial response but for anything that isn’t resolved immediately or slightly complex - the thread falls into the ether and becomes difficult to chase-up and resolve. To me this suggests both resource and process issues internally - I appreciate the growing pains you’re having but I’d suggest over-investing the funding you’ve received in this area to cover the shortfall in the short term, as it feels like the wheels are falling off for some of us and makes it hard for scaling companies like ours to have confidence in resolving critical issues.
Thank you for the prompt reply! I’ve been able to confirm that other users are experiencing this deploy to live error as well, and our engineering team is now actively investigating the behavior. We’ll get in touch directly once this is resolved, and let us know if you have any other questions in the meantime!
saturday morning at 10:16am. I’ve had the “Sorry, we ran into a temporary bug and can’t complete your request. We’ll fix it as soon as we can; please try again in a bit!” about 10 times this morning.
@stephane you haven’t been able to deploy live with issues in a long time. If you have managed to do that in the past then you found a bug that has been fixed. You need to fix your issues then deploy live.