Okay, we’ve found the root cause of the issue. I personally screwed this one up this time. I was doing some security auditing work yesterday and rotated one of our credentials, and when I was pushing the new ones out, I missed deploying them to the scheduled cluster. The broken credential caused a number of downstream issues, including both bugs described in the above post. When we automatically deployed the latest version of our code at 9 am this morning to the scheduled cluster, this automatically fixed the problem.
We’ll be doing an internal postmortem of this incident, but some quick thoughts on how we plan to prevent it from happening again:
- This is the second occasion where we’ve had an issue on the scheduled cluster because we did some operational work on the main cluster and didn’t remember to also update the scheduled cluster. This is something we can fix via automation, which we plan to build
- I occasionally do some operational work myself instead of relying on the rest of the engineering team in situations where I have access levels that would make it easier for me to do it. However, this is risky, because when I do side-channel work instead of going through our normal processes, we don’t necessarily have the same levels of controls and double-checks. This is a standard growing pain for startups (founders used to making direct changes to infrastructure having to figure out how to transition that to the team), and I should have been more genre-savvy here. I’m sorry, this was dumb of me and I’m going to work with the team to be more careful going forward.