Hi guys,
Sorry about the issues. The brief downtimes you’ve been seeing are caused by a single point of failure in our infrastructure that’s under increasing load. The issue is that many of our main-cluster apps have their custom domains pointed to the IP address 54.69.164.32, and we can only have a single server at a time connected to that address.
We’re working on a project to have custom domains pointed to CNAME records that will allow us to dynamically spread the load out and avoid this issue altogether. We also plan to route traffic to apps through CloudFlare for increased performance and reliability. Unfortunately, we’re starting to come up against our limits prior to getting this into place (I estimate we’re still at least a month away from rolling this out).
In the interim, we’re setting up some additional IP addresses to help reduce the load and stop the outages. Unfortunately, we have no way of rolling this out without you guys making a change on your end. So:
A) If you have not noticed brief downtimes over the last few days, no need to do anything
B) If you have noticed, or are still noticing issues, check to see if you have A records set with your domain registrar pointing your domain (and www.yourdomain) to 54.69.164.32. If not, no need to to do anything
C) If you do have those records set, please flip a coin and pick either 54.69.164.32 or 54.68.12.205, and use that address instead of 54.69.164.32. (If we end up with an uneven distribution, we’ll follow up with people by email and request they switch in a more systematic manner – for now, please just pick randomly).
Sorry for the inconvenience, and thanks for the help while we get to a better long-term solution for this issue.
(Also, see @keith’s helpful note above about alerting and downtime in general).