At approximately 15:14 PDT (22:14 UTC) on 2020-09-11, most HTTPS services on Render started returning a page indicating the service had been suspended by its owner. We rolled out a fix that started bringing services back online at 16:00 PST, and all HTTPS responses from the platform were back to normal by 16:21 PST.
Because the suspension page was incorrectly served with a 200 response code, intermediate proxies cached the suspended content, and a small number of static sites were still returning a suspended message until 14:11 PST on 2020-09-12.
This outage was a difficult time for all our customers, and we did not uphold the reliability standards you expect of us, and those we've set for ourselves. To everyone affected by the outage, we're sorry.
The seriousness of this incident can not be overstated, and we're doing everything we can to learn from it and make Render's platform much more reliable going forward.
The outage was caused when we ran a script to migrate a customer's services across teams. The script had a bug that incorrectly updated the suspension state of existing services in our database, causing Render's serving infrastructure to return a suspension notice. All services were running as expected on the backend, and no customer data was compromised at any time.
The issue was fixed by returning the database to the state prior to running the script.
Ultimately, this was process and systems failure exacerbated by the lack of automatic mitigations. It forced us to take a hard look at how we operate Render and find ways to prevent an incident like this from ever happening again.
We have removed the ability to manually run scripts on production infrastructure. While this feature allowed us to move faster in certain situations, it traded safety and reliability for velocity and ultimately led to this outage. We are conducting a thorough review of all our engineering practices to find ways to improve operational safety especially as we expand the capabilities of the platform.
We are implementing code changes to protect against unintended batch updates to the database. This will make the platform safer and minimize the chances of human error leading to an outage.
We have changed the HTTP code returned for suspended services to
503 so the content is never cached and monitoring infrastructure (both ours and our customers') can detect and alert on suspensions.
Finally, we're working on automating much of the mitigation work involved in this incident, and increasing redundancy in our systems to minimize the duration and impact of outages in the future.