Service Outage
Incident Report for Render
Postmortem

What Happened

At approximately 15:14 PDT (22:14 UTC) on 2020-09-11, most HTTPS services on Render started returning a page indicating the service had been suspended by its owner. We rolled out a fix that started bringing services back online at 16:00 PST, and all HTTPS responses from the platform were back to normal by 16:21 PST.

Because the suspension page was incorrectly served with a 200 response code, intermediate proxies cached the suspended content, and a small number of static sites were still returning a suspended message until 14:11 PST on 2020-09-12.

This outage was a difficult time for all our customers, and we did not uphold the reliability standards you expect of us, and those we've set for ourselves. To everyone affected by the outage, we're sorry.

The seriousness of this incident can not be overstated, and we're doing everything we can to learn from it and make Render's platform much more reliable going forward.

How It Happened

The outage was caused when we ran a script to migrate a customer's services across teams. The script had a bug that incorrectly updated the suspension state of existing services in our database, causing Render's serving infrastructure to return a suspension notice. All services were running as expected on the backend, and no customer data was compromised at any time.

The issue was fixed by returning the database to the state prior to running the script.

Ultimately, this was process and systems failure exacerbated by the lack of automatic mitigations. It forced us to take a hard look at how we operate Render and find ways to prevent an incident like this from ever happening again.

Why It Won't Happen Again

We have removed the ability to manually run scripts on production infrastructure. While this feature allowed us to move faster in certain situations, it traded safety and reliability for velocity and ultimately led to this outage. We are conducting a thorough review of all our engineering practices to find ways to improve operational safety especially as we expand the capabilities of the platform.

We are implementing code changes to protect against unintended batch updates to the database. This will make the platform safer and minimize the chances of human error leading to an outage.

We have changed the HTTP code returned for suspended services to 503 so the content is never cached and monitoring infrastructure (both ours and our customers') can detect and alert on suspensions.

Finally, we're working on automating much of the mitigation work involved in this incident, and increasing redundancy in our systems to minimize the duration and impact of outages in the future.

If you have additional questions or comments related to this incident, please let us know in render.com/chat or at support@render.com.

Posted Sep 16, 2020 - 20:36 UTC

Resolved
All sites are back to normal. We are incredibly sorry for the interruption to your workloads and understand how frustrating this can be. We will email all affected users and publish a post-mortem next week.
Posted Sep 12, 2020 - 02:41 UTC
Update
Except static sites, all services should be operational.

Most static sites should be operational, but some may not due to caching. We're working on cache eviction.
Posted Sep 12, 2020 - 00:52 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Sep 11, 2020 - 23:54 UTC
Update
All web services are back up. Static sites are coming back up now.
Posted Sep 11, 2020 - 23:21 UTC
Identified
The issue has been identified and a fix is being implemented. Changes to customer services after 2020-09-11 15:10-07:00 may be reverted.
Posted Sep 11, 2020 - 22:49 UTC
Investigating
We are currently investigating customer services which have been suspended
Posted Sep 11, 2020 - 22:38 UTC
This incident affected: Static Sites, Web Services, Cron Jobs, Background Workers, Render Dashboard, and Render Website.