Downtime for some services in Oregon
Incident Report for Render
Postmortem

Summary

Many services in the Oregon region experienced failures with web requests, builds, deploys, and cron jobs between 22:02 UTC and 22:53 UTC on December 8, 2022. A routine maintenance event failed, creating downtime for part of Render's control plane in Oregon.

Root Cause

As a part of routine maintenance to increase the capacity of Render's Oregon control plane, the compute hosts backing the control plane were replaced with larger instances. The control plane is highly available, and taking individual instances out of the pool does not impact service availability.

In this case, the new hosts backing the control plane came up successfully, but the components running on them became unhealthy during startup, largely due to resource constraints. Upgrading the control plane is a manual process, and we failed to notice the errors in these components and terminated the older, healthy hosts, resulting in control plane downtime.

We fixed the issue by increasing the size of the new control plane hosts, which gave the control plane components the resources needed to start up successfully and resume normal service.

Mitigations

  • Control plane maintenance is a manual process that requires diligence on behalf of engineers to ensure that everything is healthy as they replace old hosts. As such, any manual process is prone to human error, and automating it is often the best way to avoid mistakes. In addition to automating the upgrade process, we are going to add critical health checks and alerts to identify failures sooner and halt the process automatically.
  • Our web serving infrastructure is designed to continue functioning even with control plane downtime. However, this incident exposed a previously hidden dependency on control plane DNS components. We will remove the dependency to ensure web serving isn't affected by control plane downtime.
Posted Dec 09, 2022 - 00:43 UTC

Resolved
This incident has been resolved. We will be following up with an RCA.
Posted Dec 09, 2022 - 00:05 UTC
Monitoring
We are seeing signs of recovery and are currently monitoring the issue.
Posted Dec 08, 2022 - 22:45 UTC
Update
We are continuing to investigate this issue.
Posted Dec 08, 2022 - 22:32 UTC
Investigating
We are currently seeing degraded performance with builds, deploys, cronjobs, and external database connections for some services in the Oregon region. Some services in the Oregon region are also unavailable. We are currently investigating the issue.
Posted Dec 08, 2022 - 22:29 UTC
This incident affected: Oregon (Web Services, Cron Jobs, Web Services - Free Tier, Builds and Deploys).