Many services in the Oregon region experienced failures with web requests, builds, deploys, and cron jobs between 22:02 UTC and 22:53 UTC on December 8, 2022. A routine maintenance event failed, creating downtime for part of Render's control plane in Oregon.
As a part of routine maintenance to increase the capacity of Render's Oregon control plane, the compute hosts backing the control plane were replaced with larger instances. The control plane is highly available, and taking individual instances out of the pool does not impact service availability.
In this case, the new hosts backing the control plane came up successfully, but the components running on them became unhealthy during startup, largely due to resource constraints. Upgrading the control plane is a manual process, and we failed to notice the errors in these components and terminated the older, healthy hosts, resulting in control plane downtime.
We fixed the issue by increasing the size of the new control plane hosts, which gave the control plane components the resources needed to start up successfully and resume normal service.