Issues with triggering deploys and new resource creation in Oregon

Incident Report for Render

Postmortem

Summary

From 2023-02-27 19:04 to 19:36 UTC, one of our Oregon clusters suffered a cluster-wide outage, seen in a variety of symptoms: failure to build & deploy services, failure to create new services or data stores, failure to autoscale, etc. The cause of the outage was due to a misplaced YAML key in the configuration for etcd.

Root Cause

During an upgrade of critical infrastructure, we had an invalid YAML configuration for etcd, a core component. As a result, after performing the upgrade, etcd was unable to start up. Failure in this crucial part of our infrastructure quickly propagated through the entire cluster.

Mitigations

We're implementing validations that will run (and fail loudly) as part of future upgrades.

Posted Mar 13, 2023 - 21:42 UTC

Resolved

We are aware of issues with triggering deploys and new service and datastore creation affecting some users in the Oregon region. We have mitigated the issue and are currently monitoring.

Posted Feb 27, 2023 - 19:30 UTC