From 2023-02-27 19:04 to 19:36 UTC, one of our Oregon clusters suffered a cluster-wide outage, seen in a variety of symptoms: failure to build & deploy services, failure to create new services or data stores, failure to autoscale, etc. The cause of the outage was due to a misplaced YAML key in the configuration for etcd.
During an upgrade of critical infrastructure, we had an invalid YAML configuration for etcd, a core component. As a result, after performing the upgrade, etcd was unable to start up. Failure in this crucial part of our infrastructure quickly propagated through the entire cluster.
We're implementing validations that will run (and fail loudly) as part of future upgrades.