Between 2023-03-30 19:00 UTC and 2023-03-30 20:30 UTC, one of our Oregon clusters experienced an outage that affected builds & deploys, new service and datastore creation, and network connectivity both inbound and outbound.
After performing routine maintenance on our etcd cluster, we noticed that there was something severely wrong as each etcd member started back up. We determined that we were affected by a combination of a data corruption bug that existed in the etcd version we were running, and poor disk performance causing the etcd leader to occasionally fall behind followers. etcd being in an inconsistent state meant all cluster operations grounded to a halt, and it took the team approximately 90 minutes to recover the cluster to a healthy state.