Render uses AWS Elastic Block Store (EBS) as a storage provider in several of our clusters, and we were affected by an issue with the service:
Between 5:10 AM and 8:40 AM PDT, we experienced a performance degradation for a small number of EBS volumes in a single Availability Zone (euc1-az3) in the EU-CENTRAL-1 Region.
The issue resulted in downtime for some Postgres & Redis services, and degradation in some behaviors like autoscaling, builds, deploys, and metrics.
An estimated 3,000 managed Postgres and Redis instances experienced downtime, and even more will have experienced degraded performance.
Additionally, the Render platform in Frankfurt experienced general degradation in, but not limited to, autoscaling, builds, deploys, and metrics.
Render's automated alerting detected ongoing issues in one of our Frankfurt clusters at 12:40 UTC. We responded by opening a public incident. As part of the response, we identified that some of our hosts were experiencing severely degraded EBS performance.
Render uses AWS EBS to power storage volumes for all managed Postgres services in the Frankfurt region. Due to the outage, EBS volumes became detached from hosts. Our system automatically attempted to recover by attaching the volume to a new host, but failed to do so because of degradation in the attach/detach API. Eventually these attach operations succeeded; by 13:51 UTC, service was restored to all managed Postgres instances.
EBS also provides the storage for our etcd cluster, which runs as part of the Kubernetes cluster. Because etcd's consensus algorithm requires every member to write to disk on every proposal, high disk latency for just one member can slow down the entire etcd cluster. Our Kubernetes API server, which regularly transacts with etcd, became sluggish and slow to process requests that support behaviors like autoscaling, builds, deploys, and metrics. After identifying the failure mode, at 13:25 UTC we removed the unhealthy etcd member from the cluster. This allowed most of the degraded behaviors to recover.