Services hosted in Frankfurt are unreachable

Incident Report for Render

Postmortem

Summary

Render uses AWS Elastic Block Store (EBS) as a storage provider in several of our clusters, and we were affected by an issue with the service:

Between 5:10 AM and 8:40 AM PDT, we experienced a performance degradation for a small number of EBS volumes in a single Availability Zone (euc1-az3) in the EU-CENTRAL-1 Region.

The issue resulted in downtime for some Postgres & Redis services, and degradation in some behaviors like autoscaling, builds, deploys, and metrics.

Impact

An estimated 3,000 managed Postgres and Redis instances experienced downtime, and even more will have experienced degraded performance.

Additionally, the Render platform in Frankfurt experienced general degradation in, but not limited to, autoscaling, builds, deploys, and metrics.

Root Cause

Render's automated alerting detected ongoing issues in one of our Frankfurt clusters at 12:40 UTC. We responded by opening a public incident. As part of the response, we identified that some of our hosts were experiencing severely degraded EBS performance.

Render uses AWS EBS to power storage volumes for all managed Postgres services in the Frankfurt region. Due to the outage, EBS volumes became detached from hosts. Our system automatically attempted to recover by attaching the volume to a new host, but failed to do so because of degradation in the attach/detach API. Eventually these attach operations succeeded; by 13:51 UTC, service was restored to all managed Postgres instances.

EBS also provides the storage for our etcd cluster, which runs as part of the Kubernetes cluster. Because etcd's consensus algorithm requires every member to write to disk on every proposal, high disk latency for just one member can slow down the entire etcd cluster. Our Kubernetes API server, which regularly transacts with etcd, became sluggish and slow to process requests that support behaviors like autoscaling, builds, deploys, and metrics. After identifying the failure mode, at 13:25 UTC we removed the unhealthy etcd member from the cluster. This allowed most of the degraded behaviors to recover.

Mitigations

We have been in contact with the AWS team to discuss near- and long-term mitigations.
We have added alerting on Render's end that is targeted at identifying when etcd is experience poor disk performance. It will help us to more quickly identify and remove the slow member in the cluster.

Posted Sep 18, 2023 - 23:49 UTC

Resolved

This incident has been resolved.

Posted Sep 07, 2023 - 13:59 UTC

Monitoring

We've implemented the necessary changes and are seeing progressive recovery for services and databases. We're monitoring.

Posted Sep 07, 2023 - 13:50 UTC

Identified

We have identified the degraded component and are working to restore service.

Posted Sep 07, 2023 - 13:23 UTC

Investigating

We're seeing that services in the Frankfurt region are un-reachable, we're investigating

Posted Sep 07, 2023 - 12:56 UTC

This incident affected: Render Dashboard, Static Sites and Frankfurt (Web Services, Builds and Deploys, PostgreSQL, Redis, Web Services - Free Tier).