On May 6th 2021, Render's systems in the Frankfurt region experienced a major failure. Between 15:50 UTC and 23:38 UTC, all Frankfurt services except static sites were unavailable. The incident started when an internal TLS certificate expired, and our attempts at replacing the certificate led to a series of cascading failures which required us to recreate the entire control plane that manages workloads in the Frankfurt region.
We'd like to reiterate our commitment to all our customers in the Frankfurt region; we are taking this incident and its analysis extremely seriously. We have consequently added several mitigations to the Render platform. These mitigations will prevent us from experiencing similar outages in the future and increase the overall reliability of our platform. The technical details are outlined below.
We would like to extend our deepest apologies to everyone affected by the outage. We have given all affected users a discount for their Render bill for May 2021 and sent out an email with more details. If you were affected by the outage but did not receive this email, please reach out to firstname.lastname@example.org.
From 15:50 UTC to 23:38 UTC, all scheduled builds, deploys, and cron jobs in Render's Frankfurt region failed, and all services and databases were unavailable. In some cases, users were unable to view databases in the Render dashboard.
Static sites were not affected during this time, and there was no loss of data for any services or databases. All Frankfurt services were restored to full functionality following the incident.
All times are in UTC and are on May 6th, 2021 unless noted.
The incident began when an internal root TLS certificate authority expired. We immediately renewed this certificate authority and signed new certificates to be used throughout our system. In the process of propagating these certificates, several of our internal DNS servers, which are used to route customer requests, had crashed. This resulted in the failure of our networking infrastructure, and users were no longer able to access Render services through the public internet.
We attempted to bring back the DNS and networking layer, but could not because the certificate propagation left our infrastructure in an inconsistent state. We explored several failover strategies, but had limited options since we did not want to damage the data integrity of customer databases and services with disks. To avoid any loss of data, we recreated several infrastructure components that manage workloads in the Frankfurt region - this is what took the most time, and we have already addressed several issues to make such mitigation quicker and automated.
After the restoration, our DNS and networking layer was able to serve user traffic. By 22:50 UTC, user services began coming back online, and by 23:39 UTC, all services and databases hosted in the Frankfurt region were operational. There was no loss or corruption of customer data.
Reliability is our top priority and we remain committed to the highest standards of services across all our regions . Even before the incident, multiple Render engineers were working on increasing the resiliency of the platform; the outage highlighted items that we've prioritized and completed to ensure we can prevent similar incidents in the future:
Automated operational tasks surrounding certificates
Improved isolation between infrastructure components
Improved monitoring and alerting
Improved recovery and failover procedures