On September 19 2021, between 8:42 UTC and 13:30 UTC, all web services, databases and cron jobs in the Frankfurt region were unavailable due to a platform outage. Private services and background workers experienced intermittent degraded services during this period.
The incident started when a configuration change meant for an internal environment was inadvertently applied to to the Frankfurt production environment. This caused a critical set of networking resources to become unavailable, and the failure state required our engineers to recreate these resources manually.
Reliability remains our top priority and we remain committed to the highest standards of service across all Render regions. We are rigorous about incident analysis, and as a result of this incident, we have both planned and completed mitigations to the Render platform to prevent similar outages in the future. These are detailed in the Mitigations section below.
Since uptime is the top business and technical priority at Render, we are giving all affected customers a credit on their Render bill for September 2021. We've sent out an email with details; if you were affected by the outage but did not receive this email, please reach out to firstname.lastname@example.org.
From 8:42 UTC and 13:30 UTC, all web services, databases, and cron jobs in the Frankfurt region were unavailable.
While private services and background workers were largely up, some of them ran into issues connecting to other Render services and external URLs. Some service logs may also be missing from this period.
Static sites, which are global, continued to be available during the incident.
All Frankfurt services were restored to full functionality following the incident.
All times are in UTC on September 19th, 2021.
The incident began on September 19th, 2921 at 8:42 UTC when a configuration change was applied to the Frankfurt production environment that was meant for an internal environment. This resulted in the destruction of some internal load balancers, which impacted internal service communication as well as external connectivity to web services and databases.
We quickly recreated the missing load balancers, but Render control plane services were still unable to communicate with each other using the new infrastructure. In order to resolve the issue, several pieces of our infrastructure needed to be manually recreated or reconfigured. This was the primary issue that prolonged the incident. We began by repairing our Frankfurt control plane, which was necessary to automate repairs of the rest of our infrastructure in Frankfurt. We then repaired the connectivity of all customer workloads in the region, and as a result customer services started coming back online.
By 13:30 UTC, all customer services, databases, and background workers were back online, and cron jobs were being scheduled normally. We continued to monitor the system to ensure normal operation for all Render and customer services.
The incident highlighted key improvements we need to make to our systems and processes to prevent a prolonged outage. We have already completed some of them and are in the process of finishing others.
[In Progress] Improved failover procedures
[In Progress] Improved monitoring and automation to update Render's status page as soon as possible following an incident.