Downtime for EU services and databases

Incident Report for Render

Postmortem

Root Cause Analysis for Render incident - September 19th

Summary

On September 19 2021, between 8:42 UTC and 13:30 UTC, all web services, databases and cron jobs in the Frankfurt region were unavailable due to a platform outage. Private services and background workers experienced intermittent degraded services during this period.

The incident started when a configuration change meant for an internal environment was inadvertently applied to to the Frankfurt production environment. This caused a critical set of networking resources to become unavailable, and the failure state required our engineers to recreate these resources manually.

Reliability remains our top priority and we remain committed to the highest standards of service across all Render regions. We are rigorous about incident analysis, and as a result of this incident, we have both planned and completed mitigations to the Render platform to prevent similar outages in the future. These are detailed in the Mitigations section below.

Since uptime is the top business and technical priority at Render, we are giving all affected customers a credit on their Render bill for September 2021. We've sent out an email with details; if you were affected by the outage but did not receive this email, please reach out to support@render.com.

Impact

From 8:42 UTC and 13:30 UTC, all web services, databases, and cron jobs in the Frankfurt region were unavailable.

While private services and background workers were largely up, some of them ran into issues connecting to other Render services and external URLs. Some service logs may also be missing from this period.

Static sites, which are global, continued to be available during the incident.

All Frankfurt services were restored to full functionality following the incident.

Timeline

All times are in UTC on September 19th, 2021.

08:42 Configuration meant for an internal environment is inadvertently applied to the production Frankfurt environment, deleting core infrastructure components in the Frankfurt region.
08:47 Incident impact is confirmed and the incident is escalated internally.
09:30 Core infrastructure components are recreated and engineers begin repairing internal systems.
11:05 Render control systems come online. Engineers begin repairing networking for customer services.
12:13 External connectivity re-established to some customer services.
12:55 External connectivity re-established to some databases.
13:10 Majority of Render Frankfurt infrastructure back online. A few customer services and databases still experiencing issues.
13:30 Engineers confirm all customer services and databases are back online and continue to monitor the Frankfurt region.
13:39 Engineers confirm all Render functionality is back online .

Root Cause

The incident began on September 19th, 2921 at 8:42 UTC when a configuration change was applied to the Frankfurt production environment that was meant for an internal environment. This resulted in the destruction of some internal load balancers, which impacted internal service communication as well as external connectivity to web services and databases.

We quickly recreated the missing load balancers, but Render control plane services were still unable to communicate with each other using the new infrastructure. In order to resolve the issue, several pieces of our infrastructure needed to be manually recreated or reconfigured. This was the primary issue that prolonged the incident. We began by repairing our Frankfurt control plane, which was necessary to automate repairs of the rest of our infrastructure in Frankfurt. We then repaired the connectivity of all customer workloads in the region, and as a result customer services started coming back online.

By 13:30 UTC, all customer services, databases, and background workers were back online, and cron jobs were being scheduled normally. We continued to monitor the system to ensure normal operation for all Render and customer services.

Mitigations

The incident highlighted key improvements we need to make to our systems and processes to prevent a prolonged outage. We have already completed some of them and are in the process of finishing others.

We've implemented full isolation between the various Render environments we run in Frankfurt. It is now impossible for changes meant for non-production environments to affect production.
We've improved redundancy for core infrastructure components across multiple availability zones
[In Progress] Improved failover procedures
- We are working on automated recreation of certain infrastructure components in the event of failure. This will reduce downtime for any incidents going forward.
[In Progress] Improved monitoring and automation to update Render's status page as soon as possible following an incident.

Posted Sep 24, 2021 - 21:30 UTC

Resolved

This incident has been resolved. We will conduct a thorough incident review and will post an update soon.

Posted Sep 19, 2021 - 13:39 UTC

Monitoring

All services and databases in Frankfurt region are back online and we are monitoring for any further issues.

Posted Sep 19, 2021 - 13:30 UTC

Update

We have restored some databases and continue to restore more web services. We continue to work on fixing all databases and all web services in the Frankfurt region.

Posted Sep 19, 2021 - 13:12 UTC

Update

We continue to work on fixing databases and all web services in the Frankfurt region.

Posted Sep 19, 2021 - 12:59 UTC

Update

We have restored some web services in the Frankfurt region. We continue to work on fixing databases and all web services in the Frankfurt region.

Posted Sep 19, 2021 - 12:32 UTC

Update

We continue to work on resolving networking problems in the Frankfurt region.

Posted Sep 19, 2021 - 12:14 UTC

Update

We continue to work on resolving networking problems in the Frankfurt region.

Posted Sep 19, 2021 - 11:44 UTC

Update

We continue to work on resolving networking problems in the Frankfurt region.

Posted Sep 19, 2021 - 11:18 UTC

Update

Users may be unable to connect to their databases and web services in the Frankfurt region. Affected web services may display "Error 1016" or "server IP address could not be found" error messages when visited in the browser.

We continue to work on resolving networking problems in the Frankfurt region.

Posted Sep 19, 2021 - 10:41 UTC

Update

We are continuing to make progress on a fix for this issue.

Posted Sep 19, 2021 - 10:05 UTC

Update

Users may be unable to connect to their databases and web services in the Frankfurt region. We continue to work on resolving the problem.

Posted Sep 19, 2021 - 09:38 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Sep 19, 2021 - 09:14 UTC

Investigating

We are currently investigating this issue.

Posted Sep 19, 2021 - 09:02 UTC