New deploys for dynamic apps failing in Frankfurt

Incident Report for Render

Postmortem

Root Cause Analysis for Frankfurt Outage on 2021-05-06

Summary

On May 6th 2021, Render's systems in the Frankfurt region experienced a major failure. Between 15:50 UTC and 23:38 UTC, all Frankfurt services except static sites were unavailable. The incident started when an internal TLS certificate expired, and our attempts at replacing the certificate led to a series of cascading failures which required us to recreate the entire control plane that manages workloads in the Frankfurt region.

We'd like to reiterate our commitment to all our customers in the Frankfurt region; we are taking this incident and its analysis extremely seriously. We have consequently added several mitigations to the Render platform. These mitigations will prevent us from experiencing similar outages in the future and increase the overall reliability of our platform. The technical details are outlined below.

We would like to extend our deepest apologies to everyone affected by the outage. We have given all affected users a discount for their Render bill for May 2021 and sent out an email with more details. If you were affected by the outage but did not receive this email, please reach out to support@render.com.

Impact

From 15:50 UTC to 23:38 UTC, all scheduled builds, deploys, and cron jobs in Render's Frankfurt region failed, and all services and databases were unavailable. In some cases, users were unable to view databases in the Render dashboard.

Static sites were not affected during this time, and there was no loss of data for any services or databases. All Frankfurt services were restored to full functionality following the incident.

Timeline

All times are in UTC and are on May 6th, 2021 unless noted.

13:05 First customers report failures in builds and deploys in Frankfurt region.
13:21 Render team posts the incident on status.render.com.
15:10 TLS certificates are repaired but their propagation affects other infrastructure components.
15:50 The outage spreads; all services in the Frankfurt region are unavailable.
16:00 Incident is further escalated internally; engineers continue to investigate failures and attempt resolutions aimed at restoring service without data loss.
21:56 Engineers come up with a successful strategy to restore services without data loss and begin implementing it.
22:10 Engineers apply a partial fix and some services come back online.
22:50 The fix is validated and rolled out across all Frankfurt infrastructure; services start becoming available again.
23:39 Team updates status page to "Monitoring" as all Frankfurt services are available at this point.
May 7th, 00:25 Team updates status page to "Resolved".

Root Cause

The incident began when an internal root TLS certificate authority expired. We immediately renewed this certificate authority and signed new certificates to be used throughout our system. In the process of propagating these certificates, several of our internal DNS servers, which are used to route customer requests, had crashed. This resulted in the failure of our networking infrastructure, and users were no longer able to access Render services through the public internet.

We attempted to bring back the DNS and networking layer, but could not because the certificate propagation left our infrastructure in an inconsistent state. We explored several failover strategies, but had limited options since we did not want to damage the data integrity of customer databases and services with disks. To avoid any loss of data, we recreated several infrastructure components that manage workloads in the Frankfurt region - this is what took the most time, and we have already addressed several issues to make such mitigation quicker and automated.

After the restoration, our DNS and networking layer was able to serve user traffic. By 22:50 UTC, user services began coming back online, and by 23:39 UTC, all services and databases hosted in the Frankfurt region were operational. There was no loss or corruption of customer data.

Mitigations

Reliability is our top priority and we remain committed to the highest standards of services across all our regions . Even before the incident, multiple Render engineers were working on increasing the resiliency of the platform; the outage highlighted items that we've prioritized and completed to ensure we can prevent similar incidents in the future:

Automated operational tasks surrounding certificates
- The system can handle certificate renewal and propagation in a programatic and correct manner.
Improved isolation between infrastructure components
- We've eliminated unnecessary dependencies within our infrastructure and made core components more resilient against cascading failures.
Improved monitoring and alerting
- We've improved the observability of the system, and similar incidents will be detected well in advance and go through several checkpoints, ensuring it will not affect production users.
Improved recovery and failover procedures
- While mitigations we have taken will prevent such an incident from recurring, we've operationalized recovery from from several different types of failures. This will drastically improve our time to recovery for a broad range of incidents.

Posted May 27, 2021 - 17:42 UTC

Resolved

The incident has been resolved and all EU services are back online. If you are noticing any issues in your services, please feel free to contact us directly at support@render.com. To all users who were affected, we are incredibly sorry. We will share a root cause analysis when we have finished conducting our internal investigation.

Posted May 07, 2021 - 00:25 UTC

Monitoring

We have fixed the underlying problem and services are coming back up. We expect a full restoration shortly, and the team will be continuing to actively monitor the situation.

Posted May 06, 2021 - 23:39 UTC

Update

We have applied some fixes that have brought a few EU services back online; however the incident is still ongoing, and we are still working on safely bringing the rest of EU infrastructure back up. We will provide more updates as soon as they are available.

Posted May 06, 2021 - 23:09 UTC

Update

The incident is still ongoing. Engineers are still working on resolving the issue, which is still affecting our EU infrastructure. The team is fully dedicated to getting EU services running and will continue to provide updates as the come.

Posted May 06, 2021 - 22:00 UTC

Update

Posted May 06, 2021 - 21:09 UTC

Update

The incident is still ongoing. Engineers are making progress restoring some EU control plane functionality and are now working on resolving networking issues. We will provide an ETA for a fix as soon as possible.

Posted May 06, 2021 - 19:57 UTC

Update

We are continuing to work on a fix for this issue.

Posted May 06, 2021 - 18:56 UTC

Update

We are continuing to work on a fix for this issue.

Posted May 06, 2021 - 18:54 UTC

Update

Engineers are continuing to work on restoring control plane functionality, updates will be posted as we continue to make progress.

Posted May 06, 2021 - 18:30 UTC

Update

Incident is ongoing. The issue was triggered by certificates expiring and has led to a wider outage in our Frankfurt control plane. We are working on restoring control plane functionality and will have an ETA soon.

Posted May 06, 2021 - 17:45 UTC

Update

Incident is ongoing, EU region is experiencing a major outage and users are unable to connect to services. We are working on an ETA, updates will be posted as we make progress.

Posted May 06, 2021 - 16:50 UTC

Update

Services and databases in EU are also experiencing intermittent failures.

Posted May 06, 2021 - 15:43 UTC

Update

Builds and deploys are still failing, and we are seeing some issues with services not responding. Engineers are continuing to work on resolving the issues.

Posted May 06, 2021 - 15:40 UTC

Update

Some existing web services in Frankfurt are returning timeouts. We continue to investigate.

Posted May 06, 2021 - 15:09 UTC

Identified

New deploys for dynamic apps are failing in Frankfurt. We have identified the issue and are working to resolve it.

Posted May 06, 2021 - 13:21 UTC

This incident affected: Render Dashboard.