Frankfurt builds, deploys, databases not working

Incident Report for Render

Postmortem

Root Cause Analysis for Render incident - Jan 11th 2022

Summary

Builds, deploys, cron jobs, service and database routing were all unavailable in the Frankfurt region due to a failure in our control plane for this region. Builds, deploys and cron jobs were unavailable between 21:38 UTC Jan 11th - 1:04 UTC Jan 12th. Web service, private service and database routing was unavailable between 22:09 UTC Jan 11th - 00:29 UTC Jan 12th. Background workers were not affected during this period, and databases had no data loss.

We apologize to everyone impacted by the outage. We're currently in progress building out multiple mitigations to the platform following the incident.

Impact

Creating new builds and deploys, and cron job executions were unavailable between 21:38 UTC Jan 11th - 1:04 UTC Jan 12th. Web service, private service and database routing was unavailable between 22:09 UTC Jan 11th - 00:29 UTC Jan 12th. Background workers were not affected during this period.

Root Cause

At 21:25 UTC, engineers were paged due to the failure of one of the components in the Frankfurt control plane. We immediately began investigating and identified that the component was exhausting all resources available to it. The issue quickly cascaded to the redundant instances of the control plane. At this time, the control plane was unable to handle requests, resulting in a customer-facing impact of failed builds, deploys, and cron jobs.

We have built the components responsible for handling requests to web services to continue to function in the event of control plane outages. However, an attempted mitigation for the resource exhaustion required a restart of the control plane components, which were also running the services responsible for handling DNS lookups for customer web services. The mitigation did not immediately fix the issue, so the DNS services were unable to restart after the machines were brought back online. At this point, request web services and databases began failing.

The issue was resolved once all control plane instances were relaunched on larger hardware.

Mitigations

Planned

Re-architect the control plane so that core services are well-isolated and right-sized. In particular, we will be running the distributed data store and internal DNS on dedicated hardware.
Alert earlier on approaching resource limits for control plane services.
Model and load test our control plane components to accurately size and autoscale the services.

Posted Jan 14, 2022 - 23:12 UTC

Resolved

This incident has been resolved.

Posted Jan 12, 2022 - 01:52 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jan 12, 2022 - 00:55 UTC

Identified

We have identified the issue affecting our Frankfurt services and are working on a fix together with the underlying service provider.

Posted Jan 12, 2022 - 00:20 UTC

Update

We are continuing to investigate this issue.

Posted Jan 11, 2022 - 23:22 UTC

Update

We are continuing to investigate this issue. Some Frankfurt databases are also experiencing connectivity issues.

Posted Jan 11, 2022 - 22:31 UTC

Investigating

We are currently investigating this issue.

Posted Jan 11, 2022 - 21:57 UTC