Builds, deploys, cron jobs, service and database routing were all unavailable in the Frankfurt region due to a failure in our control plane for this region. Builds, deploys and cron jobs were unavailable between 21:38 UTC Jan 11th - 1:04 UTC Jan 12th. Web service, private service and database routing was unavailable between 22:09 UTC Jan 11th - 00:29 UTC Jan 12th. Background workers were not affected during this period, and databases had no data loss.
We apologize to everyone impacted by the outage. We're currently in progress building out multiple mitigations to the platform following the incident.
Creating new builds and deploys, and cron job executions were unavailable between 21:38 UTC Jan 11th - 1:04 UTC Jan 12th. Web service, private service and database routing was unavailable between 22:09 UTC Jan 11th - 00:29 UTC Jan 12th. Background workers were not affected during this period.
At 21:25 UTC, engineers were paged due to the failure of one of the components in the Frankfurt control plane. We immediately began investigating and identified that the component was exhausting all resources available to it. The issue quickly cascaded to the redundant instances of the control plane. At this time, the control plane was unable to handle requests, resulting in a customer-facing impact of failed builds, deploys, and cron jobs.
We have built the components responsible for handling requests to web services to continue to function in the event of control plane outages. However, an attempted mitigation for the resource exhaustion required a restart of the control plane components, which were also running the services responsible for handling DNS lookups for customer web services. The mitigation did not immediately fix the issue, so the DNS services were unable to restart after the machines were brought back online. At this point, request web services and databases began failing.
The issue was resolved once all control plane instances were relaunched on larger hardware.