Reports of request failures in Ohio
Incident Report for Render
Postmortem

Summary

On March 1, 2023 we upgraded critical infrastructure in the Ohio region which resulted in overwriting the region's DNS settings. As a result, DNS servers were unable to serve requests, resulting in DNS lookup failures for all services in the region. We updated the affected servers 18 minutes later and service was restored. The time of the impact was 17:50-18:08 UTC.

Root Cause

We upgraded critical infrastructure in Ohio, which reverted the amount of resources available to DNS servers in that region. When the DNS servers restarted, a combination of the increased load due to an empty cache and the lower resources caused them to enter a crash loop. Many of our internal services rely on DNS to function, so what followed was a cascade of failures in other components.

We updated the resources available to the DNS servers, which allowed them to come online and service was restored in all other components shortly afterwards.

Mitigations

  • We have changed the steps involved in updating the related components to avoid overwriting the DNS server settings.
  • We have updated our upgrade process to include provisioning a separate set of backup DNS servers to prevent this in the future. It has been used in a subsequent upgrade with success.
Posted Mar 13, 2023 - 21:40 UTC

Resolved
This incident has been resolved.
Posted Mar 01, 2023 - 18:16 UTC
Monitoring
A fix has been implemented and we are monitoring the results
Posted Mar 01, 2023 - 18:12 UTC
Investigating
We are investigating reports of request failures
Posted Mar 01, 2023 - 18:05 UTC
This incident affected: Ohio (Web Services).