Websites and APIs on Render are unavailable due to Cloudflare network errors
Incident Report for Render
Postmortem

Summary

Render uses Cloudflare to offer DDoS protection to all web services and as a CDN provider for static sites. A configuration change made by Cloudflare triggered a bug in Cloudflare systems, causing all Render web services and static sites to respond with a 500 HTTP error between 2023-08-11 4:00 UTC and 2023-08-11 5:26 UTC. Service was restored after Cloudflare reverted the change.

Impact

Between 4:00 UTC and 5:26 UTC, all HTTP traffic to web services and static sites hosted on Render received a Cloudflare Internal Server Error (HTTP error code 500). Traffic to render.com and dashboard.render.com was also impacted.

Root Cause

Render uses Cloudflare for DDoS protection for all HTTP services, and as a CDN for static sites. All requests to Render are therefore proxied through the closest Cloudflare data center. At 2023-08-11 4:00 UTC, Cloudflare added a rule to place an interstitial page in front of a specific Render-hosted website in response to a verified report of phishing. The application of this rule triggered a bug in the Cloudflare request processing engine, which resulted in all Render services, our dashboard, and render.com returning HTTP 500 errors.

Render's automated alerting brought the issue to our attention at 2023-08-11 4:01 UTC, and we responded by escalating the issue to Cloudflare and opening a public incident at 4:05. Given the severe and extensive impact of the outage, we coordinated with several members of the Cloudflare team while also investigating several options to restore services as quickly as possible. One of these investigations explored ways to temporarily circumvent Cloudflare's network to serve requests directly from Render's origin servers. We were able to deploy a working solution for dashboard.render.com at 5:03 UTC; this solution was manual and specific to the dashboard.render.com subdomain. We also began investigating ways to roll this solution out more broadly with customer assistance that would have involved updating their DNS records. Cloudflare identified the issue and rolled back the triggering rule at 5:26 UTC.

Mitigations

Completed

  • Over the last week, Render's engineering team worked closely with Cloudflare support and engineering teams to discuss near-term mitigations and longer-term solutions.
  • Cloudflare rolled back the original change and added validation to prevent this bug in the future.

Planned

  • Render will automate and test mechanisms to temporarily route HTTP requests directly to Render in similarly extreme cases.
  • Render will investigate provider redundancy options to enable faster failure recovery.
  • In their RCA, Cloudflare shared additional recommended mitigations for their systems. We will continue to work together to put these in place.
Posted Aug 18, 2023 - 23:01 UTC

Resolved
Service has been fully restored.
Posted Aug 11, 2023 - 05:49 UTC
Monitoring
Cloudflare has implemented a fix and we're monitoring.
Posted Aug 11, 2023 - 05:37 UTC
Identified
We are working with the Cloudflare team to determine a fix. We're separately also looking into ways you can bypass Cloudflare for your Render services.
Posted Aug 11, 2023 - 05:12 UTC
Update
We're working with the Cloudflare team to restore access. You can also follow https://www.cloudflarestatus.com for updates.
Posted Aug 11, 2023 - 04:33 UTC
Update
We are continuing to investigate this issue.
Posted Aug 11, 2023 - 04:07 UTC
Investigating
We are in touch with the Cloudflare team and investigating.
Posted Aug 11, 2023 - 04:06 UTC
This incident affected: Render Dashboard, Render Website, Render API, and Static Sites.