Elevated 502 errors in Oregon
Incident Report for Render
Postmortem

Render RCA - 2023-12-11

Summary

Starting at 2023-12-11 20:33 UTC a Render customer in the Oregon region was subject to a Distributed Denial of Service (DDoS) botnet attack. Render uses Cloudflare to offer DDoS protection and Cloudflare immediately began blocking some but not all of the malicious traffic. The remaining traffic overloaded some components of our infrastructure. The attack ended 15 minutes later and within 8 minutes of that, all of our systems had self-recovered and began serving traffic as normal.

Impact

This outage impacted infrastructure that serves roughly a quarter of our customers in the Oregon region. The web services for these customers were unable to receive traffic responding with a 5xx HTTP errors. The total degraded service window was 23 minutes (2023-12-11 20:33 UTC to 2023-12-11 20:56 UTC), with the acute period (when no traffic was being served) for 11 minutes (2023-12-11 20:39 UTC to 2023-12-11 20:50 UTC).

Root Cause

Our Cloudflare configuration did not fully mitigate a DDoS attack against one of our customers. Additionally, parts of our infrastructure were not able to handle the dramatically increased load.

Mitigations

In the short term, we leveraged Cloudflare's filtering tool to guard against the resumption of the same attack.

We are also working to improve our system's responsiveness to this class of problem generally. Namely, we will be:

  • Working with Cloudflare to better leverage their security tools so that our combined systems can respond more adaptively to attacks of this kind.
  • Improving our infrastructure to be more resilient to these scenarios.
Posted Dec 19, 2023 - 00:42 UTC

Resolved
This incident has been resolved.
Posted Dec 11, 2023 - 21:33 UTC
Monitoring
Services in Oregon were partially degraded from 20:36 to 20:54 UTC. We’ve identified the cause of the issue and have taken steps to mitigate it.
Posted Dec 11, 2023 - 21:15 UTC
Investigating
We're investigating reports that services in Oregon are serving 502 Bad Gateway errors
Posted Dec 11, 2023 - 20:55 UTC
This incident affected: Oregon (Web Services, Cron Jobs, Background Workers, PostgreSQL, Redis, Web Services - Free Tier, Builds and Deploys, Autoscaling, Metrics/Logs).