Increased 404s in Oregon (Web Services) and Static Sites

Incident Report for Render

Postmortem

Summary

As an infrastructure provider, providing a reliable platform that allows our customers to build and scale their applications with confidence is our highest obligation. We invest heavily to ensure our platform is highly reliable and secure, including in our routing layer that handles billions of HTTP requests every day.

On November 5, 2025, we inadvertently rolled back a performance improvement that was gated behind a feature flag. This led to disruption in the form of intermittent 404s for some web services and static sites deployed to the Oregon region.

We have fully identified the sequence of events that led to this outage and are in the process of taking steps to prevent it from recurring.

Impact

There were two periods where some customers hosting web services and static sites in the Oregon region experienced a partial outage with intermittent 404s. The first period occurred between 10:39 AM PST and 11:25 AM PST . At this time, two Render clusters had slightly degraded service. One cluster returned a negligible number of 404 responses, and the other cluster returned 404 responses for approximately 10% of requests.

The second period occurred between 11:59 AM PST to 12:34 AM PST and saw more significant service degradation. During this period, about 50% of all requests to services in the affected cluster received a 404 response.

All newly created services in these clusters were affected and received 404 responses during the incident. Updates to existing services were also slow to propagate. Free tier services that were recently deployed or waking from sleep were also affected.

Root Cause

Render's routing service depends on a metadata service to receive information about the user services it routes traffic to. When the routing service first starts and upon occasional reconnection, it will request and receive a large volume of data from the metadata service.

Earlier in 2025, we successfully deployed a memory optimization related to data transfer between the metadata and routing services using a feature flag. In late October, we removed the flag from code and redeployed, but we didn't redeploy the metadata service, which still depended on the flag.

On November 5th, we cleaned up unreferenced feature flags from our system. This caused the metadata service to revert to its less efficient data transfer method, leading to memory exhaustion and crashes.

Our routing service is designed to handle metadata service outages and continue serving traffic based on its last known state. However, newly created instances that could not load their initial state were incorrectly sent requests, resulting in 404 errors.

During the first period of impact, the metadata service was crashing in two of our clusters, and only a small fraction of routing service instances were impacted.

During the second period of impact, we saw a large increase in HTTP requests for services in the affected cluster. This triggered scale-ups of the routing service, all of which returned 404 errors.

Mitigations

Completed

Increased memory available to the metadata service (this has since been reverted)
Temporarily re-enabled the feature flag to support more efficient data transfer between the routing and metadata services (this has since been removed)
Deployed the metadata service to no longer rely on the feature flag
Enhanced our monitoring of the metadata service to alert us of this particular failure mode

Planned

Improve our feature flag hygiene practice to prevent the removal of a feature flag while it is still being evaluated
Prevent the routing service from receiving traffic if it never successfully loaded state from the metadata service

Posted Nov 18, 2025 - 20:02 UTC

Resolved

This incident has been resolved.

Posted Nov 05, 2025 - 21:52 UTC

Update

We are continuing to monitor for any further issues.

Posted Nov 05, 2025 - 21:03 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Nov 05, 2025 - 20:48 UTC

Update

We are continuing to work on a fix for this issue.

Posted Nov 05, 2025 - 20:21 UTC

Identified

We have identified continuing issues in Oregon. A fix is being worked on.

Posted Nov 05, 2025 - 20:08 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Nov 05, 2025 - 19:58 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Nov 05, 2025 - 19:24 UTC

Investigating

We are currently investigating the issue.

Posted Nov 05, 2025 - 19:19 UTC

This incident affected: Oregon (Web Services) and Static Sites.