Beginning at 10:03 PST on December 3, 2024, Render's routing service was unable to reach newly deployed user services, resulting in 404 errors for end users. Some routing service instances also restarted automatically, which abruptly terminated HTTP connections and reduced capacity for all web traffic. The root cause was expiring TLS certificates on internal Render components, which created inconsistent internal state for Render's routing service.
The affected certificates were refreshed and the routing service was restarted beginning at 10:24 PST and was fully recovered by 10:37 PST.
Impact 1. Starting at 10:03 PST, many services that deployed in this time period experienced full downtime. Clients to those services received 404 errors with the header no-server.
Impact 2. Starting at 10:08 PST, the routing service started abruptly terminating connections, but was otherwise able to continue serving traffic normally.
Recovery. By 10:37 PST, all routing service instances were reconnected to the metadata service and full service was restored.
The Render HTTP routing service uses an in-memory metadata cache to route traffic to user services. It relies on the Render metadata service for updates to this cache when changes are made to user services.
This incident was triggered when certificates for this metadata service expired. The certificates were previously refreshed on restarts. But, as the metadata service has stabilized, we have been redeploying it less frequently.
Although the system is designed to continue serving traffic when the metadata service is unavailable, it failed to account for partial connectivity failure. The certificates expiring caused a partial connectivity failure where updates for newly deployed services were only partially processed, reconciling to an inconsistent state that was unable to route traffic.
In an attempt to fail fast, the routing service is designed to crash and restart to resolve any client-side connectivity issues after several minutes of stale data. These restarts did not solve the issue and long-lived connections or in-flight requests to those instances were abruptly terminated.