Partial service disruption for web services and static sites

Incident Report for Render

Postmortem

Summary

Beginning at 10:03 PST on December 3, 2024, Render's routing service was unable to reach newly deployed user services, resulting in 404 errors for end users. Some routing service instances also restarted automatically, which abruptly terminated HTTP connections and reduced capacity for all web traffic. The root cause was expiring TLS certificates on internal Render components, which created inconsistent internal state for Render's routing service.

The affected certificates were refreshed and the routing service was restarted beginning at 10:24 PST and was fully recovered by 10:37 PST.

Impact

Impact 1. Starting at 10:03 PST, many services that deployed in this time period experienced full downtime. Clients to those services received 404 errors with the header no-server.

Impact 2. Starting at 10:08 PST, the routing service started abruptly terminating connections, but was otherwise able to continue serving traffic normally.

Recovery. By 10:37 PST, all routing service instances were reconnected to the metadata service and full service was restored.

Timeline (PST)

10:03 - Certificates in some clusters begin to expire resulting in Impact 1.
10:08 - Some routing service instances begin to restart resulting in Impact 2.
10:15 - An internal Render web service becomes unavailable after a deploy and an internal incident is opened.
10:18 - Render engineers are paged because the routing service has stopped getting updates in some clusters
10:20 - Render engineers identify routing service is failing to connect to metadata service
10:24 - Render engineers restart the metadata service to refresh the mTLS certificate, routing service begins to recover
10:37 - Restarts are completed and routing services in all clusters are recovered

Root Cause

The Render HTTP routing service uses an in-memory metadata cache to route traffic to user services. It relies on the Render metadata service for updates to this cache when changes are made to user services.

This incident was triggered when certificates for this metadata service expired. The certificates were previously refreshed on restarts. But, as the metadata service has stabilized, we have been redeploying it less frequently.

Although the system is designed to continue serving traffic when the metadata service is unavailable, it failed to account for partial connectivity failure. The certificates expiring caused a partial connectivity failure where updates for newly deployed services were only partially processed, reconciling to an inconsistent state that was unable to route traffic.

In an attempt to fail fast, the routing service is designed to crash and restart to resolve any client-side connectivity issues after several minutes of stale data. These restarts did not solve the issue and long-lived connections or in-flight requests to those instances were abruptly terminated.

Mitigations

Completed

Restart all metadata services to refresh certificates

Planned

Automatically refresh metadata service TLS certificates.
Update our alert on missing metadata updates to fire sooner.
Add an alert monitoring reachability of the metadata service.
Increase the threshold to tolerate stale metadata before intentionally restarting the HTTP routing service.
Update the routing service metadata cache logic to handle this mixed connectivity state correctly.

Posted Dec 06, 2024 - 00:49 UTC

Resolved

This incident has been resolved.

Posted Dec 03, 2024 - 19:33 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Dec 03, 2024 - 18:38 UTC

Update

We are continuing to investigate this issue.

Posted Dec 03, 2024 - 18:37 UTC

Investigating

We are currently investigating this issue.

Posted Dec 03, 2024 - 18:25 UTC

This incident affected: Oregon (Web Services), Frankfurt (Web Services), Ohio (Web Services), Singapore (Web Services), Virginia (Web Services), and Static Sites.