Reports of server unhealthy events

Incident Report for Render

Postmortem

Summary

On October 11, 2022 between 16:20 and 17:20 UTC, services and cron jobs in our Oregon region had an elevated rate of errors. This was caused by a routine maintenance operation which was intended to have no impact on user services. However, resource contention resulted in sustained 5xx HTTP response codes. We also showed cryptic errors in the Dashboard and in emails to users.

Impact

Some servers in our Oregon region experienced elevated errors between 16:20 and 17:20 UTC on October 11, 2022.

Some users received cryptic error emails from Render that stated "Exited while running an internal process. Please contact us if you see this again".

Root Cause

At 16:30 UTC on October 11, Render began a routine maintenance operation of updating deployment manifests for stateless servers and cronjobs. This operation affected about 5000 servers, and it had the expected effect of causing a re-deploy of each service on these servers. Since we have zero downtime deploys, we expected this operation to proceed at a gradual pace with back-pressure, and ultimately have no user impact. Unfortunately, the resource contention caused by the large number of redeploys led to downtime.

For each service deploy, the following operations happen:

We download the build artifacts of the service.
We invoke the user's start command and then wait for the user's service to pass health checks.
We add the new instance to the load balancer, so traffic starts routing to both old and new instances.
In parallel:
a. We update the load balancer's list of IP addresses to remove old instances.
b. Our systems wait up to 60 seconds, then begin scaling down old instances.
Now traffic is being routed only to new instances, completing the deployment.

Network contention during this large number of redeploys caused step 1 to fail intermittently. These errors showed up in Render Dashboard as "Exited while running an internal process. Please contact us if you see this again", as well as in emails to users. Though alarming, this error alone does not indicate any downtime. Failing at this stage results in retries, rather than shifting traffic from healthy instances to unhealthy ones. A concurrent but unrelated incident affecting GitHub DNS resolution contributed to elevated deployment failures.

The elevated number of deployments meant our system could not keep up with the updates to add new IP addresses and remove old ones from the list of ready instances in the first part of step 4. This step exceeded our 60 second timeout, which led to the old instances starting to shut down. This caused a subset of requests to be routed to an unhealthy instance, resulting in the end-user seeing a 500 or 503 error. This set of errors occurred between 16:34 and 16:38 UTC, at which point the system caught up with all of the redeploys.

Even after the rest of the system recovered, one machine running user services continued to experience elevated faults and slow response times due to contention until 18:30 UTC.

Incident response

At 17:53 UTC we launched an internal incident for elevated user reports of "Exited with status 1 while running an internal process" errors. We immediately associated it with the maintenance work we'd completed previously. At 18:05 we created a public incident with the message:

“We are currently investigating elevated rates of server unhealthy events.”

Because we'd observed that the load balancers had already recovered by 16:38 UTC, we did not take any immediate action.

Given our understanding of the system, we initially had no cause to suspect user-facing downtime. We believed the impact was limited to slow deploys and the cryptic error messages we showed to users. So at 19:35 UTC, we resolved the incident with the following message:

“Due to an infrastructure update, some users in the Oregon region may have seen an increase of Server Unhealthy events showing "Exited with status 1 while running an internal process. Please contact us if you see this again." This should not have resulted in any actual downtime for services.”

However we continued to dig into what had happened and found evidence of user-facing errors, as well as hearing from users that they had experienced downtime. So we updated the incident resolution to say:

“Due to an infrastructure update, some users in the Oregon region experienced elevated error rates and request failures between 16:20 and 17:20 UTC.”

Planned remediations

Throttle the number of simultaneous deploys we launch during maintenance operations.
Improve our alerting for elevated user-facing errors.
Improve transparency of error messages we display. Failures to begin a deploy should not give a user the impression that their service is currently down.
Update our incident comms policy to include confirming whether there was downtime.
Re-evaluate our scheduling policy for routine and non-routine maintenance.

Posted Oct 21, 2022 - 20:40 UTC

Resolved

Due to an infrastructure update, some users in the Oregon region experienced elevated error rates and request failures between 16:20 and 17:20 UTC.

Posted Oct 11, 2022 - 19:35 UTC

Investigating

We are currently investigating elevated rates of server unhealthy events.

Posted Oct 11, 2022 - 18:05 UTC

This incident affected: Oregon (Web Services).