Some services returning 503s in US region

Incident Report for Render

Postmortem

Postmortem・503 errors in Oregon・2020-12-06

Between 13:34 and 17:33 PST on Sunday December 6 2020, many Render services returned 503 errors at an elevated rate. We are very sorry for the impact this had on our customers and their businesses.

We have since made improvements to our platform that would have prevented this incident and will make similar incidents less likely going forward. We are actively working on additional improvements: reliability is and always will be a top priority for Render.

What Happened

At 13:34 PST, a customer API hosted in our Oregon region started holding onto incoming requests until they timed out with an error. The load balancing algorithm that manages HTTP requests across our platform is designed to handle this failure mode by rejecting traffic to the failing service. This prevents TCP connection exhaustion which can affect other services on a given load balancer. The service in question had one of the highest rates of incoming requests in the Oregon region, and its failure exposed a bug in the load balancing algorithm that led to intermittent rejection of requests made to other services as well. These requests were rejected with a 503 (Service Unavailable) response.

Our monitoring system immediately notified us of the issue and we took steps to mitigate the impact until the root cause was identified and fixed.

Why It Won't Happen Again

A fix has been rolled out across all our load balancers to prevent incidents like this from happening again. We are working on additional mechanisms to increase isolation between services in Render's routing and load balancing layers. We are also increasing test coverage for our load balancing code, and introducing additional synthetic failure states in our continuous integration test suites to increase platform resilience during unexpected events.

Our team holds itself to the highest standards when it comes to reliability. We know we failed to meet that standard in this case, and let you down. Still, we firmly believe that we will meet it going forward with these changes in place and renewed focus on chaos engineering. We are grateful for your continued trust in us.

If you have questions, concerns, or comments related to this incident, please reach out to us at support@render.com

Posted Dec 09, 2020 - 18:32 UTC

Resolved

This incident has been resolved. Please notify us if you experience an increased rate of service issues.

Posted Dec 07, 2020 - 02:19 UTC

Monitoring

A fix has been implemented and we're monitoring the results.

Posted Dec 07, 2020 - 01:53 UTC

Identified

Connectivity error rates have risen again. We are continuing to investigate.

Posted Dec 07, 2020 - 00:59 UTC

Monitoring

We have resolved the apparent issue with our upstream load balancers and connectivity issues seem to be resolved. We are continuing to monitor the incident.

Posted Dec 06, 2020 - 23:58 UTC

Identified

We have identified that some of our upstream load balancers are having trouble connecting to the rest our infrastructure.

Posted Dec 06, 2020 - 23:40 UTC

Investigating

We are currently investigating this issue.

Posted Dec 06, 2020 - 21:57 UTC

This incident affected: Render Dashboard.