Intermittent 503 errors for services

Incident Report for Render

Postmortem

What Happened

At approximately 6:51 AM PT on 2021-08-12, an unexpectedly large spike in inbound traffic caused many instances in Render's load balancing layer to become unavailable. The source of this traffic spike, which was 650x Render's usual peak traffic, was very likely malicious. This showed up as an increase in 503 responses returned by customer sites. Render's team was immediately alerted and began investigating and taking action to manually scale up our load balancer beyond what our autoscaling systems had already initiated.

At approximately 7:02 AM PT, our load balancer became healthy and resumed serving requests, but we noticed an increase in 502 responses, which indicated a problem reaching user services from the load balancer. We identified that a core networking component responsible for DNS was now failing, likely due to the initial traffic spike. We increased resource allocations for this component, enabling it to resume serving requests. All errors were resolved by 7:24 AM PT.

What we are doing to prevent it from happening again

We reviewed and updated resources allocated to the networking component that failed.
We are working on improved scaling strategies for our load balancing layer.
We have been investing in infrastructure to mitigate external attacks including working with leading vendors in the space. This particular incident will be prevented in the future with changes that were already in progress and completed earlier this week. Unfortunately the timing meant we had this incident by a narrow margin.

We are incredibly sorry for the impact this outage had on our customers and of course, their customers. Reliability remains the top priority at Render, and we are confident in our ability to prevent similar incidents in the future.

Posted Aug 20, 2021 - 21:52 UTC

Resolved

We are seeing a return to regular success rates. We will post a postmortem after investigating the root cause of the issue further.

Posted Aug 12, 2021 - 14:52 UTC

Monitoring

We are no longer observing a high error rate

Posted Aug 12, 2021 - 14:44 UTC

Update

We've implemented another fix that seems to be resolving many of the networking issues

Posted Aug 12, 2021 - 14:24 UTC

Investigating

We're investigating some ongoing connection failures

Posted Aug 12, 2021 - 14:12 UTC

Monitoring

We've implemented a fix and are seeing errors going down

Posted Aug 12, 2021 - 14:04 UTC

Investigating

We're investigating intermittent error responses for services

Posted Aug 12, 2021 - 13:54 UTC