Issues deploying services

Incident Report for Render

Postmortem

Summary

Builds and deploys experienced an elevated failure rate across all regions, beginning at 05:30 UTC on 2022-07-08. The incident was caused by an inefficient database query that was introduced into the code path of event listeners responsible for deploying new versions of Render services in response to successful builds. After a spike in activity at 05:30 UTC, the event listeners were unable to process events at the rate they were being produced. The issue was resolved by rolling back to a version our code base that did not include the inefficient database calls. The deploy success rate returned to normal levels at 15:30 UTC.

Root Cause

The incident was caused by an inefficient database query that was introduced into the code path of event listeners that are responsible for deploying new versions of user services in response to successful builds. The inefficient query was also executed when creating new services and triggering manual deploys, so users may have experienced high latency when performing these operations via the Render dashboard and API.

The impact of the incident was exacerbated by the length of time it took to identify the root cause of the event handler latency. The event handler is responsible for coordinating the interactions of a number of components with many potential failure modes. Engineers spent a significant amount of time identifying the specific component that was failing so they could apply the appropriate mitigation. This incident exposed gaps in our instrumentation and monitoring that will be addressed as a part of the incident mitigations so we can identify and address these issues before they impact customers in the future.

Mitigations

We have significantly improved the instrumentation of the event handler that is responsible for initiating user deploys via tracing. This will allow us to quickly identify that source of performance issue if they occur in the future. We will be performing an audit of all other critical functions of Render's platform to ensure that they have sufficient instrumentation and alerting.
Improve monitoring and alerting for database metrics. We already alert when our database experiences CPU or memory exhaustion, but we do not have sufficient alerting for the performance of individual queries. There were metrics available to us that would have helped us understand the root cause more quickly had they been noticed earlier. We will improve our alerting coverage to ensure similar issues are caught more quickly in the future.
This issue was brought to our attention by users rather than via automated alerting, slowing down our response times. This is a failure that we take very seriously. We will ensure that we have the appropriate monitoring in place to proactively alert us if users experience deploy failures in the future.

Posted Jul 19, 2022 - 21:30 UTC

Resolved

Deploy and builds are back to normal. We're separately triggering manual deploys for older builds. If you see a stuck build, please trigger a manual deploy which will cancel the older build and create a new one.

Posted Jul 08, 2022 - 20:34 UTC

Update

Builds and deploys in all regions are back to normal.

Posted Jul 08, 2022 - 16:19 UTC

Monitoring

New deploys in Oregon, Ohio, and Singapore are going through. We're still looking into Frankfurt.

Posted Jul 08, 2022 - 15:56 UTC

Update

We've applied a hotfix and new manual deploys are going through. We're still working on getting back to normal.

Posted Jul 08, 2022 - 14:26 UTC

Update

Engineers are continuing to investigate the issue around deploys failing or stalling part way through

Posted Jul 08, 2022 - 11:12 UTC

Update

Engineers are continuing to investigate the issue around deployments - this is affecting both manual and automatic deploys.

Posted Jul 08, 2022 - 09:58 UTC

Investigating

We're investigating issues around deploying services.

Posted Jul 08, 2022 - 08:25 UTC

This incident affected: Oregon (Builds and Deploys), Frankfurt (Builds and Deploys), Singapore (Builds and Deploys), and Ohio (Builds and Deploys).