Builds and deploys experienced an elevated failure rate across all regions, beginning at 05:30 UTC on 2022-07-08. The incident was caused by an inefficient database query that was introduced into the code path of event listeners responsible for deploying new versions of Render services in response to successful builds. After a spike in activity at 05:30 UTC, the event listeners were unable to process events at the rate they were being produced. The issue was resolved by rolling back to a version our code base that did not include the inefficient database calls. The deploy success rate returned to normal levels at 15:30 UTC.
The incident was caused by an inefficient database query that was introduced into the code path of event listeners that are responsible for deploying new versions of user services in response to successful builds. The inefficient query was also executed when creating new services and triggering manual deploys, so users may have experienced high latency when performing these operations via the Render dashboard and API.
The impact of the incident was exacerbated by the length of time it took to identify the root cause of the event handler latency. The event handler is responsible for coordinating the interactions of a number of components with many potential failure modes. Engineers spent a significant amount of time identifying the specific component that was failing so they could apply the appropriate mitigation. This incident exposed gaps in our instrumentation and monitoring that will be addressed as a part of the incident mitigations so we can identify and address these issues before they impact customers in the future.