Platform Outages
Incident Report for Render
Postmortem

Summary

On March 26, 2024, at 16:07 UTC, services on Render's platform were disrupted after an unintended restart of all customer and system workloads. No data was lost, and static sites and services not dependent on PostgreSQL, Redis, or Disks recovered within 20 minutes. However, recovery of managed data stores and other services with attached disks took significantly longer due to the scale of the event and underlying rate limits associated with moving disks between machines. Full functionality across all regions was restored at 20:00 UTC.

Most PostgreSQL databases, Redis instances, and services with attached disks saw much longer recovery times due to system-level throttling and rate limits that weren't designed for an event of this nature and scale. Centralized logging and metrics services were also slow to recover due to these limits. We increased the underlying rate limits after discovering the root cause of the delay, and took other actions to improve recovery times. However, even with these mitigations, the scale of the event delayed complete recovery of paid services until 18:45 UTC and free services until 20:00 UTC.

We are incredibly sorry for the extended disruption we caused for many customers. Platform reliability has always been our top priority as a company, and we let you down.

We have implemented measures to prevent an incident of this scale and nature going forward. We are prioritizing further prevention and mitigation measures to improve platform resilience.

Timeline

All times in UTC on 2024-03-26.

  • 16:07 - An unintended change causes a restart of all customer and system workloads.
  • 16:07 - Render engineers are paged.
  • 16:09 - We open an internal critical incident to investigate.
  • 16:19 - We identify the source of the restart and disable it to prevent further restarts.
  • 16:19 - We open a public incident on https://status.render.com and continue to investigate and mitigate.
  • 16:21 - All static sites, stateless services in all regions, and all services hosted on GCP are restored.
  • 17:48 - All paid services are restored in Singapore.
  • 17:53 - All paid services are restored in Ohio.
  • 18:34 - All paid services are restored in Oregon.
  • 18:45 - All paid services are restored in Frankfurt.
  • 19:40 - Logs are restored in all regions.
  • 20:00 - All free services are restored in all regions.

What happened

On March 26, 2024, at 16:07 UTC, a faulty code change caused a restart of all workloads on our platform. This change was put behind a feature flag and tested manually and automatically in Render's development and staging environments, but a combination of issues ultimately led to the bug making it to production:

  • The testing infrastructure for the code change was inconsistent across production and non-production environments.
  • The change was feature-flagged, but a subtle bug in the feature-flagging code prevented the faulty code from running in non-production environments and surfacing sooner.

Our systems paged our engineers as soon as the incident started, and we opened an internal incident to investigate. We declared a public incident 12 minutes after the initial report. We quickly identified the faulty code and disabled it to prevent additional workload restarts.

While many services without attached disks recovered within minutes, components responsible for restarting services with attached disks (PostgreSQL, Redis, and services with explicitly attached disks) were severely overloaded due to the unprecedented scale of the event, leading to significantly increased recovery times for many of these services.

When services restart, they are often transparently moved to a different host. When services with an attached disk (including managed PostgreSQL and Redis instances) are restarted and moved to a different host, our systems detach the disk from the source host and attach it to the target host. In isolation, a single detach-attach operation takes, at most, a few minutes. However, hundreds of thousands of services with disks were restarted simultaneously during the incident, overloading the systems responsible for moving disks between machines and significantly slowing down our ability to restore service

As we worked to expedite the recovery process, we discovered and quickly increased default rate limits for the detach-attach process. We also noticed throttling in an underlying infrastructure provider and worked with the provider to increase rate limits across all impacted regions. We increased these limits to the maximum values possible without creating further instability in our systems. We also prioritized paid service recovery by temporarily suspending free PostgreSQL instances during the incident. These changes enabled considerably faster recovery of impacted services; however, full recovery took longer due to the overwhelming volume of the restarts.

Some monitoring and logging systems that rely on attached disks were also unavailable during the incident, leading to gaps in some service metrics and logs.

Mitigations

This incident was the most severe and widespread outage in Render's history and surfaced multiple opportunities for us to improve platform reliability further and minimize time to recovery. They are listed below and are being implemented with the highest priority.

Increased disk management rate limits

As discussed above, we increased multiple rate limits in the systems responsible for moving disks between machines. While we are confident in our ability to prevent similar incidents in the future, we are also now equipped to recover much faster than before.

Ensure consistency between production and non-production systems

Our investigation found subtle differences in testing infrastructure between our production and non-production environments. We are working to standardize and improve our testing processes to prevent similar incidents going forward.

Improve our disk management infrastructure

In addition to increased rate limits, we are also making code changes to the components that manage disks. Specifically, we will rely more on batched operations, increasing disk management throughput by an order of magnitude.

Restrict permissions for control plane components

The control plane code that triggered the incident only needed to interact with a small subset of system resources and should not have had permission to restart existing customer services and data stores. We are adding system-level restrictions so that only necessary control plane components and services can interact with customer services.

Improve incident communication

Our investigation uncovered multiple gaps in incident communications. It took 12 minutes after opening an internal incident to update Render's public status; while our engineers worked to collect enough information to provide a meaningful update, we should have opened a public incident sooner.

In our initial update, we incorrectly used the 'Degraded Performance' status instead of 'Partial Outage' or 'Full Outage'. As a result, individual component statuses did not reflect the severity of the incident until our next update 22 minutes later.

We understand the critical importance of timely and accurate updates during incidents; we are working on automation and improving our incident response processes to ensure that our status page and other communication channels are updated as soon as relevant information becomes available.

Posted Mar 29, 2024 - 22:44 UTC

Resolved
Everything is operating normally.

We're deeply sorry for the outage; we will soon follow up with a detailed incident report, including mitigations and prevention measures.
Posted Mar 26, 2024 - 20:02 UTC
Monitoring
All services have recovered. We're continuing to monitor isolated cases.
Posted Mar 26, 2024 - 19:58 UTC
Update
All paid services are now operating normally. We're working to restore availability for free tier services.
Posted Mar 26, 2024 - 19:28 UTC
Update
Nearly all services have recovered across all regions. We're working towards 100% recovery for all paid services, and will start bringing back free services next.
Posted Mar 26, 2024 - 18:56 UTC
Update
Singapore and Ohio have now fully recovered.

We have accelerated recovery for the remaining Oregon and Frankfurt databases and services with disks. Free-tier databases will remain unavailable until further notice as we prioritize recovery for paid services.
Posted Mar 26, 2024 - 18:23 UTC
Update
Services in Singapore have recovered fully. Over the last 15 minutes, we have seen a recovery of the majority of PostgreSQL, Redis, and services with attached disks, and we continue to observe the recovery of others. Engineers are working on improving recovery times for these services. We aim for full recovery for all paid services before 12 PM PT.
Posted Mar 26, 2024 - 17:58 UTC
Update
Due to the scope of the incident, we need to intentionally and sequentially recover PostgreSQL and Redis functionality across the fleet. We're actively working towards full recovery and collaborating with upstream providers to speed things up.
Posted Mar 26, 2024 - 17:29 UTC
Update
Data Services (Postgres/Redis) and services with attached disks are taking additional work to recover. Application Services that connect to those data services will experience failures or degredation as a result.
Posted Mar 26, 2024 - 16:45 UTC
Update
Many services recovered automatically, engineering is continuing to identify still affected services and mitigate issues as necessary.
Posted Mar 26, 2024 - 16:28 UTC
Update
We are continuing to work on a fix for this issue.
Posted Mar 26, 2024 - 16:20 UTC
Identified
We are encountering a broad range of outages across the Render Platform affecting connections and services. Engineering is working on mitigating the cause of these issues and narrowing down any non-affected components.
Posted Mar 26, 2024 - 16:19 UTC
This incident affected: Oregon (Web Services, Cron Jobs, Background Workers, Builds and Deploys, PostgreSQL, Redis, Web Services - Free Tier, Autoscaling, Metrics/Logs), Frankfurt (Web Services, Cron Jobs, Background Workers, Builds and Deploys, PostgreSQL, Redis, Web Services - Free Tier, Autoscaling, Metrics/Logs), Ohio (Web Services, Cron Jobs, Background Workers, Builds and Deploys, PostgreSQL, Redis, Web Services - Free Tier, Autoscaling, Metrics/Logs), and Singapore (Web Services, Cron Jobs, Background Workers, Builds and Deploys, PostgreSQL, Redis, Web Services - Free Tier, Autoscaling, Metrics/Logs).