Database Unavailability in Frankfurt
Incident Report for Render
Postmortem

Summary

A node provided by our external cloud provider had a hardware issue and failed its status check on May 23, 2022, 08:30 AM UTC. The engineer on-call was automatically paged at 08:57 AM UTC. The engineer investigated the issue - database instances were unable to start on a healthy node because their disks were still attached to the old node. The engineer manually terminated the node around 09:27 AM UTC, which allowed disks to be attached to healthy nodes. This allowed all database instances to come back online by 09:35 AM UTC.

Impact

All user Postgres databases on the node had a complete outage from 08:30 AM UTC to 09:35 AM UTC (up to 1 hour and 5 minutes of downtime).

Root Cause

A node provided by our external cloud provider had a hardware issue and failed its status check at May 23, 2022 08:30 AM UTC.

Mitigations

Hardware issues are rare, but expected occurrences. Our mitigations focus on detection and reducing the time to resolve.

The alert that paged on-call fires after databases are stuck in a single state while starting up for at least 15 minutes - This paged engineer on call after 27 minutes of downtime. We are working on creating an alert that fires sooner.

The engineer took corrective action to resolve the incident 30 minutes after being paged. We are working on making it easier to identify the correct action to take faster and will eventually automate this to reduce the time to resolve.

We created a StatusPage incident 19 minutes after being paged (46 minutes into the outage). As a result, our status page incorrectly reflected 19 minutes of partial downtime. We cannot edit the incident details but have manually updated the uptime to correctly reflect the 1 hour and 5 minutes of total downtime. We will review our incident procedures to ensure our status page reflects our reliability as accurately as possible.

We are also working on High Availability (HA) features for Postgres which should be available around Q4 2022. Users opting into this feature after the feature is completed will suffer minimal downtime if a hardware issue occurs on the node that the primary is on.

Posted May 26, 2022 - 19:57 UTC

Resolved
Issues with Postgres database availability in Frankfurt have now been resolved.
Posted May 23, 2022 - 09:35 UTC
Investigating
We've received reports of issues with Postgres database availability in Frankfurt. Engineers are investigating.
Posted May 23, 2022 - 09:16 UTC
This incident affected: Frankfurt (PostgreSQL).