On April 21st, 2023, Point-in-time Recovery (PITR) was released to eligible new databases. Backups taken for PITR had an incorrect networking configuration causing the backup job to attach itself into the routing layer as an eligible IP for database routing.
New connections to PITR-enabled databases would fail 50% of the time during the window where PITR backups were taken because they would be routed to the backup job that had did not have a database running.
50% of new connections to PITR-enabled databases would fail if initiated while PITR backup job was running. The amount of time it takes for a backup to be created varies from 15 to 45 minutes, depending on the size of the database. Existing connections to databases (such as through connection pooling) were unaffected.
Render Postgres Databases can be connected to by a stable internal service name (e.g. dpg-123abc) . This service name is managed by our cluster and ensures any changes to the database, such as the IP address are all reflected in the routing configuration for the database.
PITR takes daily backups of the database to ensure we always have an accurate restore point to restore to if a restore is requested. This backup job was configured with a similar networking label configuration to the database. Our routing layer picked up this configuration and updated the service name to route to the running backup job as an additional IP address to route to.
Because this service was not a real database, new connections created to the database name would fail as they routed to the incorrect IP address.
Backups were disabled for affected databases until we could set them up to be scheduled during low traffic hours. When we enabled and rescheduled the backup, a backup job was immediately initiated causing downtime during business hours.
Our scheduling system for asynchronous jobs, such as backups, immediately initiates a job outside of the scheduled window automatically if there was a recent missed run.
We had not seen an issue like this one before, so it took us time to identify the root cause. We observed changes in memory behavior, specifically the active page cache. We attempted to reproduce this behavior, but we realized we weren't taking into account other aspects of the outage, specifically that new connections were failing.
Going forward, we plan to use open source benchmarking tools that give us more information about our system, so we can debug more efficiently.
The network label configuration for the backup job has been updated to ensure it does not get picked up by our routing layer as an eligible service.
During this incident some of our earlier communication indicated that problems were resolved before they were completed at our end which led to repeated issues for some users. We have tightened up the process by which we will share updates for incidents to address this.