Postmortem: 14 hours of PostgreSQL disk-full

By 👤 DANIEL SAMSON, 🤖 CO-AUTHORED-BY: CLAUDE OPUS 4.7 <NOREPLY@ANTHROPIC.COM> · 2026-04-22

#databases #Kubernetes #postgres #postmortem

Production was down for fourteen hours. No data was lost, but samson.media returned a 500 to every visitor from 05:18 to 19:26 UTC. The fix that should have taken twenty minutes took most of a day, because three unrelated latent bugs in the storage stack stacked on top of each other. This is the writeup.

The proximate cause

PostgreSQL ran out of disk on its 10 GiB volume. CloudNativePG did exactly the right thing — it refused to start the primary into a disk-full state and tried to fail over — but the standby refused to start for the same reason. With no instance willing to accept writes, every request that touched the database returned a 500.

The real cause: a logs table with no retention

The Laravel app was logging to a public.logs table. Forty-four million rows. 9.2 GB. Around 1.5 million rows a day, roughly 310 MB of growth a day, for thirty days, until the disk was full. The table had ten indexes — index overhead nearly equalled the data itself. Append-only, no pruning, no partitioning, no cap. A well-known Laravel anti-pattern that nobody caught in review.

Why a 20-minute fix took 14 hours

The fix was supposed to be: bump the volume from 10 GiB to 30 GiB and move on. It wasn't, because three pre-existing storage issues each needed separate diagnosis:

Longhorn's over-provisioning defaults rejected the expansion despite ~70 GB of real free space on each disk. The scheduling maths (scheduled + requested ≤ (max − reserved) × overProv%) said no. I had to raise over-provisioning to 150% before it would accept the resize.
A stale iscsid PID cached by Longhorn's instance-manager. It resolves the host's iscsid PID once at startup and caches it; the host's iscsid had been restarted at some point, so every volume expansion failed silently with nsenter: cannot open /host/proc/<old-pid>/ns/mnt. The fix was to delete the instance-manager pods so they re-resolved the PID.
CloudNativePG's Not enough disk space phase refuses to recreate any pod until every PVC in the cluster is enlarged. So fixing one volume produced no visible progress — it was silently waiting on the other two.

The bit that actually hurt: no alerting

The database climbed to 99% utilisation over thirty days and nothing fired. No warning at 70%, no page at 85%. We found out it was full when users found out, via a 500. A slow leak is invisible right up until it's a cliff.

What went well

GitOps made the storage bump a one-line, auditable pull request that Fleet applied in seconds. CloudNativePG's conservative safety phases meant there was never any risk of data corruption, even as the incident dragged on. The daily backups were intact the entire time — we always had a clean fallback we never needed to use.

Lessons

Never log to your primary database without a retention policy. (That one earned its own post.)
Alert on capacity, not just liveness. Liveness checks are green right up to the moment the disk fills.
Silent retry loops are the enemy. Both Longhorn and the CSI resizer buried the real error behind generic timeout messages for hours.
Know your storage layer's failure modes before 3am, not during it.

Fourteen hours for a full disk. Embarrassing in hindsight, which is exactly what a good postmortem is for.