Silo 03

Infra Silo

A resilient PostgreSQL deployment on Kubernetes with operational runbooks executed through kubectl.

Requirements

  • Run PostgreSQL with high availability and persistent storage.
  • Automate daily backups with tested restore procedures.
  • Provide clear kubectl-based runbooks for incident response.

Stack

  • PostgreSQL
  • Kubernetes
  • kubectl
  • Prometheus
  • Grafana

Architecture

Workflow

  1. Primary pod handles writes while replica pod streams WAL updates.
  2. Scheduled job snapshots backups to object storage.
  3. Health monitors trigger alerts for replication lag and pod health.
  4. Runbook operator executes controlled failover steps via kubectl.

Diagram

flowchart LR
      A[App Services] --> B[Postgres Primary Pod]
      B --> C[Replica Pod]
      B --> D[Persistent Volume]
      C --> E[Read Traffic]
      B --> F[Backup CronJob]
      F --> G[Object Storage]
      H[kubectl Runbooks] --> B
      H --> C

Kubernetes Postgres Topology

Custom topology sketch for operator perspective.

Kubernetes Cluster Postgres Primary Replica Pod Backup Job Object Storage

Tradeoffs

  • Kubernetes control plane adds complexity compared with single-host DB.
  • Operational resilience improves at the cost of deeper platform knowledge.
  • Backup/restore automation requires periodic fire-drill validation.

Risks / Failure Modes

  • Storage class misconfiguration can threaten durability guarantees.
  • Failover automation can create split-brain if guardrails are weak.
  • Alert noise can hide meaningful signals during incidents.

Outcomes