summaryrefslogtreecommitdiff
path: root/incident_response.md
diff options
context:
space:
mode:
authordoc <doc@filenotfound.org>2025-06-30 20:06:28 +0000
committerdoc <doc@filenotfound.org>2025-06-30 20:06:28 +0000
commit717fcb9c81d2bc3cc7a84a3ebea6572d7ff0f5cf (patch)
tree7cbd6a8d5046409a82b22d34b01aac93b3e24818 /incident_response.md
parent8368ff389ec596dee6212ebeb85e01c638364fb3 (diff)
uploading documentationHEADmaster
Diffstat (limited to 'incident_response.md')
-rw-r--r--incident_response.md128
1 files changed, 128 insertions, 0 deletions
diff --git a/incident_response.md b/incident_response.md
new file mode 100644
index 0000000..412f671
--- /dev/null
+++ b/incident_response.md
@@ -0,0 +1,128 @@
+# ⚠ïļ Incident Response Checklists for Common Failures
+
+These checklists are designed to normalize responses and reduce stress during downtime in your infrastructure.
+
+---
+
+## 🔌 Node Reboot or Power Loss
+
+- [ ] Verify ZFS pools are imported: `zpool status`
+- [ ] Check all ZFS mounts: `mount | grep /mnt`
+- [ ] Confirm Proxmox VM auto-start behavior
+- [ ] Validate system services: PostgreSQL, Mastodon, MinIO, etc.
+- [ ] Run `genesis-tools/healthcheck.sh` or equivalent
+
+---
+
+## 🐘 PostgreSQL Database Failure
+
+- [ ] Ping cluster VIP
+- [ ] Check replication lag: `pg_stat_replication`
+- [ ] Inspect ClusterControl / Patroni node status
+- [ ] Verify HAProxy is routing to correct primary
+- [ ] If failover occurred, verify application connections
+
+---
+
+## 🌐 Network Drop or Routing Issue
+
+- [ ] Check interface status: `ip a`, `nmcli`
+- [ ] Ping gateway and internal/external hosts
+- [ ] Test inter-VM connectivity
+- [ ] Inspect HAProxy or Keepalived logs for failover triggers
+- [ ] Validate DNS and NTP services are accessible
+
+---
+
+## ðŸ“Ķ Object Storage Outage (MinIO / rclone)
+
+- [ ] Confirm rclone mounts: `mount | grep rclone`
+- [ ] View VFS cache stats: `rclone rc vfs/stats`
+- [ ] Verify MinIO service and disk health
+- [ ] Check cache disk space: `df -h`
+- [ ] Restart rclone mounts if needed
+
+---
+
+## 🧠 Split Brain in PostgreSQL Cluster (ClusterControl)
+
+### Symptoms:
+- Two nodes think they're primary
+- WAL timelines diverge
+- Errors in ClusterControl, or inconsistent data in apps
+
+### Immediate Actions:
+- [ ] Use `pg_controldata` to verify state and timeline on both nodes
+- [ ] Temporarily pause failover automation
+- [ ] Identify true primary (most recent WAL, longest uptime, etc.)
+- [ ] Stop false primary immediately: `systemctl stop postgresql`
+
+### Fix the Broken Replica:
+- [ ] Rebuild broken node:
+ ```bash
+ pg_basebackup -h <true-primary> -D /var/lib/postgresql/XX/main -U replication -P --wal-method=stream
+ ```
+- [ ] Restart replication and confirm sync
+
+### Post-Mortem:
+- [ ] Audit any split writes for data integrity
+- [ ] Review Keepalived/HAProxy fencing logic
+- [ ] Add dual-primary alerts with `pg_is_in_recovery()` checks
+- [ ] Document findings and update HA policies
+
+---
+
+## 🐘 PostgreSQL Replication Lag / Sync Delay
+
+- [ ] Query replication status:
+ ```sql
+ SELECT client_addr, state, sync_state, sent_lsn, write_lsn, flush_lsn, replay_lsn FROM pg_stat_replication;
+ ```
+- [ ] Compare LSNs for lag
+- [ ] Check for disk I/O, CPU, or network bottlenecks
+- [ ] Ensure WAL retention and streaming are healthy
+- [ ] Restart replica or sync service if needed
+
+---
+
+## ðŸŠĶ MinIO Bucket Inaccessibility or Failure
+
+- [ ] Run `mc admin info local` to check node status
+- [ ] Confirm MinIO access credentials/environment
+- [ ] Check rclone and MinIO logs
+- [ ] Restart MinIO service: `systemctl restart minio`
+- [ ] Check storage backend health/mounts
+
+---
+
+## ðŸģ Dockerized Service Crash (e.g., AzuraCast)
+
+- [ ] Inspect containers: `docker ps -a`
+- [ ] View logs: `docker logs <container>`
+- [ ] Check disk space: `df -h`
+- [ ] Restart with Docker or Compose:
+ ```bash
+ docker restart <container>
+ docker-compose down && docker-compose up -d
+ ```
+
+---
+
+## 🔒 Fail2Ban or Genesis Shield Alert Triggered
+
+- [ ] Tail logs:
+ ```bash
+ journalctl -u fail2ban
+ tail -f /var/log/fail2ban.log
+ ```
+- [ ] Inspect logs for false positives
+- [ ] Unban IP if needed:
+ ```bash
+ fail2ban-client set <jail> unbanip <ip>
+ ```
+- [ ] Notify via Mastodon/Telegram alert system
+- [ ] Tune jail thresholds or IP exemptions
+
+---
+
+> ✅ Store these in a Gitea wiki or `/root/checklists/` for quick access under pressure.