uploading documentationHEAD master

author: doc <doc@filenotfound.org> 2025-06-30 20:06:28 +0000
committer: doc <doc@filenotfound.org> 2025-06-30 20:06:28 +0000
commit: 717fcb9c81d2bc3cc7a84a3ebea6572d7ff0f5cf (patch)
tree: 7cbd6a8d5046409a82b22d34b01aac93b3e24818 /incident_response.md
parent: 8368ff389ec596dee6212ebeb85e01c638364fb3 (diff)
1 files changed, 128 insertions, 0 deletions
diff --git a/incident_response.md b/incident_response.md
new file mode 100644
index 0000000..412f671
--- /dev/null
+++ b/incident_response.md
@@ -0,0 +1,128 @@
+# ⚠️ Incident Response Checklists for Common Failures
+
+These checklists are designed to normalize responses and reduce stress during downtime in your infrastructure.
+
+---
+
+## 🔌 Node Reboot or Power Loss
+
+- [ ] Verify ZFS pools are imported: `zpool status`
+- [ ] Check all ZFS mounts: `mount | grep /mnt`
+- [ ] Confirm Proxmox VM auto-start behavior
+- [ ] Validate system services: PostgreSQL, Mastodon, MinIO, etc.
+- [ ] Run `genesis-tools/healthcheck.sh` or equivalent
+
+---
+
+## 🐘 PostgreSQL Database Failure
+
+- [ ] Ping cluster VIP
+- [ ] Check replication lag: `pg_stat_replication`
+- [ ] Inspect ClusterControl / Patroni node status
+- [ ] Verify HAProxy is routing to correct primary
+- [ ] If failover occurred, verify application connections
+
+---
+
+## 🌐 Network Drop or Routing Issue
+
+- [ ] Check interface status: `ip a`, `nmcli`
+- [ ] Ping gateway and internal/external hosts
+- [ ] Test inter-VM connectivity
+- [ ] Inspect HAProxy or Keepalived logs for failover triggers
+- [ ] Validate DNS and NTP services are accessible
+
+---
+
+## 📦 Object Storage Outage (MinIO / rclone)
+
+- [ ] Confirm rclone mounts: `mount | grep rclone`
+- [ ] View VFS cache stats: `rclone rc vfs/stats`
+- [ ] Verify MinIO service and disk health
+- [ ] Check cache disk space: `df -h`
+- [ ] Restart rclone mounts if needed
+
+---
+
+## 🧠 Split Brain in PostgreSQL Cluster (ClusterControl)
+
+### Symptoms:
+- Two nodes think they're primary
+- WAL timelines diverge
+- Errors in ClusterControl, or inconsistent data in apps
+
+### Immediate Actions:
+- [ ] Use `pg_controldata` to verify state and timeline on both nodes
+- [ ] Temporarily pause failover automation
+- [ ] Identify true primary (most recent WAL, longest uptime, etc.)
+- [ ] Stop false primary immediately: `systemctl stop postgresql`
+
+### Fix the Broken Replica:
+- [ ] Rebuild broken node:
+  ```bash
+  pg_basebackup -h <true-primary> -D /var/lib/postgresql/XX/main -U replication -P --wal-method=stream
+  ```
+- [ ] Restart replication and confirm sync
+
+### Post-Mortem:
+- [ ] Audit any split writes for data integrity
+- [ ] Review Keepalived/HAProxy fencing logic
+- [ ] Add dual-primary alerts with `pg_is_in_recovery()` checks
+- [ ] Document findings and update HA policies
+
+---
+
+## 🐘 PostgreSQL Replication Lag / Sync Delay
+
+- [ ] Query replication status:
+  ```sql
+  SELECT client_addr, state, sync_state, sent_lsn, write_lsn, flush_lsn, replay_lsn FROM pg_stat_replication;
+  ```
+- [ ] Compare LSNs for lag
+- [ ] Check for disk I/O, CPU, or network bottlenecks
+- [ ] Ensure WAL retention and streaming are healthy
+- [ ] Restart replica or sync service if needed
+
+---
+
+## 🪦 MinIO Bucket Inaccessibility or Failure
+
+- [ ] Run `mc admin info local` to check node status
+- [ ] Confirm MinIO access credentials/environment
+- [ ] Check rclone and MinIO logs
+- [ ] Restart MinIO service: `systemctl restart minio`
+- [ ] Check storage backend health/mounts
+
+---
+
+## 🐳 Dockerized Service Crash (e.g., AzuraCast)
+
+- [ ] Inspect containers: `docker ps -a`
+- [ ] View logs: `docker logs <container>`
+- [ ] Check disk space: `df -h`
+- [ ] Restart with Docker or Compose:
+  ```bash
+  docker restart <container>
+  docker-compose down && docker-compose up -d
+  ```
+
+---
+
+## 🔒 Fail2Ban or Genesis Shield Alert Triggered
+
+- [ ] Tail logs:
+  ```bash
+  journalctl -u fail2ban
+  tail -f /var/log/fail2ban.log
+  ```
+- [ ] Inspect logs for false positives
+- [ ] Unban IP if needed:
+  ```bash
+  fail2ban-client set <jail> unbanip <ip>
+  ```
+- [ ] Notify via Mastodon/Telegram alert system
+- [ ] Tune jail thresholds or IP exemptions
+
+---
+
+> ✅ Store these in a Gitea wiki or `/root/checklists/` for quick access under pressure.
author	doc <doc@filenotfound.org>	2025-06-30 20:06:28 +0000
committer	doc <doc@filenotfound.org>	2025-06-30 20:06:28 +0000
commit	717fcb9c81d2bc3cc7a84a3ebea6572d7ff0f5cf (patch)
tree	7cbd6a8d5046409a82b22d34b01aac93b3e24818 /incident_response.md
parent	8368ff389ec596dee6212ebeb85e01c638364fb3 (diff)