1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
|
# â ïļ Incident Response Checklists for Common Failures
These checklists are designed to normalize responses and reduce stress during downtime in your infrastructure.
---
## ð Node Reboot or Power Loss
- [ ] Verify ZFS pools are imported: `zpool status`
- [ ] Check all ZFS mounts: `mount | grep /mnt`
- [ ] Confirm Proxmox VM auto-start behavior
- [ ] Validate system services: PostgreSQL, Mastodon, MinIO, etc.
- [ ] Run `genesis-tools/healthcheck.sh` or equivalent
---
## ð PostgreSQL Database Failure
- [ ] Ping cluster VIP
- [ ] Check replication lag: `pg_stat_replication`
- [ ] Inspect ClusterControl / Patroni node status
- [ ] Verify HAProxy is routing to correct primary
- [ ] If failover occurred, verify application connections
---
## ð Network Drop or Routing Issue
- [ ] Check interface status: `ip a`, `nmcli`
- [ ] Ping gateway and internal/external hosts
- [ ] Test inter-VM connectivity
- [ ] Inspect HAProxy or Keepalived logs for failover triggers
- [ ] Validate DNS and NTP services are accessible
---
## ðĶ Object Storage Outage (MinIO / rclone)
- [ ] Confirm rclone mounts: `mount | grep rclone`
- [ ] View VFS cache stats: `rclone rc vfs/stats`
- [ ] Verify MinIO service and disk health
- [ ] Check cache disk space: `df -h`
- [ ] Restart rclone mounts if needed
---
## ð§ Split Brain in PostgreSQL Cluster (ClusterControl)
### Symptoms:
- Two nodes think they're primary
- WAL timelines diverge
- Errors in ClusterControl, or inconsistent data in apps
### Immediate Actions:
- [ ] Use `pg_controldata` to verify state and timeline on both nodes
- [ ] Temporarily pause failover automation
- [ ] Identify true primary (most recent WAL, longest uptime, etc.)
- [ ] Stop false primary immediately: `systemctl stop postgresql`
### Fix the Broken Replica:
- [ ] Rebuild broken node:
```bash
pg_basebackup -h <true-primary> -D /var/lib/postgresql/XX/main -U replication -P --wal-method=stream
```
- [ ] Restart replication and confirm sync
### Post-Mortem:
- [ ] Audit any split writes for data integrity
- [ ] Review Keepalived/HAProxy fencing logic
- [ ] Add dual-primary alerts with `pg_is_in_recovery()` checks
- [ ] Document findings and update HA policies
---
## ð PostgreSQL Replication Lag / Sync Delay
- [ ] Query replication status:
```sql
SELECT client_addr, state, sync_state, sent_lsn, write_lsn, flush_lsn, replay_lsn FROM pg_stat_replication;
```
- [ ] Compare LSNs for lag
- [ ] Check for disk I/O, CPU, or network bottlenecks
- [ ] Ensure WAL retention and streaming are healthy
- [ ] Restart replica or sync service if needed
---
## ðŠĶ MinIO Bucket Inaccessibility or Failure
- [ ] Run `mc admin info local` to check node status
- [ ] Confirm MinIO access credentials/environment
- [ ] Check rclone and MinIO logs
- [ ] Restart MinIO service: `systemctl restart minio`
- [ ] Check storage backend health/mounts
---
## ðģ Dockerized Service Crash (e.g., AzuraCast)
- [ ] Inspect containers: `docker ps -a`
- [ ] View logs: `docker logs <container>`
- [ ] Check disk space: `df -h`
- [ ] Restart with Docker or Compose:
```bash
docker restart <container>
docker-compose down && docker-compose up -d
```
---
## ð Fail2Ban or Genesis Shield Alert Triggered
- [ ] Tail logs:
```bash
journalctl -u fail2ban
tail -f /var/log/fail2ban.log
```
- [ ] Inspect logs for false positives
- [ ] Unban IP if needed:
```bash
fail2ban-client set <jail> unbanip <ip>
```
- [ ] Notify via Mastodon/Telegram alert system
- [ ] Tune jail thresholds or IP exemptions
---
> â
Store these in a Gitea wiki or `/root/checklists/` for quick access under pressure.
|