The Incident Log: January 31, 2017

[23:00] *** Topic: DB Replication Lag | Status: 🔴 Critical
[23:05] *** tired_sysadmin has joined
[23:10] <tired_sysadmin> Replication is stuck again. The secondary node (db2) is refusing to sync.
[23:11] <tired_sysadmin> I’m going to wipe the data directory on db2 and let it pull a fresh copy from master.
[23:12] <tired_sysadmin> rm -rf /var/opt/gitlab/postgresql/data
[23:12] <tired_sysadmin> Weird. It’s taking a while. Usually empty directories delete instantly.
[23:13] <helper_dev> Hey, why did the website just go 500?
[23:13] <tired_sysadmin>
[23:13] <tired_sysadmin> I’m looking at my terminal prompt.
[23:14] <tired_sysadmin> It says root@db1.
[23:14] <helper_dev> db1 is Prod. You are deleting Prod.
[23:15] <tired_sysadmin> CTRL+C CTRL+C CTRL+C
[23:15] <tired_sysadmin> Okay, I stopped it. How much is left?
[23:16] <helper_dev> Checking… The directory is 4.5KB.
[23:16] <tired_sysadmin> We had 300GB of data.
[23:17] <helper_dev> Okay, don’t panic. We have 5 different backup mechanisms. Let’s check S3.
[23:20] <helper_dev> S3 bucket is empty. The backup script has been failing silently since version 8.1.
[23:21] <tired_sysadmin> Check the Azure disk snapshots.
[23:22] <helper_dev> Not enabled.
[23:23] <tired_sysadmin> …LVM snapshots?
[23:24] <helper_dev> We take them every 24 hours. We just lost 6 hours of data.
[23:25] <tired_sysadmin> I am going to live stream the restoration on YouTube so people don’t kill us.

Postmortem of database outage of January 31