The Incident Log: January 31, 2017

[23:00]
*** Topic: DB Replication Lag | Status: 🔴 Critical
[23:05]
*** tired_sysadmin has joined
[23:10]
<tired_sysadmin>
Replication is stuck again. The secondary node (db2) is refusing to sync.
[23:11]
<tired_sysadmin>
I’m going to wipe the data directory on db2 and let it pull a fresh copy from master.
[23:12]
<tired_sysadmin>
rm -rf /var/opt/gitlab/postgresql/data
[23:12]
<tired_sysadmin>
Weird. It’s taking a while. Usually empty directories delete instantly.
[23:13]
<helper_dev>
Hey, why did the website just go 500?
[23:13]
<tired_sysadmin>
…
[23:13]
<tired_sysadmin>
I’m looking at my terminal prompt.
[23:14]
<tired_sysadmin>
It says root@db1.
[23:14]
<helper_dev>
db1 is Prod. You are deleting Prod.
[23:15]
<tired_sysadmin>
CTRL+C CTRL+C CTRL+C
[23:15]
<tired_sysadmin>
Okay, I stopped it. How much is left?
[23:16]
<helper_dev>
Checking… The directory is 4.5KB.
[23:16]
<tired_sysadmin>
We had 300GB of data.
[23:17]
<helper_dev>
Okay, don’t panic. We have 5 different backup mechanisms. Let’s check S3.
[23:20]
<helper_dev>
S3 bucket is empty. The backup script has been failing silently since version 8.1.
[23:21]
<tired_sysadmin>
Check the Azure disk snapshots.
[23:22]
<helper_dev>
Not enabled.
[23:23]
<tired_sysadmin>
…LVM snapshots?
[23:24]
<helper_dev>
We take them every 24 hours. We just lost 6 hours of data.
[23:25]
<tired_sysadmin>
I am going to live stream the restoration on YouTube so people don’t kill us.

Postmortem of database outage of January 31

IRC Log: rm -rf /var/opt/gitlab/postgresql/data

The Incident Log: January 31, 2017

About the Author

Continue the series: irc-logs

See Also