TL;DR: We are a big boy site with big boy problems that cost money to adequately deal with. If you want to support the site, get Brave, get an Uphold account, and regularly contribute BAT. It's as good as cash. It feeds me and pays the bills.
Problem
The server harddrive failed.
Timeline
At about 8pm Moscow time, the server shut off. No one pinged me so when I naturally f5'd the site at 9pm Moscow time and saw it wasn't responsive, I started my diagnosis and determined pretty quickly the server was off and booted it. The web server VM was read-only and nothing worked. I determined the disk was in read-only mode, ran a check on it, and encountered an error. I rebooted the server and there were parts of the disk that were corrupt. I checked these and hoped that only a few random attachments were lost, because I can restore those easily. I did restore those attachments, but the database failed to boot and basically said it was fucked. I checked database backups which occur at 12 hour intervals and saw the last one had occurred 11 hours and 30 minutes before the site went down. The import process took 2 hours. The site came up at about 3:30am Moscow time.
Remedies going forward
All services that are not the actual KF have been moved off the KF's server to new VPSs. In particular, our in-house analytic suite is being reinstalled on a different machine. The analytic DB is actually bigger than the KF's so this will about halve the strain on the server and the disk -- I hope.
I'm also going to go ahead and start up replication because now that the DB is so big, SQL file backups as a primary backup system is not feasible. The backups themselves take many minutes to complete which can create issues in restoration as data doesn't match up. This is going to be on existing hardware I own.
My big fear right now is that the main server is actually fucked. If the disk corrupts again I have to replace the corrupted disks and RAID new ones. Our installation is now several terabytes so that's not cheap, it's to the tune of hundreds of dollars that I don't have. So fingers crossed it's just some random faggot fluke issue.
Problem
The server harddrive failed.
Timeline
At about 8pm Moscow time, the server shut off. No one pinged me so when I naturally f5'd the site at 9pm Moscow time and saw it wasn't responsive, I started my diagnosis and determined pretty quickly the server was off and booted it. The web server VM was read-only and nothing worked. I determined the disk was in read-only mode, ran a check on it, and encountered an error. I rebooted the server and there were parts of the disk that were corrupt. I checked these and hoped that only a few random attachments were lost, because I can restore those easily. I did restore those attachments, but the database failed to boot and basically said it was fucked. I checked database backups which occur at 12 hour intervals and saw the last one had occurred 11 hours and 30 minutes before the site went down. The import process took 2 hours. The site came up at about 3:30am Moscow time.
Remedies going forward
All services that are not the actual KF have been moved off the KF's server to new VPSs. In particular, our in-house analytic suite is being reinstalled on a different machine. The analytic DB is actually bigger than the KF's so this will about halve the strain on the server and the disk -- I hope.
I'm also going to go ahead and start up replication because now that the DB is so big, SQL file backups as a primary backup system is not feasible. The backups themselves take many minutes to complete which can create issues in restoration as data doesn't match up. This is going to be on existing hardware I own.
My big fear right now is that the main server is actually fucked. If the disk corrupts again I have to replace the corrupted disks and RAID new ones. Our installation is now several terabytes so that's not cheap, it's to the tune of hundreds of dollars that I don't have. So fingers crossed it's just some random faggot fluke issue.
Last edited:


