I've written
a journal about this but since it looks like we're not gonna be back for a while im moving this here.
Sunday noon the server started to spam the syslog with sata exceptions before becoming unresponsive and eventually going down completely. The errors indicated a problem with either the sata controller or the harddisk itself.
We're currently waiting for a KVM to access the server and maybe try to get a snapshot of the serverstate before it went down; or at least figure out what exactly is wrong.
In the meantime Msh100 is taking care of getting new harddisks to the datacenter so we can raid our box and prevent this from happening again the future. However, depending on the damage, it's probably gonna take until tuesday or wednesday to resolve this. We sure hope to be back up and fully functional for the upcoming SAGE lan this weekend.
The worst case scenario would be an unreadable harddisk. In that case we would need to rebuild the site from a backup. We do nightly backups of the important stuff, but that does not include user uploaded images like avatars and team logos. Those backups are the last resort and we hoped that we'd never need them so rebuilding from them is a pain in the ass.
We've set up
this fancy fallback page for now and we'll try to cover some matches manually
I'll keep this newspost updated but don't expect too much, progress is slow at the moment. In any case this is the most serious issue we've had to face, ever.
Update: We managed to get a snapshot and move it to another host. The current plan is to temporarily get it running on another box until our own server is fixed. So we might be online tonight (27.6), pretty much fully functional. In the meantime our server will hopefully get fixed with raid and we'll move back over before the lan to resume normal operation. Looks like we got lucky
Update 2: While I'm writing this we're getting a new case that can hold more hdds and two new hdds. A quick integrity check of the emergency backup revealed that everything should be in order so we can resume the exact state of the server when it went down. Like the crash never happened.
Update 3: We're back for good. On our own server already, with raid! Not a single byte of data was lost and everything is running smoothly.
Thanks to YCNs
msh100 and his immense effort this could be handled relatively quickly.