The Perl NOC: June 2007

Sunday, June 10, 2007

~All is well

Almost everything is back up. Apologies for the super-extended outage. I suppose it's bound to happen every few years, but we'll get to work on making sure this particular failure won't happen again.

Roberts been nudging the MySQL databases and no data should be lost. (A few databases did disappear, but we should be able to find a backup or otherwise restore it from exported data for all of them). Cross your fingers for us!

We're still working on making sure all the RT data is in good shape before putting it online, but it should be back one of the next few days.

Let us know at ask@ and robert@ if you see anything strange in the next few days. (Or when RT is back then at webmaster@ ...)

Thank you for your patience!

Friday, June 8, 2007

RAID went boom part 2

The system is still recovering from the failure. Details are long and boring, but computers suck. :-)

Ironically we've just been talking (again) about getting the services running on the failed server made redundant, but haven't actually done it yet. It's our biggest SPOF by far. Grrh.

Right now the ETA is "in the morning" (PST). Robert is sleeping and will check on it when he gets up. I'll keep an eye on it for a little bit longer (and then go to bed because tomorrow it's my birthday and hopefully I'll have to eat a big cake or something!)

Something Go Boom

Something's broken on our main fileserver. It's related to our disk array. We're working on fixing it. More news as it happens. Lots of stuff may be broken until we get it back up.

Update 7:10pm: We've had a disk failure, and the raid volume got unhappy. We don't think we've lost any important data, but are verifying. Things are going slow because it takes a while to check half a terabyte of data.\

Update 11:40pm: This is taking a while to resync and fsck. (Plus we accidentally aborted it halfway through.) We're going to go to bed, we'll try and have services back in the morning.