It has not been a good weekend for our datacenter. It all started on Saturday, when Los Angeles experienced record breaking temperatures. (I spent the afternoon outside, and it was a scorcher... 110 degrees plus.) There was a power failure in the building that hosts our datacenter...
the backup power kicked in, but failed after a time period when it got too hot. Our machines lost power briefly, and all but one came back up. Because Murphy's law always takes effect when you least want it to - the machine that didn't come back up was the one that hosts all the perl.org mailing lists. The datacenter personel attempted to reset it, but as they were dealing with many other customers (much more important than us - I don't blame them), they didn't have time to hook a monitor up to our system and see what was going on. So, at 11pm, I drove down to the datacenter to find out that all it wanted was "press F1 to continue". Further diagnosis showed that the bios battery was gone, and the case open sensor kept tripping. Even if the bios was set to not prompt, it would "conveniently" forget that fact. (Did I mention that I was leaving to go to Portland for OSCON in the morning?)
Today, we recieved a note that we may lose power due to some emergency maintenance the building was going to perform to repair electrical damage caused by yesterday's outage. So, instead of having to deal with fscking and rapid power loss, we shut down all of the systems. Severla hours later, we attempted to turn them back on - but only 50% came up! The datacenter staff helped reset the rest, and gave the ornery list box from above the 'f1' treatment. Everything is back up and happy now.
I know that several other companies hosted in the same building lost power, and not just in our datacenter. One, a large perl shop, is still down -- going on six hours. For larger deployments, they are concerned more with heat dissipation - so need to wait for things to cool down. I'm very happy with our hosting arrangements - they've been very helpful with getting boxes reset - and I know things are worse for them than they are for us.
This weekend has identified some weaknesses in our architechture, and we're going to be working over the next few months to solve them. While it doesn't make sense for us to have a fully distributed system, we could definitely use more redundancy in some core systems. We'll probably be posting here with an updated "wish list" soon.
Fingers crossed that the rest of this week goes smoothly. It's no fun having to deal with a datacenter from hundreds of miles away.