Today was a beautiful day in Los Angeles, the kind with clear blue skies, almost 70 degrees F, and the rainy weather almost forgotten.....
We went to the datacenter tonight to investigate the failed hardware, and it does appear that the motherboard (or powersupply crossbar bus) has failed entirely. The box is totally unresponsive. A new motherboard is on order (should arrive Thursday), and the affected machine has been moved from the datacenter to my garage for easier access. The plan is to try the new motherboard and hope it fixes the problem. Plan B is to remove the drives from the machine and connect them to another machine to extract the data. (This is complicated by the lack of spare machines with a PCI-X slot and the number of drives in the machine.)
At this time, we still believe no data has been lost and that we'll be able to safely retrieve all of it, it just might take a few days.
Currently unavailable services (different than last time):
- rt.perl.org (we've got the data safe and sound, but our code customizations are on that machine)
- historical cpan-testers data
- *.pm.org websites hosted by us
This is our first "really big" outage in nearly a decade (only because that's about as far back as our memory goes), and we've definitely got some plans to ensure that it's the only "really big" outage in the next 10 years.
No comments:
Post a Comment