Most days, everything runs smoothly. The monitoring system stays green, you push out new OS patches and watch the automated patch management apply them without error. The power stays on, the bits keep flowing, support tickets are closed quickly, and all is good.
On other days, however, forces seem to conspire against you. This week we’ve got one of those days brewing. Our datacenter discovered that they had oversold, and underprovisioned power in the room where our networking equipment is. They’ve decided that they need to bring down the 6 power circuits that keep all of our networking equipment running, because they need to replace two Power Distribution Units to prevent a possible failure.
All of the planning we put into preventing these issues almost feels superfluous. We have redundant routers, switches, and firewalls at that layer. We have redundant power in these devices. Even the power supplies are on switches that go to PDUs at opposite ends of the colo room. Foiled by the fact that our provider decided they needed to take both PDUs down at the same time.
So, we get to implement contingency plans. In a careful rush, we’ve lowered the TTLs of all of our customers’ DNS names. We provisioned a server in another datacenter to be a mail relay, to make sure no email gets losts. We’ve also setup a webserver on this site, and are redirecting customer site to this server via DNS, to let our customers’ customers know that they haven’t dropped off the face of the earth.
Then we send out the notification to our customers, and plan for a long night. Tonight is going to be a very long night, here’s the notification.
To: Customer
Subject:BitPusher Services maintenance notification:
————————————————————————
ESTIMATED START TIME: 11/03/2006 21:00 PST
ESTIMATED END TIME: 11/04/2006 01:00:00 PST
SERVICES AFFECTED: All
TYPE OF WORK: Power work
PURPOSE OF WORK: Emergency power repair.
IMPACT OF WORK: All network services will be unavailable during the
repair work.
Our colo provider, 365 Main, alerted us that they need to perform emergency work on their power system tomorrow night. All of our equipment is on redundant power feeds to two separate Power
Distribution Units, but unfortunately 365 will be bringing down both PDUs that we are connected to.
We are taking measures to minimize the effect of this outage. We have setup a webserver and mail relay on an outside network, and will be redirecting website hits and e-mail traffic to this location for the duration of the outage. We ask that if you have a specific maintenance page that you would like us to post, that you would e-mail it to support@bitpusher.com asap.