Sometimes the best laid plans of mice and men ….
November 3rd, 2006 by mhalliganMost days, everything runs smoothly. The monitoring system stays green, you push out new OS patches and watch the automated patch management apply them without error. The power stays on, the bits keep flowing, support tickets are closed quickly, and all is good.
On other days, however, forces seem to conspire against you. This week we’ve got one of those days brewing. Our datacenter discovered that they had oversold, and underprovisioned power in the room where our networking equipment is. They’ve decided that they need to bring down the 6 power circuits that keep all of our networking equipment running, because they need to replace two Power Distribution Units to prevent a possible failure.
All of the planning we put into preventing these issues almost feels superfluous. We have redundant routers, switches, and firewalls at that layer. We have redundant power in these devices. Even the power supplies are on switches that go to PDUs at opposite ends of the colo room. Foiled by the fact that our provider decided they needed to take both PDUs down at the same time.
So, we get to implement contingency plans. In a careful rush, we’ve lowered the TTLs of all of our customers’ DNS names. We provisioned a server in another datacenter to be a mail relay, to make sure no email gets losts. We’ve also setup a webserver on this site, and are redirecting customer site to this server via DNS, to let our customers’ customers know that they haven’t dropped off the face of the earth.
Then we send out the notification to our customers, and plan for a long night. Tonight is going to be a very long night, here’s the notification.
To: Customer
Subject:BitPusher Services maintenance notification:
————————————————————————
ESTIMATED START TIME: 11/03/2006 21:00 PST
ESTIMATED END TIME: 11/04/2006 01:00:00 PST
SERVICES AFFECTED: All
TYPE OF WORK: Power work
PURPOSE OF WORK: Emergency power repair.
IMPACT OF WORK: All network services will be unavailable during the
repair work.
Our colo provider, 365 Main, alerted us that they need to perform emergency work on their power system tomorrow night. All of our equipment is on redundant power feeds to two separate Power
Distribution Units, but unfortunately 365 will be bringing down both PDUs that we are connected to.
We are taking measures to minimize the effect of this outage. We have setup a webserver and mail relay on an outside network, and will be redirecting website hits and e-mail traffic to this location for the duration of the outage. We ask that if you have a specific maintenance page that you would like us to post, that you would e-mail it to support@bitpusher.com asap.











November 29th, 2006 at 1:34 pm
That so totally sucks. Perhaps you want a redundant providors. It’s just the same at the shared, vertual hosting level too. The provider just does whatever they feel and give a funny look if you complain. If you cotinue to complain they make it very clear you are not welcome and suddenly you have to move or loose your site(s) data and customer contact point.
OK so it’s not the same but similar.
November 29th, 2006 at 1:43 pm
Grins.. As we’ve found out, you need to take action in the beginning when your provider isn’t giving you the service that you deserve, or it’ll just continue. The datacenter we’ve been in for 3 years now has just been plagued with problems, the least of which were technical.
Our contract there comes up for renewal in March, and rather than endure these problems, we’ve moved to a new datacenter (http://blog.bitpusher.com/2006/11/22/opening-our-new-datacenter/).
Moving forward, I hope to also open another datacenter next year. We’re going to move towards a multi-site model, so that we are redundant on everything including real estate.