A few weeks ago we had a really ugly thing going on. Monitive was originally built about four years ago. And since day one, a main requirement was redundancy. If things go really wrong, the system would have to keep moving on, delivering the promise of 24/7 uptime monitoring to its customers.
We’d even joke about it. That if a virus or some weird apocalyptic scenario would happen, there would be no humans left on Earth, but Monitive’s uptime monitoring service would keep sending alerts just like nothing serious ever happened. Because even in the case of a disaster, there will still be some deployment of Monitive automatically going up and resuming normal operations.
After a ton of research, some turnovers and many cups, er… tanks of coffee, there it was. Our main servers were backed up by a couple of stand-by servers, with realtime data replication, a custom made heartbeat-like system that would automatically power up monitoring from the failover location as soon as the main server was considered dead. It really felt like it’s our responsibility to make sure customers know when their service is down, even while ours was dead in the water. When the main servers went down, the presentation website went down with it, but monitoring and alerting were still being done from a geographically different location.
And everything went peachy for years. The feeling that nothing can go really wrong, like disaster wrong, was the most peaceful business-wise feeling in the world.
Until the impossible happened: both the primary and the failover servers failed. In a matter of hours. We were literally shocked. And then the race started. Get all the valid data from both servers, and from the backup that we automatically have set up, and set up a new server to handle all the monitoring. I don’t need to say how Defcon 1 that was. We suffered downtime and we had partial availability for some time, some accounts had monitors that weren’t monitored. There were some really busy couple of days, I must admit it.
Of course when I noticed that the most ugly thing happened, I emailed all the customers letting them know that we have some serious issues. I thought it was the fair thing to do. After all, there were several periods of time when monitoring wasn’t working at all, and since our customers rely on us to let them know when they have problems with their sites, it only seemed fair to let them know when our monitoring system had run into issues.
I was surprised to get positive replies to my email, customers were supporting us and even offered to help us in restoring from this massive storage malfunction. That really lifted our morale, since I was expecting reproach and remonstrance shouts. After all, it wasn’t their fault that our system broke.
A few days later, everything was up and running, on a new, more powerful server, and with a lot of lessons learned. It was a very interesting experience and, to be honest, I don’t know if I would have learned so many things if I hadn’t experienced it first hand.
So, how do you handle storage breakdown in your system?