A few weeks ago one of our main servers had a major hard-drive failure. Although it was in a RAID array, things didn’t went so smooth and the server started stalling for minutes every now and then. For us, providing a high-quality Software-as-a-Service means “it’s always available”, because customers depend on us to alert them when their websites and services are in trouble. If Monitive goes down, it’s “DEFCON 1” for us, as outages are not being logged or reported. This doesn’t happen very often, and you’ll see below why.
SaaS (Software-as-a-Service) comes with a whole new set of expectations for customers. Customers usually paid to download some useful piece of software that they are installing on their computer and they can use at any time. This is exactly what they expect when the software moves into the cloud. If they’re preparing for a meeting and their spreadsheets are not available for whatever reason, disappointment settles in.
Directly regarding Monitive, our customers have even higher expectations, because an uptime monitoring service is a service that runs and does its job even when the customer is not actually using it. It’s usually setup-and-forget, so having an outage for us is not only bad because customers cannot see stats and charts, but they are not getting notified when there are problems with the services that they’re monitoring.
But luckily for us, not only that the incident was not sensed by our customers, but all the monitoring service kept on going even when the server is completely down. This is because since day one, the whole system was build to handle this gracefully.
The Must: “Disaster Recovery”
Disaster Recovery is a system setup that would allow an online business to recover from a major outage, such as when the whole datacenter goes off the grid. This is above RAID, cluster, load-balancing and Heartbeat failover setups, because these only work as long as at least a critical part of the system is still up and running. If all the servers are down, your service is down, and customers start complaining.
Every respectable business should have a form of disaster recovery, since subscriptions and payments are based on some user agreement. You can’t tell your customers “Hey, our hard-drive got fried, all your data is gone. We have some backup from three months ago, we’ll let you know when stuff works again”. This statement would go out just before bankruptcy.
To have proper disaster recovery requires an automatic backup of data and configuration files, so that one could quickly restart the online service from a different location.
Monitive’s Automatic Failover Setup
Knowing that servers do fail, usually sooner than expected, we build Monitive a bit different, from the ground up. We have two “copies” of the main servers, in two different locations in the world. One is designated ‘master’ and the other ‘failover’. By using a custom heartbeat mechanism, they both “know” when the other is down.
Data is shared via multi-master and replica set configuration, backups are automatically created incrementally allowing them to contribute to our high-availability promise to our customers.
Whenever the ‘master’ goes down, for whatever reason, the ‘failover’ server automatically “kicks in”, continuing to execute the monitoring checks, to process data, send alerts and reports. And we can also make this switch manually, at will. This allows us to execute maintenance tasks, replace hard drivers, roll out updates and pretty much anything, without breaking our promise to deliver highly-reliable services.
So we could take our time to replace the faulty hard-drive, test things out, without any stress or complaints from Monitive’s users.
Backup! Backup! Backup!
“Never say never”. This being said, when we thought that it’s impossible for both ‘master’ and ‘failover’ to fry out along with out data, we decided to make sure that this won’t be the end of us, so we have a third backup location where all the important data is being stored, so even a meteor shower that would wipe both locations off the face of the Earth, the service could be brought up in just a few hours, from a third location on a different continent.
Backup is not just for when some server fail. Backup also saves you from human error. Having a n-tier multi-master setup or MongoDB replica sets spread across hundreds of server doesn’t save you from a misuse of “DROP DATABASE” query. One wrong statement and precious information is wiped out from all your network.
Here’s where snapshots, delayed slaves and backup will help you keep a long and successful business. Technically speaking.
It’s never to late…
…to start working on your reliability. The bigger the system is, the harder it will be to set up mechanisms to keep it up and healthy. Don’t wait until the s**t hits the fan, so start today!
So how do you handle outages?