2 Things that went very wrong today
This morning we launched 22 new monitoring locations. The actual launch went smooth. Until it didn't.
I don’t know which of these two is worse, but I value transparency and honesty, so here is what happened. And what we’re doing so that it doesn’t happen in the future.
1 - Communication Failure
Adding new monitoring locations should have been much better communicated. Many customers have our IPs in their whitelist to allow monitoring from the public domain. We should have sent an email to all customers, at least one week in advance. That email should also have contained the list of IPs.
What we did is send a newsletter just one day before. But this is not the worse part. What I didn’t realize is that not all customers have subscribed to our updates newsletter, so they had no way of knowing that this update is happening.
What are we doing to prevent this in the future
As I’m writing this, we’re crafting a way to easily send important announcements to all customers, regardless of their opt-in to our newsletter.
Secondly, all monitoring node updates will be communicated with at least 7 days’ notice. I deeply hope we never, ever do this again.
2 - Alerting Failure
After the new nodes went live, many customers start getting outage alerts, at such a rate, that it depleted the Twilio funds faster than we could top it up. Even more, the auto-top-up feature that Twilio has also failed because the credit card on file was out of funds.
Thus, 285 alerts (SMS and Phone Call) were not sent. People were not notified that their websites were down.
What are we doing to prevent this in the future
First, we’ve increased the top-up threshold ten times, so that the top-up happens at least one week before we’d run out of funds. And second, change the payment information used to a card that always has funds.
Final thoughts
I apologize for what happened. I feel bad because all this could have been avoided with a bit of careful planning. I guess, being excited to have the monitoring network upgrade, I’ve got distracted and neglected the proper steps to do this right.
– Photo by Sarah Kilian on Unsplash