Eventually, things will go south. Usually when you’re not looking.
That happened to our service, yesterday morning. For some reason the data storage server ran out of storage space without us finding out about it, and it stopped recording new information, such as latency, incidents or alerts. You’ll probably see this as a gap in the hourly availability chart:
As our failover server was not properly set up after some maintenance work (Meltdown and Spectre kernel updates), there was no one else to record the data.
Therefore, I have to admit that, sadly, that data is officially gone. It’s not something to be proud of, but we’ve always been truthful with our users, so it is only fair that we give our users the bad news as well as the good.
The API was also affected, as any data-writing operation failed as well.
Since then, we’ve restored the balance within our infrastructure including setting up the failover server so that it doesn’t happen in the (near but hopefully far) foreseen future.