My First Devops Failure and Why I'm Leaving Digital Ocean

My First Devops Failure and Why I'm Leaving Digital Ocean

This past week, I had my first ever true devops failure. I lost almost a month of production data for a student group website I manage, and like most devops failures, it was the fault of myself, the user.

What happened

Our student group advisor recently left the university, and during the transition to the new advisor, her student group debit cards were cancelled. I'd actually chosen to use her card for more stability, because the student treasurer transitioned so often that their cards never lasted for more than a few months. My first failure was failing to realize her cards would be cancelled, and the impact that would have upon our server provider, Digital Ocean. I don't regularly monitor our group email account, nor do I regularly need to log into digital ocean. So when our payment methods began to fail, I missed all notices (and there were multiple) until our account was suspended and the droplet turned off.

Once I was notified that the website was down, I immediately tried to rectify it by restarting the docker container holding the website - but my ssh connection to the droplet was timing out. So, I logged into digital ocean where I was greeted by a notice that the account was suspended. At 8am our account was suspended with the following message:

Please note that you must resolve the outstanding balance within the next week to restore access to your Droplets. After that time has passed, your Droplets will have been destroyed. We will not be able to recover destroyed Droplets.

I fixed the failing payment methods, and paid the outstanding balance in its entirety by 6pm, 9 hours after the account was suspended.

After paying the outstanding balance, I anticipated I could restart the server and everything would go back to normal. Except the droplet was missing. I immediately wrote up a ticket to support, asking how I would recover access to my droplet, given I had paid the outstanding balance before the droplets were destroyed. 8 hours later, I got a reply.

Your account was suspended today for non-payment after we made several attempts to contact you about your balance. Our policy is to remove customer data after two weeks of delinquency but we accidentally ran an account purge action on your Droplets today. This means that your data has been removed permanently.

Digital ocean accidentally purged my data, despite their policy against doing so... and it looks like I wasn't the only one affected, nor was I the only one to receive the cookie-cutter response. Digital ocean refunded the outstanding balance to the payment card, but only ~12 hours after I asked - because I had seen they'd done so for other customers.

Recovery

With the server entirely destroyed, I was forced to restore from backup. Luckily, I do take backups of the server. But this is my second failure - my backups were manual and sparse at best. My policy was to take a backup whenever I performed maintenance. Unfortunately I had done maintenance the night before, but failed to take a backup. This meant my last backup was from noon on May 25th, nearly a month old. I immediately restored from the old backup, and the student group is using external records to restore what data we can recover.

In the future

At the very least I can learn from this mistake. I've set up a script to remotely and regularly backup the database. I'm still workshopping the script to be a little more secure - pull requests welcome. I'll also be setting regular reminders to check account status across accounts I don't regularly monitor the email for. I can't do anything about the volatility of payment methods for student groups, but I can try and mitigate the risks.

Conclusion

Backups are important. It's also important to choose your hosting provider wisely. Hopefully you can choose one that won't "accidentally run a purge action on your account". I consider this unacceptable from a hosting provider. I understand that it is foremost my fault for letting the account get suspended, and my fault for not taking backups - but I expect my hosting provider to follow their own policy. This was also a failure for Digital Ocean in standards and customer service. In the near future I will be transitioning all servers I maintain or oversee off Digital Ocean, across a number of student groups and personal droplets. I'll also be advocating against Digital Ocean in the future.