Hosting disaster recovery

Kit Allen 27th of February 2019

This post can be read in conjunction with our Hosting Charges Explained story from last year – KA.

The scenario

A couple of weekends ago, I had my 21st* birthday rudely interrupted by a call from a client on our emergency out of hours support number. Their site appeared to be non-functional, and they wondered if we could take a look.

I logged in to the relevant server from home and found that their site was not displaying. Reaching out to our support team, I arranged for a backup to be put in place and checked over as a short-term fix. For completeness sake, I also logged into the server’s control panel to double check a couple of other accounts.

They, too, were non-functional. The whole server was down.

Uh-oh.

How we reacted

We sent out an emergency message to our support and development team and arranged to meet at the Creatomatic office immediately. Within 15 minutes we had two senior team members on-site and three more working remotely to rectify the issue. We established that this was indeed a ‘worst case scenario’ – the entire server was inaccessible, and effectively a write-off.

Half an hour after the issue was reported, we had a list of all clients affected; a plan of action in place; a team member managing client communications to keep affected users updated; and two more members working to update individual accounts to transfer to a ready and waiting backup server.

Working from the most recent clean backup, we connected to our backup server – which exists solely as insurance for just this type of scenario – and restored all accounts and data. We then worked through the list of accounts to repoint domain DNS to the backup server. We were relatively lucky in that the majority of affected domains had already been migrated to our managed service, allowing them to be batch updated; these sites were up and running again within 90 minutes.

Finally, we established the nature of the issue and immediately put steps in place to ensure that the issue could not affect any other live servers, before notifying all affected clients of the issue and our response.

The entire process – from first notification to full restoration – took under four hours, with a technician remaining on-call over the rest of the weekend to monitor the server and support channels as an additional precaution.

Why we’re telling you this

The point of this story is partly a vainglorious boast about how well Team Creatomatic reacted to a crisis under pressure – but less flippantly, to highlight the exceptional level of service which Creatomatic hosting clients receive. It’s episodes like this which illustrate how seriously we take our responsibilities to keep our client websites running at all times.