If you are responsible for the data in your business, whether it be your job or your business, then it is your a$$ that is on the line should disaster strike.
Much has been written about business continuity and disaster recovery, but it is fair to say that in many – if not most – organisations it is way down the list of concerns as it is perceived to be a low risk and therefore not worth the likely expenditure. However you may be surprised and even quite shocked how even a common hardware failure, in a certain combination, can invoke a disaster scenario; and if you are not prepared for it the consequences can be catastrophic for your business, and maybe even terminal for your job!
We provide disaster recovery services from remote sites into our data centre utilising VMware and Zerto replication software. We (or should I say an organisation we work with) have recently experienced just such a run of bad luck; while obviously redacting the name of the organisation involved, we feel that by sharing this experience it will serve to highlight just how far up the ladder of importance BC/DR should be in a world where IT is not just part of your business, it IS your business!
Company X had taken a considerable amount of time to be convinced that having an offsite DR solution would be beneficial to their business, and this had been highlighted by an application failure as a result of which we were informed they stood to lose 6 figures per DAY in revenue from such an outage. This is itself we thought would add some urgency to the remediation, but alas this is not always the straightforward outcome.
Unfortunately there then followed several weeks of contract negotiation in which the lawyers looked to get their pound of flesh nit-picking over contract clauses that had little material impact – insofar as we were concerned anyway. In fact some clauses were introduced that we felt actually benefitted us – go figure.
In any case the contract was duly signed with a live date of September 1st and we started the process of building the DR site and installing Zerto at the client site ready to replicate data. Indeed some test replication of data had already been completed in the first week after contract signature to prove the links and processes.
On the evening of August 7th three disks in the customer SAN simultaneously failed. It appears to have been a data error that threw the disks offline rather than a full hardware failure, but the effect was the total destruction of 1 of the 4 LUNs on the SAN. This LUN contained a number of virtual servers, including unfortunately the backup systems and vCenter server, as well as Mail and other critical systems – such as their primary DB server.
So, three disks and the entire site was out. Due to the capacity issues (another story) on site there was little room to recover everything if indeed the backup system could be recovered. The backup data was safely ensconced on a different piece of storage – but as it was a dedupe store without the backup software it was not coming back any time soon. This was designed for specific data recovery, not disaster recovery.
First piece of luck: using Zerto we had replicated their main DB server to our site, this contained not only much of their critical business data but also the vCenter database. Their application server was still up at the original site so we managed to get their main app back online utilising a VPN between sites and a DNS change, just in time for a very important submission deadline the following day. Had the DB not been available offsite that deadline would have been missed – lesson 1.
Second piece of luck: we had also replicated their print server which served a dozen remote depots and was configured with 80+ printers. Again all that was needed was a new set of VPNs configuring and a DNS change and everybody could print without having to wait for the print server to be recovered – lesson 2.
We were then free to rebuild the backup server, connect to the data repository and recover vCenter to make life a little easier managing what was left of their main site. Here again made so much easier because the database for it was already up and running – lesson 3.
The primary lesson though was that had the DR been in place on August 1st all we had to do was enact a failover and the whole site would have been up and running just a few DNS changes later. They would have had no system outage as the failure occurred overnight, and the ongoing problems recovering their 600GB mail server would not have been an issue. How much the nit-picking lawyers cost them, along with the time to enact even when a decision had been made is not likely to ever be truly calculated. All because of 3 lousy spinning disks; not a fire, not an earthquake, not an outbreak of war.
So the next time you consider disaster planning, perhaps you should consider what actually will put you in a disaster situation, and you may realise that the increased probability of it happening makes mitigation a much more realistic prospect.
It is not so much can you afford it as can you afford to ignore it, although in the above case the implementation costs are likely to be a fraction of the resulting cost of the 3 disk barf – even including the extra out of contract work we had to do to get them back from the brink.
For more info on how we can help you plan your VMware BC/DR with Zerto replication software mail firstname.lastname@example.org today.