Hanu Kommalapati (Microsoft) and I wrote this with tremendous assistance from Jason Roth (Microsoft). It fills a key gap in Microsoft’s documentation story with respect to Azure.

This paper focuses on high availability for applications running in Windows Azure. An overall strategy for high availability also includes the area of disaster recovery (DR). Planning for failures and disasters in the Cloud requires you to recognize the failures quickly and implement a strategy that matches your tolerance for the application’s downtime. Additionally you have to consider the extent of data loss the application can tolerate without adverse business consequences as it is restored.

When we ask customers if they are prepared for temporary and large-scale failures, most say they are. However, before you answer that question for yourself, does your company rehearse these failures? Do you test the recovery of databases to ensure you have the correct processes in place? Chances are probably not. That’s because successful DR starts with lots of planning and architecting to implement these processes. Just like many other non-functional requirements, such as security, disaster recovery rarely gets the up-front analysis and time allocation required. Also, most customers don’t have the budget for geographically distributed datacenters with redundant capacity. Consequently even mission critical applications are frequently excluded from proper DR planning.

Cloud platforms, such as Windows Azure, provide geographically dispersed datacenters around the world. These platforms also provide capabilities that support availability and enable a variety of DR scenarios. Now, every mission critical Cloud application can be given due consideration for disaster proofing of the system. Windows Azure has resiliency and DR built into many of its services. These platform features must be studied carefully and supplemented with application strategies.

The whitepaper outlines the necessary architecture steps to be taken to disaster-proof a Windows Azure deployment so that the larger business continuity process can be implemented. A business continuity plan is a roadmap for continuing operations under adverse conditions. This could be a failure with technology, such as a downed service, or a natural disaster, such as a storm or power outage. Application resiliency for disasters is only a subset of the larger DR process as described in this NIST document: Contingency Planning Guide for Information Technology Systems.

You can find the whitepaper at this MDSN location:
http://msdn.microsoft.com/en-us/library/dn251004.aspx

Advertisements