This is the second module of substance in the my new Microsoft Azure Administration mini-series hosted primarily off the Aditi Technologies blog that parallel my just-release Pluralsight course entitled “Microsoft Azure Administration New Features (March 2014)”.
Traffic Manager enables you to improve the availability of your critical applications by monitoring your hosted Azure services. It provides automatic routing to alternative replica sites based upon your choice of three load balancing methods that applies an intelligent policy engine to the DNS queries on your domain names. These entities are performance, failover, and round robin algorithms. We’ll talk about all three of these shortly in more detail.
Azure allows you to run cloud services in datacenters located around the world. Traffic Manager can manage your traffic in different ways based upon what routing emphasis you tell it to use. It can improve the responsiveness of your applications and content delivery times by directing end-users to the cloud service that is closest to them (in terms of network latency). Or when one cloud service is brought down, perhaps for maintenance, Traffic Manager will route user traffic to the other available cloud services that you define in the Traffic Manager profile. This helps you to maintain and upgrade your services without downtime. Or if you are having no problems such as a down server or performance, you can do simple round-robin load disbursement to balance the load evenly among two or more nodes in the configuration.
So this gives you a cursory introduction to ATM. But for you to be able to leave this module with more than just a few conversational facts about ATM we need to first go back to school and talk about HA/DR in the Cloud. Without that the ATM concepts will not be applied to your real-life solutions. To not do this would be like visiting France without learning a few basic French words, such as food, bathroom, wine, and airport. So let’s first talk about HA/DR and see how it applies to Cloud architectures. Only with this basic understanding of Cloud HA/DR will you be able to use ATM optimally for your solution.
High Availability \ Disaster Recovery
I want to park on this slide for a few minutes because I think it’s really important to have some idea of what it means for solutions to be highly available and to recover from a disaster. The reason I say that is when I ask customers if they are prepared for temporary and large-scale failures, most say they are. However, before you answer that question for yourself, does your company rehearse these failures? Do you test the recovery of databases to ensure you have the correct processes in place? Chances are probably not. That’s because successful DR starts with lots of planning and architecting to implement these processes. Just like many other non-functional requirements, such as security, disaster recovery rarely gets the up-front analysis and time allocation required. Also, most customers don’t have the budget for geographically distributed datacenters with redundant capacity. Consequently even mission critical applications are frequently excluded from proper DR planning.
Azure provide geographically dispersed datacenters around the world. These platforms also provide capabilities that support availability and a variety of DR scenarios. Now, every mission critical Cloud application can be given due consideration for disaster proofing of the system. Windows Azure has resiliency and DR built into many of its services. These platform features must be studied carefully and supplemented with application strategies.
Note a full discussion of HA/DR in Azure would take us a few hours to work through. But in the next few slides I will touch on some of the factors to make you aware of them. You also need to have an awareness of your HA/DR strategy to properly use the Azure Traffic Manager to meet the RTO (Recovery Time Objective) and RPO (Recovery Point Objective) requirements of your application.
The recovery time objective is the maximum amount of time allocated for restoring application functionality. This is based on business requirements and is related to the importance of the application. Critical business applications require a low RTO.
The recovery point objective is the acceptable time window of lost data due to the recovery process. For example, if the RPO is one hour, then the data must be completely backed up or replicated at least every hour. Once the application is brought up in an alternate datacenter, the backup data could be missing up to an hour of data. Like RTO, critical applications target a much smaller RPO.
Key Factors – The implementation of the application needs to factor in the probability of a capability outage. It also needs to consider the impact it will have on the application from the business perspective before diving deep into the implementation strategies. Without due consideration to the business impact and the probability of hitting the risk condition, the implementation can be expensive and potentially unnecessary. Determining factors of RTO, RPO, and budget help you outline a strategy that works for you and your applications. And all applications in your portfolio most likely will not be treated the same with respect to HADR due to cost management.
High Availability – A highly available cloud application implements strategies to absorb the outage of the dependencies like the managed services offered by the cloud platform. In spite of possible failures of the Cloud platform capabilities, this approach permits the application to continue to exhibit the expected functional and non-functional systemic characteristics as defined by the designers. A highly available application absorbs fluctuations in availability, load, and temporary failures in the dependent services and hardware. The application continues to operate at an acceptable user and systemic response level, as defined by business requirements or application service level agreements.
Consider an automotive analogy for high availability. Even quality parts and superior engineering does not prevent occasional failures. For example, when your car gets a flat tire, the car still runs, but it is operating with degraded functionality. If you planned for this potential occurrence, you can use one of those thin-rimmed spare tires until you reach a repair shop. Although the spare tire does not permit fast speeds, you can still operate the vehicle until the tire is replaced. In the same way, a cloud service that plans for potential loss of capabilities can prevent a relatively minor problem from bringing down the entire application. This is true even if the cloud service must run with degraded functionality.
Disaster Recovery -
Unlike the temporary failure management for high availability, disaster recovery (DR) revolves around the catastrophic loss of application functionality. For example, consider the scenario where one or more datacenters go down. In this case you need to have a plan to run your application or access your data outside of the datacenter. Execution of this plan revolves around people, processes, and supporting applications that allow system to function. The level of functionality for the service during a disaster is determined by business and technology owners who define its disaster operational mode. That can take many forms from completely unavailable to partially available (degraded functionality or delayed processing) to fully available.
A cloud deployment might cease to function due to a systemic outage of the dependent services or the underlying infrastructure. Under such conditions, a business continuity plan triggers the disaster recovery (DR) process. This process typically involves both operations personnel and automated procedures in order to reactivate the application at a functioning datacenter. This requires the transfer of application users, data, and services to the new datacenter. This involves the use of backup media or ongoing replication.
Consider the previous analogy that compared high availability to the ability to recover from a flat tire through the use of a spare. By contrast, disaster recovery involves the steps taken after a car crash where the car is no longer operational. In that case, the best solution is to find an efficient way to change cars, perhaps by calling a travel service or a friend. In this scenario, there is likely going to be a longer delay in getting back on the road as well as more complexity in repairing and returning to the original vehicle. In the same way, disaster recovery to another datacenter is a complex task that typically involves some downtime and potential loss of data. To better understand and evaluate disaster recovery strategies, it is important to define two terms: recovery time objective and recovery point objective.
If you go into my course on Pluralsight (http://pluralsight.com/training/courses/TableOfContents?courseName=microsoft-azure-administration-new-features) you will find a lot more information on Microsoft Azure Traffic Manager plus be able to watch videos of how to use these concepts. You will also learn how to use Azure Traffic Manager to load balance incoming traffic across multiple hosted Windows Azure Cloud services and Web sites whether they’re running in the same datacenter or across different datacenters around the world. By effectively managing traffic, you can ensure high performance, availability and resiliency of your applications. You will see some basic ATM concepts, then spent a good amount of time understanding High Availability/Disaster Recovery in the Cloud and how you need to plan intentionally for that. We will also examine three algorithms you can use for load-balancing: failover, round-robin, or performance. In addition, the course includes a wealth of information and demos on the Azure Scheduler, Azure Recovery Servcies, Azure Management Services, Azure BizTalk Services, HDInsight, and improvements to Azure Storage and Web Sites. Hope to see you there!