Building Global Web Applications With the Windows Azure Platform – Dynamic Work Allocation and Scale out
May 17, 2011 Leave a comment
Today I would like to finish the discussion on ‘understanding capacity’ for my ‘Building Global Web Applications With the Windows Azure Platform’ series, by talking about the holy grail of cloud capacity management: Dynamic work allocation and scale out.
The basic idea is simple, keep all roles at full utilization before scaling out:
To make optimal use of the capacity that you’re renting from your cloud provider you could design your system in such a way that it is aware of it’s own usage patterns and acts upon these patterns. For example, if role 3 is running to many cpu intensive jobs and role 1 has excess capacity, it could decide to move some cpu intensive workloads off of role 3 to role 1. The system repeats these steps for all workload types and tries to maintain a balance below 80% overall capacity before deciding to scale out.
Turns out though that implementating this is not so straight forward…
First of all you need to be able to move workloads around at runtime. Every web and worker role needs to be designed in such a way that it can dynamically load workloads from some medium, and start executing it. But it also needs to be able to unload the workload, in effect your web or worker role becomes nothing more than an agent that is able to administer the workloads on the machine instead of executing them itself.
In the .net environment this means that you need to start managing separate appdomains or processes for each workload. Here you can find a sample where I implemented a worker role that can load other workloads dynamically from blob storage into a separate appdomain in response to a command that you can send from a console application. This sort of proves that moving workloads around should be technically possible.
Even though it is technically quite feasible to move workloads around, the hardest part is the business logic that decides what workloads should be moved, when and where to. You need to take quite a few things into account!
- Every workload consumes a certain amount of cpu, memory and bandwith, but these metrics cannot be derived from traditional monitoring information as that only shows overall usage. So you need to define and compute additional metrics for each individual workload in order to know what the impact of moving that specific workload would be.
- Workloads tend to be rather temporal as well, so a heavy cpu usage right now, does not mean it will consume the same amount in 5 seconds. So just simply moving workloads around when you detect a problem is not going to cut it.
- In other words, you need to find ways to accurately predict future usage based on past metrics and user supplied information.
- You need to ensure a workload is moved well before it actually would start consuming resources as moving the workload itself takes time as well.
- These same problems repeat themselves on the target side, where you would move the workload to as that role’s utilization is in continuous flux as well.
- I’m only touching the tip of the iceberg here, there is even much more to it…
Lot’s of hard work… but in time you will have to go through it. Please keep in mind that this is the way most utility companies make their (enormous amounts of) money, by continuously looking for more accurate ways to use and resell excess capacity.
Alright, now that you understand the concept of capacity and how it can help you to keep your costs down. It is time to move to the next section of this series: how to make your application globally available.