Building Global Web Applications With the Windows Azure Platform – Understanding capacity
May 9, 2011 Leave a comment
In this second installment of the ‘Building Global Web Applications series’, I would like to discuss the concept of ‘Capacity’ as I feel that only few people understand that it is the secret of the utility model, the business model behind cloud computing.
I hear, and tell, very often that cloud computing is about ‘pay for use’. But only for a few resources this is actually completely true, for many it means ‘pay for what you could potentially use’, aka the capacity of a certain resource. Let’s have a look at the pricing table of windows azure compute instances as an example:
|Compute Instance Size||CPU||Memory||Instance Storage||I/O Performance||Cost per hour|
|Extra Small||1.0 GHz||768 MB||20 GB||Low (5 Mbps)||$0.05|
|Small||1.6 GHz||1.75 GB||225 GB||Moderate (100 Mbps)||$0.12|
|Medium||2 x 1.6 GHz||3.5 GB||490 GB||High (200 Mbps)||$0.24|
|Large||4 x 1.6 GHz||7 GB||1,000 GB||High (400 Mbps)||$0.48|
|Extra Large||8 x 1.6 GHz||14 GB||2,040 GB||High (800 Mbps)||$0.96|
When you look at this table, you can see that every windows azure role has a ‘capacity’ in terms of cpu, memory, local disk space and I/O (which actually means bandwidth), in other words the extra small instance has a potential to perform roughly 1 billion instructions per second, store 768 MB of data in memory, cache 20 GB of data on disk and transfer 5 Megabits of data per second.
When serving web pages, your role will start showing a decline in performance when either one of these 4 capacities is completely utilised. When this happens you might be tempted to either scale up or scale out in order to increase the number of users you can handle, but to be honest, this might not be the best idea, because at the same time you’re also wasting part of the 3 other capacities of your instance.
Last time, I showed you a load test on a single extra small instance, that showed signs of running out of capacity when there were more than 30 concurrent users on it. But when monitoring the instance I noticed that neither, memory, cpu nor local disk space were a problem. Only 10% of the cpu was utilitised, 82% of the memory was utilised but most of this was by the OS itself and there was an abundance of free disk space. So the bottle neck must have been the bandwith…
Let’s analyse a request and see whether or not this is true, luckily loadimpact also has a page analyser that shows you which parts of a page take how much time… as you can see from the results below, most of the time is spent on waiting for the first byte of several images (which is represented by the green bar) and waiting for the download of the larger image (represented by the blue bar). All clear indicators of the low i/o performance of an extra small role.
Now in order to increase the utilisation of other capacity types in our role, as well as increase the number of users we can handle, we should remove this bottleneck.
Ofloading the static images, that don’t require computation or memory anyway, to another medium such as blob storage or the CDN is one of the prime options. This allows the machine to handle more requests for dynamic pages and thus increases the utilisation of both cpu and memory.
Next time we will see what exactly the impact is of offloading images to either blob storage or the CDN and how this compares to scaling out…