Improving throughput with NServiceBus on Windows Azure

One of the things that has always bothered me personally on the ‘NServiceBus – Azure queue storage’ relationship is throughput, the amount of messages that I could transfer from one role to the other per second was rather limited.

This is mainly due to the fact that windows azure storage throttles you at the http level, every queue only accepts 500 http requests per second and will queue up the remaining requests. Given that you need 3 requests per message, you can see that throughput is quite limited, you can transfer less than a hundred messages per second. (Sending role performing 1 post request, receiving role performing 1 get and 1 delete request)

One of the first things that you can do to increase throughput is using the SendMessages() operation on the unicast bus.This operation will group all messages passed into it into 1 single message and send it across the wire. Mind that queue storage also limits message size to 8KB, so in effect you can achieve a maximum improvement of factor 10, given that you have reasonable small messages and use binary formatting.

Secondly I’ve added support to the queue for reading in batches, using the GetMessages operation on the cloud queue client. By default the queue reads 10 messages at a time, but you can use a new configuration setting called BatchSize to control the amount of messages to be read. Mind that the BatchSize setting also influences the MessageInvisibleTime, as I multiply this number by the batchsize to define how long the messages have to stay invisible as overall process time may now take longer.

In the future I may consider even more improvements to increase throughput of queue storage. Like for example using multiple queues at a time to overcome the 500 requests per second limit. But as Rinat Abdullin already pointed out to me on twitter this might have grave consequences on both overall latency and costs. So before I continue with this improvement I have a question for you, do you think this additional latency and costs are warranted?

But even then, there is another throttle in place at the storage account level, which limites all storage operation requests to 5000 requests per second (this includes table storage and blob storage requests), in order to work around this limit you can specify a separate connection string for every destination queue using the following format “queuename@connectionstring”.

Building Global Web Applications With the Windows Azure Platform – Dynamic Work Allocation and Scale out

Today I would like to finish the discussion on ‘understanding capacity’ for my ‘Building Global Web Applications With the Windows Azure Platform’ series, by talking about the holy grail of cloud capacity management: Dynamic work allocation and scale out.

The basic idea is simple, keep all roles at full utilization before scaling out:

To make optimal use of the capacity that you’re renting from your cloud provider you could design your system in such a way that it is aware of it’s own usage patterns and acts upon these patterns. For example, if role 3 is running to many cpu intensive jobs and role 1 has excess capacity, it could decide to move some cpu intensive workloads off of role 3 to role 1. The system repeats these steps for all workload types and tries to maintain a balance below 80% overall capacity before deciding to scale out.

Turns out though that implementating this is not so straight forward…

First of all you need to be able to move workloads around at runtime. Every web and worker role needs to be designed in such a way that it can dynamically load workloads from some medium, and start executing it. But it also needs to be able to unload the workload, in effect your web or worker role becomes nothing more than an agent that is able to administer the workloads on the machine instead of executing them itself.

In the .net environment this means that you need to start managing separate appdomains or processes for each workload. Here you can find a sample where I implemented a worker role that can load other workloads dynamically from blob storage into a separate appdomain in response to a command that you can send from a console application. This sort of proves that moving workloads around should be technically possible.

Even though it is technically quite feasible to move workloads around, the hardest part is the business logic that decides what workloads should be moved, when and where to. You need to take quite a few things into account!

  • Every workload consumes a certain amount of cpu, memory and bandwith, but these metrics cannot be derived from traditional monitoring information as that only shows overall usage. So you need to define and compute additional metrics for each individual workload in order to know what the impact of moving that specific workload would be.
  • Workloads tend to be rather temporal as well, so a heavy cpu usage right now, does not mean it will consume the same amount in 5 seconds. So just simply moving workloads around when you detect a problem is not going to cut it.
  • In other words, you need to find ways to accurately predict future usage based on past metrics and user supplied information.
  • You need to ensure a workload is moved well before it actually would start consuming resources as moving the workload itself takes time as well.
  • These same problems repeat themselves on the target side, where you would move the workload to as that role’s utilization is in continuous flux as well.
  • I’m only touching the tip of the iceberg here, there is even much more to it…

Lot’s of hard work… but in time you will have to go through it. Please keep in mind that this is the way most utility companies make their (enormous amounts of) money, by continuously looking for more accurate ways to use and resell excess capacity.

Alright, now that you understand the concept of capacity and how it can help you to keep your costs down. It is time to move to the next section of this series: how to make your application globally available.

Building Global Web Applications With the Windows Azure Platform – Monitoring

In the fourth installment of the series on building global web applications I want to dive a bit deeper into monitoring your instances, as measuring and monitoring is key to efficient capacity management. The goal of capacity management should be to optimally use the instances that you have, ideally all aspects of your instances are utilised for about 80% before you decide to pay more and scale out.

Windows azure offers a wide range of capabilities when it comes to monitoring, by means of the WAD (Windows Azure Diagnostics) service, which can be configured to expose all kinds of information about your instances, including event logs, trace logs, IIS logs, performance counters and many more. The WAD can be configured both from code as by means of a configuration file that can be included in your deployment. See for more details on this configuration file.

Personally I prefer using the configuration file for anything that is not specific to my code, like machine level performance counters, but I do use code for things like trace logs. To enable a specific performance counter on all your instances, specify it in the performance counters, including the rate at which the counter should be collected.

<PerformanceCounters bufferQuotaInMB="512" scheduledTransferPeriod="PT1M">
    <PerformanceCounterConfiguration counterSpecifier="\Processor(_Total)\% Processor Time" sampleRate="PT5S" />
    <PerformanceCounterConfiguration counterSpecifier="\Memory\% Committed Bytes In Use" sampleRate="PT5S" />

Note that I only collect processor time and memory consumption from the instances, bandwidth throttling is performed at the network level, not the instance level, so you cannot collect any valuable data for this metric.

The diagnostics manager will transfer this information to your storage account, that you specified in your service configuration file under the key Microsoft.WindowsAzure.Plugins.Diagnostics.ConnectionString, at the rate mentioned in the ScheduledTransferPeriod property of the PerformanceCounters element.

Now, I admit, today the Windows Azure management tooling offered by MS is a bit lacking in terms of visualising diagnostics and monitoring information. But there is a third party product, Diagnostics Manager by Cerebrata, that covers this gap very well. Here you can see how Diagnostics Manager visualises the memory and cpu usage in my instance.

Note, the consumption rates are very low now, only 20% of memory and just a few percent cpu is effectively used at the time of measurement. this is because I upscaled to a small web role in the mean time and wasn’t executing any tests when monitoring the instance.

So, now that you know how to monitor your instances efficiently it is time to start filling up the free capacity that is sitting idle in your machines, but that is for next time when I will discuss the holy grail of capacity management: dynamically work load allocation.

Building Global Web Applications With the Windows Azure Platform – Offloading static content to blob storage or the CDN

In this third post on building global web applications, I will show you what the impact of offloading images to blob storage or the CDN is in contrast to scaling out to an additional instance. Remember from the first post in this series that I had an extra small instance that started to show signs of fatigue as soon as more than 30 people came over to visit at once. Let’s see how this will improve by simply moving the static content.

In a first stage I’ve moved all images over to blob storage and ran the original test again, resulting in a nice scale up in terms of number of users the single instance can handle. Notice that the increase in users has nearly no impact on our role.  I lost about 50ms in minimum response time though, in comparison to the initial test, but I would happy to pay that price in order to handle more users. If you need faster repsonse times than the ones delivered by blob storage, you really should consider enabling the CDN.

And I’ll prove it with this second test: I enabled the CDN for my storage account, a CDN (or Content Delivery Network) brings files to a datacenter closer to the surfer, resulting in a much better overall experience when visiting your site. As you can see in the following test result, the page response times decrease dramatically, down to 30 percent:

But I can hear you think, what if I would have scaled out instead? If you compare the above results to the test results of simply scaling out to 2 extra small instances, you can see that 2 instances only moved the tipping point from 30 users to 50 users, just doubling the number of users we can handle. While offloading the images gives us a way more serious increase for a much lower cost ($0.01 per 10.000 requests).

Note that the most probable next bottleneck will become memory, as most of the 768 MB’s are being used by the operating system already. To be honest I do not consider extra small instances good candidates for deploying web roles on, as they are pretty limited in 2 important  aspects for serving content, bandwidth and memory. I do consider them ideal for hosting worker roles though, as they have quite a lot of cpu relative to the other resources and their price.

For web roles, intended to serve rather static content, I default to small instances as they have about 1GB of useable memory and 20 times the bandwith of  an extra small role for only little more than twice the price. Still the bandwidth is not excessive, so you still want to offload your images to blob storage and the CDN.

Please remember, managing the capacity of your roles is the secret to benefitting from the cloud. Ideally you manage to use each resource for 80% without ever hitting the limit… Another smart thing to do, is to host background work loads on the same machine as the web role to use the cpu cycles that are often not required when serving relatively static content.

Next time, we’ll have a look at how to intelligently monitor your instances which is a prerequisite to being able to manage the capacity of your roles…

Building Global Web Applications With the Windows Azure Platform – Understanding capacity

In this second installment of the ‘Building Global Web Applications series’, I would like to discuss the concept of ‘Capacity’ as I feel that only few people understand that it is the secret of the utility model, the business model behind cloud computing.

I hear, and tell, very often that cloud computing is about ‘pay for use’. But only for a few resources this is actually completely true, for many it means ‘pay for what you could potentially use’, aka the capacity of a certain resource. Let’s have a look at the pricing table of windows azure compute instances as an example:

Compute Instance Size CPU Memory Instance Storage I/O Performance Cost per hour
Extra Small 1.0 GHz 768 MB 20 GB Low (5 Mbps) $0.05
Small 1.6 GHz 1.75 GB 225 GB Moderate (100 Mbps) $0.12
Medium 2 x 1.6 GHz 3.5 GB 490 GB High (200 Mbps) $0.24
Large 4 x 1.6 GHz 7 GB 1,000 GB High (400 Mbps) $0.48
Extra Large 8 x 1.6 GHz 14 GB 2,040 GB High (800 Mbps) $0.96

When you look at this table, you can see that every windows azure role has a ‘capacity’ in terms of cpu, memory, local disk space and I/O (which actually means bandwidth), in other words the extra small instance has a potential to perform roughly 1 billion instructions per second, store 768 MB of data in memory, cache 20 GB of data on disk and transfer 5 Megabits of data per second.

When serving web pages, your role will start showing a decline in performance when either one of these 4 capacities is completely utilised. When this happens you might be tempted to either scale up or scale out in order to increase the number of users you can handle, but to be honest, this might not be the best idea, because at the same time you’re also wasting part of the 3 other capacities of your instance.

Last time, I showed you a load test on a single extra small instance, that showed signs of running out of capacity when there were more than 30 concurrent users on it. But when monitoring the instance I noticed that neither, memory, cpu nor local disk space were a problem. Only 10% of the cpu was utilitised, 82% of the memory was utilised but most of this was by the OS itself and there was an abundance of free disk space. So the bottle neck must have been the bandwith…

Let’s analyse a request and see whether or not this is true, luckily loadimpact also has a page analyser that shows you which parts of a page take how much time… as you can see from the results below, most of the time is spent on waiting for the first byte of several images (which is represented by the green bar) and waiting for the download of the larger image (represented by the blue bar). All clear indicators of the low i/o performance of an extra small role.

Now in order to increase the utilisation of other capacity types in our role, as well as increase the number of users we can handle, we should remove this bottleneck.

Ofloading the static images, that don’t require computation or memory anyway, to another medium such as blob storage or the CDN is one of the prime options.  This allows the machine to handle more requests for dynamic pages and thus increases the utilisation of both cpu and memory.

Next time we will see what exactly the impact is of offloading images to either blob storage or the CDN and how this compares to scaling out…

Building Global Web Applications With the Windows Azure Platform – Introduction

I don’t know if you noticed, probably not, but I’ve put some content again on This content will serve as a starting point to a new series that I’m writing. In this series I will discuss, step by step, what it takes to build global, highly scalable, highly available, high density and cheap web applications with the windows azure platform.

In this first stage I’ve just built a simple web application, using MVC, with some fairly static content: a razor layout page, a static content body, a css file, some images, nothing fancy… All of this is configured in an extra small webrole uploaded to with only 1 instance. ( is mapped to this address using a CNAME record at my DNS provider).

The general idea behind this series is to build on top of this basic sample, with more functionality and more windows azure features, and try out how the application will behave in the real world, in terms of performance, scalability, availability and so on. In order to achieve this we need to be able to simulate some real life load on our application, so I signed up at which allows me to setup load tests with up to 5000 simulated users.

In a very first test I will ramp up to 50 concurrent users and see if this miniature application can handle it. 50 concurrent users means about a 1000 visits per hour (given that the average stay time is about 3 minutes), or 24000 visitors per day, this should definitly do for my simple site at this stage…

Note: If you want to derive the average number of concurrent users currently on your site, you can use the following formula: concurrent_users = (hourly_visits * time_on_site_in_seconds) / 3600

Now let’s have a look at the results:

1 Extra Small Instance - 50 Concurrent users

As you can see, the site survived the onslaught, albeit barely. There is a significant decline in performance when the number of concurrent users increases over 30, and an almost 300% increase in response time once we reach 50 concurrent users. I’m quite sure the site would actually break down if we increased the numbers only a bit more.

Breaking down is a subjective concept on the web, it does not mean that the webserver actually crashes, it means that the users go away from the page they were intending to visit. This graph shows the average load times for all of the users. Average really means that 50% of the requests took more than the amount of time displayed in the graph. Personally I consider a site broken if it’s requests take more than 3 seconds to load on average, which means 50% of it’s users had to wait more than 3 seconds before they got a response (which they won’t do anymore).

So what can we do to handle this problem? How can we serve more people? We either scale up or scale out, right?

If this is your first reaction, you didn’t get the utility model yet…

Utilities, like windows azure, are all about optimal use of capacity. Windows azure roles offer an array of different kinds of capacity (compute, memory, bandwidth) and I bet that not all of these are optimally used yet and only one of them is the bottleneck…

Next time we will look into this capacity topic a bit further and see how we can get some more juice out of this instance without having to pay a lot more…