Azure Tip – Demystifying Table Service Exceptions

A quick tip for when you’re trying to identify exceptions resulting from the windows azure storage services. Usually these look something like ‘The remote server returned an error: (400) Bad Request.’

Stating well… not much, just that you did something bad. Now how do you go by identifying what happened?

First of all, make sure you get the exception at it’s origin even if this is not in your code. You can do this by enabling visual studio debug on throw for the exception on System.Net.WebException

Now you can use your immediate window to extract the response body from the http response by issueing following command:

new System.IO.StreamReader(((System.Net.WebException) $exception).Response.GetResponseStream()).ReadToEnd()

The response you get back contains the error message indicating what is wrong. In my case one of the values I pass in is out of range. (sadly enough it does not say which one)

"<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"yes\"?>\r\n<error xmlns=\"http://schemas.microsoft.com/ado/2007/08/dataservices/metadata\">\r\n <code>OutOfRangeInput</code>\r\n <message xml:lang=\"en-US\">One of the request inputs is out of range.\nRequestId:ba01ae00-6736-4118-ab7f-2793a8504595\nTime:2011-07-28T12:36:21.2970257Z</message>\r\n</error>"

Obviously there are other errors as well, you can find list of these error types here: http://msdn.microsoft.com/en-us/library/dd179438.aspx

But the one I got is pretty common, following docs can help you identify which of the inputs is wrong: http://msdn.microsoft.com/en-us/library/dd179338.aspx

May this post save you some time 🙂

Overcoming message size limits on the Windows Azure Platform with NServiceBus

When using any of the windows azure queuing mechanisms for communication between your web and worker roles, you will quickly run into their size limits for certain common online use cases.

Consider for example a very traditional use case, where you allow your users to upload a picture and you want to resize it into various thumbnail formats for use throughout your website. You do not want the resizing to be done on the web role as this implies that the user will be waiting for the result if you do it synchronously, or that the web role will be using its resources for doing something else than serving users if you do it asynchronously. So you most likely want to offload this workload to a worker role, allowing the web role to happily continue to serve customers.

Sending this image as part of a message through the traditional queuing mechanism to a worker is not easy to do. For example it is not easily implemented by means of queue storage as this mechanism is limited to 8K messages, neither is it by means of AppFabric queues as they can only handle messages up to 256K messages, and as you know image sizes far outreach these numbers.

To work your way around these limitations you could perform the following steps:

  1. Upload the image to blob storage
  2. Send a message, passing in the images Uri and metadata to the workers requesting a resize.
  3. The workers download the image blob based on the provided Uri and performs the resizing operation
  4. When all workers are done, cleanup the original blob

This all sounds pretty straight forward, until you try to do it, then you run into quite a lot of issues. Among others:

To avoid long latency and timeouts you want to upload and/or download very large images in parallel blocks. But how many blocks should you upload at once and how large should these blocks be? How do you maintain byte order while uploading in parallel? What if uploading one of the blocks fails?

To avoid paying too much for storage you want to remove the large images. But when do you remove the original blob? How do you actually know that all workers have successfully processed the resizing request? Turns out you actually can’t know this, for example due to the built in retry mechanisms in the queue the message may reappear at a later time.

Now the good news is, I’ve gone through the ordeal of solving these questions and implemented this capability for windows azure into NServiceBus. It is known as the databus and on windows azure it uses blob storage to store the images or other large properties (FYI on-premises it is implemented using a file share).

How to use the databus

When using the regular azure host, the databus is not enabled by default. In order to turn it on you need to request custom initialization and call the AzureDataBus method on the configuration instance.

internal class SetupDataBus : IWantCustomInitialization
{
     public void Init()
     {
         Configure.Instance.AzureDataBus();
     }
}

As the databus is implemented using blob storage, you do need to provide a connection string to your storage account in order to make it work properly (it will point to development storage if you do not provide this setting)

<Setting name="AzureDatabusConfig.ConnectionString" value="DefaultEndpointsProtocol=https;AccountName={yourAccountName};AccountKey={yourAccountKey} />

Alright, now the databus has been set up, using it is pretty simple. All you need to do is specify which of the message properties are too large to be sent in a regular message, this is done by wrapping the property type by the DataBusProperty<T> type. Every property of this type will be serialized independently and stored as a BlockBlob in blob storage.

Furthermore you need to specify how long the associated blobs are allowed to stay alive in blob storage. As I said before there is no way of knowing when all the workers are done processing the messages, therefore the best approach to not flooding your storage account is providing a background cleanup task that will remove the blobs after a certain time frame. This time frame is specified using the TimeToBeReceived attribute which must be specified on every message that exposes databus properties.

In my image resizing example, I created an ImageUploaded event that has an Image property of type DataBusProperty<byte[]> which contains the bytes of the uploaded image. Furthermore it contains some metadata like the original filename and content type. The TimeToBeReceived value has been set to an hour, assuming that the message will be processed within an hour.

[TimeToBeReceived("01:00:00")]
public class ImageUploaded : IMessage
{
    public Guid Id { get; set; }
    public string FileName { get; set; }
    public string ContentType { get; set; }
    public DataBusProperty<byte[]> Image { get; set; }
}

That’s it, besides this configuration there is no difference with a regular message handler. It will appear as if the message has been sent to the worker as a whole, all the complexity regarding sending the image through blob storage is completely hidden for you.

public class CreateSmallThumbnail : IHandleMessages<ImageUploaded>
{
    private readonly IBus bus;

    public CreateSmallThumbnail(IBus bus)
    {
        this.bus = bus;
    }

    public void Handle(ImageUploaded message)
    {
        var thumb = new ThumbNailCreator().CreateThumbnail(message.Image.Value, 50, 50);

        var uri = new ThumbNailStore().Store(thumb, "small-" + message.FileName, message.ContentType);

        var response = bus.CreateInstance<ThumbNailCreated>(x => { x.ThumbNailUrl = uri; x.Size = Size.Small; });
        bus.Reply(response);
    }
}

Controlling the databus

In order to control the behavior of the databus, I’ve provided you with some optional configuration settings.

AzureDataBusConfig.BlockSize allows you to control the size in bytes of each uploaded block. The default setting is 4MB, which is also the maximum value.

AzureDataBusConfig.NumberOfIOThreads allows you to set the number of threads that will upload blocks in parallel. The default is 5

AzureDataBusConfig.MaxRetries allows you to specify how many times the databus will try to upload a block before giving up and failing the send. The default is 5 times.

AzureDataBusConfig.Container specifies the container in blob storage to use for storing the message parts, by default this container is named ‘databus’, note that it will be created automatically for you.

AzureDataBusConfig.BasePath allows you to add a base path to each blob in the container, by default there is no basepath and all blobs will be put directly in the container. Note that using paths in blob storage is purely a naming convention that is being adhered to, it has no other effects as blob storage is actually a pretty flat store.

Wrapup

With the databus, your messages are no longer limited in size. At least the limit has become so big that you probably don’t care about it anymore, in theory you can use 200GB per property and 100TB per message. The real limit however has now become the amount of memory available on either the machine generating or receiving the message, you cannot exceed those amounts for the time being… Furthermore you need to keep latency in mind, uploading a multi mega or gigabyte file takes a while, even from the cloud.

That’s it for today, please give the databus a try and let me know if you encounter any issues. You can find the sample used in this article in the samples repository, it’s named AzureThumbnailCreator and shows how you can create thumbnails of various sizes (small, medium, large) from a single uploaded image using background workers.

Have fun with it…

Improving throughput with NServiceBus on Windows Azure

One of the things that has always bothered me personally on the ‘NServiceBus – Azure queue storage’ relationship is throughput, the amount of messages that I could transfer from one role to the other per second was rather limited.

This is mainly due to the fact that windows azure storage throttles you at the http level, every queue only accepts 500 http requests per second and will queue up the remaining requests. Given that you need 3 requests per message, you can see that throughput is quite limited, you can transfer less than a hundred messages per second. (Sending role performing 1 post request, receiving role performing 1 get and 1 delete request)

One of the first things that you can do to increase throughput is using the SendMessages() operation on the unicast bus.This operation will group all messages passed into it into 1 single message and send it across the wire. Mind that queue storage also limits message size to 8KB, so in effect you can achieve a maximum improvement of factor 10, given that you have reasonable small messages and use binary formatting.

Secondly I’ve added support to the queue for reading in batches, using the GetMessages operation on the cloud queue client. By default the queue reads 10 messages at a time, but you can use a new configuration setting called BatchSize to control the amount of messages to be read. Mind that the BatchSize setting also influences the MessageInvisibleTime, as I multiply this number by the batchsize to define how long the messages have to stay invisible as overall process time may now take longer.

In the future I may consider even more improvements to increase throughput of queue storage. Like for example using multiple queues at a time to overcome the 500 requests per second limit. But as Rinat Abdullin already pointed out to me on twitter this might have grave consequences on both overall latency and costs. So before I continue with this improvement I have a question for you, do you think this additional latency and costs are warranted?

But even then, there is another throttle in place at the storage account level, which limites all storage operation requests to 5000 requests per second (this includes table storage and blob storage requests), in order to work around this limit you can specify a separate connection string for every destination queue using the following format “queuename@connectionstring”.

Building Global Web Applications With the Windows Azure Platform – Dynamic Work Allocation and Scale out

Today I would like to finish the discussion on ‘understanding capacity’ for my ‘Building Global Web Applications With the Windows Azure Platform’ series, by talking about the holy grail of cloud capacity management: Dynamic work allocation and scale out.

The basic idea is simple, keep all roles at full utilization before scaling out:

To make optimal use of the capacity that you’re renting from your cloud provider you could design your system in such a way that it is aware of it’s own usage patterns and acts upon these patterns. For example, if role 3 is running to many cpu intensive jobs and role 1 has excess capacity, it could decide to move some cpu intensive workloads off of role 3 to role 1. The system repeats these steps for all workload types and tries to maintain a balance below 80% overall capacity before deciding to scale out.

Turns out though that implementating this is not so straight forward…

First of all you need to be able to move workloads around at runtime. Every web and worker role needs to be designed in such a way that it can dynamically load workloads from some medium, and start executing it. But it also needs to be able to unload the workload, in effect your web or worker role becomes nothing more than an agent that is able to administer the workloads on the machine instead of executing them itself.

In the .net environment this means that you need to start managing separate appdomains or processes for each workload. Here you can find a sample where I implemented a worker role that can load other workloads dynamically from blob storage into a separate appdomain in response to a command that you can send from a console application. This sort of proves that moving workloads around should be technically possible.

Even though it is technically quite feasible to move workloads around, the hardest part is the business logic that decides what workloads should be moved, when and where to. You need to take quite a few things into account!

  • Every workload consumes a certain amount of cpu, memory and bandwith, but these metrics cannot be derived from traditional monitoring information as that only shows overall usage. So you need to define and compute additional metrics for each individual workload in order to know what the impact of moving that specific workload would be.
  • Workloads tend to be rather temporal as well, so a heavy cpu usage right now, does not mean it will consume the same amount in 5 seconds. So just simply moving workloads around when you detect a problem is not going to cut it.
  • In other words, you need to find ways to accurately predict future usage based on past metrics and user supplied information.
  • You need to ensure a workload is moved well before it actually would start consuming resources as moving the workload itself takes time as well.
  • These same problems repeat themselves on the target side, where you would move the workload to as that role’s utilization is in continuous flux as well.
  • I’m only touching the tip of the iceberg here, there is even much more to it…

Lot’s of hard work… but in time you will have to go through it. Please keep in mind that this is the way most utility companies make their (enormous amounts of) money, by continuously looking for more accurate ways to use and resell excess capacity.

Alright, now that you understand the concept of capacity and how it can help you to keep your costs down. It is time to move to the next section of this series: how to make your application globally available.

Building Global Web Applications With the Windows Azure Platform – Monitoring

In the fourth installment of the series on building global web applications I want to dive a bit deeper into monitoring your instances, as measuring and monitoring is key to efficient capacity management. The goal of capacity management should be to optimally use the instances that you have, ideally all aspects of your instances are utilised for about 80% before you decide to pay more and scale out.

Windows azure offers a wide range of capabilities when it comes to monitoring, by means of the WAD (Windows Azure Diagnostics) service, which can be configured to expose all kinds of information about your instances, including event logs, trace logs, IIS logs, performance counters and many more. The WAD can be configured both from code as by means of a configuration file that can be included in your deployment. See http://msdn.microsoft.com/en-us/library/gg604918.aspx for more details on this configuration file.

Personally I prefer using the configuration file for anything that is not specific to my code, like machine level performance counters, but I do use code for things like trace logs. To enable a specific performance counter on all your instances, specify it in the performance counters, including the rate at which the counter should be collected.

<PerformanceCounters bufferQuotaInMB="512" scheduledTransferPeriod="PT1M">
    <PerformanceCounterConfiguration counterSpecifier="\Processor(_Total)\% Processor Time" sampleRate="PT5S" />
    <PerformanceCounterConfiguration counterSpecifier="\Memory\% Committed Bytes In Use" sampleRate="PT5S" />
</PerformanceCounters>

Note that I only collect processor time and memory consumption from the instances, bandwidth throttling is performed at the network level, not the instance level, so you cannot collect any valuable data for this metric.

The diagnostics manager will transfer this information to your storage account, that you specified in your service configuration file under the key Microsoft.WindowsAzure.Plugins.Diagnostics.ConnectionString, at the rate mentioned in the ScheduledTransferPeriod property of the PerformanceCounters element.

Now, I admit, today the Windows Azure management tooling offered by MS is a bit lacking in terms of visualising diagnostics and monitoring information. But there is a third party product, Diagnostics Manager by Cerebrata, that covers this gap very well. Here you can see how Diagnostics Manager visualises the memory and cpu usage in my instance.

Note, the consumption rates are very low now, only 20% of memory and just a few percent cpu is effectively used at the time of measurement. this is because I upscaled to a small web role in the mean time and wasn’t executing any tests when monitoring the instance.

So, now that you know how to monitor your instances efficiently it is time to start filling up the free capacity that is sitting idle in your machines, but that is for next time when I will discuss the holy grail of capacity management: dynamically work load allocation.

Building Global Web Applications With the Windows Azure Platform – Offloading static content to blob storage or the CDN

In this third post on building global web applications, I will show you what the impact of offloading images to blob storage or the CDN is in contrast to scaling out to an additional instance. Remember from the first post in this series that I had an extra small instance that started to show signs of fatigue as soon as more than 30 people came over to visit at once. Let’s see how this will improve by simply moving the static content.

In a first stage I’ve moved all images over to blob storage and ran the original test again, resulting in a nice scale up in terms of number of users the single instance can handle. Notice that the increase in users has nearly no impact on our role.  I lost about 50ms in minimum response time though, in comparison to the initial test, but I would happy to pay that price in order to handle more users. If you need faster repsonse times than the ones delivered by blob storage, you really should consider enabling the CDN.

And I’ll prove it with this second test: I enabled the CDN for my storage account, a CDN (or Content Delivery Network) brings files to a datacenter closer to the surfer, resulting in a much better overall experience when visiting your site. As you can see in the following test result, the page response times decrease dramatically, down to 30 percent:

But I can hear you think, what if I would have scaled out instead? If you compare the above results to the test results of simply scaling out to 2 extra small instances, you can see that 2 instances only moved the tipping point from 30 users to 50 users, just doubling the number of users we can handle. While offloading the images gives us a way more serious increase for a much lower cost ($0.01 per 10.000 requests).

Note that the most probable next bottleneck will become memory, as most of the 768 MB’s are being used by the operating system already. To be honest I do not consider extra small instances good candidates for deploying web roles on, as they are pretty limited in 2 important  aspects for serving content, bandwidth and memory. I do consider them ideal for hosting worker roles though, as they have quite a lot of cpu relative to the other resources and their price.

For web roles, intended to serve rather static content, I default to small instances as they have about 1GB of useable memory and 20 times the bandwith of  an extra small role for only little more than twice the price. Still the bandwidth is not excessive, so you still want to offload your images to blob storage and the CDN.

Please remember, managing the capacity of your roles is the secret to benefitting from the cloud. Ideally you manage to use each resource for 80% without ever hitting the limit… Another smart thing to do, is to host background work loads on the same machine as the web role to use the cpu cycles that are often not required when serving relatively static content.

Next time, we’ll have a look at how to intelligently monitor your instances which is a prerequisite to being able to manage the capacity of your roles…

Building Global Web Applications With the Windows Azure Platform – Understanding capacity

In this second installment of the ‘Building Global Web Applications series’, I would like to discuss the concept of ‘Capacity’ as I feel that only few people understand that it is the secret of the utility model, the business model behind cloud computing.

I hear, and tell, very often that cloud computing is about ‘pay for use’. But only for a few resources this is actually completely true, for many it means ‘pay for what you could potentially use’, aka the capacity of a certain resource. Let’s have a look at the pricing table of windows azure compute instances as an example:

Compute Instance Size CPU Memory Instance Storage I/O Performance Cost per hour
Extra Small 1.0 GHz 768 MB 20 GB Low (5 Mbps) $0.05
Small 1.6 GHz 1.75 GB 225 GB Moderate (100 Mbps) $0.12
Medium 2 x 1.6 GHz 3.5 GB 490 GB High (200 Mbps) $0.24
Large 4 x 1.6 GHz 7 GB 1,000 GB High (400 Mbps) $0.48
Extra Large 8 x 1.6 GHz 14 GB 2,040 GB High (800 Mbps) $0.96

When you look at this table, you can see that every windows azure role has a ‘capacity’ in terms of cpu, memory, local disk space and I/O (which actually means bandwidth), in other words the extra small instance has a potential to perform roughly 1 billion instructions per second, store 768 MB of data in memory, cache 20 GB of data on disk and transfer 5 Megabits of data per second.

When serving web pages, your role will start showing a decline in performance when either one of these 4 capacities is completely utilised. When this happens you might be tempted to either scale up or scale out in order to increase the number of users you can handle, but to be honest, this might not be the best idea, because at the same time you’re also wasting part of the 3 other capacities of your instance.

Last time, I showed you a load test on a single extra small instance, that showed signs of running out of capacity when there were more than 30 concurrent users on it. But when monitoring the instance I noticed that neither, memory, cpu nor local disk space were a problem. Only 10% of the cpu was utilitised, 82% of the memory was utilised but most of this was by the OS itself and there was an abundance of free disk space. So the bottle neck must have been the bandwith…

Let’s analyse a request and see whether or not this is true, luckily loadimpact also has a page analyser that shows you which parts of a page take how much time… as you can see from the results below, most of the time is spent on waiting for the first byte of several images (which is represented by the green bar) and waiting for the download of the larger image (represented by the blue bar). All clear indicators of the low i/o performance of an extra small role.

Now in order to increase the utilisation of other capacity types in our role, as well as increase the number of users we can handle, we should remove this bottleneck.

Ofloading the static images, that don’t require computation or memory anyway, to another medium such as blob storage or the CDN is one of the prime options.  This allows the machine to handle more requests for dynamic pages and thus increases the utilisation of both cpu and memory.

Next time we will see what exactly the impact is of offloading images to either blob storage or the CDN and how this compares to scaling out…

Building Global Web Applications With the Windows Azure Platform – Introduction

I don’t know if you noticed, probably not, but I’ve put some content again on http://www.goeleven.com. This content will serve as a starting point to a new series that I’m writing. In this series I will discuss, step by step, what it takes to build global, highly scalable, highly available, high density and cheap web applications with the windows azure platform.

In this first stage I’ve just built a simple web application, using asp.net MVC, with some fairly static content: a razor layout page, a static content body, a css file, some images, nothing fancy… All of this is configured in an extra small webrole uploaded to http://goeleven-eu.cloudapp.net with only 1 instance. (http://www.goeleven.com is mapped to this address using a CNAME record at my DNS provider).

The general idea behind this series is to build on top of this basic sample, with more functionality and more windows azure features, and try out how the application will behave in the real world, in terms of performance, scalability, availability and so on. In order to achieve this we need to be able to simulate some real life load on our application, so I signed up at http://loadimpact.com which allows me to setup load tests with up to 5000 simulated users.

In a very first test I will ramp up to 50 concurrent users and see if this miniature application can handle it. 50 concurrent users means about a 1000 visits per hour (given that the average stay time is about 3 minutes), or 24000 visitors per day, this should definitly do for my simple site at this stage…

Note: If you want to derive the average number of concurrent users currently on your site, you can use the following formula: concurrent_users = (hourly_visits * time_on_site_in_seconds) / 3600

Now let’s have a look at the results:

1 Extra Small Instance - 50 Concurrent users

As you can see, the site survived the onslaught, albeit barely. There is a significant decline in performance when the number of concurrent users increases over 30, and an almost 300% increase in response time once we reach 50 concurrent users. I’m quite sure the site would actually break down if we increased the numbers only a bit more.

Breaking down is a subjective concept on the web, it does not mean that the webserver actually crashes, it means that the users go away from the page they were intending to visit. This graph shows the average load times for all of the users. Average really means that 50% of the requests took more than the amount of time displayed in the graph. Personally I consider a site broken if it’s requests take more than 3 seconds to load on average, which means 50% of it’s users had to wait more than 3 seconds before they got a response (which they won’t do anymore).

So what can we do to handle this problem? How can we serve more people? We either scale up or scale out, right?

If this is your first reaction, you didn’t get the utility model yet…

Utilities, like windows azure, are all about optimal use of capacity. Windows azure roles offer an array of different kinds of capacity (compute, memory, bandwidth) and I bet that not all of these are optimally used yet and only one of them is the bottleneck…

Next time we will look into this capacity topic a bit further and see how we can get some more juice out of this instance without having to pay a lot more…

NServiceBus on Sql Azure, sometimes stuff just works

It just works

Microsoft has put a tremendous amount of effort into Sql Azure to ensure that it works the same way as Sql Server on premises does. The fruits of all this labor are now reaped by us, in order to persist NServiceBus data, such as subscriptions and saga’s for example, on Sql Azure all you got to do is change your connection string.

To show this off, I’ve included a version of the Starbucks sample in the trunk of NServiceBus which has been configured to store it’s information on Sql Azure. You may recall this sample from my previous article so I’m not going to explain the scenario again, I’ll just highlight the differences.

Configuring the sample

In order to configure NServiceBus for Sql Azure, you have to configure it as you would for Sql Server using the DBSubscriptionStorage and the NHibernateSagaPersister configuration settings for respectively subscriptions and saga’s:

Configure.With()
    .Log4Net()
    .StructureMapBuilder(ObjectFactory.Container)

    .AzureConfigurationSource()
    .AzureMessageQueue().JsonSerializer()
    .DBSubcriptionStorage()
    .Sagas().NHibernateSagaPersister().NHibernateUnitOfWork()

    .UnicastBus()
    .LoadMessageHandlers()
    .IsTransactional(true)
    .CreateBus()
    .Start();

In your application configuration file you also need to include the NHibernate connection details using the DBSubscriptionStorageConfig and NHibernateSagaPersisterConfig configuration sections.

<section name="DBSubscriptionStorageConfig"
         type="NServiceBus.Config.DBSubscriptionStorageConfig, NServiceBus.Core" />
<section name="NHibernateSagaPersisterConfig"
         type="NServiceBus.Config.NHibernateSagaPersisterConfig, NServiceBus.Core" />

Furthermore you need to provide the details for your NHibernate connection, like the connection provider, the driver, the sql dialect and finally your connection string. Be sure to format your connection string using the sql azure recognized format and naming conventions.

<DBSubscriptionStorageConfig>
  <NHibernateProperties>
    <add Key="connection.provider"
          Value="NHibernate.Connection.DriverConnectionProvider"/>
    <add Key="connection.driver_class"
          Value="NHibernate.Driver.SqlClientDriver"/>
    <add Key="connection.connection_string"
          Value="Server=tcp:[yourserver].database.windows.net;Database=NServiceBus;User ID=[accountname]@[yourserver];Password=[accountpassword];Trusted_Connection=False;Encrypt=True;"/>
    <add Key="dialect"
          Value="NHibernate.Dialect.MsSql2005Dialect"/>
  </NHibernateProperties>
</DBSubscriptionStorageConfig>

<NHibernateSagaPersisterConfig>
  <NHibernateProperties>
    <add Key="connection.provider"
          Value="NHibernate.Connection.DriverConnectionProvider"/>
    <add Key="connection.driver_class"
          Value="NHibernate.Driver.SqlClientDriver"/>
    <add Key="connection.connection_string"
          Value="Server=tcp:[yourserver].database.windows.net;Database=NServiceBus;User ID=[accountname]@[yourserver];Password=[accountpassword];Trusted_Connection=False;Encrypt=True;"/>
    <add Key="dialect"
          Value="NHibernate.Dialect.MsSql2005Dialect"/>
  </NHibernateProperties>
</NHibernateSagaPersisterConfig>

Alright, that was it, validate your other settings, like the azure storage account used as a transport mechanism and hit F5, it just works!

The Saga pattern on Azure with NServiceBus

The Saga Pattern

As you probably know by now, I’ve been adding support for Windows Azure to the NServiceBus framework, mostly because I believe NServiceBus has the most comprehensive set of communication pattern implementations available for doing development on the Windows Azure Platform. And one of these is the Saga pattern! Before I show you how easy it is to get Saga’s working on Windows Azure, let’s take some time to discuss the pattern itself as odds are you never heard of it before.

In essence a Saga is an object that keeps track of the state of a conversation between an object and it’s different partners. By combining multiple saga’s, representing each partner in the conversation you can fulfill the role of an orchestration that overcomes some of the challenges faced by workflow implementations in a cloud environment such as Windows Azure.

Indeed, the saga is a very viable alternative to a workflow. I’m not going into a debate on which style is better, I feel they both deserve a spot in my toolcase and can perfectly be used in conjunction in a solution. But the most important difference though is that Saga’s are more responsive in nature, they will act on incoming messages and rarely be in charge of the entire conversation. Because of this responsive and typical asynchronous nature of the communication between saga’s, they are not dependent on a distributed transaction. This is quite usefull in the cloud as distributed transactions simply don’t exist there.

Configuring saga support in NServiceBus

Alright, now how easy is it to persist saga’s to Azure’s Table Storage? As anything in NServiceBus, you just configure it by calling a few methods at configuration time. The Sagas method enables saga support, and specifying that you want to persist it to Azure Table Storage is done by means of the AzureSagaPersister method.

Configure.With()
                .Log4Net()
                .StructureMapBuilder(ObjectFactory.Container)

                .AzureConfigurationSource()
                .AzureMessageQueue().JsonSerializer()
                .AzureSubcriptionStorage()
                .Sagas().AzureSagaPersister().NHibernateUnitOfWork()

                .UnicastBus()
                     .LoadMessageHandlers()
                     .IsTransactional(true)
                .CreateBus()
                .Start();

Of course you also need to specify a connection string representing your storage account in the web.config (or service definition file if hosted in azure). Another option that you can enable is schema creation, which is recommended for azure table storage so that it is ensured that all table names exist (table storage doesn’t have any other schema information).

<AzureSagaPersisterConfig ConnectionString="UseDevelopmentStorage=true" CreateSchema="true" />

The Starbucks sample

If you want to see saga’s in action on azure, you can have a look at a azurified version of the Starbucks example. If you’ve never heard of the Starbucks example, it was originally defined by ayende on his blog and has become the default example for demonstrating the use of saga’s to implement a orchestration between humans and services. The implementation can be found in the NServiceBus trunk.

The scenario goes as follows:

I want to order a venti hot chocolate:

  • I (client) ask the cashier (external facing service) for a venti hot chocolate.
  • The cashier informs the barista (internal service) that a new order for venti hot chocolate has arrived.
  • The barista starts preparing the drink.
  • The cashier starts the payment process (credit card auth, or just counting change).
  • When the cashier finishes the payment process, he tells the barista that the payment process is done.
  • I (client) move to the outgoing coffee stand, and wait for my coffee.
  • The barista finishes preparing the drink, check that the payment for this order was done, and serve the drink.
  • I pick up my drink, and the whole thing is finished.

The implementation of the workflow consists of a saga for each the Cashier and the Barista. There is no saga for the customer as we humans tend to perform our communication orchestration ourselves. I’ll show you the code of the Cashier saga to explain how a saga is implemented, you can find the full source in the NServiceBus trunk of course.

First of all, we need to give our saga a memory, a data structure used to contain the state of the conversation. This data structure must implement the IContainSagaData interface to fulfill the identification requirements such as an Id and a representation of the originator. Furthermore we need to add some properties representing the type of drink, the amount ordered and the name of the customer to get the drink back to him or her.

public class CashierSagaData : IContainSagaData
{
    public virtual Guid Id { get; set; }
    public virtual String Originator { get; set; }
    public virtual String OriginalMessageId { get; set; }

    public virtual Double Amount { get; set; }
    public virtual String CustomerName { get; set; }
    public virtual String Drink { get; set; }
    public virtual DrinkSize DrinkSize { get; set; }
    public virtual Guid OrderId { get; set; }
}

The implementation of the saga itself requires you specify the state as the closing of the Saga generic type used as a base implementation. Furthermore you need to specify for which messages the saga will take part in the conversation. One specific message will mark the start of the conversation, in this example a new order, which is indicated by the IAmStartedByMessages interface. Other messages are handled with the traditional IHandleMessages interface.

public class CashierSaga : Saga<CashierSagaData>,
                            IAmStartedByMessages<NewOrderMessage>,
                            IHandleMessages<PaymentMessage>
{
    private readonly IStarbucksCashierView _view;

    public CashierSaga()
    {}

    public CashierSaga(IStarbucksCashierView view)
    {
        _view = view;
    }

We also need to specify when a saga instance should be loaded from persistence and play part in the conversation. Many orchestration frameworks would require you to pass the orchestration ID along with all of the messages, but NServiceBus has opted to implement this the other way around, we will map existing message information to the saga’s identifier using the ConfigureHowToFindSaga and ConfigureMapping methods.

    public override void ConfigureHowToFindSaga()
    {
        ConfigureMapping(s => s.OrderId, m => m.OrderId);
        ConfigureMapping(s => s.OrderId, m => m.OrderId);
    }

Defining the orchestration of the conversation is done by means of the individual message handlers, updating the state of the saga’s data and publishing, or replying to, other messages through the bus. The end of the conversation is marked by the MarkAsComplete method, this will in fact remove the saga from the conversation. The code below shows you how the cashier communicates with both the customer and the barista.

        
    public void Handle(NewOrderMessage message)
    {
        _view.NewOrder(new NewOrderView(message));

        Data.Drink = message.Drink;
        Data.DrinkSize = message.DrinkSize;
        Data.OrderId = message.OrderId;
        Data.CustomerName = message.CustomerName;
        Data.Amount = CalculateAmountAccordingTo(message.DrinkSize);

        Bus.Publish(new PrepareOrderMessage(Data.CustomerName, Data.Drink, Data.DrinkSize, Data.OrderId));
        Bus.Reply(new PaymentRequestMessage(Data.Amount, message.CustomerName, message.OrderId));
    }

    public void Handle(PaymentMessage message)
    {
        if(message.Amount >= Data.Amount)
        {
            var viewData = new ReceivedFullPaymentView(Data.CustomerName, Data.Drink, Data.DrinkSize);
            _view.ReceivedFullPayment(viewData);

            Bus.Publish(new PaymentCompleteMessage(Data.OrderId));
        }
        else if(message.Amount == 0)
        {
            var viewData = new CustomerRefusesToPayView(Data.CustomerName, Data.Amount, Data.Drink, Data.DrinkSize);
            _view.CustomerRefusesToPay(viewData);
        }

        MarkAsComplete();
    }

    private static Double CalculateAmountAccordingTo(DrinkSize size)
    {
        switch(size)
        {
            case DrinkSize.Tall:
                return 3.25;
            case DrinkSize.Grande:
                return 4.00;
            case DrinkSize.Venti:
                return 4.75;
            default:
                throw new InvalidOperationException(String.Format("Size '{0}' does not compute!", size));
        }
    }
}

Hope this example shows you how you can implement saga’s, as an alternative to workflow orchestrations, on the windows azure platform. But before I leave you, there is still one thing to remember:

One thing to remember

This implementation uses Azure Table storage under the hood and Azure table storage only supports a small number of data types as specified in following msdn article. It speaks for itself that you can only use compatible .Net types in your saga’s state. If you need more data types, you will have to consider SQL Azure as your data store, which I will show you how to do in a next post…