HDInsight bootcamp resources

Last weekend we did the second HDInsight bootcamp for @ukwaug. It was a good meetup and I think people got a lot out of it. One attendee said to me at the end.

“I can’t believe I got up and running with a cluster and map-reduce job in half an hour in C#. It took me weeks and an armada of experts to use EMR (Elastic Map Reduce) even though I’ve been using Java for the past few years”

That’s true though. HDInsight was meant as an easy route for programmers and .NET developers especially and those that are lazy enough to want their infrastructure to just be there (like me).

Anyway, thanks to the guys at bbits for giving us 50,000 data points from reports from their LoveCleanStreets application at http://www.lovecleanstreets.org/ it was great to see people coming up with innovative uses of the data and finding trends. In the future we’ll look to do a more advanced workshop we can enrich with other sources.

The Map-Reduce labs can be found here

The PIG labs can be found here

The dataset can be found here

Last, but not least, people have asked for the slides which can be found here (Map-Reduce anyway)

Thanks to @mjunwin, @isaac_abraham and @ds_ldn for some great presentations on the day. I’ll be updating this post in the coming days to include their 3 presentations.

Thanks to Microsoft as well for sponsoring (and Elastacloud who are picking up the delta on the bootcamp series!)

Tagged with: , , ,
Posted in Azure REST APIs, Big Data, HDInsight

A better way to submit Map-Reduce jobs with #HDInsight

Thanks to Maxim from the HDInsight team for this one.

Elastacloud has been using HDInsight since the start. Primarily @andybareweb although the skills have been transferred throughout the team between projects. I thought I’d take this opportunity to blog about another way of submitting jobs.

Let’s start with a few assumptions first. People will use HDInsight so that they can write Map-Reduce in .NET. There you go. I’m not saying that there are no other reasons to use HDInsight. We’re currently working on projects with heavy use of PIG and HIVE but the real differentiating factor is the use of Map-Reduce in C# or better still F#. Having spent some years working on Java projects and spent a brief time over summer working on Python I would rather sit in my own vomit than begin a Hadoop project with either of these. Throwing away programmatic terseness, type safety, fantastic abstractions and other nuggets in favour of a first class unit test testing framework, a strongly SDK for streaming and the ability to use lambdas and expressions on demand. There’s no contest for me. Maybe this is why we have such a good record of completing Hadoop projects. Thanks in part to Microsoft.

I wanted to write this because I’ve swapped using the strongly typed SDK in our current project in favour of using the raw job submission for a Map-Reduce. Without some insider knowledge this is fairly difficult to extrapolate so I thought I’d save everyone the trouble of getting stuck given that the only examples are based on wordcount (yes we do sometimes have to do more than count words in jobs- although TF-IDF is becoming something commonplace).

A hadoop job signature may look like this:

What happens here is that when you connect to the cluster the SDK will automatically delete the output folder so that HDInsight can rewrite to that location with incident. It’s possible to have many jobs and have to manage a very difficult system which can get somewhat cluttered. Everytime you submit a job the HDInsight SDK will reflect on the assemblies that you’re using with the contained types and then upload these to a location in ASV (or WASB – Windows Azure Storage Blob – as it’s now called) and upload them to a new folder under a specified user folder under the name of a GUID which in a contained dotnetcli folder. A lot going on! This means that everytime you connect to the cluster before you submit a job you can have a potentially large interaction with Blob Storage. Imagine that you’re doing this several hundred times! The time will add up quite quickly. If you could take s step back and manage this process yourself so that the software only needed to be uploaded once in some circumstances that would be better. Right?

When the reflection occurs in the SDK the following will appear on the command line window that submits to the cluster via the Hadoop service Templeton:

Output folder exists.. deleting.
File dependencies to include with job:
[Auto-detected] C:\Projects\Elastacloud\bin\Debug\Elastacloud.ClusterManagement.vshost.exe
[Auto-detected] C:\Projects\Elastacloud\bin\Debug\Elastacloud.ClusterManagement.exe
[Auto-detected] C:\Projects\Elastacloud\bin\Debug\Elastacloud.Types.dll
[Auto-detected] C:\Projects\Elastacloud\bin\Debug\Microsoft.Hadoop.MapReduce.dll
[Auto-detected] C:\Projects\Elastacloud\bin\Debug\Elastacloud.AzureManagement.Fluent.dll
[Auto-detected] C:\Projects\Elastacloud\bin\Debug\Microsoft.Hadoop.Client.dll
[Auto-detected] C:\Projects\Elastacloud\bin\Debug\Microsoft.Hadoop.WebClient.dll
[Auto-detected] C:\Projects\Elastacloud\bin\Debug\Newtonsoft.Json.dll

In this case Elastacloud.ClusterManagement submits the job to the cluster using the SDK methods but Elastacloud.Types.dll contains the types. Making this seperation of concerns is important as HDInsight will not be able to reflect on the .exe and offer a BadImageFormatException. It’s worth noting this since a lot of people have not realised that the data nodes have now been upgraded to include .NET 4.5 and slip back to using 4.0 since they think this will fix the problem (until recently the cluster nodes only supported .NET 4.0).

So we can create a bunch of StreamingJobParameters like so:

Where the Mapper and Reducer are kicked off from .exe drivers in the MRLib directory. We set our own status folder as needed as well which will contain the outputs from stderr, stdout and stdin.

We can add the files to the definition now. Important to note that these files should be copied there first so we could either copy them one time or emulate the HadoopJob and copy them to a new folder everytime.

This is the most important part:

There you go. A bunch of MSFT environment variables that you’ll have to populate to make this work and identify the correct types. It’s important to point out that a number of additional arguments can be added to this which may change the bahaviour of the job such as limiting or increasing the numbers of mapper or reducers.

Once this is done the job can be submitted. To submit this job now we can use the following:

ReadClusterInformation will be used to get the cluster information from config or the REST API and help populate the ClusterName and credentials if needs be.

JobClient can then be used to submit the job to the cluster.

Unlike the SDK which will wait on the job we’ll need to build this code in ourselves.

That way we can block until the job completes. Whilst the HadoopJob SDK offers a great abstraction and takes care of a lot of heavy lifting if you want to control some of the elements of job submission to Templeton then this approach may work better.

Have fun!

Tagged with: , , , ,
Posted in Azure REST APIs, Big Data, HDInsight

#codevember finally bring Fluent Management fix

I finally updated fluent management as part of my contribution to #codevember commits. Done about 3 this weekend but plan to do some more on the Storm-Service Bus Spout in the next few days. Determined to do one commit a day now!

I’ve updated the MobileServiceClient so that the service doesn’t fail calling an old API. The REST API moved from Webspaces to a single site over summer and so one of the commands to get the scale and tier failed. I didn’t have a chance to fix until now (thanks to Danfer Lopez for the prompt!) but it should be okay. Given #codevember I’m going to hold back a nuget release and add a scale update method, a use an existing database method, plugin into the scheduler and plugin a custom API and google key and APNS script and cert. I should have enough time this month for all of that :-)


Tagged with: , ,
Posted in Azure REST APIs, Service Management API


#codevember sounds cool. Thanks to Rich Astbury (@richorama) for bringing it to my attention.


I’ll extend my new service bus spout in Storm and look at adding things like message properties. Small steps everyday throughout November with the aim of making Windows Azure a more hospitable place for Storm.

Tagged with: ,
Posted in Linux Virtual Machines

Using Storm on Windows Azure

It’s been a long time coming but I’ve started on a new project! I parked fluent management at some point this summer. It was a great learning experience because understanding the Service Management API means you understand Windows Azure. That exploration stood me in very good stead. I was blown away with the new library our friend and colleague Brady Gaster @bradygaster has released for Microsoft which makes fluent management look like a young fetus. That said, I’ll maintain fluent since it has a userbase and a few bits and pieces which are really interesting, VM cluster management, WasabiWeb and mobile services integration.

So .. Moving onto greener pastures. Elastacloud are intent on bringing many of things that currently work well on AWS and other clouds to Windows Azure. In this blog post I want to detail our most recent project.

For those of you that don’t know … A few years ago Twitter bought a company called Backtype where Nathan Marz developed a real time data transformation framework called Storm. We’ve been looking at Storm for about 6 months now but just recently had the opportunity to put together an architecture which used HDInsight along with something realtime. If Twitter wasn’t enough of an advertisement then reading Nathan’s blog should be. He talks about this great appeal called Lambda architecture which is real time straddling batch. This is a perfect synthesis of our experience with HDInsight plus a little “Real Time Shizzle” as @andybareweb calls it.

We’re working on a project at the moment which gets a high throughput of messages and could do with a real-time engine to transform these and give us some cumulative analytics. We decided at the start that we wanted to use the Windows Azure Service Bus to decouple the various application parts and allow messages to build up logs for HDInsight to parse at the end of day but at the same time allow Storm to process messages off the Service Bus and give us that real-time shizzle.

Storm is a very simple framework with a great deal of support. There are numerous deployments on AWS and none on Windows Azure because the magic EC2 button doesn’t exist for Azure (a deployment is 1 node to thousands – yes it scales linearly). We approached this as a two-fold task. The first is to deploy Storm across a cluster of Windows Azure nodes. If you’re lucky enough to be one of our customers we’ve now pushed this deployment into our clustermanager tool so for the bargain price of a donkey and three farmyard hens you can get an Elastic Elastacloud Storm deployment in Windows Azure. Not sold yet? Well …

We decided to start building a useful set of Windows Azure contrib tools beginning with a “spout” for the ServiceBus. Storm has a concept of spouts which allow you to ingest messages from a feed and consume them in some way and then emit them back out to a series of “bolts”. Bolts transform the messages so you can apply any type of standard transform or lookup which will enable you to interact with lists, redis, storage or anything to check incoming messages and transform them against persistent or transient state. I’ve seen a wide spectrum of examples now from log parsing to real time machine learning and semantic analysis of twitter feeds.

Over summer I used Storm with an Apache Kafka Spout and decided that it would be good if we could do the same thing with the Windows Azure Service Bus given that we already have a reliable messaging subsystem in Windows Azure.

You can see the project at https://github.com/azurecoder/servicebusspout

We’re using it but I hope it will also be used by others too. Feel free to contribute or give feedback as it’s just a first pass. I’ve actually written some tests for this as well!

I’ve set it up to use both Maven and Leiningen. We’ve got it working as part of our continuous integration process now with TeamCity. Here are some instructions for the Java noobs.

Clone the repo to start:

> git clone https://github.com/azurecoder/servicebusspout

To start either of the following will work:

> mvn install
> lein uberjar

You should end up with lots of outputs.

You should also be able to import this into Eclipse using:

mvn Eclipse:Eclipse

There you are. Ready to go. Now you can write a wordcount bolt!

Just to give you a little bit code. The following will allow messages from a ServiceBus Queue to be consumed and pass them onto the bolt “bolt-reader”. It runs this on a local test cluster, does some work and then shuts down.

I’m in the process of adding this to Maven Central and will post an update for those that want to consume the packaged spout from Maven.

BTW, it also supports subscribing to Service Bus Topics. Have fun and give feedback.

Tagged with: , , ,
Posted in Big Data, HDInsight, Linux Virtual Machines

New vmdepot image Azure Data Analysis available

I decided that it would be good to put all of the things that I’m using to learn Data Science into a single vm depot image. So today … Azure Data Analysis is born!

Create an image from the vmdepot

I started out by adding the R language and upgrading to 3.02 and included a whole heap of useful packages such as rJava, Rcpp, openmpi and many statistical and machine learning packages. All of the tools are in this image to parallelise R using openmpi so you can get some great performance across Azure VMs.

Copying and registering the image

Adding the endpoints (80, 4040, 443/8888)

I’ve added the update of my favourite tools Spark. This is now at 0.8 which is my first deployment as an Apache Incubator project. You can start up the Spark Console by typing:

> spark-shell

This will bring up the Spark REPL and allow you to enter scala commands with a SparkContext. This latest version will allow you to see the job scheduler on port 4040.


You can also use Shark from the command line.

> shark

Then you should be able to use Shark!

In addition to Spark I’ve enabled iPython Notebook. In order to run it you’ll have to enter.

> ipython notebook –profile=nbserver

iPython notebook

You can then navigate to the web root which will be https://*.cloudapp.net and you should see a command prompt. At the textbox enter the password Elastacloud123 and you should be able to login to iPython notebook and start creating notebooks. Everything you need for machine learning and azure is already installed pandas, scikit-learn, numpy, scipy etc.

Storm and Kafka have also been deployed to the machine as well as zeromq so you should be able to get up and running experimenting with Storm.

Have fun!

Posted in Linux Virtual Machines, vmdepot

Beginning Spark on Windows Azure

I’ve been using Spark and Shark for the past year. As a result I decided to release a vmdepot image to help people that want to experiment without the hassle of trying to install the software which can be fiddly and time-consuming. I won’t talk about Spark except to say that I love it and it has inspired me to learn quite a bit of Scala. Check out the course by Martin Odersky on Coursera!

Anyhow, try out the image using the following with the CLI where username and password is the username and password of the vm.

> azure vm create DNS_PREFIX -o vmdepot-6582-1-16 -l “North Europe” USER_NAME [PASSWORD] –ssh 

In order to use the image login and setup a root session (if you want to use the logged in user you can but you’ll have to copy all of the environment variables from the root .bashrc to your home – although I plan to update the image set them up in /etc/bash.bashrc in the near future).

In any case to get started run a root session and navigate to:

> cd /usr/spark/spark-0.7.3

Then we’ll run a command to check we can calculate the value of pi from a Spark example using a couple of concurrent threads.

> ./run spark.examples.SparkPi local[2]

You should see this as the last line if it has worked.

> Pi is roughly 3.1438

Not hugely accurate!

We’ll leave it there for the time being and check whether Shark works. Feel free to explore the samples directory which contain all kinds of goodies including a k-means clustering example.

Navigate to:

> cd /usr/spark/shark/shark-0.7.0/bin

Then run the following commands:


> CREATE TABLE src(key INT, value STRING);

> LOAD DATA LOCAL INPATH ‘${env:HIVE_HOME}/examples/files/kv1.txt’ INTO TABLE src;

If you get an error here with chmod – allocate some virtual memory like so:

> fallocate -l 10g /mnt/resource/swap1

> chmod 600 /mnt/resource/swap1

> mkswap /mnt/resource/swap1

> swapon /mnt/resource/swap1

Use: nano /etc/fstab and append:

> /mnt/resource/swap1 swap swap defaults 0 0

You should now finally be able to run a query:

> select * from src;

And get back 500 results.

There you go. Spark in a few easy steps. I’ll be looking at some ways that I can make this test VM easier to use in the near future and Elastacloud will be updating this regularly with new builds. Enjoy!


Tagged with: , ,
Posted in Linux Virtual Machines, vmdepot

Autoscaling for Windows Azure Websites: Welcome to WasabiWeb

So, the autoscalar was released for Cloud Services and websites recently. Finally! Anyway, some of you may have read my post earlier this year when I reverse-engineered the Azure Websites Service Management API and built it into fluent management. Well … A couple of weeks ago I had the chance to spend some time at TechEd Europe working as staff answering some question about Windows Azure.

Whilst I was there I went to see Mark Russinovich and spent time talking to many MVP colleagues and others but had time on my hands to finally build out the autoscalar which I wanted to months ago. It just so happened that I finished this at exactly the same time as the auoscalar preview was released.

As someone who has used WASABi on many occasions I really understand the power of building a framework around monitoring and alerts. I think where scaling is concerned this is all important. Much of our conversations with new customers revolves around them understanding their scale unit. The scale unit is the way that you can milk the most from your deployment in an attempt to understand how to get it to work at full capacity and not fall over but when it runs out of resources it should scale up. The logical extension to this would be run some software to check the patterns in order to pre-empt what an application will be doing 5 minutes from now based on sum over histories.

Scaling traditionally works by testing metrics over a sample period and then scaling up or down based on metric thresholds. WASABi uses diagnostic data to test assertions and then updates the ServiceConfiguration.cscfg with a change in instance count. Websites has a similar instance count which is embedded in the config. This is the NumberOfWorkers. The count can be anything up to 10 which is the maximum number available per ServerFarm in any webspace. You can consider as a location for websites.

Principally you would want to scale a website based on a set of conditions like WASABi. You would want to chain the output of these metrics tests together to ensure that some logical operation was performed on the aggregate. This is what I have built into WasabiWeb (sorry to anyone in the patterns team of Microsoft – I couldn’t think of a better name!)

The current implementation of websites is probably a first stab for Microsoft and will improve over time. The fact that it’s integrated is great but it should be viewed for what it is. It gives you the ability to average CPU load over a fixed time of an hour and then decide if the load has been consistently high or low. Thresholds are 40% and 80%. This is important because the autoscalar here isn’t immediately reactive to resource deficiency. This means that you’ll save money in the long haul because it will ensure that your average hourly values smooth out but won’t be any more reactive than that.

In order to create an autoscalar therefore you can use the following few lines of code in a console application or a windows service or anything that blocks on the main thread. In this example we look at the UKWAUG website and test the metrics against our rules every 5 minutes. The two examples rules offer a metric name with an upper bound and a lower bound threshold for the metrics. The WebsiteManagementConnector uses the WasabiWebLogicalOperation which can take And or Or values. When these two rules are Anded unless both rules are the same i.e. both metrics values are greater than their upper bound rules or both metric values are less than their upper bound rules then no change will be applied to the instance count.

We subscribe to the event ScaleUpdate and whenever there is a scale change our event is raised. In order to begin the monitoring and scaling process we call the method MonitorAndScale. There are a variety of rules we can test for including:

  • BytesReceived
  • BytesSent
  • CPUTime
  • DataSent
  • Requests
  • NetworkBytesRead
  • NetworkBytesWritten
  • Http2xx
  • Http5xx

The list above only represents about half of the available metrics so you can see that with the entire set you begin to formulate a very rich set of rules on whether to scale the site or not that are more application specific than just CPU. The sites are scalable for both Reserved (now called Standard) and Shared.

In addition to the above there are alerts which can be set to notify you of metrics being breached within the sample window. This means that you’ll be informed without a scaling up or down of the service.

One example of where you might want to use this is if you wanted to track the HTTP errors in your application. You might also want to check Requests + Bytes to see whether users were targeting upload pages.

It doesn’t just need to be about scaling though … If you want to use the rules engine to set alerts and monitor the thresholds you set without autoscaling you can do that too!

This one monitors all Http 200 responses so you can count all of your page successes within a period of time.

For those that a new to fluent you can get it from nuget here. For those that want to use WasabiWeb here it is. The source is on github just look through some old posts :-)

Have fun.

Tagged with: , , , , ,
Posted in Azure REST APIs, Service Management API, Windows Azure Websites

Enabling an HTTP listener for remote powershell in a Windows Azure VM

I’ve been working with powershell for a while now but have recently extended my understanding to beyond the scope of Azure. When remote powershell was released I thought this was a perfect way to try and dynamically build compute clusters in Windows. For one of our projects I’ve recently had to use remote powershell to perform a number of operations from creating an Active Directory on the fly (dcpromo) to getting back info from the remote Virtual Machine.

Powershell is enabled by default when you check the enable remote powershell box in Windows Azure. There are two types of remote powershell HTTP and HTTPS. Both are listeners. By default HTTPS is enabled but you have to enable it with a certificate that you can use.

In order to simplify matters and primarily because Michael Washam has a great blog post on enabling powershell with the HTTPS listener I’ll stick with the HTTP one.

We may have a powershell script like this to create a new VM. Note the use of the -EnableWinRMHttp as by default this is not enabled using powershell.

Also the public endpoint has to be created on port 5985 which will allow access to the VM. As long as we open the endpoint we should be able to use remote powershell. If we forget to use the -EnableWinRMHttp switch then we’ll have to go onto the box and type in the following:

This will enable remote powershell if it hasn’t been enabled yet and add a firewall rule to open up the correct port on the virtual machine. It will then enable the credentials acceptance for a remote powershell session on the server and allow remote origin scripts to be executed. All in all it’s better to use -EnableWinRMHttp. If you try and image the above it won’t survive.

When the service management API is used to create a virtual machine it understands that it needs to create powershell listeners so the XML supports this natively. By default the Azure Powershell CmdLets create an HTTPS listener by filling in these details.

Anyway, to connect to this is fairly simple now. We can now just add the following in Powershell:

Once you have this access turning this virtual machine into a domain controller is a doddle. Simply enter the following powershell CmdLet into your newly added remote powershell session and wait!

Happy trails!


Tagged with: , , ,
Posted in Service Management API, Windows Azure Active Directory

IP restricting Windows Azure Sql Database with the Service Management API

One common request or annoyance from customers is the fact that Windows Azure Sql Database (WASD) supports two modes of protection with its firewall only.

1. External IP addresses that are used to manage databases through SMSS or some other tool

2. An “any” rule with an IP range of – which denotes that any Windows Azure virtual machine can access the Sql database server

Its not that clear cut but lets have a short lesson on the latter is achieved … Each role instance has an IP address of 10.x.x.x and traffic can be routed between roles on internal endpoints through this network interface. An external service like Windows Azure Sql Database has no knowledge of this since it’s not part of a virtual network. This means that when an application hosted on a virtual machine connects to the WASD the firewall will pick up the public address of the cloud service.

Try this out by removing all rules and connect to a database in a sample web application and you’ll see an IP address that is referenced within a public range here:


That being said a better way of using the firewall would be to avoid the “any” rule altogether and lock down the WASD to the specific IP address of the cloud service.

Fluent Management was conceptualised to create deployment workflows in Windows Azure and one that I like quite a lot is the idea of building an agent or a monitoring application of some kind which will monitor a cloud service deployment. If the address changes (infinite lease unlikely but possible depending on the circumstances) then the agent will pick up the fact that the IP has changed blow away the firewall rules and create a new set.

I set out to do just this and this is what my code looks like:

It’s a new SqlDatabaseClient class which now has that single method to be able to take a cloud service name and look at either the staging or production slot and optionally blow away all other rules or append the new IP to an existing set of rules.

There are several ways to do this using Fluent Management. One way could be coded as follows:

This will use the Linq provider to determine the current public IP address of the cloud service and it can be checked against the stored previous one(s). This can also be used in conjunction with the fluent API deployment for cloud services and WASD. More can be read about this on the wiki.

The latest version of fluent management is which will be packaged and I’ll add to nuget in the next couple of days.

Posted in Azure REST APIs, Service Management API, Windows Azure Sql Database