Thanks to Maxim from the HDInsight team for this one.
Elastacloud has been using HDInsight since the start. Primarily @andybareweb although the skills have been transferred throughout the team between projects. I thought I’d take this opportunity to blog about another way of submitting jobs.
Let’s start with a few assumptions first. People will use HDInsight so that they can write Map-Reduce in .NET. There you go. I’m not saying that there are no other reasons to use HDInsight. We’re currently working on projects with heavy use of PIG and HIVE but the real differentiating factor is the use of Map-Reduce in C# or better still F#. Having spent some years working on Java projects and spent a brief time over summer working on Python I would rather sit in my own vomit than begin a Hadoop project with either of these. Throwing away programmatic terseness, type safety, fantastic abstractions and other nuggets in favour of a first class unit test testing framework, a strongly SDK for streaming and the ability to use lambdas and expressions on demand. There’s no contest for me. Maybe this is why we have such a good record of completing Hadoop projects. Thanks in part to Microsoft.
I wanted to write this because I’ve swapped using the strongly typed SDK in our current project in favour of using the raw job submission for a Map-Reduce. Without some insider knowledge this is fairly difficult to extrapolate so I thought I’d save everyone the trouble of getting stuck given that the only examples are based on wordcount (yes we do sometimes have to do more than count words in jobs- although TF-IDF is becoming something commonplace).
A hadoop job signature may look like this:
What happens here is that when you connect to the cluster the SDK will automatically delete the output folder so that HDInsight can rewrite to that location with incident. It’s possible to have many jobs and have to manage a very difficult system which can get somewhat cluttered. Everytime you submit a job the HDInsight SDK will reflect on the assemblies that you’re using with the contained types and then upload these to a location in ASV (or WASB – Windows Azure Storage Blob – as it’s now called) and upload them to a new folder under a specified user folder under the name of a GUID which in a contained dotnetcli folder. A lot going on! This means that everytime you connect to the cluster before you submit a job you can have a potentially large interaction with Blob Storage. Imagine that you’re doing this several hundred times! The time will add up quite quickly. If you could take s step back and manage this process yourself so that the software only needed to be uploaded once in some circumstances that would be better. Right?
When the reflection occurs in the SDK the following will appear on the command line window that submits to the cluster via the Hadoop service Templeton:
Output folder exists.. deleting.
File dependencies to include with job:
In this case Elastacloud.ClusterManagement submits the job to the cluster using the SDK methods but Elastacloud.Types.dll contains the types. Making this seperation of concerns is important as HDInsight will not be able to reflect on the .exe and offer a BadImageFormatException. It’s worth noting this since a lot of people have not realised that the data nodes have now been upgraded to include .NET 4.5 and slip back to using 4.0 since they think this will fix the problem (until recently the cluster nodes only supported .NET 4.0).
So we can create a bunch of StreamingJobParameters like so:
Where the Mapper and Reducer are kicked off from .exe drivers in the MRLib directory. We set our own status folder as needed as well which will contain the outputs from stderr, stdout and stdin.
We can add the files to the definition now. Important to note that these files should be copied there first so we could either copy them one time or emulate the HadoopJob and copy them to a new folder everytime.
This is the most important part:
There you go. A bunch of MSFT environment variables that you’ll have to populate to make this work and identify the correct types. It’s important to point out that a number of additional arguments can be added to this which may change the bahaviour of the job such as limiting or increasing the numbers of mapper or reducers.
Once this is done the job can be submitted. To submit this job now we can use the following:
ReadClusterInformation will be used to get the cluster information from config or the REST API and help populate the ClusterName and credentials if needs be.
JobClient can then be used to submit the job to the cluster.
Unlike the SDK which will wait on the job we’ll need to build this code in ourselves.
That way we can block until the job completes. Whilst the HadoopJob SDK offers a great abstraction and takes care of a lot of heavy lifting if you want to control some of the elements of job submission to Templeton then this approach may work better.