Codit Wiki

Loading information... Please wait.

Codit Blog

Posted on Friday, November 20, 2015 2:35 PM

Tom Kerkhove by Tom Kerkhove

Recently I have been working on an open-source project around Azure Data Lake Analytics. It will contain custom U-SQL components that you can use in your own projects and already contains a Xml Attribute Extractor.

In this brief article I'll discuss how you can use NuGet to allow you to perform automated builds for the Azure Data Lake Analytics extensibility.

To build my code I'm using because it builds & automatically packages it in one or more NuGet packages.

The only thing I needed to do was sign-up and point MyGet to my repository. Every time I push my code to GitHub it will automatically be built & re-packaged.

Unfortunately, the build service encountered some issues when it was building my code:

2015-11-06 02:14:23 [Information] Start building project D:\temp\tmp9931\src\Codit.Analytics.sln...

C:\Program Files (x86)\MSBuild\14.0\bin\ Microsoft.Common.CurrentVersion.targets(1819,5): warning MSB3245: Could not resolve this reference. Could not locate the assembly "Microsoft.Analytics.Interfaces".

Check to make sure the assembly exists on disk. If this reference is required by your code, you may get compilation errors.

Obviously the server didn't have the correct DLLs to build it, but first: how did I create my project?

With the Azure Data Lake Tools for Visual Studio you create a solution/project for U-SQL based on these templates:

For my custom extractor I have created a "Class Library (For U-SQL Application)" project with a corresponding Test-project. Once it's created it automatically references the DLLs you need.

The problem with these references are that they point to DLLs in the Visual Studio folder.

C:\Program Files (x86)\Microsoft Visual Studio 12.0\Common7\IDE\ PublicAssemblies\Microsoft.Analytics.Interfaces.dll

As the build server doesn't have the tooling installed it can't find them. Unfortunately there is no NuGet package for this, however this is something that is under review.

But don't worry! I've wrapped all the DLLs in two NuGet packages so we can build all our stuff until there is an official NuGet package:

  • Install-Package TomKerkhove.Analytics.Extensibility - Contains all the DLLs for writing your own extensibility (NuGet page)

  • Install-Package TomKerkhove.Analytics.Extensibility.Testing - Contains the unit test DLL and installs the above package. (NuGet page)

By installing these packages you are overriding the existing DLLs & your build server is good to go!

Thanks for reading,


Categories: Big Data, Azure
written by: Tom Kerkhove

Posted on Thursday, October 29, 2015 2:56 PM

Filip Van Raemdonck by Filip Van Raemdonck

Big Data. It’s a very large topic involving the collection, processing, visualizing and analyzing the right data in a good way. For all these needs, Microsoft offers products like event hubs, stream analytics, Data Lake and Power BI. These topics are already touched in other Codit blog posts. This article however will not concern all of these, but will indulge in machine learning, a very important technique to process the data.

Machine learning: what is it?

Machine learning is part of artificial intelligence. Hereby, computers will learn the behavior of data and will use this behavior to analyze new data and maybe try to predict values. Two parts can be separated here: supervised learning and unsupervised learning.

  • In unsupervised learning, we group our data in similar groups, without predicting anything. 
  • With supervised learning, we have historical data where we know some properties and we want to predict these properties for new data (e.g. predicting when an event will happen or predict whether it will happen).

Both supervised and unsupervised learning have some subcategories.

Supervised learning has the following categories:

  • Regression: We search for a function which predicts the outcome of a new set of input elements.
  • Binary classification: In this case we will try to predict whether a product is in a certain category or not, where these categories are self-defined. An example of this is e.g. detecting if a component will break in the next 5 years.
  • Classification: This is the same as Binary classification, but with more categories. We can use this e.g. to detect whether a component is likely to break in zero to 5 year, in five to 10 years, in 10 to 15 years or will keep on working for more than 15 years.

Unsupervised learning has following categories:

  • Anomaly detection: In anomaly detection we will analyze existing data and look for a certain pattern in it. If new data comes in and has a different behavior, this data is possibly corrupted or will tell us about a defect.
  • Clustering: Splitting the data in groups where all members have similar properties.

Example: predictive maintenance

Today, it is important to keep things working. If something breaks, it can disrupt a whole process which can result in high losses. This is why lots of companies renew a lot of their components after a certain amount of time.

However, this is not always optimal: depending on the circumstances, the possible lifetime of a product can vary a lot. Hereby we try to predict when a certain component will break, which allows us to use it longer than normally, which saves us money. For this remaining lifetime calculation, different machine learning techniques give us different possibilities:

  • Regression: predicting in time when a component will fail
  • Binary: Predicting if something will fail in the next x days
  • Classification: Predicting if something will fail in the next x days, from day "X" to day "Y", ...
  • Anomaly detection: detecting if the behavior of a component is diverging from the useful behavior.
  • Recommendation: predict how much someone will like a movie, film, ...

Setup and data preparation

As said, machine learning is just a part of the whole IoT and Big data story. It is a way of processing data, but it is very important to know what data you are dealing with and what you want to do with it. Before you start with the machine learning itself, you need an answer to next topics:

  • What do you want to predict?
  • Do you have data measures which are relevant to the topic/question?
  • Is the data accurate?
  • Is the data connected?
  • Do you have enough data (because a lot of data is needed)?

You may have to preprocess your data to remove outliers and add new derived data. Keep in mind the type of data you want to predict, and what is the use of it. If you try to predict the end of the lifetime of a component, maybe a component of an aircraft, you like to predict the failure too early then too late. There are some solutions to this problem:

  • In Binary classification, lots of algorithms will allow you to work with chances, e.g. the group of components which have a likelihood from more than 5% to fail.
  • Keep in mind the type of data you work with! It is very unlikely to have lots of data from defects. Which can influence your prediction.
  • Copy samples which makes you have more samples of failed components

The machine learning algorithm

Depending on what your needs are, you choose your machine learning algorithm. I already talked about supervised and unsupervised learning, and which categories you have within them. But also within every category you have Machine learning techniques to choose about. For choosing the right ML technique, you can use the ML cheat sheet as represented in figure 1.


Figure 1: ML cheat sheet (source:

Between this cheat sheet, you have to take into account that sometimes you have to justify your decisions, which needs a transparent machine learning process. Hereby it can be a good idea not choosing for algorithms like neural networks and forest trees, but for a transparent algorithm like linear regression.

Machine learning in azure

Machine learning in azure is very easy. With azure ML, you can drag and drop all the modules you need. Traditionally, you will start with your data sources as your data input to train your model.

Next you preprocess your data: remove outliers, remove or replace missing values, … and calculate derivate data like averages if you need to and haven’t done this yet. If you have your data on which you will train your model, you can split the data and use some of the data to test the model you build while training your model with the other data. For the model you simply can drag the algorithm you use and drag a training module which will, with the help of the training data, generate a trained model. You can evaluate this trained model if you want to make sure you do not overfit your data.

Next you can drag and drop the models you want to use and train them on your training data.


Generating your experiment

At the bottom side, press the button ‘set up web service’ to start setting up your web service. Run the new setup again, but don’t forget to remove the value you want to predict from the input! Also the data splitter, if you use one, has to be removed. Execute the program again and now you can set up a your web service. If you also want to train your experiment at runtime you can deploy your training experiment.



Example: Smart grids

A lot of times we can see how much prices can differ depending on time or place. But why are these prices so different? A lot of issues could cause this: economic problems, … In this part we will look more local and we will ask ourselves how we want to price our product. We calculate the optimal prices from our product with the help of price elasticity, forecast and optimization.

Price elasticity

Price elasticity expresses how your sell rate will change in function of your product price:


The elasticity will express whether your price is too high or too low:

  • Elasticity < -1: elastic (high price low demand)
  • Elasticity > -1 : Inelastic (low price high demand)
  • Elasticity = -1: optimal = balance between price and demand

When we look to the graph we easily can find an optimal point. But this is only in the case we know the line and in the case the line only depends on the products price itself. Finding this line will be the next challenge.

With the help of taking the log of the elasticity formula and then the derivate of it, we can calculate the optimal price.


The previous section gave us a pure theoretical example. In every practical case, we have to find this line through machine learning, with the help of linear regression. And hereby the price will not only depend from the price of the product itself, but also from the price of other products and other issues. In the final stage linear regression was fit to:

  • The price of the product itself
  • The price of competing products
  • The type of the day, whereby we looked whether it’s weekend, a holiday, … They easily are to take into account by using a variable which is one in case of weekend, holiday, .. and zero in the other case

This linear regression will show us how the sales evolve in function of the prices. Not only its own price, but from the competing products as well. The input in the ML program are all the current prices of the different products, as well as properties of the product (like what the product contains and time of holidays). With the prediction, the availability of enough historical data is necessary. 


Today, where a lot of data is available, machine learning can help us in a lot of cases. It can help to predict, classify, recommend things and has a lot of applications where it can be used in. It is part of automatisation where companies want to automate as much as possible without any human intervention, saving a lot of valuable time and money.
However, you always have to prepare your data in a proper way and you have to chose a good model which does not overfit nor underfit the available testdata. 

Categories: Azure
written by: Filip Van Raemdonck

Posted on Thursday, October 22, 2015 11:30 AM

Tom Kerkhove by Tom Kerkhove

Recently Microsoft revealed their plans for Azure Data Lake and how it groups several big data services in Azure under one umbrella.

In this blog post I will walk you through what it offers and how it enables you to reuse your existing skills to analyse big data.

What is Azure Data Lake?

Azure Data Lake is a family of Azure services that enable you to analyse your big data workloads in a managed manner.

It consists of three services:

  • Azure Data Lake Store - A data repository that enables you to store any type of data in its raw format.
  • Azure Data Lake Analytics - An analytics service that enables you to run jobs on data sets without having to think about clusters.
  • Azure Data Lake HDInsight - An analytics service that enables you to analyse data sets on a managed cluster running open-source technologies such as Hadoop, Spark, Storm & HBase. You can either run these clusters on Windows or since recently also on Linux. This is an existing services that joined the family.

Here is an overview of the big data family.

© Microsoft Azure

Azure Data Lake is built on the learnings from their internal big data service called Cosmos. It uses SCOPE as a query language and is build on top of Dryad, a Microsoft Research project.

Services such as Bing, Xbox Live, Office365 & Skype are built on top of Cosmos running more than a million jobs a day.

Let's dive a bit more in detail what Azure Data Lake Store & Analytics leverage us.

Data Lake Store - Thé storage for big data in Azure

Data Lake Store is a data repository that enables you to store all kinds of data in their raw format without defining schemas.

The store offers unlimited storage with immediate read/write access to it and scaling the throughput you need for your workloads. It's also offers small writes at low latency for big data sets.

Azure Data Store is fully integrated with Azure Active Directory allowing you to reuse your existing active directory and leveraging enterprise-grade security on top of your data.

This is the "Azure Data Lake" Microsoft originally announced at //BUILD/

These are the typical characteristics of a Data Lake, an existing concept in the data space. Let's have a look what exactly a data lake is.


What is a Data Lake?

Martin Fowler wrote an interesting article explaining what a data lake is and how it's different from a data warehouse.

  • A data warehouse enables you to store data in a structured manner and is coupled to one or more schemas. Before the data is stored in the warehouse, it sometimes needs to be cleansed and transformed in the destination schema.
  • A data lake is a single repository that holds all your data in its raw format. It allows data scientists to analyse that data without losing any valuable data without knowing.

 © Martin Fowler

However, both concepts have their drawbacks.

  • Without decent metadata on what's in your Data Lake it might turn into a "data swamp" before you know it.
  • A data warehouse is tight to the schemas which means that you're not storing data that is interesting at this moment, but could be later on...

I really recommend reading Martin his article on Data Lakes.


Re-using existing tools & technologies

The Store is WebHDFS compatible allowing to use your existing tools to manage your data or analyse the data from any processing technology.


© Microsoft Azure


How about Azure Storage?

Azure now has two data services for storing blobs - Azure Storage & Azure Data Lake Store. When do I use which one?

I think it's safe to say that Azure Data Lake Store is thé data store for all your big data in Azure. If you're planning for analysing it, store it in Data Lake because of its unlimited size, low r/w latency, etc.

So when should I use Azure Storage? It's simple, all other scenarios and the data size is within the Azure Storage limits.

From a pricing perspective I think we might speculate that Azure Data Lake Store will be more expensive than Azure Storage because of several features such as enterprise-grade security based on Azure AD.

However, it still depends on several other factors such as what will the data durability offering be, are there different data tiers, etc.

Analysing your data with Data Lake Analytics

Azure Data Lake Analytics allows you to run analysis jobs on data without having to worry about clusters. Azure will run your jobs on clusters that are setup & managed by them. By using Apache YARN, Analytics is able to manage its resources for its processing engine as good as possible.

By using the U-SQL query language -which we will discuss next- you can process data from several data sources such as Azure Data Lake Store, Azure Blob Storage, Azure SQL Database but also from other data stores built on HDFS. Microsoft is also planning support for other query languages -such as Hive- but no timeframe was defined.

By using more units you can easily scale the processing of your job. This means that your job will be executed over multiple vertices that will distribute the workload by splitting files into segments for example.

With Azure Data Lake Analytics you pay only for the jobs you run instead of running a clusters (that is idle). However, they didn't announce anything on the pricing.

Here is an example of Azure Data Lake Analytics looks like in the portal.

© Matt Winkler, Microsoft

SQL + C# = U-SQL

U-SQL is a new querying language designed by Microsoft that enables you to run your queries in a distributed manner. It is built on the learnings from T-SQL, ANSI-SQL & Hive and a SQL syntax with C# extensibility. Whether you're working on small files or files bigger than 1 exabytes, U-SQL can query it!

By using a Extract-Transform-Output pattern you can process your data and come with out-of-the-box extractors & outputters or you can build your own in C#!

The language also supports other interesting features such using partition tables or using variables in folder paths, these allow you to add the variable to the output data set.
Also you can run inline C# statements or call external assemblies from within your script.

Michael Rys explains you how you can use U-SQL with some examples. Here is one of them.

Azure Data Lake Tools for Visual Studio

With these new services comes new tooling called the Azure Data Lake Tools for Visual Studio. These tools allow you to be more productive with U-SQL, Azure Data Lake Store & Analytics from within Visual Studio.

One of the features allows you to preview a file in the Store. You can then select the delimiter, qualifier & encoding in order to get a live preview. It's also possible to save that file locally or preview it in Excel.

© Michael Rys, Microsoft

Another feature allows you to trigger, visualize, debug & playback Analytics jobs. This allows you to see how much data is read/written, pin-point performance issues or just see the current status of your job.

© Michael Rys, Microsoft

One other interesting thing is that you can create & build your U-SQL scripts in Visual Studio. This allows you to verify your syntax before running your jobs and enables you to add them to your source control and collaborate on them.

In general, this tooling allows you to be productive with Azure Data Lake without the need to leave your (precious) IDE! I can personally say that these tools are really great and a nice addition to Visual Studio!

This video walks you through some of the features available.

Integrating Azure Data Lake in your data pipelines

Gaurav Malhotra announced that Azure Data Factory will offer a source & sink for Azure Data Lake Store.

This will allow you to move data to Azure Data Lake Store or export it to another data store in Azure or on-premises, i.e. Azure Blob Storage or on-premises file system.

Unfortunately, there is no news on Azure Data Lake Analytics integration (yet). It sounds feasible to be able to trigger jobs from within you data pipelines.

Openness at Microsoft

We've all noticed that things at Microsoft have changed with the coming of Satya on all fronts, also on the openness of the company.

In the past few years Microsoft has been contributing to several open-source technologies such as Docker and have open-sourced some of their own technologies in the .NET foundation.

In order to the announcements made around Azure Data Lake, Microsoft has been working closely with partners such as Hortonworks to make their platform compatible with existing open-source technologies. They have also contributed to several projects going from YARN & Hive to WebHDFS and beyond.

T. K. “Ranga” Rengarajan announced that it has been working closely with partners to ensure you have the best applications available to work with Azure Data Lake. Some of these partners are BlueTalon & Dataguise for big data security & governance or Datameer for end-to-end big data analytics.

Interested in more? Raghu walks you through some of the contributions that were made by Microsoft.

Azure Data Lake Opens A New World For Developers

The Azure Data Lake services offer data scientist & developers an easy-to-use platform for storing & analysing big data by using clusters (HDInsight) or using jobs (Analytics). With the addition of HDInsight to the family, Azure Data Lake is your answer for Big Data in Azure.

People whom have experience with big data can choose to run their jobs on a managed cluster -Running Hadoop, Spark or another technology- or let Analytics run it in the future.

On the other hand, Analytics allow developers to reuse their existing skills. That's what exactly what Codit did.

U-SQL was especially helpful because we were able to get up and running using our existing skills with .NET and SQL. This made big data easy because we did not have to learn a whole new paradigm.

With Azure Data Lake, we were able to process data coming in from smart meters and combine it with the energy spot market prices to give our customers the ability to optimize their energy consumption and potentially save hundreds of thousands of dollars. - Sam Vanhoutte, CTO at Codit

We have been part of the private preview for a while and I'm astonished how powerful the platform is. When you're running big queries you can easily assign more processing units and go from 1+ hour of processing to literally 5 minutes.

While sensors are sending us all their events, we are all storing them in Azure Data Lake in their raw format. With Azure Data Lake Analytics we are able to process it and explore what the data is telling us. If the data is no longer relevant, we can aggregate it and archive as long term storage or for deep learning.

I've been interested in big data for a while but managing a cluster scared me and I followed from afar. With Azure Data Lake I'm now capable to do it myself without having to worry about the clusters I'm running because I don't have to!

Microsoft just made big data easy.

Azure Data Lake Store & Analytics will be in public preview later this year.

Thanks for reading,


Thanks to Matt Winkler for reviewing.

Here are the official announcements:

written by: Tom Kerkhove

Posted on Friday, October 9, 2015 4:20 PM

Tom Kerkhove by Tom Kerkhove

Last week Microsoft did a lot of announcements regarding Microsoft Azure during AzureCon.

In this blog post I'll discuss a few of them that I personally feel that were interesting.

Azure has changed a lot since it was released and keeping track of all the changes is (almost) impossible.

In one of the AzureCon sessions I found a nice overview of the Azure Landscape summarizing almost all of the services that are available at the moment.

Unfortunately this landscape is already out-of-date and is missing some of the services such as Azure Data Lake (private preview), IoT Hub (public preview), IoT Suite (public preview) and others. That said it still gives a nice overview of what the platform is offering as of today and summarize what's in the landscape.

Today I'll walk you through some of the announcements made at AzureCon last week and what you can expect from the services.

Azure IoT Hub

One of the most -if not THE most- important aspect of IoT is security. We need to know who is sending data to us, where they are physically located, be able to revoke access, push software updates, and more all at scale. Building this is hard and requires a big investment.

Because of this Microsoft released Azure IoT Hub in public preview, a managed service that enables secure device-to-cloud ingestion & cloud-to-device messaging between millions of devices and the cloud. By using bi-directional communication you are able to send commands from the cloud to your devices.

With IoT Hubs comes a device registry allowing you to store metadata of a device and use a per-device authentication model. When you suspect that a device has gone rogue you simply revoke it in the IoT hub.

Devices can communicate with IoT Hub by using HTTP 1.1 or AMQP 1.0. For those who are new to AMQP, I recommend watching Clemens Vasters' "AMQP 1.0" video series.

Interested in more?

  • Learn how you can manage your devices using IoT Hub & IoT Suite in this article
  • Learn how to connect your IoT devices with the Azure IoT libraries here
  • Read Clemens Vasters' whitepaper on Service Assisted Communication
  • Learn how IoT Hub is different from Event Hubs in this article
  • Learn how you can support additional protocols in this article
  • Get an overview of Azure IoT Hub here

Azure IoT Suite

Next to Azure IoT Hub they've also released Azure IoT Suite in public preview. This suite is an abstraction layer on top of existing services such as IoT Hub, Stream Analytics, DocumentDb, Event Hubs and others allowing you to focus on your scenario rather than the implementation.

Based on the preconfigured solution you choose, the service will generate all the artifacts in a matter of minutes and you're ready to go. Once this is completed we can change the configuration, scaling, etc... as we've always done through the Azure portal. 
An example of such a preconfigured solution is remote monitoring of your devices in the field.

The suite comes with a management portal with a dashboard that gives you an overview of the telemetry history per device, map with your devices, history of alerts and some gauges showing the humidity. Ofcourse, this is different for each solution you choose to provision.

There is also integrated device management (on top of IoT Hub) but personally I'm glad to see built-in support for rules & actions. This allows us to add business logic to the solution without writing any code!

Microsoft Azure Certified for IoT

As part of this announcement Microsoft also announced the Microsoft Azure Certified for IoT program for devices that are tested and verified to work with the Azure IoT Suite.

We believe that IoT suite would be a very good solution to generate a reference implementation in a quick way that can then be customized for the customer. This would be ideal in prototyping, demos and standard solutions.

Another great thing to note is that all the preconfigured solution from Microsoft are available on GitHub allowing you to customize what you want - The management portal for example. You can find the Remote Monitoring example here.

To take it a step further - It would be great to have the ability to save our reference architecture as a template and re-provision it again later on or share it with our peers.

You can now get started with Azure IoT Suite and provision a solution for you here.

Interested in more?

  • Watch an introductory video on Azure IoT Suite here
  • Read more about the Microsoft Azure Certified for IoT program here

Azure Container Service

Almost one year ago Docker & Microsoft announced their partnership to drive the adoption of distributed application with containerisation. Since then Microsoft has been working on a Docker Engine for Windows Server, contributed to the Docker ecosystem and containerisation is the next big thing - Works on your machine? Ship your machine!

During AzureCon Microsoft announced the Azure Container Service, a service to easily create & manage clusters of hosts running Docker, Apache Mesos, Mesosphere Marathon & Dockser Swarm. For this Microsoft partnered with Mesosphere, a company building on top of the Apache Mesos project.

While the service is in an early stage you can already deploy a quickstart ARM template that creates a Mesos cluster with Marathon & Swarm for you. Later this year the Azure Container Service will become available for you that will make it even more easy. While the service will be in charge of creating and management of the Azure infrastructure while Docker will stay in charge of running the app code.

Interested in more?

  • Learn more about Docker, Azure Container Service & Windows Server containers here
  • Read more on Mesosphere and how Mesos powers the service here
  • Follow the Docker & Microsoft partnership here
  • Learn more about Mesosphere here
  • Read more on containers here
  • Read the announcement here

Azure Compute Pre-Purchase Plan

As of the 1st of December you will be able to pre-purchase your compute power with the Compute Pre-Purchase Plan. This allows you to reserve predictable workloads, such as development VMs during business hours, and save up to 63% of what you pay today! This will be available in every region.

From my understandings this is a similar offering such as AWS EC2 Reserved Instances, here's what Amazon is offering.

Azure Security Center

Over the past couple of months we've seen services to increase the security of your solutions in Azure - One example of them is Azure Key Vault. If you haven't heard about it? Read more about it here!

During AzureCon Microsoft added an additional service to build more secure solutions - Azure Security Center. Security center provides a unified dashboard with security information on top of your existing Azure resources. The goal is to get insights of what resources are vulnerable or detect events that were undetected in the past. Next to that the service is also heavily analysing all the data and using Machine Learning to improve the detection system.

An example of this is somebody trying to brute force your VMs. Azure Security Center than tries to determine where the user is located and create awareness around this.

Based on policies you can define the service will also give recommendations to improve the security of your resources. The example for this was that they've defined a policy that every Web App should have a firewall configured. This allows the service to detect Web Apps without a firewall and recommend a fix for it.

While the service isn't publically available yet, you can already request an invite here. Public preview is scheduled for later this year.

Interested in more?

  • Read more on what the Azure Security Center offers here
  • Learn more about the Azure Security Center in this video
  • Learn more about security & compliances in Azure here
  • Learn more about Encryption and key management with Azure Key Vault in this video

New SAS capabilities for Azure Storage

After adding support for client-side encryption with Azure Key Vault, the Azure Storage team has extended their security capabilitis with the additions of three features for Shared Access Signatures:

  • Account-level SAS tokens - You can now create SAS tokens on the account level leveraging an alternative to storage account keys. This allows you to give a person or application access to manage your account without exposing your account keys. Currently only Blob & File access are supported, Queues & Tables are coming in the next two months
  • IP Restrictions - Specify one or a range of IP addresses for your SAS token from which requests are allowed, others will be blocked.
  • Protocol - Restrict account & service-level SAS tokens to HTTPS only.

For more information on the new SAS capabilities or other Azure Storage announcements, read the announcement or read about using Shared Access Signatures (SAS) for Azure Storage here.

General availability

During AzureCon several services were announced to become general available in the near future.
Here are some of them :

  • Azure HDInsight on Linux
  • App Service Environment
  • Azure File Storage
  • Mobile Engagement
  • Azure Backup

They also announced that the regions Central India (Pune), South India (Chennai) & West India (Mumbai) are now available for everyone. Here is the full list of all the supported locations.


These were a ton of announcements of which these were only a few. If you want to read all of them I suggest you go to the official Microsoft Azure blog.

All the sessions of AzureCon are available on-demand on Channel9.

Thanks for reading,


Categories: Azure, IoT, Security
written by: Tom Kerkhove

Posted on Thursday, September 10, 2015 8:18 PM

Sam Vanhoutte by Sam Vanhoutte

On September 10 and 11, Codit was present at the first Cortana Analytics Workshop in Redmond, WA. This blog post is a report for the community on the content we collected and our impressions with the offering of Cortana Analytics. From data to decisions to action is the tagline for this event.

Once again, beautiful weather welcomed us in Redmond, when we arrived for the first Cortana Analytics workshop.  A lot of people from all over the world were joining this event that was highly anticipated in the Microsoft data community.  

As Codit is betting big on the new scenarios such as SaaS and Hybrid integration, API management and IoT, we really understand the real value for our customers will be gained through (Big) Data intelligence.  Next to a lot of new faces, there were also quite a bit of integration companies attending the workshop, such as our partner Axon Olympus, our Columbian friends from IT Sinergy and Chris Kabat from SPR.  

I will start with some impressions and opinions from our side.  After that, you can find more information on some of the sessions we attended.

Key take aways.

At first sight, Cortana Analytics suite is combining a lot of the existing data services that exist on Azure.  

  • Information management: Azure Data Catalog, Azure Event Hubs, Azure Data Factory
  • Big data stores: Azure Data Lake, Azure SQL Data Warehouse
  • Machine Learning and Analytics: Azure ML, Azure Stream Analytics, Azure HDInsight
  • Interaction: PowerBI (dashboarding), Cortana (personal assistent), Perceptual Intelligence (Face vision, Speech test)

So far, no specific pricing information has been announced.  The only thing to this regards was "One simple subscription model".

There are a lot of choices to implement stuff on Azure Big data.  And new stuff gets added frequently.  It will become very important to make the right choices for the right solution.  

  • Will we store our data in blob storage or in HDFS (Data Lake)
  • Will we leverage Spark, Storm or the easy Stream Analytics.
  • How will we query data?

Keynote session

The keynote session, in a packed McKinley room, was both entertaining and informative.  The event was kicked off by Corporate Vice President Jospeph Sirosh who positioned the Big Data and Analytics offering of Microsoft and Cortana Analytics.  The suite should give people the answers to the typical questions such as 'What happened?', 'Why did it happen?', 'What will happen?', 'What should I do?'.  To answer those questions, Cortana Analytics suite gives the tools to access data, analyze it and take the decisions.

Every conference has a schema that returns in every single sessions.  The schema for Cortana Analytics is the following one that shows all services that are part of the platform. (apologies for the phone picture quality)

The top differentiators for Cortana Analytics Suite

  • Agility (Volume, Variety, Velocity)
  • Platform (storage, compute, real time ingest, connectivity to all your data, information management, big data environments, advanced analytics, visualization, IDE's)
  • Assistive intelligence (Cortana!)
  • Apps (Vertical toolkits, hosted API's, eco system)
  • Features: Elasticity, Fully managed, Economic, Open Source (R & Python)
  • Facilitators
  • Secure, compliant, auditable
  • One simple subscription model
  • Future proof, with research of MSR & Bing

Cortana Analytics should be Agile, Simple and beautiful.

A firm statement was that "if you can cook by following a recipe, you should be able to use Cortana Analytics".  While I don't think that means we can go and fire all of our data scientists, I really believe that the technology to perform data analytics becomes more easily available and understandable for traditional developers and architects.

This is achieved through the following things.

  • Fully managed cloud services
  • A fully integrated platform
  • Very simple to purchase
  • Productize, simplify and democratize
  • Partner eco system


A lot of the services were demonstrated, using some interesting and well-known examples.  Especially the demo and the underlying architecture was very interesting.  That application was only possible through the tremendous scalability of Azure, and the intelligent combination of the right services on the Azure platform.  

Demystifying Cortana Analytics

Jason Wilcox, Director of engineering, was up next.  During a long intro on data anlytics, he mentioned the 'process' for data analytics as following:

  1. Find out what data sets are available
  2. Gain access to the data
  3. Shape the data
  4. Run first experiment
  5. Repeat steps 1, 2, 3 and 4 until you get it right
  6. Find the insight
  7. Operationalize & take action

 In his talk, Jason explained the following things about Cortana Analytics

  • It is a set of fully managed services (true PaaS!)
  • It works with all types of data (structured and unstructured) at any scale.  Azure Data Lake is a native HDFS (Hadoop File System) for the cloud.  Is it integrated with HDInsight, Hortonworks and Cloudera (and more services to come).  It is accessible from all HDFS compatible projects, such as Spark, Storm, Flume, Sqoop, Kafka, R, etc.  And it is fully built on open standards!
  • Operationalize the data through Azure Data Catalog (publish data assets) which will be integrated in Cortana Analytics
  • Cortana Analytics is open  to embrace and extend and allow customers to use the best-of-breed tools.
  • It's an end-to-end solution from data ingestion to presentation.


Real time data processing.  How do I choose the right solution

Two Europeans from the Data Insight Global Practice (Simon Lidberg, SE and Benjamin Wright-Jones, UK) were giving an overview of the 3 major streaming analysis services, available on Azure: Azure Stream Analytics, Apache Storm and Apache Spark.  

Azure Stream Analytics

Azure Stream Analytics is a known service for us and we've been using it for more than a year now.  We also have some blog posts and talks about it.  Simon was giving a high level overview of the service.

  • Fully managed service: No hardware deployment
  • Scalable: Dynamically scalable
  • Easy development: SQL Language
  • Built-in monitoring: View system performance through Azure portal

The typical Twitter sentiment demo was shown afterwards.  In my opinion, Azure Stream Analytics is indeed extremely easy to get started and to build quick win scenarios on Azure for telemetry ingestion, alerting and out of the box reporting.  

Storm on HDInsight

HDInsight is a streaming framework available on HDInsight.  A quick overview of HDInsight was given, positioning things like Map/Reduce (Batch), Pig (Script), Hive (SQL), HBase (NoSQL) and Storm (Streaming).

This is Apache Storm

  • Real time processing
  • Open Source
  • Visual Studi integration (C# and Java)
  • Available on HDInsight

Spark on HDInsight

Spark is extremely fast (3x faster than Hadoop in 2014).  Spark also unifies and combines Batch processing, Real Time processing, Stream Analytics, Machine Learning and Interactive SQL. 

Spark is integrated very well with the Azure platform.  There is an out of the box PowerBI connector and there is also support for direct integration with Azure Event Hubs.  

The differentiators for Spark were described as follows:

  • Enterprise Ready (fully managed service)
  • Streaming Capabilities (first class connector for Event Hubs)
  • Data Insight 
  • Flexibility and choice

Spark vs Storm comparison

Spark differs in a number of ways:

  • Workload: Spark implements a method for batching incoming updates vs individual events (Storm)
  • Latency: Seconds latency (Spark) vs Sub-second latency (Storm)
  • Fault tolerance: Exactly once (Spark) vs At least once (Storm)

When to use what?

The following table compared the three technologies.  My advise would be to opt for Stream Analytics for quick wins and straight forward scenarios.  For more complex and specialized scenarios, Storm might be a better solution.  It depends, would be the typical answer to the above question.

 Below is a good comparison table, where the '*' typically means "with some limitations".





Multi tenant service




Deployment model








Deployment complexity








Open Source Support






.NET, Java, Python

SparkSQL, Scala, Python, Java…

Power BI Integration

Yes, native


Yes, native

Overview of the Cortana Analytics Big Data stack (pt2)

In this session, 4 technologies were demonstrated and presented by 4 different speakers.  A very good session to get insights in the broad eco system of HDInsight related services.

We were shown Hadoop (Hive for querying), Storm (for streaming), HBase (NoSQL) and the new Big Data applications that will become available on the new Azure portal soon.

A nice demo, leveraging Hadoop HBase is the Tweet Sentiment demo:

Real-World Data Collection for Cortana Analytics

This was a very interesting real world scenario that was presented on a real world IoT project (soil moisture).  It's always good to hear people talk from the field.  People who have had issues and solved them.  People who have the true experience.  This was one of those sessions.  It's hard to give a good summary, so I thought to just write down some of the findings that I took away.

  • You need to get the data right!
  • Data scientists should know about the meaning of the data and sit down with the right functional people.
  • An IoT solution should be built in such a way that it suports changing sensors, data types and versions over time.

Data Warehousing in Cortana Analytics

In this session, we got a good overview of the highly scalable offering of Azure SQL Data Warehousing, by Matt Usher. The characteristics of Azure SQL DW are:

  • You can start small, but scale huge
  • Designed for on-demand scale (<1 minute for resizing!)
  • Massive parallel processing
  • Petabyte scale
  • Combine relational and non-relational data (It is Polybase with HDinsight!)
  • It is integrated with AzureML, Power BI and Azure Data Factory
  • There is SQL Server compatibility (UDF's, Table partitioning, Collations, Indices and Columnstore support)


Closing keynote: the new Power BI

James Philips, CVP Microsoft

Power BI is obviously the flagship visualization tool that gets a lot of attention.  While there are a lot of shortcomings for a lot of scenarios, it's indeed an awesome tool that allows to build reports very fast.  In this session, we got an overview of the new enhancements of Power BI and some insights in what's coming next.

Wihle most of these features were known, it was good to get an overview and recap of these features:

  • Power BI content packs
  • Custom Power BI visuals
  • Natural language query
  • On premise data connectivity

Some things I did not know yet

  • There is R support in Power BI Desktop (plotting graphs and generating word clouds)
  • When you add devToolsEnabled=true in the url, there are custom dev tools available in Power BI
  • Cortana can be integrated with Power BI.  


This was it for the first day.  You can expect another blog post on Day 2 and more on Machine Learning from my colleague Filip.

Cheers! Sam


written by: Sam Vanhoutte