all Technical posts

Azure Data Explorer and Industrial IoT

Discover our experiences working with Azure Data Explorer in the context of a large-scale Industrial IoT project.

Azure Data Explorer, formerly known internally as “Kusto”, is a hyper-scale, append-only, distributed database. At Codit, we have had the pleasure to work with the Kusto engineering team since January 2018, when we onboarded Azure Data Explorer as the main data base for a client’s IoT telemetry database.

In this blog post, I’d like to share our experiences working with Azure Data Explorer in the context of a large-scale Industrial IoT project.

Azure Data Explorer is an append-only database, which means you can add data to Azure Data Explorer, but you cannot alter existing records. Therefore, its sweet spot is for high-throughput ingress of events that need to be persisted and queried at low latency. It is a perfect fit for IoT scenarios, where you have large ingress of sensor signals, which you never need to alter.

Furthermore, Azure Data Explorer is optimized for Time-Series analysis across very large datasets. We have seen Azure Data Explorer perform very complex and sophisticated queries across terabytes of data at a performance that no other database in the market can match for the same cost.

Database structure

In essence, Azure Data Explorer is a distributed cluster instance provisioned from Azure. This cluster is maintained and operated by Microsoft. You do not have to spend effort from your team designing indexes or perform typical DBA maintenance tasks. An Azure Data Explorer cluster hosts Databases, and each Database hosts Tables. Your data needs to be mapped into a tabular format, but Azure Data Explorer does support dynamic column types, where you can basically stick whatever you want into it.

Security model

Azure Data Explorer’s security model is based on Azure Active Directory. You assign permissions to the clusters, databases and tables based on Azure Active Directory users, security groups or service principals. If you are used to work with Azure Active Directory (either with Azure Active Directory-issued JWT user tokens, or with Azure Active Directory service principals), you will have no problems integrating your user and application access to Azure Data Explorer tables.

Data ingestion

Azure Data Explorer initially offered 2 ways of ingesting data.

Queued ingestion. With this method, data is ingested asynchronously. Your messages will be first stored in blob storage, and a message will be put into a queue, which is monitored by Azure Data Explorer. Azure Data Explorer will receive this message, and ingest the messages from blob storage. This method decouples the ingestion of messages from the Azure Data Explorer engine, allows for efficient retry policies, and can handle large-throughput scenarios with efficiency. All this orchestration is managed for you by the Client SDK, but under the covers, this is what happens. The drawback of this scenario is that the time from ingestion to queryability can take some seconds or minutes (data is not immediately available for queries immediately).

Direct Ingestion. With this method, the client pushes the data immediately into the Azure Data Explorer engine. The advantage of this is that the time to queryability is lower than with queued ingestion. However, the client is now responsible for error handling (this is done for you in queued ingestion by using blobs and queues) and this ingestion methods puts a higher load on the cluster.

However, the road map includes additional data sources to achieve even quicker and tighter integrations with additional Azure services. See an example here: https://docs.microsoft.com/en-us/azure/data-explorer/ingest-data-event-hub

You can start ingesting data by leveraging the capabilities of the Ingestion SDK:
https://www.nuget.org/packages/Microsoft.Azure.Kusto.Data

Data access

You query your data set by using the KQL (Kusto Query Language). Yes, it’s another language that you need to learn, but it’s so simple, intuitive and straightforward that you’ll be fully productive in no time at all. Also, this is the same query language already employed for Azure Log Analytics and Application Insights, so you might already be familiar with it.

You can send your KQL queries to your Azure Data Explorer cluster via a REST endpoint.
However, we recommend that you implement Azure Data Explorer’s own client SDKs for this purpose, and this will handle a lot of the complexity for you.
There are .NET, Python and Java SDKs that you can use to query the database. KQL and Azure Data Explorer’s capabilities are so huge, they merit multiple blog posts themselves.

You can start playing with the queries with this SDK:
https://www.nuget.org/packages/Microsoft.Azure.Kusto.Data/
https://github.com/Azure/azure-kusto-java
https://github.com/Azure/azure-kusto-python

Why do I need an Azure Data Explorer (yet another) database

I will tell you why we selected Azure Data Explorer for our purpose. This question is especially important as the offer of databases in Azure keeps growing, and it’s difficult to understand the differences between all the offerings.

Our key guiding principles for selecting a database for our IoT telemetry were the following:
• Fully managed. Because devs don’t like to lose their time solving infrastructure problems
• Optimized for time series analysis
• Highly-scalable and very performant (again, looking at time-series queries)
• Easy to integrate with (simple to push data into and with a simple API surface)

We analyzed the following databases:
• InfluxDb: Influx required us to deploy and manage our own clusters (VMs), which did not meet our selection criteria above.
• CosmosDb: Cosmos is not optimized for running complex time series queries. Many of the operators we needed are not supported at all, and for the rest, we could not achieve an acceptable price/performance relationship given the volume of data we have to handle.
• Time Series Insights: TSI is, in fact, built on top of ADE. However, our customer found the pricing model of TSI unattractive for their very specific multi-tenant scenarios and our devs found that the APIs and client SDKs of Azure Data Explorer accelerated their work and development tasks.

Azure Data Explorer is not the magic bullet. For low-volume scenarios, the entry price point is prohibitively expensive. If you need a 1-in-all package and no multi-tenancy, you might be better off with Time Series Insights (since it gives you a database, an API and a visualization engine). If you need to edit records, then you might be better off with Cosmos.

Should I embrace Azure Data Explorer at this early stage?

Azure Data Explorer is a new service offering in Azure. However, the service has been heavily used internally at Microsoft for several years already. In fact, most of Azure’s own internal telemetry is handled by internal Azure Data Explorer clusters. Within a period of 3 years, the scale of Azure Data Explorer inside of Azure achieve a scale bigger that what most organizations will ever need. The technology is mature, stable and reliable. Also, as a Ring 1 service, you will find the Azure Data Explorer is already available in many Azure regions (untypical for new Azure services, which are typically first rolled out in a handful of regions).

Azure Data Explorer, as a new Azure service, will need to invest further in topics related to customer-facing interactions, such as fine-tuning its pricing model, providing publicly-accessible mature documentation, and enabling Microsoft’s support organization (Premier and CSS) to handle Azure Data Explorer cases. The offering (as a Generally Available offering) also still needs to go through a process of maturing and community building. However, the service and the technology are quite mature for our use cases and after working closely with the Microsoft Azure Data Explorer engineering team, I am confident they will make the right investments to grow in this space too.

Subscribe to our RSS feed