all Technical posts

Invictifying Azure Data Factory

This blog will showcase a new view on setting up Azure Data Factory by making use of concepts of the Invictus Methodology.

Jasper Defesche

20 May 2021

Invictus Methodology

Codit has been creating integration architectures for years. These years of experience have resulted in a mature integration architecture and methodology, which we have named Invictus.

Like in the not-so-old days of BizTalk, it is vital to setup your messaging integrations in a way that promote reuse, agility and flexibility. Decoupling is an important concept to achieve this and underlying is the most important concept of decoupling. Decoupling manifests itself in several areas:

On message format: When mapping data from source to target, we never do this directly but always with a common/canonical format in between.
On consumption: The source system of the data has no knowledge of the potentially multiple consumers of that data. Likewise, the consumers of data have no knowledge of each other and make no assumptions on the source of the data.
On availability: We cannot assume that all systems involved have the same availability.
On message exchange patterns: the technical manner of receiving data (by push or pull) should be strict independent of the way of publishing the data.
On status of data: Although the source system dictates the state of the data, it is only allowed to do so within its own domain, not throughout the entire landscape. For new consumers the data is new, but the data itself may be old.

Codit has taken these principles and concepts towards the cloud in the Invictus for Azure offering. Invictus for Azure is the Codit off-the-shelf integration solution that allows you to easily implement your integrations in Azure. Without going into much technical details, the picture below shows the various Azure components that are used in Invictus for Azure. It clearly shows that there is a strict decoupling between receive and send processing.

Overview of the Invictus Methodology

With Invictus for Azure and the architecture it supports, setting up quality integrations in Azure is a breeze.

Azure Data Factory

Integration in Azure is not only messaging but can also consist of large datasets. Although Logic Apps can handle large sets of data, there is a hard limit on that (see https://docs.microsoft.com/en-us/azure/logic-apps/logic-apps-limits-and-config for details). Besides that, the nature of Logic Apps is messaging. In other words, Logic Apps may not be the best tool to get the job done. For larger datasets, we can make use of Azure Data Factory. Azure Data Factory is Microsoft’s answer for moving and transforming large datasets.

The most noticeable parts of Azure Data Factory are pipelines and dataflows. Pipelines are similar to Logic Apps as they describe the workflow that needs to be done. A typical pipeline contains various sequential steps like copying data, invoking functions and triggering dataflows.

Sample of an ADF pipeline

Dataflows can be compared to mapping one message to another, but it can do more, such as:

joining different sets of data (streams)
aggregation of streams
filtering data from streams

Sample of a dataflow in ADF

Best of both worlds

Just like any tool by Microsoft, Azure Data Factory is flexible and powerful. But the saying with great power comes great responsibility has never been more true. Microsoft by itself will not prevent you from creating tightly coupled integrations using Logic Apps, and the same applies to Azure Data Factory. By not taking decoupling into account, we end up with tightly coupled pipelines and dataflows, losing the flexibility and agility the business requires from IT.

So why not combine both? Can we setup Azure Data Factory with ideas and principles from the Invictus Methodology? In regards to messaging, can we bridge Azure Data Factory and messaging so that we detect and publish changes from large datasets?

Invictifying Azure Data Factory

The cornerstone of Invictifying Azure Data Factory is decoupling. We achieve this by splitting up pipelines and dataflows, similar to Logic Apps. With the Invictus Methodology in mind, the architecture for Azure Data Factory will look like this:

Invictifying Azure Data Factory Architecture

The picture shows that there is a clear separation between the source system and the target systems, just like the Logic App setup using the architecture described by Invictus. Instead of using an Azure Service Bus, we make use of Azure Blob Storage. Any subscribing pipeline acts on the BlobCreatedEvent.

All consuming pipelines act on the same entire dataset in storage. That is fine but what if we only want to get the changes in the dataset. For one of our clients, it was decided to not send all the data every time, but to switch to changes-only (deltas), so the client only received data when there was changes. For this we can use the Changefeed (see https://docs.microsoft.com/en-us/azure/cosmos-db/change-feed-design-patterns for details) mechanism that is in Azure Cosmos DB.

In the picture below a pipeline is shown that is subscribed to the same BlobCreatedEvent but does the interaction with Azure Cosmos DB to determine the changes.

Dealing with deltas

The mechanism to determine if a record has changed or not is fingerprinting. We are not comparing each and every field, but we generate a fingerprint for each record and compare it with the fingerprint based on the data in Azure Cosmos DB. A matching fingerprint means the data is exactly the same so it should be filtered from the incoming stream. All data with a different fingerprint is saved/updated in Azure Cosmos DB. The data in Azure Cosmos DB is in the canonical format.

The update or insert in Azure Cosmos DB will trigger the Changefeed. The Changefeed is captured by an Azure Durable Function which splits the Changefeed documents in separate messages and publishes each message in the Invictus Service Bus. From there, the message can be picked up by any Logic App for further processing.

For subscribers that require all data at a regular interval, we do not couple them with the incoming data stream. Instead, we use Azure Cosmos DB as data source by creating an Azure Data Factory pipeline that is scheduled to retrieve all data from Azure Cosmos DB. After transforming the data stream to the required format, the data is sent to the target system.

FULL Data consumers

This way any subscriber can retrieve all data any time. This is totally separated from the frequency of the source system publishing the data.

Summary

Maximizing decoupling of systems is an important goal when setting up integrations. This not only applies to Logic Apps but other Microsoft integration technologies as well. In this blog, I’ve tried to show you that Azure Data Factory becomes even greater if we apply the same principles and ideas we use in messaging.

Subscribe to our RSS feed

Azure Data Factory 2.0 - How is it different from Azure Logic Apps?

Azure Logic Apps & Azure Data Factory are both orchestrators, but how do they differ from each other? Well, combining both is the sweet spot.

10 Differences between Azure Functions and Logic Apps

Azure's serverless PaaS offering consists of Azure Functions and Logic Apps. If you consult the documentation, you'll find out that there is quite some overlap between the two. For many people, it's not clear what technology to use in what scenario. In this blog post, I discuss the main differences…

10 Ways to Leverage Scrum Principles Within Integration

I've always been intrigued by agile development and scrum methodology. Unfortunately, I never had the opportunity to work within a real agile organization. Because I strongly believe in scrum and its fundamental key principles, I've tried to apply an agile mindset on integration projects; even within very waterfall oriented organizations…

Brussels Airlines’ Digital Transformation Takes Off

IoT Takes Bühler Group from Field to Fork

Going the Distance with Cloud-Connected Industrial Sensors

Swiss Re leverages Cloud Technology and Data Services for its Digital Risk Intelligence Solutions

Soudal is Digitally Transforming Sales in the Chemical Industry

Creating New Revenue Streams in Logistics by Connecting Data

Brussels Airlines’ Digital Transformation Takes Off

IoT Takes Bühler Group from Field to Fork

Going the Distance with Cloud-Connected Industrial Sensors

Swiss Re leverages Cloud Technology and Data Services for its Digital Risk Intelligence Solutions

Soudal is Digitally Transforming Sales in the Chemical Industry

Creating New Revenue Streams in Logistics by Connecting Data

Invictifying Azure Data Factory

Invictus Methodology

Azure Data Factory

Best of both worlds

Invictifying Azure Data Factory

Summary

Related articles

Hi there,
how can we help?

Let's talk

Let's talk

Thanks, we'll be in touch soon!

Call us

Send blog to my inbox

Thanks, we've sent the link to your inbox

Your download should start shortly!

What can we connect for you?

Brussels Airlines’ Digital Transformation Takes Off

IoT Takes Bühler Group from Field to Fork

Going the Distance with Cloud-Connected Industrial Sensors

Swiss Re leverages Cloud Technology and Data Services for its Digital Risk Intelligence Solutions

Soudal is Digitally Transforming Sales in the Chemical Industry

Creating New Revenue Streams in Logistics by Connecting Data

Invictifying Azure Data Factory

Invictus Methodology

Azure Data Factory

Best of both worlds

Invictifying Azure Data Factory

Summary

Related articles

Hi there,how can we help?

Let's talk

Let's talk

Thanks, we'll be in touch soon!

Call us

Send blog to my inbox

Thanks, we've sent the link to your inbox

Your download should start shortly!

Stay in Touch - Subscribe to Our Commercial Communication

Great you’re on the list!

What can we connect for you?

Hi there,
how can we help?