GDPR is around the corner which mandates that every company serving European customers need to comply with a lot of additional rules such as being transparent in what data is being stored, how it is being processed, and more.
The most interesting ones are actually that you need to be able to request what data they are storing about you and make it available to you, all of it. A great example is how Google allows you to select the data you want to have and give it to you, try it here.
Being inspired by this, I decided to build a similar flow running on Azure and show how easy it is to achieve this.
Consolidating user data with Azure Serverless
In this sample, I'm using a fictitious company that is called Themis Inc. which provides a web application where users can signup, create a profile and does awesome things. That application is powered by a big data set of survey information which is being processed to analyze and see if the company can deliver targeted ads for specific users.
Unfortunately, this means that the company is storing Personal Identifiable Information (PII) for the user profile and the survey results for that user. Both of these datasets need to be consolidated and provided as a download to the user.
For the sake of this sample, we are actually using the StackExchange data set and the web app simply allows me to request all my stored information.
This is a perfect fit for Azure Serverless where we will combine Azure Data Factory_, the unsung serverless hero,_ with Azure Logic Apps, Azure Event Grid and Azure Data Lake Analytics.
How it all fits together
If we look at the consolidation process, it actually consists of three steps:
- Triggering the data consolidation and send an email to the customer that we are working on it
- Consolidating, compressing and making the data available for download
- Sending an email to the customer with a link to the data
Here is an overview of all the pieces fit together:
Azure Logic Apps is a great way to orchestrate steps that are part of your application. Because of this, we are using a Logic App that is in charge of handling new data consolidation requests that were requested by customers in the web app. It will trigger the Data Factory pipeline that is in charge of preparing all the data. After that, it will get basic profile information about the user by calling the Users API and send out an email that the process has started.
The core of this flow is being managed by an Azure Data Factory pipeline which is great to orchestrate one or more data operations that represent a business process. In our case, it will get all the user information from our Azure SQL DB and get all data, related to that specific user, in our big data set that is stored on Azure Data Lake Store. Both data sets are being moved to a container in Azure Blob Storage and compressed after which a new Azure Event Grid event is being published with a link to the data.
To consolidate all the user information from our big data set we are using U-SQL because it allows me to write a very small script and submit this, while Azure Data Lake Analytics runs and looks through your data. This is where Data Lake Analytics shines because you don't need to be a big data expert to use it, it does all the heavy lifting for you by determining how it needs to execute it, scale it, and so on.
Last but not least, a second Logic App is subscribing to our custom Event Grid topic and sends out emails to customers with a link to their data.
By using Azure Event Grid topics, we remove the responsibility of the pipeline to know who should act on his outcome and trigger it. It also makes our current architecture flexible by providing extension points that can be used by other processes to integrate with it in the process in case we need to make the process more complex. It also removes the responsibility from the pipeline to know who should act on his outcome.
This is not the end
Users can now download their stored data, great! But there is more...
Use an API Gateway
The URLs that are currently exposed by our Logic Apps & Data Factory pipelines are generated by Azure and are tightly coupled to those resources.
As the cloud is constantly changing, this can become a problem when you decide to use another service or somebody simply deletes and you need to recreate it where it will have a new URL. Azure API Management is a great service for this where it will basically shield away from the backend process from the consumer and act as an API gateway. This means that if your backend changes; you don't need to update all you consumers, simply update the gateway instead.
Azure Data Factory pipelines can be triggered via HTTP calls but this has to be done via a REST API - Great! The downside is that it is secured via Azure AD which brings some overhead in certain scenarios. Using Azure API Management, you can shield this from your consumers by using an API key and leave the AD authentication up to the API gateway.
GDPR mandates that every platform needs to give a user the capability to delete all the data for a specific user on request. To achieve this a similar approach can be used or even refactor the current process so that they re-use certain components such as the Logic Apps.
Azure Serverless is a very great way to focus on what we need to achieve and not worry about the underlying infrastructure. Another big benefit is that we only need to pay for what we are using. Given this flow will be used very sporadically this is perfect because we don't want to set up an infrastructure which needs to be maintained and hosted if it will only be used once a month.
Azure Event Grid makes it easy to decouple our processes during this flow and provide more extension points where there is a need for this.
Personally, I am a fan of Azure Data Factory because it makes me as a developer so easy to automate data processes and comes with the complete package - Code & visual editor, built-in monitoring, etc.
Last but not least, this is a wonderful example of how you can combine both Azure Logic Apps & Azure Data Factory to build automated workflows. While at first, they can seem as competitors, they are actually a perfect match - One focusses on the application orchestration while the other one does the data orchestration. You can read more about this here.
Want to see this in action? Attend my "Next Generation of Data Integration with Azure Data Factory" talk at Intelligent Cloud Conference on 29th of May.
In a later post, we will go more into detail on how we can use these components to build this automated flow. Curious to see the details already? Everything will be available on GitHub.
Thanks for reading,