wiki

Codit Wiki

Loading information... Please wait.

Codit Blog

Posted on Tuesday, February 6, 2018 2:05 PM

Glenn Colpaert by Glenn Colpaert

Exposing data with different content types in a performant way can be a challenging problem to solve, Azure Search tackles this problem by providing a full-text search experience for web and mobile applications.

When talking about modern application development, where enterprise applications exist both on-premises and in the cloud, companies want to integrate beyond their firewall typically with SaaS based applications or APIs exposed by third parties.

Next to integrating with different services or applications, many companies also want to expose data in a simple, fast and secure way through API first development. Performance and a great experience is the key to success with your APIs. This is where Azure Search comes in...

Scenario

For a datasharing platform that we are currently building, we have a large amount of files stored on Blob Storage. These files are being ingested into the platform through different endpoints and protocols (HTTPS, FTP, AS2,...). When files are ingested in the platform, different types of metadata are added to the file before storing it to Blob Storage. These metadata values are extracted from the file content and provided as metadata. These metadata values will be made available for querying through APIs.

Our first implementation of the API was directly querying the Blob Storage to search for specific files that match our metadata filters that are provided in the API calls. We started noticing the limits of our implementation because the large amounts of Blobs inside our storage container. This is related to the limited query capabilities of Blob Storage, we needed to list the blobs and then do the filtering inside the implementation based on the metadata. 

To optimize our searches and performance we quickly introduced Azure Search into our implementation. To get things straight, Azure Search is not in charge of executing queries across all the blobs in Azure Blob Storage but more to index all the blobs in the storage account with a search layer on top of it. 

Azure Search

Azure Search is a search-as-a-service cloud solution that gives developers APIs and tools for adding a rich search experience over your content in web, mobile, and enterprise applications. This all without managing infrastructure or the need to become search experts.

To use Azure Search there are a couple of steps you need to take, first of all you need to provision the Azure Search service, this will be the scope of your capacity, billing and authentication and is fully managed through the Azure Portal or through the Management API.

When your Azure Search service is provisioned you need to define one or more indexes that are linked with your search service. An index is a searchable collection of documents and it contains multiple fields where you can query on.

Once your index is created you can define a schedule for that index to run, once every hour, x-time an hour...

When everything is configured and up and running you can start using the Azure Search service to start querying your indexed data for results.

Fit it all togheter

Below you can find a simplified example of our initial implementation. You can see that we were directly querying the blob storage and needed to fetch the attributes for each single blob file and match it with the search criteria.

 

This is how our current high level implementation looks like. We are using the Azure Search engine to provide both Queries and Filters to find immediately what we need.

Conclusion

Azure Search immediately gave us the performance we needed and was fairly easy to set up and use.

We struggled a bit finding our way round some of the query limitations and options in the basic Azure Search query options, but quickly came to the conclusion that using the Lucene query syntax provides enough rich query capabilities that we needed to search for the metadata.

 

Cheers, Glenn

Categories: Azure, Architecture
written by: Glenn Colpaert

Posted on Friday, May 18, 2018 1:59 PM

Tom Kerkhove by Tom Kerkhove

GDPR mandates that you make data available to users on their request. In this post I show you how you can use Azure Serverless to achieve this with very little effort.

GDPR is around the corner which mandates that every company serving European customers need to comply with a lot of additional rules such as being transparent in what data is being stored, how it is being processed, and more.

The most interesting ones are actually that you need to be able to request what data they are storing about you and make it available to you, all of it. A great example is how Google allows you to select the data you want to have and give it to you, try it here.

Being inspired by this, I decided to build a similar flow running on Azure and show how easy it is to achieve this.

Consolidating user data with Azure Serverless

In this sample, I'm using a fictitious company that is called Themis Inc. which provides a web application where users can signup, create a profile and does awesome things. That application is powered by a big data set of survey information which is being processed to analyze and see if the company can deliver targeted ads for specific users.

Unfortunately, this means that the company is storing Personal Identifiable Information (PII) for the user profile and the survey results for that user. Both of these datasets need to be consolidated and provided as a download to the user.

For the sake of this sample, we are actually using the StackExchange data set and the web app simply allows me to request all my stored information.

This is a perfect fit for Azure Serverless where we will combine Azure Data Factory_, the unsung serverless hero,_ with Azure Logic Apps, Azure Event Grid and Azure Data Lake Analytics.

How it all fits together

If we look at the consolidation process, it actually consists of three steps:

  1. Triggering the data consolidation and send an email to the customer that we are working on it
  2. Consolidating, compressing and making the data available for download
  3. Sending an email to the customer with a link to the data

Here is an overview of all the pieces fit together:

Azure Logic Apps is a great way to orchestrate steps that are part of your application. Because of this, we are using a Logic App that is in charge of handling new data consolidation requests that were requested by customers in the web app. It will trigger the Data Factory pipeline that is in charge of preparing all the data. After that, it will get basic profile information about the user by calling the Users API and send out an email that the process has started.

The core of this flow is being managed by an Azure Data Factory pipeline which is great to orchestrate one or more data operations that represent a business process. In our case, it will get all the user information from our Azure SQL DB and get all data, related to that specific user, in our big data set that is stored on Azure Data Lake Store. Both data sets are being moved to a container in Azure Blob Storage and compressed after which a new Azure Event Grid event is being published with a link to the data.

To consolidate all the user information from our big data set we are using U-SQL because it allows me to write a very small script and submit this, while Azure Data Lake Analytics runs and looks through your data. This is where Data Lake Analytics shines because you don't need to be a big data expert to use it, it does all the heavy lifting for you by determining how it needs to execute it, scale it, and so on.

Last but not least, a second Logic App is subscribing to our custom Event Grid topic and sends out emails to customers with a link to their data.

By using Azure Event Grid topics, we remove the responsibility of the pipeline to know who should act on his outcome and trigger it. It also makes our current architecture flexible by providing extension points that can be used by other processes to integrate with it in the process in case we need to make the process more complex. It also removes the responsibility from the pipeline to know who should act on his outcome.

This is not the end

Users can now download their stored data, great! But there is more...

Use an API Gateway

The URLs that are currently exposed by our Logic Apps & Data Factory pipelines are generated by Azure and are tightly coupled to those resources.

As the cloud is constantly changing, this can become a problem when you decide to use another service or somebody simply deletes and you need to recreate it where it will have a new URL. Azure API Management is a great service for this where it will basically shield away from the backend process from the consumer and act as an API gateway. This means that if your backend changes; you don't need to update all you consumers, simply update the gateway instead.

Azure Data Factory pipelines can be triggered via HTTP calls but this has to be done via a REST API - Great! The downside is that it is secured via Azure AD which brings some overhead in certain scenarios. Using Azure API Management, you can shield this from your consumers by using an API key and leave the AD authentication up to the API gateway.

User Deletion

GDPR mandates that every platform needs to give a user the capability to delete all the data for a specific user on request. To achieve this a similar approach can be used or even refactor the current process so that they re-use certain components such as the Logic Apps.

Conclusion

Azure Serverless is a very great way to focus on what we need to achieve and not worry about the underlying infrastructure. Another big benefit is that we only need to pay for what we are using. Given this flow will be used very sporadically this is perfect because we don't want to set up an infrastructure which needs to be maintained and hosted if it will only be used once a month.

Azure Event Grid makes it easy to decouple our processes during this flow and provide more extension points where there is a need for this.

Personally, I am a fan of Azure Data Factory because it makes me as a developer so easy to automate data processes and comes with the complete package - Code & visual editor, built-in monitoring, etc.

Last but not least, this is a wonderful example of how you can combine both Azure Logic Apps & Azure Data Factory to build automated workflows. While at first, they can seem as competitors, they are actually a perfect match - One focusses on the application orchestration while the other one does the data orchestration. You can read more about this here.

Want to see this in action? Attend my "Next Generation of Data Integration with Azure Data Factory" talk at Intelligent Cloud Conference on 29th of May.

In a later post, we will go more into detail on how we can use these components to build this automated flow. Curious to see the details already? Everything will be available on GitHub.

Thanks for reading,

Tom.