wiki

Codit Wiki

Loading information... Please wait.

Codit Blog

Posted on Wednesday, December 30, 2015 2:00 PM

Tom Kerkhove by Tom Kerkhove

In my previous blog post I've introduced you to a blog series where we will analyse StackExchange data by using Microsoft Azure Data Lake Store & Analytics.

Today I'll illustrate where we can download the StackExchange sample data & how we can upload and store it in the Data Lake Store by using PowerShell.

There are several options for data storage in Azure, each with a specific goal. For data analytics -especially with Azure Data Lake Analytics- Azure Data Lake Store is the "de facto".

The StackExchange data is made available on Archive.org as zip-files. We will use an Azure VM to download it from the website, unzip every folder and upload it to the Store. Let us start!

Why do we use Azure Data Lake Store over Azure Blob Storage?

Before we start you might ask why we are using Azure Data Lake Store over Azure Blob Storage?

The reason is very simple - We are planning to store a decent amount of data and perform analytics on it with Azure Data Lake Analytics.

While Azure Blob Storage can be used with Azure Data Lake Analytics, it is recommended to use Azure Data Lake Store instead. The service is built for running analytical workloads on top of it and is designed to scale along with its load.

Azure Data Lake Store also offers unlimited storage without any limits on file or account level while this isn't the case for Azure Storage.

However - Storing all your data in Azure Blob Storage will be a lot cheaper than storing it in Azure Data Lake, even when you are using Read-Access Geographically Redundant Storage (RA-GRS).

These are only some of the differences they have. They also differ in different topics regarding access control, encryption, etc.

To summarize - There is no silver bullet. It basically depends on what your scenario is and how big the data is you want to store. My suggestion is that if you'll do big data processing in Azure, you should use Azure Data Lake Store!

If for some reason you decide that the store you've picked doesn't fit your needs, you can still move it with tools like Azure Data Factory or PowerShell.

Note - During the public preview of Azure Data Lake Store it will be cheaper but keep in mind that this is at 50% of the GA pricing.

Preparing our environment

For this phase we'll need to provision two resources: A new Azure Data Lake Store account & an Azure VM in the same region.

But do I really need an Azure VM?

Benefits of using an Azure VM

It is also possible to do everything locally but I personally recommend using a VM because we can let it run overnight more easily and will be faster.

It allows us to download a file of 28 GB in the Azure datacenter, unzip 250+ folders overnight and upload 150 GB to the Store. This means that we will only pay for 28 GB of ingress instead of 150 GB, however you need to take into account that you need to pay for your VM.

You will only benefit from this if the resources are allocated within the same region, otherwise Azure will charge you for 150 GB of egress & ingress.

Provisioning a new Data Lake Store

To provision a Data Lake Store resource, browse to the Azure portal and click on 'New > Data + Storage > Data Lake Store (Preview)'.

Give it a self-describing name, assign a resources and location and click 'Create'.

After a couple of minutes, the Store will be created and you should see something similar to this.

As you can see it includes monitoring on the total storage utilization, has a special ADL URI to point to your account and has a Data Explorer. The latter allows you to navigate and browse through your data that is stored in your account.

At the end of this article you should be able to navigate through all the contents of the data dump.

Provisioning & configuring a VM

Last but not least, we'll provision a new Azure VM in which we will download, unzip & upload all the data.

In the Azure Portal, click 'New > Compute' and select a Windows Server template of your choice. Here I'm using the 'Windows Server 2012 R2 Datacenter' template.

Assign a decent host name, user name & solid password and click 'Create'.

We will also add an additional data disk to the VM on which we will store the unzipped data as the default disk is too small.

To do so, navigate to the VM we've just provisioned and open the 'Settings' blade.

Select 'Disks', click on 'Attach New' and give it a decent name. 

We don't need to increase the default value as 1024 GB is more than enough.

Once the disk is added it will show up in the overview. Here you can see my stackexchange-data.vhd data disk.

Now that the disk is added we can connect to the machine and prepare it by formatting the disk and giving it a decent name.

Now that we have a Data Lake Store account and a VM to handle the data we are ready to handle the data set.

Retrieving the StackExchange data

StackExchange has made some of their data available on archive.org allowing you to download insight about all their websites.

The websites provide you several options for downloading everything going from a torrent to individual zips to one large zip.

I personally downloaded everything in one zip and two additional files - Sites.xml & SitesList.xml.

As we can see I've stored all the information on the new data disk that we have added to the VM.

Extracting the data

Time to unzip the large files into individual zip files per website, to do so you can use tools such as 7-Zip.

Once it's done it should look similar like this.

Next up - Unzipping all the individual websites. It is recommended to select all the zip-files, unzip them at once.

Grab a couple of coffees because it will take a while.

You should have around 150 GBs of data excl. the zip-files.

So what kind of data do we have?!

Looking at the data

Now that we have unwrapped all the data we can have a look at what data is included in the data dump.

As mentioned before, the zip contains a folder for each website by StackExchange, incl. all the meta-websites.
Each folder gives your more information about all the relevant data for that specific website going from users & posts to comments and votes and beyond.

Here is all the data included that is included for coffee-stackexchange-com in this example:

+ coffee-stackexchange-com
    - Badges.xml
    - Comments.xml
    - PostHistory.xml
    - PostLinks.xml
    - Posts.xml
    - Tags.xml
    - Users.xml
    - Votes.xml

However, there is one exception - Since StackOverflow is so popular, there is a lot more data and thus bigger files. Because of this they have separated each file across a dedicated folder per file.

Here is an overview of how the data is structured:

+ stackapps-com
    - Badges.xml
    - ...
    - Votes.xml
+ stackoverflow-com-badges
    - Badges.xml
+ stackoverflow-com-...
+ stackoverflow-com-votes
    - Votes.xml
+ startups-stackexchange-com
    - Badges.xml
    - ...
    - Votes.xml

With that structure in mind, let's have a look at how we can upload the data to Azure.

Uploading to Azure with PowerShell

In order to upload all the data, it would be a good thing to automate the process, luckily Azure provides a lot of PowerShell cmdlets that allow you to automate your processes.

For our scenario I've created a script called ImportStackExchangeToAzureDataLakeStore.ps1 that will loop over all the extracted folders & upload all its files to a new directory in Azure Data Lake Store.

Although it's a simple script I'll walk you through some of the interesting commands that are used in the script.

In order to interact with Azure Data Lake Store from within PowerShell we need to use the Azure Resource Manager (Rm) cmdlets.

To do so we first need to authenticate, assign the subscription we want to use and register the Data Lake Store provider.

# Log in to your Azure account
Login-AzureRmAccount

# Select a subscription 
Set-AzureRmContext -SubscriptionId $SubscriptionId

# Register for Azure Data Lake Store
Register-AzureRmResourceProvider -ProviderNamespace "Microsoft.DataLakeStore" 

With Test-AzureRmDataLakeStoreItem-command we can check if a specific path already exists in the account, i.e. a folder or file.

$FolderExists = Test-AzureRmDataLakeStoreItem -AccountName $DataLakeStoreAccountName -Path $DataLakeStoreRootLocation

If the specified would not exist, we could create it in the store with the New-AzureRmDataLakeStoreItem-command.

New-AzureRmDataLakeStoreItem -AccountName $DataLakeStoreAccountName -Folder $DestinationFolder

In our scenario we combine these two commands to check if the folder per website, i.e. coffee-stackexchange-com, already exists. If this is not the case, we will create it before we start uploading the *.xml-files to it.

Uploading is just as easy as calling the Import-AzureRmDataLakeStoreItem with the local path to the file telling it where to save it in the store.

Import-AzureRmDataLakeStoreItem -AccountName $DataLakeStoreAccountName -Path $FullFile -Destination $FullDestination

That's it, that's how easy it is to interact with Azure Data Lake Store from PowerShell!

To start it we simply call the function and pass in some metadata: What subscription we want to use and what the name of the Data Lake Store account is, where we want to upload it and where our extracted data is located.

C:\Demos > Import-StackExchangeToAzureDataLakeStore -DataLakeStoreAccountName 'codito' -DataLakeStoreRootLocation '/stackexchange-august-2015' -DumpLocation 'F:\2015-August-Stackexchange\' -SubscriptionId '<sub-id>'

While it's running you should see how it is going through all the folders and uploading the files to Azure Data Lake.

Once the script is done we can browse through all our data in the Azure portal by using the Data Explorer.

Alternatively you could also update it to Azure Blob Storage with ImportStackExchangeToAzureBlobStorage.ps1.

Conclusion

We've seen how we can provision an Azure Data Lake Store and how we could use an infrastructure in Azure to download, unzip and upload the StackExchange data to it. Also we've had a look at how the dump is structured and what data it contains.

I've made my scripts available on GitHub so you can test it out yourself!
Don't forget to turn off your VM afterwards...

In a next blog post we will see how we can aggregate all the Users.xml data in one CSV file by Azure Data Analytics and writing one U-SQL script. This will allow us to analyze the data later one before we visualize it.

If you have any questions or suggestions, feel free to write a comment below.

Thanks for your time ,

Tom.

Categories: Azure
Tags: Data Lake
written by: Tom Kerkhove

Posted on Wednesday, April 29, 2015 10:00 PM

Tom Kerkhove by Tom Kerkhove

In this brief blog post I will summarize the extended features to Azure SQL Databases and walk through the new data offerings and give you some pointers for deeper insights.

Today Microsoft announced additional features for Azure SQL Database and two new big data services called Azure Data Lake & Azure SQL Data Warehouse at //BUILD/.

Extending the Azure SQL Database capabilities

Scott Guthrie announced new capabilites for Azure SQL Database going from full text search to creating elastic database pools to encrypting connections with Transparent Data Encryption (TDE) which presumeably uses Azure Key Vault behind the scenes.

Learn more about the extended capabilities here :

  • Build 2015 : Elastic Database Tools (Video)
  • Microsoft Announces Elastic SQL Database Pools For Azure (Article)

SQL Data Warehouse

SQL Data Warehouse allows you to store petabytes of relational data in one place and integrates Machine Learning & Power BI.

While Azure SQL Data Warehouse is released two years after AWS' Redshift Scott compared both services and showed that SQL Data Warehouse now is a leap ahead offering the solution on-prem, is more flexible and comes with full SQL support.

Azure SQL Data Warehouse will become available as public preview in June.

Azure Data Lake

Last but not least is Azure Data Lake, which is in private preview, that allows you to store & manage infinite amount of data and keep it in their original format. This allows you to store your valuable data for ages without losing important segments of it.

Data Lake will be a central storage to perform low latency analytics jobs with enterprise-grade security and integration with other services like Azure Stream Analytics. It is compatible with Hadoops HDFS and Microsofts HDInsights as well as the open-source tooling like Spark & Storm.

Learn more about Azure SQL Data Warehouse & Azure Data Lake here :

  • Microsoft announces Azure SQL Data Warehouse and Azure Data Lake in preview (Article)

  • Microsoft BUILDs its cloud Big Data story (Article)

  • Introduction to the Data Lake-concept (Article)

Current IoT offering in Microsoft Azure

Let's wrap up the day with an nice overview of the current Azure offering in the IoT space.

Let's hope that tomorrows keynote will unveil what the famous Azure IoT Suite includes and what tricks Microsoft has up their sleeves.
In the meanwhile have a look at the IoT breakout sessions today :

  • Internet of Things overview (link)
  • Azure IoT Security (link)
  • Best practices for creating IoT solutions with Azure (link)

All images in this blog posts are property of The Verge & Venturebeat.

Thanks for reading,

Tom.

Categories: Azure
Tags: IoT
written by: Tom Kerkhove

Posted on Friday, November 20, 2015 2:35 PM

Tom Kerkhove by Tom Kerkhove

Recently I have been working on an open-source project around Azure Data Lake Analytics. It will contain custom U-SQL components that you can use in your own projects and already contains a Xml Attribute Extractor.

In this brief article I'll discuss how you can use NuGet to allow you to perform automated builds for the Azure Data Lake Analytics extensibility.

To build my code I'm using MyGet.org because it builds & automatically packages it in one or more NuGet packages.

The only thing I needed to do was sign-up and point MyGet to my repository. Every time I push my code to GitHub it will automatically be built & re-packaged.

Unfortunately, the build service encountered some issues when it was building my code:

2015-11-06 02:14:23 [Information] Start building project D:\temp\tmp9931\src\Codit.Analytics.sln...

C:\Program Files (x86)\MSBuild\14.0\bin\ Microsoft.Common.CurrentVersion.targets(1819,5): warning MSB3245: Could not resolve this reference. Could not locate the assembly "Microsoft.Analytics.Interfaces".

Check to make sure the assembly exists on disk. If this reference is required by your code, you may get compilation errors.

Obviously the server didn't have the correct DLLs to build it, but first: how did I create my project?

With the Azure Data Lake Tools for Visual Studio you create a solution/project for U-SQL based on these templates:

For my custom extractor I have created a "Class Library (For U-SQL Application)" project with a corresponding Test-project. Once it's created it automatically references the DLLs you need.

The problem with these references are that they point to DLLs in the Visual Studio folder.

C:\Program Files (x86)\Microsoft Visual Studio 12.0\Common7\IDE\ PublicAssemblies\Microsoft.Analytics.Interfaces.dll

As the build server doesn't have the tooling installed it can't find them. Unfortunately there is no NuGet package for this, however this is something that is under review.

But don't worry! I've wrapped all the DLLs in two NuGet packages so we can build all our stuff until there is an official NuGet package:

  • Install-Package TomKerkhove.Analytics.Extensibility - Contains all the DLLs for writing your own extensibility (NuGet page)

  • Install-Package TomKerkhove.Analytics.Extensibility.Testing - Contains the unit test DLL and installs the above package. (NuGet page)

By installing these packages you are overriding the existing DLLs & your build server is good to go!

Thanks for reading,

Tom.

Categories: Azure
Tags: ALM, Data Lake
written by: Tom Kerkhove

Posted on Wednesday, December 30, 2015 2:00 PM

Tom Kerkhove by Tom Kerkhove

As of Wednesday 28th of October, Azure Data Lake Store & Analytics are now in public preview allowing you to try it out yourself. You won't have to worry about any clusters and allows us to focus on our business logic!

To celebrate this, I'm writing a series that will take you through the process of storing the data in Data Lake Store, processing it with Data Lake Analytics and visualizing the gained knowledge in Power BI.

 

I will break-up the series into four major parts:

  1. Storing the data in Azure Data Lake Store or Azure Storage
  2. Aggregating the data with Azure Data Lake Analytics
  3. Analyzing the data with Azure Data Lake Analytics
  4. Visualizing the data with Power BI

During this series we will use open-source data from StackExchange.
This allows us to deal with real-world data and how that might cause some difficulties.

In my next post I'll walk you through the steps to upload the data and how we can do this in a cost-efficient way.

Thanks for reading,

Tom Kerkhove.

Categories: Azure
written by: Tom Kerkhove

Posted on Thursday, October 22, 2015 11:30 AM

Tom Kerkhove by Tom Kerkhove

Recently Microsoft revealed their plans for Azure Data Lake and how it groups several big data services in Azure under one umbrella.

In this blog post I will walk you through what it offers and how it enables you to reuse your existing skills to analyse big data.

What is Azure Data Lake?

Azure Data Lake is a family of Azure services that enable you to analyse your big data workloads in a managed manner.

It consists of three services:

  • Azure Data Lake Store - A data repository that enables you to store any type of data in its raw format.
  • Azure Data Lake Analytics - An analytics service that enables you to run jobs on data sets without having to think about clusters.
  • Azure Data Lake HDInsight - An analytics service that enables you to analyse data sets on a managed cluster running open-source technologies such as Hadoop, Spark, Storm & HBase. You can either run these clusters on Windows or since recently also on Linux. This is an existing services that joined the family.

Here is an overview of the big data family.

© Microsoft Azure

Azure Data Lake is built on the learnings from their internal big data service called Cosmos. It uses SCOPE as a query language and is build on top of Dryad, a Microsoft Research project.

Services such as Bing, Xbox Live, Office365 & Skype are built on top of Cosmos running more than a million jobs a day.

Let's dive a bit more in detail what Azure Data Lake Store & Analytics leverage us.

Data Lake Store - Thé storage for big data in Azure

Data Lake Store is a data repository that enables you to store all kinds of data in their raw format without defining schemas.

The store offers unlimited storage with immediate read/write access to it and scaling the throughput you need for your workloads. It's also offers small writes at low latency for big data sets.

Azure Data Store is fully integrated with Azure Active Directory allowing you to reuse your existing active directory and leveraging enterprise-grade security on top of your data.

This is the "Azure Data Lake" Microsoft originally announced at //BUILD/

These are the typical characteristics of a Data Lake, an existing concept in the data space. Let's have a look what exactly a data lake is.

 

What is a Data Lake?

Martin Fowler wrote an interesting article explaining what a data lake is and how it's different from a data warehouse.

  • A data warehouse enables you to store data in a structured manner and is coupled to one or more schemas. Before the data is stored in the warehouse, it sometimes needs to be cleansed and transformed in the destination schema.
  • A data lake is a single repository that holds all your data in its raw format. It allows data scientists to analyse that data without losing any valuable data without knowing.

 © Martin Fowler

However, both concepts have their drawbacks.

  • Without decent metadata on what's in your Data Lake it might turn into a "data swamp" before you know it.
  • A data warehouse is tight to the schemas which means that you're not storing data that is interesting at this moment, but could be later on...

I really recommend reading Martin his article on Data Lakes.

 

Re-using existing tools & technologies

The Store is WebHDFS compatible allowing to use your existing tools to manage your data or analyse the data from any processing technology.

 

© Microsoft Azure

 

How about Azure Storage?

Azure now has two data services for storing blobs - Azure Storage & Azure Data Lake Store. When do I use which one?

I think it's safe to say that Azure Data Lake Store is thé data store for all your big data in Azure. If you're planning for analysing it, store it in Data Lake because of its unlimited size, low r/w latency, etc.

So when should I use Azure Storage? It's simple, all other scenarios and the data size is within the Azure Storage limits.

From a pricing perspective I think we might speculate that Azure Data Lake Store will be more expensive than Azure Storage because of several features such as enterprise-grade security based on Azure AD.

However, it still depends on several other factors such as what will the data durability offering be, are there different data tiers, etc.

Analysing your data with Data Lake Analytics

Azure Data Lake Analytics allows you to run analysis jobs on data without having to worry about clusters. Azure will run your jobs on clusters that are setup & managed by them. By using Apache YARN, Analytics is able to manage its resources for its processing engine as good as possible.

By using the U-SQL query language -which we will discuss next- you can process data from several data sources such as Azure Data Lake Store, Azure Blob Storage, Azure SQL Database but also from other data stores built on HDFS. Microsoft is also planning support for other query languages -such as Hive- but no timeframe was defined.

By using more units you can easily scale the processing of your job. This means that your job will be executed over multiple vertices that will distribute the workload by splitting files into segments for example.

With Azure Data Lake Analytics you pay only for the jobs you run instead of running a clusters (that is idle). However, they didn't announce anything on the pricing.

Here is an example of Azure Data Lake Analytics looks like in the portal.

© Matt Winkler, Microsoft

SQL + C# = U-SQL

U-SQL is a new querying language designed by Microsoft that enables you to run your queries in a distributed manner. It is built on the learnings from T-SQL, ANSI-SQL & Hive and a SQL syntax with C# extensibility. Whether you're working on small files or files bigger than 1 exabytes, U-SQL can query it!

By using a Extract-Transform-Output pattern you can process your data and come with out-of-the-box extractors & outputters or you can build your own in C#!

The language also supports other interesting features such using partition tables or using variables in folder paths, these allow you to add the variable to the output data set.
Also you can run inline C# statements or call external assemblies from within your script.

Michael Rys explains you how you can use U-SQL with some examples. Here is one of them.

Azure Data Lake Tools for Visual Studio

With these new services comes new tooling called the Azure Data Lake Tools for Visual Studio. These tools allow you to be more productive with U-SQL, Azure Data Lake Store & Analytics from within Visual Studio.

One of the features allows you to preview a file in the Store. You can then select the delimiter, qualifier & encoding in order to get a live preview. It's also possible to save that file locally or preview it in Excel.

© Michael Rys, Microsoft

Another feature allows you to trigger, visualize, debug & playback Analytics jobs. This allows you to see how much data is read/written, pin-point performance issues or just see the current status of your job.

© Michael Rys, Microsoft

One other interesting thing is that you can create & build your U-SQL scripts in Visual Studio. This allows you to verify your syntax before running your jobs and enables you to add them to your source control and collaborate on them.

In general, this tooling allows you to be productive with Azure Data Lake without the need to leave your (precious) IDE! I can personally say that these tools are really great and a nice addition to Visual Studio!

This video walks you through some of the features available.

Integrating Azure Data Lake in your data pipelines

Gaurav Malhotra announced that Azure Data Factory will offer a source & sink for Azure Data Lake Store.

This will allow you to move data to Azure Data Lake Store or export it to another data store in Azure or on-premises, i.e. Azure Blob Storage or on-premises file system.

Unfortunately, there is no news on Azure Data Lake Analytics integration (yet). It sounds feasible to be able to trigger jobs from within you data pipelines.

Openness at Microsoft

We've all noticed that things at Microsoft have changed with the coming of Satya on all fronts, also on the openness of the company.

In the past few years Microsoft has been contributing to several open-source technologies such as Docker and have open-sourced some of their own technologies in the .NET foundation.

In order to the announcements made around Azure Data Lake, Microsoft has been working closely with partners such as Hortonworks to make their platform compatible with existing open-source technologies. They have also contributed to several projects going from YARN & Hive to WebHDFS and beyond.

T. K. “Ranga” Rengarajan announced that it has been working closely with partners to ensure you have the best applications available to work with Azure Data Lake. Some of these partners are BlueTalon & Dataguise for big data security & governance or Datameer for end-to-end big data analytics.

Interested in more? Raghu walks you through some of the contributions that were made by Microsoft.

Azure Data Lake Opens A New World For Developers

The Azure Data Lake services offer data scientist & developers an easy-to-use platform for storing & analysing big data by using clusters (HDInsight) or using jobs (Analytics). With the addition of HDInsight to the family, Azure Data Lake is your answer for Big Data in Azure.

People whom have experience with big data can choose to run their jobs on a managed cluster -Running Hadoop, Spark or another technology- or let Analytics run it in the future.

On the other hand, Analytics allow developers to reuse their existing skills. That's what exactly what Codit did.

U-SQL was especially helpful because we were able to get up and running using our existing skills with .NET and SQL. This made big data easy because we did not have to learn a whole new paradigm.

With Azure Data Lake, we were able to process data coming in from smart meters and combine it with the energy spot market prices to give our customers the ability to optimize their energy consumption and potentially save hundreds of thousands of dollars. - Sam Vanhoutte, CTO at Codit

We have been part of the private preview for a while and I'm astonished how powerful the platform is. When you're running big queries you can easily assign more processing units and go from 1+ hour of processing to literally 5 minutes.

While sensors are sending us all their events, we are all storing them in Azure Data Lake in their raw format. With Azure Data Lake Analytics we are able to process it and explore what the data is telling us. If the data is no longer relevant, we can aggregate it and archive as long term storage or for deep learning.

I've been interested in big data for a while but managing a cluster scared me and I followed from afar. With Azure Data Lake I'm now capable to do it myself without having to worry about the clusters I'm running because I don't have to!

Microsoft just made big data easy.

Azure Data Lake Store & Analytics will be in public preview later this year.

Thanks for reading,

Tom.

Thanks to Matt Winkler for reviewing.


Here are the official announcements:

Categories: Azure
written by: Tom Kerkhove