In this post, we will learn about the Azure Data Factory.
What is Azure Data Factory (ADF)?
Its an ETL tool based in the cloud. Everything that we do in azure data factory is executed using the azure compute resources. We will do the development in azure portal using the UI instead of using the visual studio. We have the drag and drop interactivity with the minimum code. This is unlike the databricks where we need to write the code. Data factory is similar to the SSIS on premises. We use the expression language here in the GUI tool where you write the code in the expression builder. We use Azure data factory for the data orchestration and data transformation. One of the nice things about Data factory is that it integrates nicely with both the hybrid environments, on premises and cloud environments together.
How to set up Azure Data Factory?
Its very simple. We provide a name which should be globally unique, subscription, resource group, version and location. We are going to learn to setup the ADF here in this blog post.
- Pipelines which is same as the package on on-premises environments. Pipeline is a collection of activities and actions that do a particular work (loading the file, cleaning and importing the data etc.)
- Data Flows are the data flow activities where you do the data transformation (cleanup, merge the data, adding columns in the code etc.)
- Mapping Data Flows is similar to the SSIS on on-premises. We have source, transformations that cleanup the data and transform the data and the destination where the data be loaded.
- Wrangling Data Flows are derived from the Power Query editor in Power BI. It is a UI way of cleaning data (for example removing values, unpivoting the data)
In order to connect to the source connections in ADF, we need to provide where the source connection information from like from where the data is coming from by providing the server name and the credentials just like we provide in the SSIS. We provide that information in ADF by using linked services which is similar to the SSIS connection managers on-premises. These linked services can be used among the pipelines in ADF.
Dataset is a connection to a specific table or a specific query that we are pulling the data from a table. Dataset can be a file, documents or a storage location with in the data lake account.
Within your pipeline of the ADF, we have many different activities which will define the type of action that we want to perform on our data. One example activity can be a stored procedure inserting/updating/deleting a data in a table, running a databricks notebook directly and executing it using data factory through the activity known as databricks activity, copy activity etc.
Now, lets go ahead and do this in Action!
Just by provisioning the Data factory doesn’t charge you anything. Only you get charged for the pipelines and the amount of resources the pipelines runs generally. Create a data factory from the create new resources tab and search for the data factory.
Once the window opens, provide the resource name and the location where your company located region, give the data factory a name and version V2 as this is the latest version available.
You can set the git repository for the source control in the next tab but right now lets not setup that right now, so check on the configure git later and then click on review and create
We do take the data in the data factory and publish in your code repository and then from the code repository, we build the pipeline release from the code repository and so we can publish to the azure data factory in the production.
Now, the data factory is created. We can now create the connections for our data sources, pipelines and data flows. Go to the resource, click on the resource we created. You can see the data factory got created.
This opens the administration page but if you wish to do the development, we need to open up the author & monitor page where we can build the connection managers, datasets and pipelines.
A window opens up like the above. You can see the side panel where you can see the data factory. This is where all of our pipelines and data flows are built. Then we get the built-in monitoring that comes out of the box and we have manage. Lets click on the manage button.
When you click on the manage, you see the connections where you can create the linked services, integration runtimes. Under the source control, you can create the git configuration, ARM template. Under author we see triggers which are used for schedule executions just like the SQL Agent on the on-premises servers.
Let’s go ahead and create the linked services. On the top, you can see new. Click on that and create a connection manager. We will create two connection managers here. One for the azure data lake account and one for the Azure SQL Database so we can interact with the data sources within them.
Lets first create the Azure data lake. Click on the Azure tab on the right side near the search. Click on the Azure Data lake storage to create a connection manager for Azure data lake.
Give the proper name to represent it as the azure data lake for easy understanding. Here I choose “AzureDataLake_dbanuggets_datafactory” where dbanuggets-datafactory being my data factory account. Give the authentication method as the Account key here but for the best practices we should not be using the Account key as it will impact using the Azure devops in your code repository because it will not allow to store the sensitive information but store the sensitive information like all the passwords and all in the key vault so we can reference the key vault name from here. For our testing, lets keep this as the Account key and for the account selection method, use the “From azure subscription” and choose the azure subscription as “Microsoft Azure sponsorship” and the storage account name is going to be my storage account that we have created on our first part of this series.
Click on the test connection at the bottom of the screen, once test connection is succeded, click on the create
The Azure data lake connection has been created
Now, lets go ahead and create the Azure SQL database connection. Create on New at the top on the linked servers page, choose Azure tab and scroll down to choose Azure SQL Database and then click on continue
Give the details and test the connection
But my connection failed here. Why is that?
When you click on more side to the error, you will be able to see the details of the failure
Azure data factory is a service in azure and it cannot connect to Azure server. We need to go to the Azure server firewall rules and check the properties over there. When you check the error, It failed because the IP address mentioned in the virtual machine is not our IP address. That’s the IP address of the virtual machine that’s running in the background that the azure data factory is using to try to connect to the server. We will go back to the all resources window and find the server that we created earlier.
Under the security, find the firewall rules
Go ahead and click on the firewall rules where you will be choosing “Allow Azure services and resources to access this server” which means any service in azure can now try to authenticate to this server. This might be a concern for lot of customers because if we are on Azure VM, if that option is on, then from any part of the world anyone can connect to by trying different usernames and passwords. Other than adding that IP address from that VM, for our testing sake, lets enable the “Allow Azure services and resources to access this server” temporarily so the data factory works with this. but as the IP address of the data factory that we saw in the error is dynamic and that will actually changes. So, once we turn on the azure services to yes here and then click on save at the top of the window.
Once again do the test connection and it is successful now
Click on create. Now, the Azure SQL Database connection is created
Lets go ahead with a simple example. Lets go ahead and pull the data from the dbanuggets table into the azure data lake. We will be taking the data from the dbanuggets table and copy that to the logical container for grouping those files together in our data lake. So, we need a dataset that is pointing to the dbanuggets table in azure data factory and another one to write the file into the azure data lake.
To do this go to the Author on the Azure data factory side bar and see the different factory resources.
Pipelines is the collection of the different activities that does the work. Data sets are the connections for the reusability.
Create a new databaset by clicking on the ellipsis symbol right after the datasets
Add new dataset
Select the Azure tab under the search bar and choose the Azure data lake storage Gen 2 and click on continue at the bottom of the window
In the formats, click the CSV Delimited text
Give the name of the container, linked server name we created before, the file path actually doesn’t exist so if we give a file path that does not exist, it actually creates one for us and give the file name across the file path section. Keep the first row as header. For the import schema, mention as none because that csv file doesn’t exist yet. We will create one. Then click on okay at the bottom of the page.
We have created our first resource in our data factory other than the connection manager and we can see that as a tab across the top. As you create the new resources they will be opening up as the new tabs on the top. Once we complete working on them we will have to click on the publish all and close them. Once this resource is created you can see many options that you can choose from. Select the filesystem account that we created earlier.
Parameters are really useful to help the dataset reusable. Instead of having the dataset dynamically for the bigproduct, we can make dynamic through the parameters so it can also be dynamically change the name of the file automatically by using the parameters and expressions so it can be reusable.
We need to create one more dataset that will be pointing to the bigproducts table from the SQL database. Click on the datasets ellipsis again and choose the Azure SQL Database and click continue
Give the name and connect to the Azure SQL database and to the table and for the import schema, the beauty of the Azure data factory is we do not have to know the schema up front which is something known as late arrive schema, they can actually change any time and data factory will handle that. This is not available in the SSIS packages which is completely schema bound. If one of your table schema changes for example, if you drop any column or modify the name of any column, all of your SSIS packages referencing the table will break. We can choose none to not worry about the schema and it becomes dynamic but lets choose to import the schema for our example.
After we created this dataset, we need to click on the publish all so we can save the data that has been created until now to the data factory in production. So, if in case you do not click on the publish and leave by closing the windows or if something happens, then all your data will be lost. So, lets go ahead and publish.
If you setup source control, azure devops setup than all this published code is sent to the repository and not publishing to the data factory. As we now created two datasets, lets go ahead and move the data from the database to the data lake.
We need to create a pipeline for that. Go to the pipeline tab on the left side of the pane>new pipeline>all the way on the right opens up a window>give the name
Give the name and click on the properties button to close
From the data transformations activity, drag the copy data which will only copy the data
Down, you can see all the activities. Give the name and go to the sink. Sink meaning destination. You can see the 1 on the top of the source and sink tabs is to tell us that there are required parameters that we need to set in order to complete the activity. In the general tab, give the name
In the source tab, give the source and for the destination as well. We can do the mapping manually or we can let it map itself which is the late arriving schema, what ever schema it may be, ADF will be able to map itself. This is such a beautiful feature.
We are getting the data from the table
In the sink (destination), provide the destination details
We have other details as well if you would like to change but for this example, lets not change those.
We do not want to change anything in the mapping, let the mapping takes place automatically. Now, go ahead and run the debug. This is like the testing mode in the azure data factory. We are running the most recent copy of that pipeline not running the one which we published for the azure data factory.
Note: If the debug failed for you just like me before fixing it, please make sure your source and destination connections were already setup correctly and published properly.
When you go to all resources tab on the left side pane and open the storage account and then to the storage explorer, you can see the newly created container products and when you click on that container, you will be able to see the csv file we created.
Publish the changes again
If you wanted to change the name of the file dynamically, what if we wanted to add the date along with the filename we already created. Open the file path location again for the csv we created and at the file name location, select the add dynamic content at the bottom of the name
A new window will be popped up to add the dynamic content
Go to the string functions and then to the concat operator
Once you click on the concat, we will automatically have the concat function pop up in the add dynamic content and then add the below content to add the date after the name. for now, just add the name you wish to have before the date
Then for the date, go to the date function and press the utcnow
After this you will automatically see the utcnow() been added to the content, just add the .csv to the end of the content. click finish.
Now, go ahead and debug the pipeline one more time to see how this works. Debug succeeded.
Go ahead and check the file has been created by clicking on refresh
A new file has been created along with the date. So cool, right?
But then we wanted only the date part, lets go ahead and do that:
Go ahead and get the substring function from the dataset window
Go ahead and insert the utcnow() in to the substring and only get the date beginning from the 0 to 10th character. remember, in SQL Server we will get the beginning position from 1 but in ADF, we will get from 0. Now, click finish.
now, run the debug again and see its succeded
Go to the storage explorer again and see the file again
In this post, we have learned about the Azure Data Factory. We have learned creating the linked services, datasets and pipelines. We tested the pipeline by using the debug button and created the files with different format by adding the dynamic content. In the coming post, lets learn about creating the data flows.
Thanks for reading!