Data flows are where we do the bulk of transformations and data cleansing. In this session, lets go ahead and create Data flow and run it.
Open the ADF, go to the data flows pane on the left and create a new data flow. Enable the data flow debug.
We will take the example of creating a data flow by merging two tables, bigproducts and transactionhistory tables. Here, we need to bring in our data source. We already have one for our bigproduct table but we need to also have for TransactionHistory table. Let’s create another dataset for TransactionHistory table. Create the dataset just like the other datasets we created in this part of Azure fundamentals.
Now, go back to the data flow tab and add two sources. Bigproduct and transactionhistory sources. Just click on the add source space.
Now, lets add another data source
To add additional transforms, we need to add by clicking on the addition button.
Click on the plus symbol to add a join condition.
In the join condition, provide the left table and the right table, kind of join you needed and the join condition like shown below:
Now, lets go ahead and remove all the columns we do not wanted. Again click on the plus icon on the right side of the join condition
choose select and get rid of all the columns we do not want and then rename the columns to make it compatible with the destination
Delete the columns you do not want after we choose the select
Now, lets use the derived column in ADF. With the derived column, we wanted to check the name that we bought from the transactionhistory table, if there is no match we need to get a value that we define instead of null. Again press the plus symbol after the select and choose the derived column under the schema modifiers.
In the incoming stream, choose select because that’s the last step and choose the column that you want to replace when there are any nulls, in the expression choose the open expression builder like below.
In the expression elements, choose functions and search for iif condition
You need to drag the function into the pane like above. In the expression iifnull(), go ahead and edit it to add name
In the expression, we are mentioning If the name is not null, bring the name as is. If the name is null, replace the null with NA. Click save and finish
We want to load this into the file in the data lake. In the data flow, in the background it is using the Azure databricks. Let’s give the destination. For the destination let’s just clone one of the datasets and edit.
Here it gives me the same copy of the bigproducts file. Change the name of the file. Remove the filename and it dynamically creates the file
Go to the data flow now, Create sink
Now, in the properties, choose the dataset as the dataset we newly created
In the settings tab of the window, choose the output to the single file
Now, the data flow is completed. We can run the data flow from the pipeline and we directly cannot run the data flow itself. Create a new pipeline from the pipelines tab on the left side
Drag the data flow activity
Run the debug on the top. Debug succeeded.
Microsoft will spin up a databricks cluster in the background, it will runs the code against the databricks cluster, once it runs and we get the results the databricks cluster will shutdown. In order to debug and test this in data factory,we need to turn on the dataflow debug option. By enabling this option, we are manually creating the databbricks cluster for testing purposes that’s running in the background.
See and check if we have output under the storage account>products container>producttransactionhistory file.
In this post, we have learned about Data flows and created a sample of data flow and used transformations to transform the data as we need. In the coming post, we will be learning about the Azure synapse.
Thanks for reading!