Data flows are where we do the bulk of transformations and data cleansing. In this session, lets go ahead and create Data flow and run it.
Open the ADF, go to the data flows pane on the left and create a new data flow. Enable the data flow debug.
We will take the example of creating a data flow by merging two tables, bigproducts and transactionhistory tables. Here, we need to bring in our data source. We already have one for our bigproduct table but we need to also have for TransactionHistory table. Let’s create another dataset for TransactionHistory table. Create the dataset just like the other datasets we created in this part of Azure fundamentals.
Now, go back to the data flow tab and add two sources. Bigproduct and transactionhistory sources. Just click on the add source space.
Now, lets add another data source
To add additional transforms, we need to add by clicking on the addition button.
Click on the plus symbol to add a join condition.
In the join condition, provide the left table and the right table, kind of join you needed and the join condition like shown below:
Now, lets go ahead and remove all the columns we do not wanted. Again click on the plus icon on the right side of the join condition
choose select and get rid of all the columns we do not want and then rename the columns to make it compatible with the destination
Delete the columns you do not want after we choose the select
Now, lets use the derived column in ADF. With the derived column, we wanted to check the name that we bought from the transactionhistory table, if there is no match we need to get a value that we define instead of null. Again press the plus symbol after the select and choose the derived column under the schema modifiers.
In the incoming stream, choose select because that’s the last step and choose the column that you want to replace when there are any nulls, in the expression choose the open expression builder like below.
In the expression elements, choose functions and search for iif condition
You need to drag the function into the pane like above. In the expression iifnull(), go ahead and edit it to add name
In the expression, we are mentioning If the name is not null, bring the name as is. If the name is null, replace the null with NA. Click save and finish
We want to load this into the file in the data lake. In the data flow, in the background it is using the Azure databricks. Let’s give the destination. For the destination let’s just clone one of the datasets and edit.
Here it gives me the same copy of the bigproducts file. Change the name of the file. Remove the filename and it dynamically creates the file
Go to the data flow now, Create sink
Now, in the properties, choose the dataset as the dataset we newly created
In the settings tab of the window, choose the output to the single file
Now, the data flow is completed. We can run the data flow from the pipeline and we directly cannot run the data flow itself. Create a new pipeline from the pipelines tab on the left side
Drag the data flow activity
Run the debug on the top. Debug succeeded.
Microsoft will spin up a databricks cluster in the background, it will runs the code against the databricks cluster, once it runs and we get the results the databricks cluster will shutdown. In order to debug and test this in data factory,we need to turn on the dataflow debug option. By enabling this option, we are manually creating the databbricks cluster for testing purposes that’s running in the background.
See and check if we have output under the storage account>products container>producttransactionhistory file.
In this post, we have learned about Data flows and created a sample of data flow and used transformations to transform the data as we need. In the coming post, we will be learning about the Azure synapse.
In this post, we will learn about the Azure Data Factory.
What is Azure Data Factory (ADF)?
Its an ETL tool based in the cloud. Everything that we do in azure data factory is executed using the azure compute resources. We will do the development in azure portal using the UI instead of using the visual studio. We have the drag and drop interactivity with the minimum code. This is unlike the databricks where we need to write the code. Data factory is similar to the SSIS on premises. We use the expression language here in the GUI tool where you write the code in the expression builder. We use Azure data factory for the data orchestration and data transformation. One of the nice things about Data factory is that it integrates nicely with both the hybrid environments, on premises and cloud environments together.
How to set up Azure Data Factory?
Its very simple. We provide a name which should be globally unique, subscription, resource group, version and location. We are going to learn to setup the ADF here in this blog post.
Pipelines which is same as the package on on-premises environments. Pipeline is a collection of activities and actions that do a particular work (loading the file, cleaning and importing the data etc.)
Data Flows are the data flow activities where you do the data transformation (cleanup, merge the data, adding columns in the code etc.)
Mapping Data Flows is similar to the SSIS on on-premises. We have source, transformations that cleanup the data and transform the data and the destination where the data be loaded.
Wrangling Data Flows are derived from the Power Query editor in Power BI. It is a UI way of cleaning data (for example removing values, unpivoting the data)
In order to connect to the source connections in ADF, we need to provide where the source connection information from like from where the data is coming from by providing the server name and the credentials just like we provide in the SSIS. We provide that information in ADF by using linked services which is similar to the SSIS connection managers on-premises. These linked services can be used among the pipelines in ADF.
Dataset is a connection to a specific table or a specific query that we are pulling the data from a table. Dataset can be a file, documents or a storage location with in the data lake account.
Within your pipeline of the ADF, we have many different activities which will define the type of action that we want to perform on our data. One example activity can be a stored procedure inserting/updating/deleting a data in a table, running a databricks notebook directly and executing it using data factory through the activity known as databricks activity, copy activity etc.
Now, lets go ahead and do this in Action!
Just by provisioning the Data factory doesn’t charge you anything. Only you get charged for the pipelines and the amount of resources the pipelines runs generally. Create a data factory from the create new resources tab and search for the data factory.
Once the window opens, provide the resource name and the location where your company located region, give the data factory a name and version V2 as this is the latest version available.
You can set the git repository for the source control in the next tab but right now lets not setup that right now, so check on the configure git later and then click on review and create
We do take the data in the data factory and publish in your code repository and then from the code repository, we build the pipeline release from the code repository and so we can publish to the azure data factory in the production.
Now, the data factory is created. We can now create the connections for our data sources, pipelines and data flows. Go to the resource, click on the resource we created. You can see the data factory got created.
This opens the administration page but if you wish to do the development, we need to open up the author & monitor page where we can build the connection managers, datasets and pipelines.
A window opens up like the above. You can see the side panel where you can see the data factory. This is where all of our pipelines and data flows are built. Then we get the built-in monitoring that comes out of the box and we have manage. Lets click on the manage button.
When you click on the manage, you see the connections where you can create the linked services, integration runtimes. Under the source control, you can create the git configuration, ARM template. Under author we see triggers which are used for schedule executions just like the SQL Agent on the on-premises servers.
Let’s go ahead and create the linked services. On the top, you can see new. Click on that and create a connection manager. We will create two connection managers here. One for the azure data lake account and one for the Azure SQL Database so we can interact with the data sources within them.
Lets first create the Azure data lake. Click on the Azure tab on the right side near the search. Click on the Azure Data lake storage to create a connection manager for Azure data lake.
Give the proper name to represent it as the azure data lake for easy understanding. Here I choose “AzureDataLake_dbanuggets_datafactory” where dbanuggets-datafactory being my data factory account. Give the authentication method as the Account key here but for the best practices we should not be using the Account key as it will impact using the Azure devops in your code repository because it will not allow to store the sensitive information but store the sensitive information like all the passwords and all in the key vault so we can reference the key vault name from here. For our testing, lets keep this as the Account key and for the account selection method, use the “From azure subscription” and choose the azure subscription as “Microsoft Azure sponsorship” and the storage account name is going to be my storage account that we have created on our first part of this series.
Click on the test connection at the bottom of the screen, once test connection is succeded, click on the create
The Azure data lake connection has been created
Now, lets go ahead and create the Azure SQL database connection. Create on New at the top on the linked servers page, choose Azure tab and scroll down to choose Azure SQL Database and then click on continue
Give the details and test the connection
But my connection failed here. Why is that?
When you click on more side to the error, you will be able to see the details of the failure
Azure data factory is a service in azure and it cannot connect to Azure server. We need to go to the Azure server firewall rules and check the properties over there. When you check the error, It failed because the IP address mentioned in the virtual machine is not our IP address. That’s the IP address of the virtual machine that’s running in the background that the azure data factory is using to try to connect to the server. We will go back to the all resources window and find the server that we created earlier.
Under the security, find the firewall rules
Go ahead and click on the firewall rules where you will be choosing “Allow Azure services and resources to access this server” which means any service in azure can now try to authenticate to this server. This might be a concern for lot of customers because if we are on Azure VM, if that option is on, then from any part of the world anyone can connect to by trying different usernames and passwords. Other than adding that IP address from that VM, for our testing sake, lets enable the “Allow Azure services and resources to access this server” temporarily so the data factory works with this. but as the IP address of the data factory that we saw in the error is dynamic and that will actually changes. So, once we turn on the azure services to yes here and then click on save at the top of the window.
Once again do the test connection and it is successful now
Click on create. Now, the Azure SQL Database connection is created
Lets go ahead with a simple example. Lets go ahead and pull the data from the dbanuggets table into the azure data lake. We will be taking the data from the dbanuggets table and copy that to the logical container for grouping those files together in our data lake. So, we need a dataset that is pointing to the dbanuggets table in azure data factory and another one to write the file into the azure data lake.
To do this go to the Author on the Azure data factory side bar and see the different factory resources.
Pipelines is the collection of the different activities that does the work. Data sets are the connections for the reusability.
Create a new databaset by clicking on the ellipsis symbol right after the datasets
Add new dataset
Select the Azure tab under the search bar and choose the Azure data lake storage Gen 2 and click on continue at the bottom of the window
In the formats, click the CSV Delimited text
Give the name of the container, linked server name we created before, the file path actually doesn’t exist so if we give a file path that does not exist, it actually creates one for us and give the file name across the file path section. Keep the first row as header. For the import schema, mention as none because that csv file doesn’t exist yet. We will create one. Then click on okay at the bottom of the page.
We have created our first resource in our data factory other than the connection manager and we can see that as a tab across the top. As you create the new resources they will be opening up as the new tabs on the top. Once we complete working on them we will have to click on the publish all and close them. Once this resource is created you can see many options that you can choose from. Select the filesystem account that we created earlier.
Parameters are really useful to help the dataset reusable. Instead of having the dataset dynamically for the bigproduct, we can make dynamic through the parameters so it can also be dynamically change the name of the file automatically by using the parameters and expressions so it can be reusable.
We need to create one more dataset that will be pointing to the bigproducts table from the SQL database. Click on the datasets ellipsis again and choose the Azure SQL Database and click continue
Give the name and connect to the Azure SQL database and to the table and for the import schema, the beauty of the Azure data factory is we do not have to know the schema up front which is something known as late arrive schema, they can actually change any time and data factory will handle that. This is not available in the SSIS packages which is completely schema bound. If one of your table schema changes for example, if you drop any column or modify the name of any column, all of your SSIS packages referencing the table will break. We can choose none to not worry about the schema and it becomes dynamic but lets choose to import the schema for our example.
After we created this dataset, we need to click on the publish all so we can save the data that has been created until now to the data factory in production. So, if in case you do not click on the publish and leave by closing the windows or if something happens, then all your data will be lost. So, lets go ahead and publish.
If you setup source control, azure devops setup than all this published code is sent to the repository and not publishing to the data factory. As we now created two datasets, lets go ahead and move the data from the database to the data lake.
We need to create a pipeline for that. Go to the pipeline tab on the left side of the pane>new pipeline>all the way on the right opens up a window>give the name
Give the name and click on the properties button to close
From the data transformations activity, drag the copy data which will only copy the data
Down, you can see all the activities. Give the name and go to the sink. Sink meaning destination. You can see the 1 on the top of the source and sink tabs is to tell us that there are required parameters that we need to set in order to complete the activity. In the general tab, give the name
In the source tab, give the source and for the destination as well. We can do the mapping manually or we can let it map itself which is the late arriving schema, what ever schema it may be, ADF will be able to map itself. This is such a beautiful feature.
We are getting the data from the table
In the sink (destination), provide the destination details
We have other details as well if you would like to change but for this example, lets not change those.
We do not want to change anything in the mapping, let the mapping takes place automatically. Now, go ahead and run the debug. This is like the testing mode in the azure data factory. We are running the most recent copy of that pipeline not running the one which we published for the azure data factory.
Note: If the debug failed for you just like me before fixing it, please make sure your source and destination connections were already setup correctly and published properly.
When you go to all resources tab on the left side pane and open the storage account and then to the storage explorer, you can see the newly created container products and when you click on that container, you will be able to see the csv file we created.
Publish the changes again
If you wanted to change the name of the file dynamically, what if we wanted to add the date along with the filename we already created. Open the file path location again for the csv we created and at the file name location, select the add dynamic content at the bottom of the name
A new window will be popped up to add the dynamic content
Go to the string functions and then to the concat operator
Once you click on the concat, we will automatically have the concat function pop up in the add dynamic content and then add the below content to add the date after the name. for now, just add the name you wish to have before the date
Then for the date, go to the date function and press the utcnow
After this you will automatically see the utcnow() been added to the content, just add the .csv to the end of the content. click finish.
Now, go ahead and debug the pipeline one more time to see how this works. Debug succeeded.
Go ahead and check the file has been created by clicking on refresh
A new file has been created along with the date. So cool, right?
But then we wanted only the date part, lets go ahead and do that:
Go ahead and get the substring function from the dataset window
Go ahead and insert the utcnow() in to the substring and only get the date beginning from the 0 to 10th character. remember, in SQL Server we will get the beginning position from 1 but in ADF, we will get from 0. Now, click finish.
now, run the debug again and see its succeded
Go to the storage explorer again and see the file again
In this post, we have learned about the Azure Data Factory. We have learned creating the linked services, datasets and pipelines. We tested the pipeline by using the debug button and created the files with different format by adding the dynamic content. In the coming post, lets learn about creating the data flows.
Azure SQL Database is a managed version of SQL that is in Azure. Microsoft is managing the database, backend, infrastructure, backups. If until now you have been managing the services on-premises and you would like to move to the Azure but still manage some of the stuff yourself, then you may be looking for the Infrastructure as a Service where you will still be responsible for patching your operating systems, updating your software. So, you would like to go to the SQL Server on a VM which is very cost effective and you will be getting maximum close to the feature functionalities of on-premises. If you want all the feature functionalities just like your on-premises then you can go with the SQL Server on a VM.
We have Azure SQL DB where you can create this DB in less than few minutes and it is very less expensive. We can have a Azure SQL DB hosted on a Microsoft managed data center and we can do our processing on the database like we can read, write and modify the data with less price. We don’t have to pay for the server or create any servers for the hosting the database. Everything will be taken care by the Microsoft.
Azure SQL Database has some limitations. When you run the database as Azure SQL Database, we might not get all the functionalities we get for the database on on-premises server. We have another option here as Platform as a Service where everything will be managed for you which is what known as SQL Managed Instance. Here we have a SQL Database that is managed by the VM but you get all the server level functionalities which you lose when you have the Azure SQL DB. When you wanted to use big data analytics or machine learning or data science and if you are looking for powerful resources to accommodate these, the option for this is SQL pools (formerly known as Azure SQL Data warehouse). Azure SQL Data warehouse is a large parallel processing service.
When we create the Azure SQL data warehouse in Azure, it take just few minutes to set up and start running. When you compare that to DWH on-premises Server build, it takes a lot of time like at least couple of weeks and cost lot of money to procure all the hardware that you need to make it happen. This has completely changed with Azure as this is a very quick process to setup an DWH environment in a matter of minutes with very low cost.
Azure SQL Database:
Azure SQL Database is a Platform as a Service meaning that Microsoft will be managing the hardware, infrastructure, backups, DR failover, data redundancy, patching operating system so our responsibility is to take care of the database, tables and objects in it, writing, reading and modifying the data in the database. It is a relational database as a service. Microsoft guarantees that database will be available 99.99% of the times and they will provide the credits if in case the database goes down for a length of time, Microsoft will financially back that agreement by crediting the money back. You only pay for what you need here.
Here, we get three built in replicas of the database for high availability. We also get the point in time restore capabilities up to 35 days with the default setting available. For the Azure SQL database, we can choose the performance level and the price we are going to pay. We can use the SQL Server management studio tools and the SQL environment is similar and very familiar with Azure SQL Database. We can also replicate the Azure database to many geographical locations. Just like we discussed in the previous post about the storage accounts, if your entire data center goes down where your Azure SQL Database is located, then we can immediately failover to the other geographical location so start using the database there.
Couple of more built-in features about Azure SQL Database. When you delete your database in Azure, these databases are automatically backed up. We automatically have Transparent data encryption, Always encrypted and we have row level security. We also get the Azure defender for SQL which monitors the activity of logins and users connecting to the database where we can setup to send us the alerts if any malicious activity happens. We also get automatic database tuning and also Azure Active Directory Integration.
Azure SQL Database Limitations:
We cannot backup the database from our end. Microsoft will take care of the backups.
There is no windows authentication. All we do is through Active directory.
There is no Change Data Capture, SQL Agent, Database Mail, PolyBase, Service broker, MDS/DQS, BI services (SSIS, SSRS, SSAS), Limited VNET integration, can be only the transactional and snapshot subscriber only.
When we say there is no SQL Agent doesn’t mean that we cannot create Jobs. We can use Azure data factory instead of SSIS to move the data and to clean the data, instead of using the SQL agent, we will be using the triggers inside our Azure data factory that kicks off those events to either trigger on the schedule or when event occurs.
PaaS for the Databases is really beneficial because we do have more time for optimizing the queries, focusing on the performance tuning on applications, monitoring the applications use by the users and scaling the applications well so companies can save money on Azure as well.
We can connect to the Azure Database using SQL Server management studio just like the on-premises database. We can also use the tools visual studio and PowerShell even when we are connecting the resources to Azure.
Let’s just go ahead and practice this by building the Virtual Server, meaning we are not purchasing any physical hardware or physical servers for creating this virtual server. Microsoft virtually create these logical servers and this logical Server will be just saved as the logical container that we created in our previous post. This logical server will save all the metadata information about the databases, elastic pools, any data warehouses that are attached to the server.
To create the logical Server, go to the “create a resource” pane on the left side of the panel and search for SQL Server (logical server) in the search bar which will give the service.
Click on the create and it brings the GUI creation screen for that resource.
Now, go ahead and give the resource Group name which is “dbanuggets” and in the server details tab, give the server name which should be unique across all of the Azure. If someone anywhere in the world already used the server name you are providing, then you will not be able to create the Server. This is because we use the server name when we connect to the databases on the server. Give the server name as “dbanuggetsserver” and Give the location details.
In the administrative accounts, provide the SQL Server account admin login and the password that will allow us to authenticate to the server and let us see the databases within that server. Give the username and password and press create.
Go ahead and press create by pressing the create button at the bottom of the screen again.
It took a minute to create and received the notification that the deployment is complete.
If you go to your resource group>select resource group dbanuggets that we created in previous post> and check to see the server has been created and listed under the resource group.
As our server is ready now, let’s go ahead and create the database now.
Go to the create a resource pane on the left side of the panel and the resource we wanted to create now is SQL Database. so go ahead and search that
Click Create on SQL Database
You can go ahead and provide the Resource group, location and name of the database
When you scroll down, you see two important options. One is “Do you want to use an elastic pool”
Azure SQL Databases are the individual databases that have their own resources allocated to them. This is very different from on-premises environments. On the on-premises environments all the databases are created on same provisioned server and they share all the available resources among them. In Azure, each database has its own resources but to improve the performance and to save the cost, it is important to create a pool of resources and attach all the databases to those resources.
If you have databases that have traffic during certain periods of the day or you have some of the databases that are not often used but when they are used, they need full performance. Those are the databases which will essentially be good candidates for creating a pool but not all the databases comes into this category. If you would like to get the similar kind of resources where the databases share the resources like in the on-prem environments, then enabling the elastic pool can be helpful.
Compute+Storage meaning what amount compute like the disk space, memory, CPU, disk space or disk speed for reading and writing and what amount of storage we want to setup and configure for this database. We can do that by either clicking the “configure database” option. When you click on it, it will take us to the new tab and it shows two kinds of different pricing structures: One is vCore which is the new model and the other is DTU which is a legacy model.
From the below tab opened, you can see the price options increase from left to right starting with basic, standard and premium.
For our examples, we will be using the DTU model because it is easier and less priced entry. The vCore model is easier to compare to on-premise environments. For example, in on-premises we have these many cores so we will go with same number of vCores. It is easy with the vCore model to set it up. DTU is being supported but as a backward computability and as a legacy model.
We will choose basic 5DTU’s. What is this 5DTU? It is some amount of compute, CPU, memory and speed on your hard drive of how fast we can write and read our data. If we go from % DTU’s to 10 DTU’s meaning we are doubling all of these resources. For 5DTU’s, we get only 2GB of storage and that will cost us 5$ a month which is very less. Click apply.
review and then click create.
If in future you would like to go to the higher service tier for your Azure SQL database, then it is completely possible. You can easily scale up and scale down as well. You can also change from the DTU model to the vCore model so we can scale in between them easily with out any issues.
If in case you need to scale up your resources later to some thousands of DTU’s and after couple of months, you do not need those many resources, then you can easily scale down in Azure SQL Database. That is pay as you go. Review and create button at the bottom of the screen.
Your deployment is now complete
Open the link of the database, Under the SQL database, below screen shows up as a blade and you can click on the copy to clipboard to copy the server name like below
Copy the server name to the clipboard and open the SQL Server Management Studio and paste the servername that you copied from the azure portal. Use SQL authentication and give the login ID and password you created when creating the database on azure.
Once you press connect, a screen pops up and it doesn’t allow me to connect because what we have not done on the server is that we did not give access to ourselves from the IP address
This is our first level of step to take in protecting your data. Because I created an admin account, I can directly press sign in and it will add in my current IP address. We can go to the azure portal and in the server page on the left side of the panel click on Firewalls and Virtual networks.
Click on Add Client IP
Give the rule name
Click Save at the top of the same window
Now, try to connect to the server by connecting to the azure servername, id and password. You should be able to connect now.
This looks so similar to the database on premises environment and now we are connecting to the azure database from the local management studio.
I created a sample table in the database by using the below script and added few rows
After creating and inserting this data. Lets do select * from dbo.bigproduct and see the results.
You can create tables, insert data, read and write the data in to the database right from the management studio. This is just the same as on-premises database. There is a little bit of latency insert the data as we are paying just couple of dollars on a database per month in this free service Azure account.
In this session, we leaned about the Azure SQL Database. We started by creating a virtual server, created a database on that server on azure, we added the admin account by adding the SQL authentication account, provided the permissions on to the client IPaddress and connected to the database using SQL Server management studio. We created a table and inserted the data in to the Azure SQL Database using management studio on local machine.
In the coming blog, we are going to learn about Azure Data Factory fundamentals. Stay tuned..
Microsoft Azure is a cloud computing platform and infrastructure for building, testing, deploying and managing your applications and services. Since many years, we have servers, databases on-premises where companies are responsible for buying, installing, managing the hardware and software. Azure was introduced in year 2008 and since then we have seen huge transition in the cloud computing as it provides services in managing the hardware, infrastructure, securing the data including the backups and many other services.
This is a new blog series about Azure Fundamentals, if you would you like to get the notifications as soon as I post a new article to this series, subscribe here at the WANT TO BE NOTIFIED Section at the right side of the screen.
There are twenty two service categories available in Azure and some hundreds of services to choose from. If you are new to Azure, you can sign up for a free Azure account. Microsoft offers a free 200$ credit for 30 days so you can go ahead and freely explore their services like you can test and deploy enterprise apps for free, try data analytics with your data and try many more services. Here is the link from Microsoft you can go to and create a free account:
Azure Data Services
For on-premises, we are responsible to procure the hardware, disk space, CPU and all the other hardware in place and to maintain them. We are responsible for the networking, patching, virtualization, operating system, middleware, apps and data. What if virtualization, servers, storage and networking are managed by the Microsoft and we can mainly focus on the application development and we can manage our data? That’s where the IaaS (Infrastructure as a Service) comes in. In this service, we can only take care of the operating system, patching, install the software, apps and data.
PaaS (Platform as a service) is a service where Microsoft is going to take care of lot more services so you can solely focus on the application development and data development. Microsoft manages all the services managed by IaaS but also operating system, middle ware and runtime. If your database needed to be restored in any case, backups will be provided to restore at any time.
SaaS (Software as a service) is a service like office 365, Dynamics 365 and they do everything for you as you insert, write and read your data from those platforms.
Data Services in Azure:
These services are the Infrastructure as a service and Platform as a service solutions. In the below diagram, we can see on the left side all the resources we can connect to which can connect from the on-premises/cloud/structured/unstructured data types. We do have SSIS but it doesn’t actually have lot of cloud ETL capabilities as it cannot connect to all new data sources. As we move to Azure, there is Azure data factory which provides the graphical user interface where you can drag and drop and perform different activities. For extracting and transforming data, we also have another tool Azure Databricks which gives us lot of flexibility where you can write R, Python and Spark SQL but for using these requires coding. Professionals who are data scientists and data engineers use the databricks.
Azure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise data warehousing, and big data analytics.
Once we apply the transformations, we load that data into the Azure data Lake storage or any other storage areas like Azure storage, Azure SQL DB, SQL pool or store the data in the non relational database systems like Cosmos DB. Once we store the data, we can report the data using Power BI.
We will take all the resources that we create in azure into a logical container. Resource groups are a great way to take all of the resources we create for a project, department or for any other purposes and store them into a resource group. This is same as the directory location where we put all the resources as a resource group. We can update, deploy or delete them as a group. All the resources that we create will be under a resource group. All the metadata information that we have under the resources will be stored in resource group like what type of resource we are creating, what resources were allocated to it and the names of them.
There are always ways to automate this but for this example, for simplicity purposes we are going to do this through GUI.
Open your Azure portal, go to the navigation pane and click on the “Create a resource”
This will bring us to the Market place where we can see the 22 different categories and some hundreds of different services that are available. Find the “Resource Groups” by typing it in the search bar
Select the Resource Group and click create
This will bring us to the resource creation page. We can give the resource group a name, choose the location where it is going to be stored. For latency sake, for productivity and performance, please choose the right region details. We do have other options out there but to get started with azure, this should be good. When you check the other Tags out there, these are for the information like to which project is this resource group associated with, who is the owner and who purchased it.
It is better to keep all of our resources in same region so we can reduce the cost. If we move the data from one region to other, there is actually an extra charge for that for moving data. When you are creating your databases, storage accounts, data factory to move data, it is important to know to keep the data in same data center.
Then at the bottom of the screen, click review and create.
It should give you the validation passed and press the create at the bottom of the screen. Once you do that, your resource group has been created and you will get the notification at the top right corner that the resource group has been created.
You can either go to the resource group from the notification section or you can also go from the navigation pane by clicking on the “Resource groups”
When you click on the resource group “dbanuggets”, you see a vertical pane opened up across your screen which is known as Blade in Azure. When you click on any tabs in it, you will see vertical tabs opening up. We are currently looking at the resource group we created “dbanuggets” and under that name you see it as a Resource group. We see all administrative properties under that and we can setup properties like security like access control, policies, resource costs, deployment history that we can control. So, we have created a logical container, resource group.
If you want to close this resource group pane, you can simply press the cross button at the top right corner of the pane to go back to the previous blade.
Azure storage is the general purpose account v2 and the Azure data lake store Gen2. This is the cheapest option for the storage that we have available. If you are creating a storage account on azure, usually Azure data lake gen 2 will be good enough unless you need the features that are only available in the general purpose accounts that are not existed in the Azure data lake store.
Azure Data lake uses the Active directory. Blob storage account uses the shared access key. If someone can get the name of the storage account and get to the access key, that might be dangerous situation. Access key should be regenerated every couple of weeks or so in a schedule to keep the key secure. With the Azure data lake Gen2 accounts, we can take the advantage of login Id of the Azure active directory. There are some regulatory and compliance that we get for general purpose that we do not get with data lake. One of them is WORM storage which means Write Once Read Many. Once the data has been written, it cannot be modified. Azure blob storage offers that capability of doing WORM storage and Azure data lake doesn’t. Azure blob storage have a capability of soft storage where if you accidentally deleted anything, we can restore that. Azure data lake doesn’t offer that.
Azure data lake is designed to store massive amounts of data for big data analytics. It provides unlimited scalability, Hadoop compatibility, optimized azure blob file system (ADFS) designed for big data analytics, Zone-redundant and Geo-redundant storage.
Lets create Azure Data Lake:
Go to the navigate pane and search for the “Storage account”. To create general purpose storage account or Azure data lake is through the same storage account service. You will see the storage account service, click create.
You will see a create storage account pane, and the top you see many tabs like data protection, advanced, tags. You can create the storage account first and then later configure these.
Select the resource group, We will be selecting the one that we created with name “dbanuggets” and give a storage account name. There are some requirements here that we have to remember while creating the storage account name. It should be all lower case letters and it should be globally unique across all of azure because when we create the storage account, we can connect to it through the name.
Give the name of the storage account in all lower case letters. Here I am creating the storage account with name “dbanuggetsstorage” and choose the location of the storage to match the resource group. Here I am choosing the East US. Then choose the kind of account you are going to create. Remember, this is not where we actually change this to general purpose v2 or azure data lake gen2.
Then choose the Replication tab and select the type of redundancy you need. The cheapest option and the one that is built in is the “Locally-redundant storage(LRS)” meaning if you are working with data lake or the storage account, Microsoft makes sure we have like three copies of that data at any given time. This is an amazing thing as we now have three copies available and do not have to worry about what will happen if one data rack goes down. We will still have other two racks readily available. This option is the redundancy with in the data center so if the data center itself goes down, then we need to check for different options available in the azure like zone redundant storage (ZRS), geographical redundancy storage(GRS) and many others. There is a cost associated with the options you choose from here. Here for this example, we are going to choose the cheapest option “Local-redundant storage(LRS)”.
When you check the other tabs like networking, here we can control the access through IP addresses. The next tab is data protection. Here we can do like soft deletes for blobs. If someone will delete the file accidentally, we can recover those files. You can do versioning for the blobs and also turn on the point in time restore for container.
The next tab is advanced tab where we can choose where we actually want the storage account to be. You can see this in the advanced tab> Data Lake Storage Gen2> diabled or enabled. This is the single place where you click to enable or disable the Data Lake storage which will signify if your storage is going to be Data lake and this needs to be done at the time of the creating the resource. We cannot go and modify this later. Once we choose the storage type here, we cannot change it later. Let’s go ahead and choose the Data lake to be enabled.
Once you enable this option and go to the Data protection tab, lot of options that were there under the data protection are disabled we lose the point in time restore, the ability to do the soft deletes, the versioning for the blobs and we also lose the WORM storage (meaning once we write the data, we cannot modify it). These are the limitations that we get with the Azure Data Lake, so if you need these options we have to go with the blob storage account. Remember that we are not restricted with only one storage account for all our projects. We can have for some projects blob storage and for others Azure Data Lake storage. We can have some data that requires the regulatory compliance like the WORM storage and you do not want any one to accidentally delete the files and you need the soft delete functionality, in that case for that specific project we can choose the blob storage account. We can always have and use the different services for our projects in Azure.
On the Tags tab, we can give the name and the value to that tag name so once we go to the billing information, this resource is now tagged and we can see our costs clearly. It is a good practice to tag all of our resources. Click review and create at the bottom of the page.
You see the validation passed on the top of the screen. You can review the resources that you wanted to create and click create at the bottom of the screen.
I am seeing the deployment in progress once I press the create button. The storage account that you created will not cost us anything now. You get billed for storage account is when you write the data to the storage account and when you read data from it. If we created the storage account but not created anything in it, we do not get billed for that. If we created and using very little like creating only few files then we will only be minimally be charged. We can monitor the costs down the lane in the Azure.
You will be seeing the notification button at the top right corner of the screen notifying the deployment is complete.
To see how the storage account that we created look like, go to the navigation pane on the left side of the screen, choose the resource group> select the resource group we created (in my example, dbanuggets). Here is our storage account “dbanuggetsstorage”.
Once you click on that storage account, a new blade will be opened in the vertical pane and you will be able to see all associated details of that storage account.
When you scroll down, you will see the “containers”. Click on that and here when you want to upload any files or you need to remove any files from azure storage account, we need to create a container for that first.
Click on the containers and create a container by giving a name.
Give the name like here I provided the name of the container as “dbawork”. Remember, we are creating a directory here and we need to give the name all small letters. I am keeping my public access Private here in the example. Then, click create.
You will see the notification that the container has been created successfully.
We can now upload files and download files. Usually, we all do this through automating. We will be doing this through azure data factory. Now, lets go ahead and manually upload a file in our example and see how easy it is to import data into Azure. Click on the container we just created.
Click on the upload button at the top
Select the folder to upload the file
Upload the file and press upload
File has been uploaded. This file is now in azure and stored in azure data center. Any file that we upload to storage account, we have to upload the files to a container. These containers are just like the directory locations where we have folders having the files on our on-premises environments. We can make sure we have security setup at the container level.
You can directly click on that file that is now in the container and click edit to directly edit the content in the file.
A new blade pops up opening the file and you can edit it
For easy uploading and managing your files to the storage, there is this beautiful application you can use “Azure storage Explorer”
Azure storage Explorer is an amazing tool where you can connect to your azure account and see all the different resources you have in azure. It is both for the general purpose and azure data lake Gen2. We can upload files, folders, change permissions. You can upload, download and you can also modify the security on these containers easily through this application. If you do not have it already, download it here.
Once you download the Azure storage explorer, signin with the azure account you created. I signed in with the azure account I created and I can see that in the account section in the application that you can see at the left side of the panel.
Under the explorer panel, expand the Azure subscription tab to see all the resources you have like storage account and the containers I have along with the fileshares.
You do not have to login to the azure portal, you can upload, download from here itself. We usually automate the upload processes through azure data factory or through the business applications.
In this post, we have learned about Azure Data Services- Infrastructure as a Service (IaaS), Platform as a Service (PaaS), Software as a Service and we have seen the difference in between these services. We have leaned about the Azure Storage-Blob Storage, Data lake storage Gen2. We have learned with examples so you can get started with Azure.
In the next post, we will learn the fundamentals of Azure SQL DB.