With the invitation from AndyLeonard for the May month of T-SQL Tuesday, here are my thoughts on how to respond to technology changes.
As there is a saying, “When there is no Change, there is no Growth”
This is true in every aspect of life including technology. As the Technology changes, it is helping our lives to be more simpler and easier than before. With the continuous advancements in the technology, many professionals find themselves rushing in learning the new technologies. These advancements in the technologies adding up new features to previously existed technologies improve in performance and solve problems making our lives easier. The only question is how quickly can we cope with this change?
The only answer I believe is to train the employees with the new changes in the technologies. It is really important for the companies to not only bring the latest technologies to the companies for the use but also to mainly focus on the employee trainings to use the latest technologies efficiently.
Apart from the trainings, there are many employees who love and always look for the ways to learn new technologies. For every new technologies we have now, a free training platform may already be available to train the individuals so they have ease in using those technologies in their work. For example, Microsoft Azure. We do have a free training from Microsoft known as Microsoft learn. Microsoft also offers a free azure account so professionals can go ahead and get that hands on experience as they learn. This platform has helped many professionals across the world including me and helping in advancing our careers. Huge thanks to Microsoft for taking this initiative.
To keep ourselves updated with the latest evolving technologies, I follow several resources out there along with the technical documentation itself. There are several blogs related to specific technologies, communities helping each other to learn these new technologies and several technical newsletters and publications to keep us updated. I follow social media articles and posts from linkedin and twitter to be updated as much as possible with upcoming technologies.
Along with these, there are several youtube channels providing the concepts of the new technologies as they evolve. There are other training companies who can actually provide trainings especially from beginner to advance level along with hands on labs.
Keeping us up to date to know what is going on is not a tough job but investing time and interest in learning these technologies to effectively use in our daily work life is really important.
At the beginning, everything seems to be so confusing. If we see the number of technologies evolving at the same time, it is overwhelming for sure but if we can take one technology at a time and if we can focus on how the new technologies can actually simplify our daily work, we will be embracing the new technologies for the latest innovations they can offer.
Finally, it is very important to be ready for more and more advancements and to anticipate the changes. As the technology evolves, it becomes much easier to use. Keeping ourselves updated is the only way to keep growing!
In this post, we will learn about the Azure Data Factory.
What is Azure Data Factory (ADF)?
Its an ETL tool based in the cloud. Everything that we do in azure data factory is executed using the azure compute resources. We will do the development in azure portal using the UI instead of using the visual studio. We have the drag and drop interactivity with the minimum code. This is unlike the databricks where we need to write the code. Data factory is similar to the SSIS on premises. We use the expression language here in the GUI tool where you write the code in the expression builder. We use Azure data factory for the data orchestration and data transformation. One of the nice things about Data factory is that it integrates nicely with both the hybrid environments, on premises and cloud environments together.
How to set up Azure Data Factory?
Its very simple. We provide a name which should be globally unique, subscription, resource group, version and location. We are going to learn to setup the ADF here in this blog post.
Pipelines which is same as the package on on-premises environments. Pipeline is a collection of activities and actions that do a particular work (loading the file, cleaning and importing the data etc.)
Data Flows are the data flow activities where you do the data transformation (cleanup, merge the data, adding columns in the code etc.)
Mapping Data Flows is similar to the SSIS on on-premises. We have source, transformations that cleanup the data and transform the data and the destination where the data be loaded.
Wrangling Data Flows are derived from the Power Query editor in Power BI. It is a UI way of cleaning data (for example removing values, unpivoting the data)
In order to connect to the source connections in ADF, we need to provide where the source connection information from like from where the data is coming from by providing the server name and the credentials just like we provide in the SSIS. We provide that information in ADF by using linked services which is similar to the SSIS connection managers on-premises. These linked services can be used among the pipelines in ADF.
Dataset is a connection to a specific table or a specific query that we are pulling the data from a table. Dataset can be a file, documents or a storage location with in the data lake account.
Within your pipeline of the ADF, we have many different activities which will define the type of action that we want to perform on our data. One example activity can be a stored procedure inserting/updating/deleting a data in a table, running a databricks notebook directly and executing it using data factory through the activity known as databricks activity, copy activity etc.
Now, lets go ahead and do this in Action!
Just by provisioning the Data factory doesn’t charge you anything. Only you get charged for the pipelines and the amount of resources the pipelines runs generally. Create a data factory from the create new resources tab and search for the data factory.
Once the window opens, provide the resource name and the location where your company located region, give the data factory a name and version V2 as this is the latest version available.
You can set the git repository for the source control in the next tab but right now lets not setup that right now, so check on the configure git later and then click on review and create
We do take the data in the data factory and publish in your code repository and then from the code repository, we build the pipeline release from the code repository and so we can publish to the azure data factory in the production.
Now, the data factory is created. We can now create the connections for our data sources, pipelines and data flows. Go to the resource, click on the resource we created. You can see the data factory got created.
This opens the administration page but if you wish to do the development, we need to open up the author & monitor page where we can build the connection managers, datasets and pipelines.
A window opens up like the above. You can see the side panel where you can see the data factory. This is where all of our pipelines and data flows are built. Then we get the built-in monitoring that comes out of the box and we have manage. Lets click on the manage button.
When you click on the manage, you see the connections where you can create the linked services, integration runtimes. Under the source control, you can create the git configuration, ARM template. Under author we see triggers which are used for schedule executions just like the SQL Agent on the on-premises servers.
Let’s go ahead and create the linked services. On the top, you can see new. Click on that and create a connection manager. We will create two connection managers here. One for the azure data lake account and one for the Azure SQL Database so we can interact with the data sources within them.
Lets first create the Azure data lake. Click on the Azure tab on the right side near the search. Click on the Azure Data lake storage to create a connection manager for Azure data lake.
Give the proper name to represent it as the azure data lake for easy understanding. Here I choose “AzureDataLake_dbanuggets_datafactory” where dbanuggets-datafactory being my data factory account. Give the authentication method as the Account key here but for the best practices we should not be using the Account key as it will impact using the Azure devops in your code repository because it will not allow to store the sensitive information but store the sensitive information like all the passwords and all in the key vault so we can reference the key vault name from here. For our testing, lets keep this as the Account key and for the account selection method, use the “From azure subscription” and choose the azure subscription as “Microsoft Azure sponsorship” and the storage account name is going to be my storage account that we have created on our first part of this series.
Click on the test connection at the bottom of the screen, once test connection is succeded, click on the create
The Azure data lake connection has been created
Now, lets go ahead and create the Azure SQL database connection. Create on New at the top on the linked servers page, choose Azure tab and scroll down to choose Azure SQL Database and then click on continue
Give the details and test the connection
But my connection failed here. Why is that?
When you click on more side to the error, you will be able to see the details of the failure
Azure data factory is a service in azure and it cannot connect to Azure server. We need to go to the Azure server firewall rules and check the properties over there. When you check the error, It failed because the IP address mentioned in the virtual machine is not our IP address. That’s the IP address of the virtual machine that’s running in the background that the azure data factory is using to try to connect to the server. We will go back to the all resources window and find the server that we created earlier.
Under the security, find the firewall rules
Go ahead and click on the firewall rules where you will be choosing “Allow Azure services and resources to access this server” which means any service in azure can now try to authenticate to this server. This might be a concern for lot of customers because if we are on Azure VM, if that option is on, then from any part of the world anyone can connect to by trying different usernames and passwords. Other than adding that IP address from that VM, for our testing sake, lets enable the “Allow Azure services and resources to access this server” temporarily so the data factory works with this. but as the IP address of the data factory that we saw in the error is dynamic and that will actually changes. So, once we turn on the azure services to yes here and then click on save at the top of the window.
Once again do the test connection and it is successful now
Click on create. Now, the Azure SQL Database connection is created
Lets go ahead with a simple example. Lets go ahead and pull the data from the dbanuggets table into the azure data lake. We will be taking the data from the dbanuggets table and copy that to the logical container for grouping those files together in our data lake. So, we need a dataset that is pointing to the dbanuggets table in azure data factory and another one to write the file into the azure data lake.
To do this go to the Author on the Azure data factory side bar and see the different factory resources.
Pipelines is the collection of the different activities that does the work. Data sets are the connections for the reusability.
Create a new databaset by clicking on the ellipsis symbol right after the datasets
Add new dataset
Select the Azure tab under the search bar and choose the Azure data lake storage Gen 2 and click on continue at the bottom of the window
In the formats, click the CSV Delimited text
Give the name of the container, linked server name we created before, the file path actually doesn’t exist so if we give a file path that does not exist, it actually creates one for us and give the file name across the file path section. Keep the first row as header. For the import schema, mention as none because that csv file doesn’t exist yet. We will create one. Then click on okay at the bottom of the page.
We have created our first resource in our data factory other than the connection manager and we can see that as a tab across the top. As you create the new resources they will be opening up as the new tabs on the top. Once we complete working on them we will have to click on the publish all and close them. Once this resource is created you can see many options that you can choose from. Select the filesystem account that we created earlier.
Parameters are really useful to help the dataset reusable. Instead of having the dataset dynamically for the bigproduct, we can make dynamic through the parameters so it can also be dynamically change the name of the file automatically by using the parameters and expressions so it can be reusable.
We need to create one more dataset that will be pointing to the bigproducts table from the SQL database. Click on the datasets ellipsis again and choose the Azure SQL Database and click continue
Give the name and connect to the Azure SQL database and to the table and for the import schema, the beauty of the Azure data factory is we do not have to know the schema up front which is something known as late arrive schema, they can actually change any time and data factory will handle that. This is not available in the SSIS packages which is completely schema bound. If one of your table schema changes for example, if you drop any column or modify the name of any column, all of your SSIS packages referencing the table will break. We can choose none to not worry about the schema and it becomes dynamic but lets choose to import the schema for our example.
After we created this dataset, we need to click on the publish all so we can save the data that has been created until now to the data factory in production. So, if in case you do not click on the publish and leave by closing the windows or if something happens, then all your data will be lost. So, lets go ahead and publish.
If you setup source control, azure devops setup than all this published code is sent to the repository and not publishing to the data factory. As we now created two datasets, lets go ahead and move the data from the database to the data lake.
We need to create a pipeline for that. Go to the pipeline tab on the left side of the pane>new pipeline>all the way on the right opens up a window>give the name
Give the name and click on the properties button to close
From the data transformations activity, drag the copy data which will only copy the data
Down, you can see all the activities. Give the name and go to the sink. Sink meaning destination. You can see the 1 on the top of the source and sink tabs is to tell us that there are required parameters that we need to set in order to complete the activity. In the general tab, give the name
In the source tab, give the source and for the destination as well. We can do the mapping manually or we can let it map itself which is the late arriving schema, what ever schema it may be, ADF will be able to map itself. This is such a beautiful feature.
We are getting the data from the table
In the sink (destination), provide the destination details
We have other details as well if you would like to change but for this example, lets not change those.
We do not want to change anything in the mapping, let the mapping takes place automatically. Now, go ahead and run the debug. This is like the testing mode in the azure data factory. We are running the most recent copy of that pipeline not running the one which we published for the azure data factory.
Note: If the debug failed for you just like me before fixing it, please make sure your source and destination connections were already setup correctly and published properly.
When you go to all resources tab on the left side pane and open the storage account and then to the storage explorer, you can see the newly created container products and when you click on that container, you will be able to see the csv file we created.
Publish the changes again
If you wanted to change the name of the file dynamically, what if we wanted to add the date along with the filename we already created. Open the file path location again for the csv we created and at the file name location, select the add dynamic content at the bottom of the name
A new window will be popped up to add the dynamic content
Go to the string functions and then to the concat operator
Once you click on the concat, we will automatically have the concat function pop up in the add dynamic content and then add the below content to add the date after the name. for now, just add the name you wish to have before the date
Then for the date, go to the date function and press the utcnow
After this you will automatically see the utcnow() been added to the content, just add the .csv to the end of the content. click finish.
Now, go ahead and debug the pipeline one more time to see how this works. Debug succeeded.
Go ahead and check the file has been created by clicking on refresh
A new file has been created along with the date. So cool, right?
But then we wanted only the date part, lets go ahead and do that:
Go ahead and get the substring function from the dataset window
Go ahead and insert the utcnow() in to the substring and only get the date beginning from the 0 to 10th character. remember, in SQL Server we will get the beginning position from 1 but in ADF, we will get from 0. Now, click finish.
now, run the debug again and see its succeded
Go to the storage explorer again and see the file again
In this post, we have learned about the Azure Data Factory. We have learned creating the linked services, datasets and pipelines. We tested the pipeline by using the debug button and created the files with different format by adding the dynamic content. In the coming post, lets learn about creating the data flows.
Azure SQL Database is a managed version of SQL that is in Azure. Microsoft is managing the database, backend, infrastructure, backups. If until now you have been managing the services on-premises and you would like to move to the Azure but still manage some of the stuff yourself, then you may be looking for the Infrastructure as a Service where you will still be responsible for patching your operating systems, updating your software. So, you would like to go to the SQL Server on a VM which is very cost effective and you will be getting maximum close to the feature functionalities of on-premises. If you want all the feature functionalities just like your on-premises then you can go with the SQL Server on a VM.
We have Azure SQL DB where you can create this DB in less than few minutes and it is very less expensive. We can have a Azure SQL DB hosted on a Microsoft managed data center and we can do our processing on the database like we can read, write and modify the data with less price. We don’t have to pay for the server or create any servers for the hosting the database. Everything will be taken care by the Microsoft.
Azure SQL Database has some limitations. When you run the database as Azure SQL Database, we might not get all the functionalities we get for the database on on-premises server. We have another option here as Platform as a Service where everything will be managed for you which is what known as SQL Managed Instance. Here we have a SQL Database that is managed by the VM but you get all the server level functionalities which you lose when you have the Azure SQL DB. When you wanted to use big data analytics or machine learning or data science and if you are looking for powerful resources to accommodate these, the option for this is SQL pools (formerly known as Azure SQL Data warehouse). Azure SQL Data warehouse is a large parallel processing service.
When we create the Azure SQL data warehouse in Azure, it take just few minutes to set up and start running. When you compare that to DWH on-premises Server build, it takes a lot of time like at least couple of weeks and cost lot of money to procure all the hardware that you need to make it happen. This has completely changed with Azure as this is a very quick process to setup an DWH environment in a matter of minutes with very low cost.
Azure SQL Database:
Azure SQL Database is a Platform as a Service meaning that Microsoft will be managing the hardware, infrastructure, backups, DR failover, data redundancy, patching operating system so our responsibility is to take care of the database, tables and objects in it, writing, reading and modifying the data in the database. It is a relational database as a service. Microsoft guarantees that database will be available 99.99% of the times and they will provide the credits if in case the database goes down for a length of time, Microsoft will financially back that agreement by crediting the money back. You only pay for what you need here.
Here, we get three built in replicas of the database for high availability. We also get the point in time restore capabilities up to 35 days with the default setting available. For the Azure SQL database, we can choose the performance level and the price we are going to pay. We can use the SQL Server management studio tools and the SQL environment is similar and very familiar with Azure SQL Database. We can also replicate the Azure database to many geographical locations. Just like we discussed in the previous post about the storage accounts, if your entire data center goes down where your Azure SQL Database is located, then we can immediately failover to the other geographical location so start using the database there.
Couple of more built-in features about Azure SQL Database. When you delete your database in Azure, these databases are automatically backed up. We automatically have Transparent data encryption, Always encrypted and we have row level security. We also get the Azure defender for SQL which monitors the activity of logins and users connecting to the database where we can setup to send us the alerts if any malicious activity happens. We also get automatic database tuning and also Azure Active Directory Integration.
Azure SQL Database Limitations:
We cannot backup the database from our end. Microsoft will take care of the backups.
There is no windows authentication. All we do is through Active directory.
There is no Change Data Capture, SQL Agent, Database Mail, PolyBase, Service broker, MDS/DQS, BI services (SSIS, SSRS, SSAS), Limited VNET integration, can be only the transactional and snapshot subscriber only.
When we say there is no SQL Agent doesn’t mean that we cannot create Jobs. We can use Azure data factory instead of SSIS to move the data and to clean the data, instead of using the SQL agent, we will be using the triggers inside our Azure data factory that kicks off those events to either trigger on the schedule or when event occurs.
PaaS for the Databases is really beneficial because we do have more time for optimizing the queries, focusing on the performance tuning on applications, monitoring the applications use by the users and scaling the applications well so companies can save money on Azure as well.
We can connect to the Azure Database using SQL Server management studio just like the on-premises database. We can also use the tools visual studio and PowerShell even when we are connecting the resources to Azure.
Let’s just go ahead and practice this by building the Virtual Server, meaning we are not purchasing any physical hardware or physical servers for creating this virtual server. Microsoft virtually create these logical servers and this logical Server will be just saved as the logical container that we created in our previous post. This logical server will save all the metadata information about the databases, elastic pools, any data warehouses that are attached to the server.
To create the logical Server, go to the “create a resource” pane on the left side of the panel and search for SQL Server (logical server) in the search bar which will give the service.
Click on the create and it brings the GUI creation screen for that resource.
Now, go ahead and give the resource Group name which is “dbanuggets” and in the server details tab, give the server name which should be unique across all of the Azure. If someone anywhere in the world already used the server name you are providing, then you will not be able to create the Server. This is because we use the server name when we connect to the databases on the server. Give the server name as “dbanuggetsserver” and Give the location details.
In the administrative accounts, provide the SQL Server account admin login and the password that will allow us to authenticate to the server and let us see the databases within that server. Give the username and password and press create.
Go ahead and press create by pressing the create button at the bottom of the screen again.
It took a minute to create and received the notification that the deployment is complete.
If you go to your resource group>select resource group dbanuggets that we created in previous post> and check to see the server has been created and listed under the resource group.
As our server is ready now, let’s go ahead and create the database now.
Go to the create a resource pane on the left side of the panel and the resource we wanted to create now is SQL Database. so go ahead and search that
Click Create on SQL Database
You can go ahead and provide the Resource group, location and name of the database
When you scroll down, you see two important options. One is “Do you want to use an elastic pool”
Azure SQL Databases are the individual databases that have their own resources allocated to them. This is very different from on-premises environments. On the on-premises environments all the databases are created on same provisioned server and they share all the available resources among them. In Azure, each database has its own resources but to improve the performance and to save the cost, it is important to create a pool of resources and attach all the databases to those resources.
If you have databases that have traffic during certain periods of the day or you have some of the databases that are not often used but when they are used, they need full performance. Those are the databases which will essentially be good candidates for creating a pool but not all the databases comes into this category. If you would like to get the similar kind of resources where the databases share the resources like in the on-prem environments, then enabling the elastic pool can be helpful.
Compute+Storage meaning what amount compute like the disk space, memory, CPU, disk space or disk speed for reading and writing and what amount of storage we want to setup and configure for this database. We can do that by either clicking the “configure database” option. When you click on it, it will take us to the new tab and it shows two kinds of different pricing structures: One is vCore which is the new model and the other is DTU which is a legacy model.
From the below tab opened, you can see the price options increase from left to right starting with basic, standard and premium.
For our examples, we will be using the DTU model because it is easier and less priced entry. The vCore model is easier to compare to on-premise environments. For example, in on-premises we have these many cores so we will go with same number of vCores. It is easy with the vCore model to set it up. DTU is being supported but as a backward computability and as a legacy model.
We will choose basic 5DTU’s. What is this 5DTU? It is some amount of compute, CPU, memory and speed on your hard drive of how fast we can write and read our data. If we go from % DTU’s to 10 DTU’s meaning we are doubling all of these resources. For 5DTU’s, we get only 2GB of storage and that will cost us 5$ a month which is very less. Click apply.
review and then click create.
If in future you would like to go to the higher service tier for your Azure SQL database, then it is completely possible. You can easily scale up and scale down as well. You can also change from the DTU model to the vCore model so we can scale in between them easily with out any issues.
If in case you need to scale up your resources later to some thousands of DTU’s and after couple of months, you do not need those many resources, then you can easily scale down in Azure SQL Database. That is pay as you go. Review and create button at the bottom of the screen.
Your deployment is now complete
Open the link of the database, Under the SQL database, below screen shows up as a blade and you can click on the copy to clipboard to copy the server name like below
Copy the server name to the clipboard and open the SQL Server Management Studio and paste the servername that you copied from the azure portal. Use SQL authentication and give the login ID and password you created when creating the database on azure.
Once you press connect, a screen pops up and it doesn’t allow me to connect because what we have not done on the server is that we did not give access to ourselves from the IP address
This is our first level of step to take in protecting your data. Because I created an admin account, I can directly press sign in and it will add in my current IP address. We can go to the azure portal and in the server page on the left side of the panel click on Firewalls and Virtual networks.
Click on Add Client IP
Give the rule name
Click Save at the top of the same window
Now, try to connect to the server by connecting to the azure servername, id and password. You should be able to connect now.
This looks so similar to the database on premises environment and now we are connecting to the azure database from the local management studio.
I created a sample table in the database by using the below script and added few rows
After creating and inserting this data. Lets do select * from dbo.bigproduct and see the results.
You can create tables, insert data, read and write the data in to the database right from the management studio. This is just the same as on-premises database. There is a little bit of latency insert the data as we are paying just couple of dollars on a database per month in this free service Azure account.
In this session, we leaned about the Azure SQL Database. We started by creating a virtual server, created a database on that server on azure, we added the admin account by adding the SQL authentication account, provided the permissions on to the client IPaddress and connected to the database using SQL Server management studio. We created a table and inserted the data in to the Azure SQL Database using management studio on local machine.
In the coming blog, we are going to learn about Azure Data Factory fundamentals. Stay tuned..
With the invitation of Steve Jones for April month of T-SQL Tuesday, I am going to share some of my thoughts on using Jupyter notebooks.
Jupyter Notebook is an open-source web application that can serve us to create, share code and documents with others. It provides that beautiful way to document your code, text, images and query results. Jupyter notebooks in Azure data studio can run PySpark, Python 3, Spark R and Scala, as well as PowerShell. Run and visualize the data as you see the results without having to leave the environment. This is an excellent tool for documenting things. You do not have to write your long emails attaching bunch of screenshots any more. Once you start using the Jupyter notebooks, it will save you so much time and give you peace of mind. This tool can not only be used for documentation purposes but also can be used for your presentations for your demo purposes. Not only for the SQL code, this is a very nice tool for doing machine learning models, visualizing data and much more. Your code can be written in different cells which can be executed individually. This can help us to run specific part of the code in the project. This tool supports Python, R as well.
There are some magic commands that you can use to make your work easier. %clear, %autosave, %debug and %mkdir are some of the commands and these magic commands can be used in two ways. Linewise and cellwise. Linewise is when you wish to run a single command and cell-wise is when you wish to run an entire block of code.
There are also interactive dashboards in Jupyter notebooks.
When you wish to share these jupyter files and as they are in json format. You can save the notebook using different file options like a CSV, Excel, JSON or XML file . You can also convert your notebook into different formats by using the nbconvert option. For analytical queries, you can also build a chart for the data.
You can use the Jupyter notebooks for giving presentations to your customers and clients about your projects.
Jupyter notebooks is an amazing tool as you can document everything what you do. Once we start using Jupyter notebooks, you would definitely do not want to turn back to the previously used tools anymore. I recommend this tool to anyone who would like to explore and take advantage of many features the Jupyter notebooks has to offer.
Microsoft Azure is a cloud computing platform and infrastructure for building, testing, deploying and managing your applications and services. Since many years, we have servers, databases on-premises where companies are responsible for buying, installing, managing the hardware and software. Azure was introduced in year 2008 and since then we have seen huge transition in the cloud computing as it provides services in managing the hardware, infrastructure, securing the data including the backups and many other services.
This is a new blog series about Azure Fundamentals, if you would you like to get the notifications as soon as I post a new article to this series, subscribe here at the WANT TO BE NOTIFIED Section at the right side of the screen.
There are twenty two service categories available in Azure and some hundreds of services to choose from. If you are new to Azure, you can sign up for a free Azure account. Microsoft offers a free 200$ credit for 30 days so you can go ahead and freely explore their services like you can test and deploy enterprise apps for free, try data analytics with your data and try many more services. Here is the link from Microsoft you can go to and create a free account:
Azure Data Services
For on-premises, we are responsible to procure the hardware, disk space, CPU and all the other hardware in place and to maintain them. We are responsible for the networking, patching, virtualization, operating system, middleware, apps and data. What if virtualization, servers, storage and networking are managed by the Microsoft and we can mainly focus on the application development and we can manage our data? That’s where the IaaS (Infrastructure as a Service) comes in. In this service, we can only take care of the operating system, patching, install the software, apps and data.
PaaS (Platform as a service) is a service where Microsoft is going to take care of lot more services so you can solely focus on the application development and data development. Microsoft manages all the services managed by IaaS but also operating system, middle ware and runtime. If your database needed to be restored in any case, backups will be provided to restore at any time.
SaaS (Software as a service) is a service like office 365, Dynamics 365 and they do everything for you as you insert, write and read your data from those platforms.
Data Services in Azure:
These services are the Infrastructure as a service and Platform as a service solutions. In the below diagram, we can see on the left side all the resources we can connect to which can connect from the on-premises/cloud/structured/unstructured data types. We do have SSIS but it doesn’t actually have lot of cloud ETL capabilities as it cannot connect to all new data sources. As we move to Azure, there is Azure data factory which provides the graphical user interface where you can drag and drop and perform different activities. For extracting and transforming data, we also have another tool Azure Databricks which gives us lot of flexibility where you can write R, Python and Spark SQL but for using these requires coding. Professionals who are data scientists and data engineers use the databricks.
Azure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise data warehousing, and big data analytics.
Once we apply the transformations, we load that data into the Azure data Lake storage or any other storage areas like Azure storage, Azure SQL DB, SQL pool or store the data in the non relational database systems like Cosmos DB. Once we store the data, we can report the data using Power BI.
We will take all the resources that we create in azure into a logical container. Resource groups are a great way to take all of the resources we create for a project, department or for any other purposes and store them into a resource group. This is same as the directory location where we put all the resources as a resource group. We can update, deploy or delete them as a group. All the resources that we create will be under a resource group. All the metadata information that we have under the resources will be stored in resource group like what type of resource we are creating, what resources were allocated to it and the names of them.
There are always ways to automate this but for this example, for simplicity purposes we are going to do this through GUI.
Open your Azure portal, go to the navigation pane and click on the “Create a resource”
This will bring us to the Market place where we can see the 22 different categories and some hundreds of different services that are available. Find the “Resource Groups” by typing it in the search bar
Select the Resource Group and click create
This will bring us to the resource creation page. We can give the resource group a name, choose the location where it is going to be stored. For latency sake, for productivity and performance, please choose the right region details. We do have other options out there but to get started with azure, this should be good. When you check the other Tags out there, these are for the information like to which project is this resource group associated with, who is the owner and who purchased it.
It is better to keep all of our resources in same region so we can reduce the cost. If we move the data from one region to other, there is actually an extra charge for that for moving data. When you are creating your databases, storage accounts, data factory to move data, it is important to know to keep the data in same data center.
Then at the bottom of the screen, click review and create.
It should give you the validation passed and press the create at the bottom of the screen. Once you do that, your resource group has been created and you will get the notification at the top right corner that the resource group has been created.
You can either go to the resource group from the notification section or you can also go from the navigation pane by clicking on the “Resource groups”
When you click on the resource group “dbanuggets”, you see a vertical pane opened up across your screen which is known as Blade in Azure. When you click on any tabs in it, you will see vertical tabs opening up. We are currently looking at the resource group we created “dbanuggets” and under that name you see it as a Resource group. We see all administrative properties under that and we can setup properties like security like access control, policies, resource costs, deployment history that we can control. So, we have created a logical container, resource group.
If you want to close this resource group pane, you can simply press the cross button at the top right corner of the pane to go back to the previous blade.
Azure storage is the general purpose account v2 and the Azure data lake store Gen2. This is the cheapest option for the storage that we have available. If you are creating a storage account on azure, usually Azure data lake gen 2 will be good enough unless you need the features that are only available in the general purpose accounts that are not existed in the Azure data lake store.
Azure Data lake uses the Active directory. Blob storage account uses the shared access key. If someone can get the name of the storage account and get to the access key, that might be dangerous situation. Access key should be regenerated every couple of weeks or so in a schedule to keep the key secure. With the Azure data lake Gen2 accounts, we can take the advantage of login Id of the Azure active directory. There are some regulatory and compliance that we get for general purpose that we do not get with data lake. One of them is WORM storage which means Write Once Read Many. Once the data has been written, it cannot be modified. Azure blob storage offers that capability of doing WORM storage and Azure data lake doesn’t. Azure blob storage have a capability of soft storage where if you accidentally deleted anything, we can restore that. Azure data lake doesn’t offer that.
Azure data lake is designed to store massive amounts of data for big data analytics. It provides unlimited scalability, Hadoop compatibility, optimized azure blob file system (ADFS) designed for big data analytics, Zone-redundant and Geo-redundant storage.
Lets create Azure Data Lake:
Go to the navigate pane and search for the “Storage account”. To create general purpose storage account or Azure data lake is through the same storage account service. You will see the storage account service, click create.
You will see a create storage account pane, and the top you see many tabs like data protection, advanced, tags. You can create the storage account first and then later configure these.
Select the resource group, We will be selecting the one that we created with name “dbanuggets” and give a storage account name. There are some requirements here that we have to remember while creating the storage account name. It should be all lower case letters and it should be globally unique across all of azure because when we create the storage account, we can connect to it through the name.
Give the name of the storage account in all lower case letters. Here I am creating the storage account with name “dbanuggetsstorage” and choose the location of the storage to match the resource group. Here I am choosing the East US. Then choose the kind of account you are going to create. Remember, this is not where we actually change this to general purpose v2 or azure data lake gen2.
Then choose the Replication tab and select the type of redundancy you need. The cheapest option and the one that is built in is the “Locally-redundant storage(LRS)” meaning if you are working with data lake or the storage account, Microsoft makes sure we have like three copies of that data at any given time. This is an amazing thing as we now have three copies available and do not have to worry about what will happen if one data rack goes down. We will still have other two racks readily available. This option is the redundancy with in the data center so if the data center itself goes down, then we need to check for different options available in the azure like zone redundant storage (ZRS), geographical redundancy storage(GRS) and many others. There is a cost associated with the options you choose from here. Here for this example, we are going to choose the cheapest option “Local-redundant storage(LRS)”.
When you check the other tabs like networking, here we can control the access through IP addresses. The next tab is data protection. Here we can do like soft deletes for blobs. If someone will delete the file accidentally, we can recover those files. You can do versioning for the blobs and also turn on the point in time restore for container.
The next tab is advanced tab where we can choose where we actually want the storage account to be. You can see this in the advanced tab> Data Lake Storage Gen2> diabled or enabled. This is the single place where you click to enable or disable the Data Lake storage which will signify if your storage is going to be Data lake and this needs to be done at the time of the creating the resource. We cannot go and modify this later. Once we choose the storage type here, we cannot change it later. Let’s go ahead and choose the Data lake to be enabled.
Once you enable this option and go to the Data protection tab, lot of options that were there under the data protection are disabled we lose the point in time restore, the ability to do the soft deletes, the versioning for the blobs and we also lose the WORM storage (meaning once we write the data, we cannot modify it). These are the limitations that we get with the Azure Data Lake, so if you need these options we have to go with the blob storage account. Remember that we are not restricted with only one storage account for all our projects. We can have for some projects blob storage and for others Azure Data Lake storage. We can have some data that requires the regulatory compliance like the WORM storage and you do not want any one to accidentally delete the files and you need the soft delete functionality, in that case for that specific project we can choose the blob storage account. We can always have and use the different services for our projects in Azure.
On the Tags tab, we can give the name and the value to that tag name so once we go to the billing information, this resource is now tagged and we can see our costs clearly. It is a good practice to tag all of our resources. Click review and create at the bottom of the page.
You see the validation passed on the top of the screen. You can review the resources that you wanted to create and click create at the bottom of the screen.
I am seeing the deployment in progress once I press the create button. The storage account that you created will not cost us anything now. You get billed for storage account is when you write the data to the storage account and when you read data from it. If we created the storage account but not created anything in it, we do not get billed for that. If we created and using very little like creating only few files then we will only be minimally be charged. We can monitor the costs down the lane in the Azure.
You will be seeing the notification button at the top right corner of the screen notifying the deployment is complete.
To see how the storage account that we created look like, go to the navigation pane on the left side of the screen, choose the resource group> select the resource group we created (in my example, dbanuggets). Here is our storage account “dbanuggetsstorage”.
Once you click on that storage account, a new blade will be opened in the vertical pane and you will be able to see all associated details of that storage account.
When you scroll down, you will see the “containers”. Click on that and here when you want to upload any files or you need to remove any files from azure storage account, we need to create a container for that first.
Click on the containers and create a container by giving a name.
Give the name like here I provided the name of the container as “dbawork”. Remember, we are creating a directory here and we need to give the name all small letters. I am keeping my public access Private here in the example. Then, click create.
You will see the notification that the container has been created successfully.
We can now upload files and download files. Usually, we all do this through automating. We will be doing this through azure data factory. Now, lets go ahead and manually upload a file in our example and see how easy it is to import data into Azure. Click on the container we just created.
Click on the upload button at the top
Select the folder to upload the file
Upload the file and press upload
File has been uploaded. This file is now in azure and stored in azure data center. Any file that we upload to storage account, we have to upload the files to a container. These containers are just like the directory locations where we have folders having the files on our on-premises environments. We can make sure we have security setup at the container level.
You can directly click on that file that is now in the container and click edit to directly edit the content in the file.
A new blade pops up opening the file and you can edit it
For easy uploading and managing your files to the storage, there is this beautiful application you can use “Azure storage Explorer”
Azure storage Explorer is an amazing tool where you can connect to your azure account and see all the different resources you have in azure. It is both for the general purpose and azure data lake Gen2. We can upload files, folders, change permissions. You can upload, download and you can also modify the security on these containers easily through this application. If you do not have it already, download it here.
Once you download the Azure storage explorer, signin with the azure account you created. I signed in with the azure account I created and I can see that in the account section in the application that you can see at the left side of the panel.
Under the explorer panel, expand the Azure subscription tab to see all the resources you have like storage account and the containers I have along with the fileshares.
You do not have to login to the azure portal, you can upload, download from here itself. We usually automate the upload processes through azure data factory or through the business applications.
In this post, we have learned about Azure Data Services- Infrastructure as a Service (IaaS), Platform as a Service (PaaS), Software as a Service and we have seen the difference in between these services. We have leaned about the Azure Storage-Blob Storage, Data lake storage Gen2. We have learned with examples so you can get started with Azure.
In the next post, we will learn the fundamentals of Azure SQL DB.
When I was preparing for my presentation “Method Behind the Magic of Internals in SQL Server”, I had a plan to write a blog post about the Data page internals and I am finally able to write it. Here are some little nuggets about Data page internals in SQL Server.
SQL Server stores the data in the form of 8kb pages. Each page is interlinked between to the previous page and the next page number. There are two types of pages in SQL Server where data gets stored. Pages that contain the actual data and pages that contain the data about all other pages, meaning pages about the metadata information. We will have a separate blog post of types of pages we have in SQL Server. No matter how many blog posts we read on this topic, it never gets old 🙂
Lets quickly see the structure of a page.
Page Header is always 96bytes in size. This area contains the metadata about the page itself like the page number, previous page, next page and much more metadata information about the page (A separate blog post about this in near future).
Pay load area is where the actual data gets stored in the slots. When you check the bottom of the page, that’s the Row offset array which have the 2-byte pointers pointing to the exact space where the corresponding row begins on the page.
Data inside the payload area might not always be in the physical order of the data rows but always be logically ordered in the row offset array in the index sort order. SQL Server always read the row offset array from the right to left. The first slot array is 0 which have a pointer pointing to the lowest key value on the page and the second slot array has the pointer pointing to the next lowest key value and so on.
There are main datatypes in SQL Server, Fixed length and the variable length data types. Fixed length data types always store the same space regardless of having null or not null values. With the variable data types, SQL Server will use the required space and additionally two extra bytes.
Each byte you see in the picture has a purpose. The first two blocks containing a byte, Status Bit A and Status Bit B contains the bitmap information about the row, like if the row is logically been deleted/ghosted, row type information, versioning tag, if the row contains any NULL values, Variable length columns. The next 2 bytes is used for storing the length of the fixed length data. The next n bytes are for storing the fixed length data itself. There is a null bitmap after that which will have both the 2-byte column count in the row and null bitmap array. Regardless of if the column in null or not, each and every column will have one bit per every column.
Next is for the variable length data portion where the first 2 bytes here is for the number of variable length column in the row. The next n bytes is for the offset array of the variable column. Even when the value of the column is Null, SQL Server will still stores the offset value of 2-bytes per each variable length column. Next is the actual data of the variable length data. Finally, the 14 byte size versioning tag especially for the row versioning. This is used for the optimistic isolation levels and for the index rebuilds.
Let’s go ahead and create a brand new table and insert two rows to see how they actually look like. By using the DBCC command, let us first view all the pages for the table.
create table dbo.dbanuggets
ID int not null,
nugget1 varchar(255) null,
nugget2 varchar(255) null,
nugget3 varchar(255) null
insert into dbo.dbanuggets(ID, nugget1, nugget3) values (1,replicate('d',10),replicate('b',10));
insert into dbo.dbanuggets(ID, nugget2) values (2,replicate('a',10));
/* dbcc ind ('databasename','dbo.tablename',-1 /* Displaying info about all pages from the table */);*/
dbcc IND ('dbanuggets', 'dbo.dbanuggets', -1)
To view the page header and the column information, Turn the Traceon(3604):
When you check the first record here, it has a column Pagetype as 10 which is a metadata page IAM (Index allocation map). This page will track what all pages belongs to a particular table. The next record which have the PageType as 1 is the actual data page which contains the data rows.
There are two pages that belong to the table. The first one with PageType=10 is the special type of the page called IAM allocation map. This page tracks the pages that belong to particular object. Let’s not focus on it now – we will cover allocation map pages in one of the following blog posts.
Check the row structure here:
When you see, the 3rd one in the row is a two byte value of 0800. This is a byte swapped to the value 0008 which is the offset tells the SQL Server where the fixed-length part of the row ends and offset for the columns in the row. Next 4 bytes is used to store the ID column here which is a fixed length data. The next 4 bytes are used to store the fixed length data, the ID column. Next, the 2 byte value shows how many columns we have, here we have 4 columns. Next, we have one byte Null bitmap. As we have 4 columns, we have one byte bitmap. The value shown here as 04 which is 00000100 in the binary format. It represents that the third column in the row contains NULL value. The next 2 bytes shows the number of the variable length columns which is 0300 in the swapped order which means 3 columns. Next contains the offset value where the column data ends. Next comes the actual data for the variable length columns.
second row look likes this:
Row structure is a bit different here:
When you observe here, the nuggets1 and nuggets3 columns are having null values. The NULL bitmap in the second row shows the binary value 00001010 which tells that the nuggets1 and nuggets3 columns are NULL. In the variable length columns indicates there are only two columns. SQL Server will not have the information about the NULL variable length column in the data row. We can manage the size of the data row by making sure we add the null value columns at the last when declaring the table definition and this is when we use the variable length columns that will have many null values.
There is 8,060 bytes available on the single data page. Null bitmap is always there for the heap tables and clustered index leaf level rows even when the table does not have the nullable columns. If there are no nullable columns in the index, then the non clustered indexes do not have bitmap values.
In this post, we have learned how the data page look like and the different parts of the data page. We have also learned about how the data row structure look like with the examples. How having null values doesn’t take space in the row data but then managed by the SQL Server as the Null bitmap value. I will be posting more about the Page internals in the future blog posts.
In the Part1 of decoding the physical join operators, we learned about the different types of physical operators: Nested loops, Merge joins and Hash joins. We have seen when they are useful and how to take advantage of each for the performance of our queries. We have also seen when they are useful and when they needs to be avoided.
In this part, we will know more about these operators and how the indexes really help these operator to perform better so the queries can execute faster.
In order to use the Merge join, the join must include equalities. Meaning the join should have columns of one table equal to column of another table. If the join criteria is range comparisons like inequalities or with greater than, less than and not equal to, then the optimizer will not use a Merge join.
In the part1 of physical join operators, we have joined every single value from the first table to find all the matching values from the second table. What if we just want to only work with few values (For example, lets say we are looking for 6 values) from table 1 instead of all values? For the hash join, SQL Server has to still goes through every single value from the first table before reading any values from the second table. With the nested loops, as soon as we read one row from table 1, SQL Server can look for the matching rows in the second table. We may have a question here, for each of the 6 values we are looking for from table 1, we still have to read all values from the second table for the nested loop join. That can be pretty slow until we create an Index. Index created on the value created on the second table, all the lookups will become very fast. After reading the first row from the outer table, SQL Server can use the index to find the location of the matching row in the inner table. That helps, right? Instead of reading all the values from the second table for 5 times, SQL Server has to just do 5 index lookups to find the matching values. So, it reads 5 values from the table 1 and 5 index lookups from the table 2.
For Hash join, it starts reading all the rows from the first table building the hash table. Then it starts reading the rows in the inner table. It does that until the SQL Server finds the 6 matching rows which means it may have to read all the rows from the second table as well if there is no index. Here, the hash join is so much slower than the nested loop joins.
Merge join has to sort the data first so using the index on the first table that can avoid a sort created by SQL Server own sorting process. If we have the index on the join columns then instead of sorting all the rows in the outer table, we can simply use the index getting the matching rows from the table. If we have the index on the second table for that value column, SQL Server will use that index to get the rows for the merge join.
If we are returning a small fraction of the rows from each table, then nested loops will be efficient. For the hash match, if we are looking for a couple of values in the first table which we can find them easily using the index on the first table, after building the hash table in the build phase and probe the second table using the index on the second table value column as well. With the nested loops, for every value on the first table, the values on the second table can be found by using the index quickly. Nested loops are very good when we are working on couple of rows from each table. If we want to compare the large amounts of data in the first table, we have to do as many index lookups on the second table. In that situations, it is really better to switch to the hash join to full scan the second table.
The switching point difference for the optimizer to choose from the nested loops or hash joins being the fastest can be really small. If you statistics are out of date, in that case, the row estimate can lead the optimizer choose the non-optimal join type causing performance issues.
select soh.*, sod.*
inner join sales.salesorderdetail sod on soh.SalesOrderID=sod.SalesOrderID
In the above example, SQL server used Nested loop operator. It used clustered index seek on the outer table and the estimate based on the predicate is 1 and when you observe it did the clustered index seek on the inner table and the estimated number of rows were 15. Here we have an index on the inner table. We need to have index on inner table to avoid the expensive scan operation. Here the clustered index seek operator is executed once because we are only returning one qualified row from the outer table. Here we have two times the compute scalar operator being used because we have two computed columns that we are getting from the two tables.
Lets create two new tables with columns ID and Rollnumber and insert values into the tables. Since we do not have any unique constraint, no unique index defined on the Rollnumber column of the table, SQL optimizer doesn’t know about the uniqueness of the column. Since the SQL optimizer doesn’t know about the uniqueness, it doesn’t choose the merge join. We can force the merge join in option and check the execution plan.
select * from [dbo].[Table1]
select * from [dbo].[Table2]
select a.*, b.*
option (merge join)
SQL server has to sort the data first coming from the table 1 and table 2 and used Merge join as we forced the optimizer using the hint in the query. If you look at the memory grant it used 1MB.
When you hover over your mouse to the merge join operator, you will see many to many operation as True. That means SQL server thought there may be duplicate values in the outer table join column because there is no supportive index or check constraint defined.
As it estimated the duplicate values, sql server also created the work table. You can see this information by setting the set statistics IO ON and see the messages tab.
Now, lets go ahead and create a unique nonclustered index on the second table ‘table2’. This is the inner table SQL server chosen in the execution plan. Now SQL server chooses the table 2 as the outer join because now SQL Server knows we have unique values in the table 2. As the rows are directly read from the index we created, explicit sort operator is no more needed.
create unique nonclustered index IDX_rollnumber on table2(Rollnumber)
When you observe the many to many operator, its false now.
As we created the index and as SQL server now know there are unique values in table 2, it did not use the explicit sort and it no longer need to create the work table and so the many to many operator is false this time. Now, lets go ahead and create the nonclustered index on the first table ‘table1’. This time there are no more explicit sort operators as the table column Rollnumber is in the sorted order by the newly created indexe on table 1.
create unique nonclustered index IDX_rollnumber on table1(Rollnumber)
The prerequisite for the Merge join is to have the data sorted by a supportive index or by explicitly having the SQL server use its sort operator in the execution plan.
Lets run a very simple query to generate the hash match operator by running the below query in Adventureworks2016 database.
from person.person p
inner join person.personphone pp
If you see the number of estimated rows, they are matching to the actuals. If the estimates are off when compared to the actuals, then the hash join will actually spill the rows to disk to do the operation which is really expensive.
When you observe the execution plan, personphone table was used as the outer table to create the hash table and the created hash table is probed to the inner table person to get the matching records. You can also see on which columns the probing was done by hovering your mouse over to the hash match operator under the hash keys probe.
There are three types of physical join operators. Nested loops, Merge join and Hash join.Nested loops are suitable when joining small tables or small fractions of big tables. Hash joins are better when we are joining all the rows in a table or large tables. Merge join can be efficient if the rows are sorted but sorting is slow but can be faster if indexes are present on the joining columns.
How knowing all this information will help you? You can easily tune your queries by checking the table statistics are up to date and the indexes. One of the important things that will help optimizer to choose the join type and the join order is based on statistics estimate of how many rows it thinks going to come out of each table. While tuning the queries, check for the indexes on the join columns. If they are missing, creating the index help optimizer to choose the nested loop join. If there is no index, it will force the optimizer to use the hash join. By making sure the statistics are up to date and by checking the necessary indexes exists or not will help the optimizer to choose the good enough plan for your queries.
I specifically thought of writing about this topic since long time. I always had trouble understanding clearly about Nested loop joins, Merge joins work. Once I learned about how they actually work, I planned to write a blog post about the same so you can know about how these joins work as well. By the end of this post, you will have a nice smile on your face as you feel this is so easy once we understand the concept behind these joins. This is a 2 part series.
Types of Physical Operators
Nested Loop Joins:
For example, in the diagram you see the representation of Table 1 (outer table) and Table 2 (Inner table). For every value in the first table (outer table), SQL Server will query the second table (Inner table) looking for the matching values. We will read every value in the second table once for every value in the first table. In the below example, once the 1st value containing the orange color is compared with each value in the table 2. Once the complete table 2 values are compared, then SQL Server will loop into the next value on the table 1 to compare each value on the table 2 again and this process is looped over until each value from table 1 is compared to each and every value of table 2 to get the matched values as output of this operator. This is a lot of work. This takes so much CPU.
If the inner table is presorted by having an index, then this will save time because when we iterate values over the inner table, SQL Server can now seek to the rows that we need to compare against the outer table instead of comparing to every row in second table. This can reduce the number of comparisons that needs to be compared which can actually improve the performance.
Now, how do we know if these nested loops are appropriate in our plans or not? You can view the actual execution plan to see on how many rows does the nested loops iterator is getting out. If the number of rows that are coming out of that iterator are only few, than that is fine for the nested loop join. The other cause might be the estimates. Estimates might be off for that data and instead of using the better joins for that data, it might be using the nested loop join. We also might be missing the index which can pre sort the data that is going into the join. This may also be the reason for choosing the nested loop join.
What if we sorted the data, now we have the data sorted in both the tables. Look at the diagram below for reference. As we have both the tables in the sorted order and we have to join them, SQL Server will first take the first value and compare with the first value in second table which also has the same value 1 as the first table first value 1. Then it goes to the second value in the second table which is 2 and compare with the first value 1 in first table which doesn’t match. Because the second table is sorted, we know that there is no other values that matches in the table 2 with the value 1 from the first table. It will check with the next value 2 from the table 2 which doesn’t match with the value 1 from table 1. value 2 is greater than value 1 so SQL Server now knows that there are no more matching rows in the table 2 and SQL Server will stop processing now and goes back to the next value in table 1 which is value 2 and check with the table 2 value 1. This time SQL Server will compare with the last value we had a join on which is value 1 in the table 2. Value 2 from the table 1 doesn’t match the value 1 from the table 2 so it goes to the second value 2 from the table 2 find the match and goes to the next value 3 from the table 2 and finds value 3 is greater than value 2 and stops the comparison as it knows there will be no matching values next in the table as the data is sorted in the order. It goes to the next value 3 from table 1 and compare from the last value we had join on which is value 2 from the table 2 and find the matching value 3 from the table 2, goes to the next value 4 from the table 2 and found the mismatch and so it knows there are no more matching rows and stops processing and then goes to the value 4 from the table 1 and compare with the last value we had join on and do the same processing until the comparison completes for each value in table 1. This is done as the SQL Server walks and compares both the tables bit by bit. This is the basis of a Merge Join.
In the Merge join, we sorted the table on the join columns and walked down the columns bit by bit comparing the rows. Because the rows are sorted, SQL Server can stop the processing once it founds the matching value in the second table which is greater than the value of the current row in the first table.
Merge join is the fastest physical join operator in the SQL Server. Merge join is so efficient because it can go through each input a single time and it doesn’t have to loop back and do more operations like the nested loop join. When we have many to many merge join, in that case SQL Server cannot reiterate the number of rows of inputs by doing many to many join. It instead creates a work table in tempdb and loops that over and over again to process those many to many joins. If the first table has duplicate values, in that case SQL server will create a work table and do the comparison just like the regular merge join but when it finds the duplicate values in the second table, it writes them into work table in tempdb and does the comparisons there. If it also finds the duplicate values on the first table, SQL Server will go ahead and compares the first table values to already stored values in the work table. Once the iterator moves past those duplicated values, the work table gets cleared out and the merge join continues until all the values gets compared. If you would like to know in depth about how the work tables work in detail, please refer to this wonderful article from sqlserverfast.com
Merge join is not a bad thing and it may be efficient already in your execution plan. What you have to observe when you see the merge joins and performance slow on that plan is to focus on the upstream operations that are going into the merge join. Whether the data is presorted as you already have an index or whether the data is presorted in SQL Server own way then in that case, you can simply check if you can just add that missing column in the index and place in the last key column in the index or use a different join algorithm will be better. The other scenario might be you have lots of duplicate values in your data. If that is the case SQL Server will be using the work tables to handle how the duplicate values can be joined on. So, if you see the duplicate values or using tempdb, then finding the better options will be good.
Combining the tables with the merge join is faster than the nested loop join but we need to sort the data first. You know that the sorting is slow. So, how can we do this better. That’s when the Hash join comes in.
For every value in the table 1, we can apply the hash function to it and build a Hash table on it. Each and every value will be now in the hash table that has been built. Once that has been done, we can go to the second table and apply the same hash function to each value in table 2 and lookup probe for the hash table 1 we built for matching values and we have a join. We repeat the process for all the values in the second table. This is the Hash join.
There are two phases that the Hash join happens. The first phase is Build phase where SQL Server will build the in-memory hash table from one of the two inputs. Hashes are calculated based on the join keys on the input data and the whole rows are stored in the hash table under the hash buckets. Mostly we have one row per the hash bucket except when we have multiple rows with duplicate join keys or they is a hash collision (total different join keys might get the same hash which is a rare scenario). Once the hash table is build, SQL Server will start the probe phase where SQL Server will calculate the join key hash for each row in the second input and checks to see if it exists in the hash table builds in the first build phase. If it finds the match it will goes to the next to see if the join key themselves actually match because sometimes we might have the hash collision issues.
Hash joins is not the fastest type of the join because it needs to use the hash function which uses more CPU to calculate the hashes of the join keys but this join is really useful to join really large tables and it uses Tempdb to create these hash tables if memory is not sufficient to build them. If SQL Server cannot build all the hashes of the hash table in the memory because it is very large, then the hash can actually spill to disk and use the tempdb. In this scanario, SQL Server can put as many buckets as possible in the memory and the ones it cant keep in memory can be stored in tempdb. In the proble phase, SQL Server will join any rows that are in the memory that are in-mmeory hash table and any hashes that it cannot finds the match in the hash table, it writes to tempdb until all the rows from the probe input has been processed, it loads everything related to from the tempdb into the memory and continues to do the comparisons there. SQL Server can actually stage some of the data into the memory during the join phases and loading it back into the memory once some of the memory gets cleared up.
What can we do if we see the Hash match in our execution plans? We have to first see if this operator is useful in the plan because this is a blocking operator meaning until the build and the probe phases are completed all the downstream operators have to wait in the execution plan for processing. We know that Hash join operator can actually join any data sets but when it actually spills to disk, it can cause the performance issues. That’s when we need to peek in and look at the memory grants that has been allocated is not sufficient because of the incorrect estimates. The memory grant that has been actually allocated to the query might not be sufficient to build the hash table in the memory and so spilling to disk. In this scenario, you can check if the statistics are up to date and check to see if adding that one column to the index might help choose a different optimal physical operator to process the query effectively.
In this part 1, we have learned about how the physical join operators work, when they can be efficient and when they are not. Where and what to look at when they are not efficient to make our queries faster. In the second part of the series, we are going to look how the indexes help these physical operators go faster.
In the Part 3 of the series, we have learned about the Sorts, Hash matches and Serial nested loops. We have seen when they can be useful and when they needs to be avoided with examples. In this session let’s learn about Seek and Residual Predicates with examples.
Lets take an example of a very simple query running on the BigTransactionHistory table and check for the execution plan.
Credits: Adam Machanic
SELECT * FROM
bigTransactionHistory as bth
ProductId BETWEEN 1001 AND 50001
When you check the execution plan, Estimates are off here. Estimated number of rows were 80 but the actual number of rows were 0. Did you observe, we have parallelism here? Parallelism is used for high cot query plans but why for this simple query?
Run the same query by checking the statistics this time
SET STATISTICS IO ON
SELECT * FROM
bigTransactionHistory as bth
ProductId BETWEEN 1001 AND 50001
SQL Server has to read 131444 logical reads which is a lot.
There are two types of predicates in SQL Server. Index predicates and Residual predicates. Residual predicates are the predicates which are not satisfied by the index but they are still be pushed to the storage engine. When you observe the predicate information in the execution plan here, you see the plan shows the index predicates (seek predicate) on the ProductID and the storage engine instead of the query processor gave us the residual predicate. It filtered the actual cost in the storage engine and did not pass the rows to the query processor.
So what are we missing here? How to track where the issue is? We can see the line passing from the Index seek to parallelism operator, it is so thin. These lines can be really useful in tracking down where the issues are when your execution plans are big so we do not have to go and look each and every iterator in the plan. Though the residual predicate make us harder to debug the query plan faster but it is really helpful to run the query faster. But, if you wanted to test the query performance in lower environments, enabling the TraceFlag 9130 can be really helpful in debugging the plan easily. Now what this trace flag does it, it will disable the residual predicates and force the query optimizer to process the residual predicates as a filter in the query processor. When you enable this Traceflag and run the same query, look at the difference in the line. Enabling this traceflag is useful when you have lot of I/O input and we cannot figure out where the issue is.
Warning: Do not enable this Traceflag 9130 in production because this will cause query slowness. Enable this in non-production servers to quickly debug the queries. This is a Gem for Query tuning developers and DBA’s.
Sometimes when you try to run your queries searching on the predicates more than once on the same column of the table, then SQL server will go ahead and use the seek predicate on one of the predicate and do the residual predicate on the other. This may not be the good thing to do. Instead, we can use the case expressions and modify the code.
Below is one of the examples using the same predicate on same column:
CREATE TABLE #Purchaseorders
--Primary key ([SalesOrderID])
WHEN [SalesOrderID] BETWEEN 43659 AND 43701 THEN '20110531'
WHEN [SalesOrderID] BETWEEN 43702 AND 43705 THEN '20110601'
WHEN [SalesOrderID] BETWEEN 54990 AND 55990 THEN '20081110'
DECLARE @OrderDate DATE='20110531'
FROM #Purchaseorders as r
SELECT TOP (1000)
FROM [Sales].[SalesOrderHeader] AS SOH
AND SOH.[OrderDate] BETWEEN r.OrderDate AND r.DueDate
AND SOH.[OrderDate] >= @OrderDate
) AS a
Instead of using both the predicates on same column, we can replace the query to use case expression by modifying the code and check for the execution plan.
DECLARE @OrderDate DATE='20110531'
FROM #Purchaseorders as r
SELECT TOP (1000)
FROM [Sales].[SalesOrderHeader] AS SOH
AND SOH.[OrderDate] BETWEEN r.OrderDate AND r.DueDate
AND SOH.[OrderDate] >= @OrderDate
) AS a
Check the execution plan now, optimizer decided to use both of them as the residual predicate in this scenario but its not we who made sql optimizer confuse to use seek predicate or residual predicate. Its the optimizer choice to use residual predicate here.
In this post of the Performance tuning series, we have gone through the Seek and the Residual predicates and learned about how enabling the Traceflag 9130 can be really helpful in debugging the plan easily on your non-production servers as it will disable the residual predicates and force the query optimizer to process the residual predicates as a filter in the query processor. You can see the thick lines in your query plans to quickly investigate into where the issue is.
Thanks for reading and don’t forget to subscribe to this website, you have WANT TO BE NOTIFIED? section on right side of the main page where you can enter your email address to receive new blog posts notifications to your email address every week!
In the Part 2 of the series, we have learned about the Key lookups and the Spools. We have seen when they can be useful and when they needs to be avoided with examples. In this session let’s learn about Sorts, Hash matches and Serial nested loops.
This is a continuation to the Performance tuning series Part 1 and Part 2. As I mentioned in the first part of this series, there is no end to this series. As I learn about performance tuning tips and tricks, I will keep on adding to this series. If you love to learn along with me and level up your Query tuning skills, Subscribe to dbanuggets Here and give me that boost to work more 🙂
Sorts are important and necessary for many queries as most of the queries needs to sort their results. Sorted input is often what we exactly need. We will see with an example of a query we run on BigTransactionHistory table to get 500 products that sold for the highest cost and sold for greater than 1500. We will get the rows from the bigtransactionhistory table where the rows are greater than or equal to 1500, tied to the rownumber and do the partition by productid so for each productid, we get a new rownumber or sets of rownumbers and orderby the actualcost dec and get the total 500 rows from there. Run the below query that does the same thing and check the execution plan.
SELECT TOP (500) WITH TIES
Actualcost desc)as r
) AS x
Query runs in 8 seconds. When you observe the query plan in the data centric view, first sql server did a NC index scan, sorted the data so we can have the row number, calculated the row number by window aggregate and filter operators and sort the rows again, gathering streams through parallelism and top 500 rows gathered. The first sort after the NC index scan is sorting many rows. When you hover over the mouse to the rows coming out of NC index scan, you can see how many rows are sorted.
What if we convert the same query by making sure we get the nested loops as they are cheaper. To get the nested loop in the execution plan, we can use CROSSAPPLY and remove the partition by and we run the query to get one product at a time.
SELECT TOP (500) WITH TIES
from bigproduct as p
bt.Actualcost desc)as r
from bigtransactionhistory as bt
bt.Actualcost>=1500 ) as x
Check the execution plan. Now the query runs much faster in 3 seconds. It did a CI scan on bigproduct table and goes into the loop once per the product, it goes and grab the rows just for that product, sort just the rows for that product and calculate the rows just for the product (segment and sequence project). The sort here in this plan is also expensive but small sorts are very less expensive when compared with the sorts on huge amount of rows.
Hash match operator is appropriate for the Datawarehouse workloads but not the OLTP workloads. It is appropriate for the medium set of rows against the large set of rows but not god for the large set of rows against the large set of rows. Let’s take an example. In my company, we regularly receive data from different vendors and we import the data into multiple tables depending up on the type of data. Before we import any data, we need to check if the data is already existed in the database. That’s the example we want to take here. Run the below query and check the execution plan.
[SalesOrderID] BETWEEN 43659 AND 1181640
FROM #sohe as b
SELECT * FROM [Sales].[SalesOrderHeaderEnlarged] as sohe
AND sohe.[SalesOrderID] BETWEEN 43659 AND 1081640
OPTION (MAXDOP 1)
When you observe the execution plan, we see the clustered Index seek on the SalesOrderHeaderEnlarged table. Index seek is on the smaller set of the SalesOrderId’s. It did not do the table seek on the temp table sohe because we do not have an Index on the temp table. That’s the SQL Server is forced to do the Hash Match. This is the type of situation where we see the SQL Server is forced to use the Hash match because we do not have proper index to seek.
As you can see, in the execution plan we have a Index recommendation that says the performance will be improved by 13%. Lets go ahead and create the recommended index on the temp table. Then run the same query again to see the difference in the execution plan.
CREATE CLUSTERED INDEX ix_SalesOrderID_OrderDate
FROM #sohe as b
SELECT * FROM [Sales].[SalesOrderHeaderEnlarged] as sohe
AND sohe.[SalesOrderID] BETWEEN 43659 AND 1081640
OPTION (MAXDOP 1)
We have the Merge join this time because we have two sets of the data that are identical in size which is great for the merge join. Here the merge join is better than the Hash match. In this scenario, the index creation changed the hash match to merge join which is what we need here to improve the performance.
Serial Nested Loops:
Sometimes, we have never ending queries due to the serial nested loops. Let’s see this with an example. We will use Adventureworks2016 database and as my database is 2019 mode and the server is sql 2019, let me go ahead and place the Auto update statistics to be turned OFF on the database level. Then lets see how many rows have to be updated on the table Sales.SalesOrderHeaderEnlarged to hit the threshold to invalidate the statistics ()on the CustomerID column.
ALTER DATABASE [adventureworks2016Big]
SET AUTO_UPDATE_STATISTICS OFF
SET CustomerID = 11091
WHERE CustomerID < 16575;
[sch].[name] + '.' + [so].[name] AS [TableName] ,
[ss].[name] AS [Statistic] ,
[sp].[modification_counter] AS [RowModifications] ,
FROM [sys].[stats] [ss]
JOIN [sys].[objects] [so]
ON [ss].[object_id] = [so].[object_id]
JOIN [sys].[schemas] [sch]
ON [so].[schema_id] = [sch].[schema_id]
LEFT OUTER JOIN [sys].[indexes] AS [si]
ON [so].[object_id] = [si].[object_id]
AND [ss].[name] = [si].[name]
OUTER APPLY [sys].[dm_db_stats_properties]([so].[object_id], [ss].[stats_id]) sp
WHERE [so].[object_id] = OBJECT_ID(N'Sales.SalesOrderHeaderEnlarged')
ORDER BY [ss].[stats_id];
When you observe in the above results, we are focused on the IX_SalesOrderHeaderEnlarged_CustomerID statistics that is being created on CustomerID column. We need to update 49221 rows to invalidate this particular statistics as shown in the NewThreshold column. We updated 890k records as you see in the Rowmodifications. Now, the statistics is outdated but as you disabled the AUTO update statistics to be turned OFF, SQL Server will not go ahead and just update the stats the next time you run the query using that outdated statistics. Since there are no updates to the statistics, execution plan will still use the same outdated statistics information for the queries using CustomerID column. Let’s go ahead and run the query using the outdated CustomerID column statistics and check the execution plan.
FROM [Sales].[SalesOrderHeaderEnlarged] oh
ON oh.SalesOrderID = od.salesOrderID
When you observe the execution plan above, the highlighted Red color rectangle box represents the estimated number of rows coming out of the SalesorderHeaderEnlarged table which is just 2156 rows but look at the actual rows on the top of the estimated rows. Its 890k rows which is way too above than the estimates. Now, why is that? Its because of our AUTO update stats is set to OFF. Update stats did not kick in when we run the query using the outdated stats as the option is set to OFF. Now let’s go ahead and enable the AUTO update stats to be turned ON and see the magic by running the same query one more time. Check the execution plan this time.
ALTER DATABASE [adventureworks2016Big] SET AUTO_UPDATE_STATISTICS ON WITH NO_WAIT
FROM [Sales].[SalesOrderHeaderEnlarged] oh
ON oh.SalesOrderID = od.salesOrderID
When you observe the execution plan now, we see the execution plan itself changed. The estimated are lined up very close to the actuals for the clustered index scan on the SalesOrderHeaderEnarged table.
Point to remember: When your Auto update stats is turned ON at the database level and when you statistic is outdated, the next time you run any query using that outdated statistic, that’s when the SQL Server will first go ahead and update that stats behind the scenes, invalidate the execution plan and create a new execution plan for that query and uses the newly updated statistics information in the execution plan. Once that is done behind the scenes, the query execution comes after that and user will get back the results. SQL Server here thought it would be better to do the clustered index scan as the number of the rows that comes out of the table are around 890k rows.
It is always better to make sure we keep the statistics up to date on the database level and maintain them using maintenance plans as well. For the older versions of SQL server before SQL Server 2016, optimizer will use a different threshold value to invalidate the stats. That is 20%+500 number of rows needs be changed to invalidate the stats and to kick the AUTO update statistics the next time a query runs using that outdated statistic. If your table is huge, then it is hard to get that 20% of the rows to be changed and to hit the threshold. For that reason, you have an option to enable the trace flag 2371 so you can use the new cardinality estimator which only needs SQRT(1000*number of rows) to be changed to hit the new threshold which is way smaller than older threshold.
In the part 3 of the Performance tuning series, we learned about Sorts, Hash matches and Serial nested loops. We have seen when they can be appropriate and when they are not and how to avoid them when they are inappropriate and improve the performance of the queries. As we add more posts to this series, we will get better at tuning our queries.