Azure Fundamentals for Beginners: Azure SQL Database

Azure SQL Database is a managed version of SQL that is in Azure. Microsoft is managing the database, backend, infrastructure, backups. If until now you have been managing the services on-premises and you would like to move to the Azure but still manage some of the stuff yourself, then you may be looking for the Infrastructure as a Service where you will still be responsible for patching your operating systems, updating your software. So, you would like to go to the SQL Server on a VM which is very cost effective and you will be getting maximum close to the feature functionalities of on-premises. If you want all the feature functionalities just like your on-premises then you can go with the SQL Server on a VM.

We have Azure SQL DB where you can create this DB in less than few minutes and it is very less expensive. We can have a Azure SQL DB hosted on a Microsoft managed data center and we can do our processing on the database like we can read, write and modify the data with less price. We don’t have to pay for the server or create any servers for the hosting the database. Everything will be taken care by the Microsoft.

Azure SQL Database has some limitations. When you run the database as Azure SQL Database, we might not get all the functionalities we get for the database on on-premises server. We have another option here as Platform as a Service where everything will be managed for you which is what known as SQL Managed Instance. Here we have a SQL Database that is managed by the VM but you get all the server level functionalities which you lose when you have the Azure SQL DB. When you wanted to use big data analytics or machine learning or data science and if you are looking for powerful resources to accommodate these, the option for this is SQL pools (formerly known as Azure SQL Data warehouse). Azure SQL Data warehouse is a large parallel processing service.

When we create the Azure SQL data warehouse in Azure, it take just few minutes to set up and start running. When you compare that to DWH on-premises Server build, it takes a lot of time like at least couple of weeks and cost lot of money to procure all the hardware that you need to make it happen. This has completely changed with Azure as this is a very quick process to setup an DWH environment in a matter of minutes with very low cost.

Azure SQL Database:

Azure SQL Database is a Platform as a Service meaning that Microsoft will be managing the hardware, infrastructure, backups, DR failover, data redundancy, patching operating system so our responsibility is to take care of the database, tables and objects in it, writing, reading and modifying the data in the database. It is a relational database as a service. Microsoft guarantees that database will be available 99.99% of the times and they will provide the credits if in case the database goes down for a length of time, Microsoft will financially back that agreement by crediting the money back. You only pay for what you need here.

Here, we get three built in replicas of the database for high availability. We also get the point in time restore capabilities up to 35 days with the default setting available. For the Azure SQL database, we can choose the performance level and the price we are going to pay. We can use the SQL Server management studio tools and the SQL environment is similar and very familiar with Azure SQL Database. We can also replicate the Azure database to many geographical locations. Just like we discussed in the previous post about the storage accounts, if your entire data center goes down where your Azure SQL Database is located, then we can immediately failover to the other geographical location so start using the database there.

Couple of more built-in features about Azure SQL Database. When you delete your database in Azure, these databases are automatically backed up. We automatically have Transparent data encryption, Always encrypted and we have row level security. We also get the Azure defender for SQL which monitors the activity of logins and users connecting to the database where we can setup to send us the alerts if any malicious activity happens. We also get automatic database tuning and also Azure Active Directory Integration.

Azure SQL Database Limitations:

We cannot backup the database from our end. Microsoft will take care of the backups.

There is no windows authentication. All we do is through Active directory.

There is no Change Data Capture, SQL Agent, Database Mail, PolyBase, Service broker, MDS/DQS, BI services (SSIS, SSRS, SSAS), Limited VNET integration, can be only the transactional and snapshot subscriber only.

When we say there is no SQL Agent doesn’t mean that we cannot create Jobs. We can use Azure data factory instead of SSIS to move the data and to clean the data, instead of using the SQL agent, we will be using the triggers inside our Azure data factory that kicks off those events to either trigger on the schedule or when event occurs.

PaaS for the Databases is really beneficial because we do have more time for optimizing the queries, focusing on the performance tuning on applications, monitoring the applications use by the users and scaling the applications well so companies can save money on Azure as well.

We can connect to the Azure Database using SQL Server management studio just like the on-premises database. We can also use the tools visual studio and PowerShell even when we are connecting the resources to Azure.

Let’s just go ahead and practice this by building the Virtual Server, meaning we are not purchasing any physical hardware or physical servers for creating this virtual server. Microsoft virtually create these logical servers and this logical Server will be just saved as the logical container that we created in our previous post. This logical server will save all the metadata information about the databases, elastic pools, any data warehouses that are attached to the server.

To create the logical Server, go to the “create a resource” pane on the left side of the panel and search for SQL Server (logical server) in the search bar which will give the service.

Click on the create and it brings the GUI creation screen for that resource.

Now, go ahead and give the resource Group name which is “dbanuggets” and in the server details tab, give the server name which should be unique across all of the Azure. If someone anywhere in the world already used the server name you are providing, then you will not be able to create the Server. This is because we use the server name when we connect to the databases on the server. Give the server name as “dbanuggetsserver” and Give the location details.

In the administrative accounts, provide the SQL Server account admin login and the password that will allow us to authenticate to the server and let us see the databases within that server. Give the username and password and press create.

Go ahead and press create by pressing the create button at the bottom of the screen again.

It took a minute to create and received the notification that the deployment is complete.

If you go to your resource group>select resource group dbanuggets that we created in previous post> and check to see the server has been created and listed under the resource group.

As our server is ready now, let’s go ahead and create the database now.

Go to the create a resource pane on the left side of the panel and the resource we wanted to create now is SQL Database. so go ahead and search that

Click Create on SQL Database

You can go ahead and provide the Resource group, location and name of the database

When you scroll down, you see two important options. One is “Do you want to use an elastic pool”

Azure SQL Databases are the individual databases that have their own resources allocated to them. This is very different from on-premises environments. On the on-premises environments all the databases are created on same provisioned server and they share all the available resources among them. In Azure, each database has its own resources but to improve the performance and to save the cost, it is important to create a pool of resources and attach all the databases to those resources.

If you have databases that have traffic during certain periods of the day or you have some of the databases that are not often used but when they are used, they need full performance. Those are the databases which will essentially be good candidates for creating a pool but not all the databases comes into this category. If you would like to get the similar kind of resources where the databases share the resources like in the on-prem environments, then enabling the elastic pool can be helpful.

Compute+Storage meaning what amount compute like the disk space, memory, CPU, disk space or disk speed for reading and writing and what amount of storage we want to setup and configure for this database. We can do that by either clicking the “configure database” option. When you click on it, it will take us to the new tab and it shows two kinds of different pricing structures: One is vCore which is the new model and the other is DTU which is a legacy model.

From the below tab opened, you can see the price options increase from left to right starting with basic, standard and premium.

For our examples, we will be using the DTU model because it is easier and less priced entry. The vCore model is easier to compare to on-premise environments. For example, in on-premises we have these many cores so we will go with same number of vCores. It is easy with the vCore model to set it up. DTU is being supported but as a backward computability and as a legacy model.

We will choose basic 5DTU’s. What is this 5DTU? It is some amount of compute, CPU, memory and speed on your hard drive of how fast we can write and read our data. If we go from % DTU’s to 10 DTU’s meaning we are doubling all of these resources. For 5DTU’s, we get only 2GB of storage and that will cost us 5$ a month which is very less. Click apply.

review and then click create.

If in future you would like to go to the higher service tier for your Azure SQL database, then it is completely possible. You can easily scale up and scale down as well. You can also change from the DTU model to the vCore model so we can scale in between them easily with out any issues.

If in case you need to scale up your resources later to some thousands of DTU’s and after couple of months, you do not need those many resources, then you can easily scale down in Azure SQL Database. That is pay as you go. Review and create button at the bottom of the screen.

Your deployment is now complete

Open the link of the database, Under the SQL database, below screen shows up as a blade and you can click on the copy to clipboard to copy the server name like below

Copy the server name to the clipboard and open the SQL Server Management Studio and paste the servername that you copied from the azure portal. Use SQL authentication and give the login ID and password you created when creating the database on azure.

Once you press connect, a screen pops up and it doesn’t allow me to connect because what we have not done on the server is that we did not give access to ourselves from the IP address

This is our first level of step to take in protecting your data. Because I created an admin account, I can directly press sign in and it will add in my current IP address. We can go to the azure portal and in the server page on the left side of the panel click on Firewalls and Virtual networks.

Click on Add Client IP

Give the rule name

Click Save at the top of the same window

Now, try to connect to the server by connecting to the azure servername, id and password. You should be able to connect now.

This looks so similar to the database on premises environment and now we are connecting to the azure database from the local management studio.

I created a sample table in the database by using the below script and added few rows

USE dbanuggets-Adventureworks
GO
CREATE TABLE [dbo].[bigProduct](
	[ProductID] [int] NOT NULL,
	[Name] [nvarchar](80) NULL,
	[ProductNumber] [nvarchar](56) NULL,
		[Color] [nvarchar](15) NULL,
	[SafetyStockLevel] [smallint] NOT NULL,
	[ReorderPoint] [smallint] NOT NULL,
	[StandardCost] [money] NOT NULL,
	[ListPrice] [money] NOT NULL,
	[Size] [nvarchar](5) NULL,
	[SizeUnitMeasureCode] [nchar](3) NULL,
	[WeightUnitMeasureCode] [nchar](3) NULL,
	[Weight] [decimal](8, 2) NULL,
	[DaysToManufacture] [int] NOT NULL,
	[ProductLine] [nchar](2) NULL,
	[Class] [nchar](2) NULL,
	[Style] [nchar](2) NULL,
	[ProductSubcategoryID] [int] NULL,
	[ProductModelID] [int] NULL,
	[SellStartDate] [datetime] NOT NULL,
	[SellEndDate] [datetime] NULL,
	[DiscontinuedDate] [datetime] NULL,
 CONSTRAINT [pk_bigProduct] PRIMARY KEY CLUSTERED 
(
	[ProductID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO
INSERT [dbo].[bigProduct] ([ProductID], [Name], [ProductNumber], [Color], [SafetyStockLevel], [ReorderPoint], [StandardCost], [ListPrice], [Size], [SizeUnitMeasureCode], [WeightUnitMeasureCode], [Weight], [DaysToManufacture], [ProductLine], [Class], [Style], [ProductSubcategoryID], [ProductModelID], [SellStartDate], [SellEndDate], [DiscontinuedDate]) VALUES (1001, N'Adjustable Race1000', N'AR-5381-1000', NULL, 1000, 750, 0.0000, 0.0000, NULL, NULL, NULL, NULL, 0, NULL, NULL, NULL, NULL, NULL, CAST(N'2008-04-30T00:00:00.000' AS DateTime), NULL, NULL)
GO
INSERT [dbo].[bigProduct] ([ProductID], [Name], [ProductNumber],  [Color], [SafetyStockLevel], [ReorderPoint], [StandardCost], [ListPrice], [Size], [SizeUnitMeasureCode], [WeightUnitMeasureCode], [Weight], [DaysToManufacture], [ProductLine], [Class], [Style], [ProductSubcategoryID], [ProductModelID], [SellStartDate], [SellEndDate], [DiscontinuedDate]) VALUES (1002, N'Bearing Ball1000', N'BA-8327-1000', NULL, 1000, 750, 0.0000, 0.0000, NULL, NULL, NULL, NULL, 0, NULL, NULL, NULL, NULL, NULL, CAST(N'2008-04-30T00:00:00.000' AS DateTime), NULL, NULL)
GO
INSERT [dbo].[bigProduct] ([ProductID], [Name], [ProductNumber],  [Color], [SafetyStockLevel], [ReorderPoint], [StandardCost], [ListPrice], [Size], [SizeUnitMeasureCode], [WeightUnitMeasureCode], [Weight], [DaysToManufacture], [ProductLine], [Class], [Style], [ProductSubcategoryID], [ProductModelID], [SellStartDate], [SellEndDate], [DiscontinuedDate]) VALUES (1003, N'BB Ball Bearing1000', N'BE-2349-1000', NULL, 800, 600, 0.0000, 0.0000, NULL, NULL, NULL, NULL, 1, NULL, NULL, NULL, NULL, NULL, CAST(N'2008-04-30T00:00:00.000' AS DateTime), NULL, NULL)
GO
INSERT [dbo].[bigProduct] ([ProductID], [Name], [ProductNumber],  [Color], [SafetyStockLevel], [ReorderPoint], [StandardCost], [ListPrice], [Size], [SizeUnitMeasureCode], [WeightUnitMeasureCode], [Weight], [DaysToManufacture], [ProductLine], [Class], [Style], [ProductSubcategoryID], [ProductModelID], [SellStartDate], [SellEndDate], [DiscontinuedDate]) VALUES (1004, N'Headset Ball Bearings1000', N'BE-2908-1000',  NULL, 800, 600, 0.0000, 0.0000, NULL, NULL, NULL, NULL, 0, NULL, NULL, NULL, NULL, NULL, CAST(N'2008-04-30T00:00:00.000' AS DateTime), NULL, NULL)
GO
INSERT [dbo].[bigProduct] ([ProductID], [Name], [ProductNumber],  [Color], [SafetyStockLevel], [ReorderPoint], [StandardCost], [ListPrice], [Size], [SizeUnitMeasureCode], [WeightUnitMeasureCode], [Weight], [DaysToManufacture], [ProductLine], [Class], [Style], [ProductSubcategoryID], [ProductModelID], [SellStartDate], [SellEndDate], [DiscontinuedDate]) VALUES (1316, N'Blade1000', N'BL-2036-1000', NULL, 800, 600, 0.0000, 0.0000, NULL, NULL, NULL, NULL, 1, NULL, NULL, NULL, NULL, NULL, CAST(N'2008-04-30T00:00:00.000' AS DateTime), NULL, NULL)
GO

After creating and inserting this data. Lets do select * from dbo.bigproduct and see the results.

You can create tables, insert data, read and write the data in to the database right from the management studio. This is just the same as on-premises database. There is a little bit of latency insert the data as we are paying just couple of dollars on a database per month in this free service Azure account.

Summary:

In this session, we leaned about the Azure SQL Database. We started by creating a virtual server, created a database on that server on azure, we added the admin account by adding the SQL authentication account, provided the permissions on to the client IPaddress and connected to the database using SQL Server management studio. We created a table and inserted the data in to the Azure SQL Database using management studio on local machine.

In the coming blog, we are going to learn about Azure Data Factory fundamentals. Stay tuned..

Thanks for reading!

T-SQL Tuesday #137: Jupyter Notebooks

With the invitation of Steve Jones for April month of T-SQL Tuesday, I am going to share some of my thoughts on using Jupyter notebooks.

Jupyter Notebook is an open-source web application that can serve us to create, share code and documents with others. It provides that beautiful way to document your code, text, images and query results. Jupyter notebooks in Azure data studio can run PySpark, Python 3, Spark R and Scala, as well as PowerShell. Run and visualize the data as you see the results without having to leave the environment. This is an excellent tool for documenting things. You do not have to write your long emails attaching bunch of screenshots any more. Once you start using the Jupyter notebooks, it will save you so much time and give you peace of mind. This tool can not only be used for documentation purposes but also can be used for your presentations for your demo purposes. Not only for the SQL code, this is a very nice tool for doing machine learning models, visualizing data and much more. Your code can be written in different cells which can be executed individually. This can help us to run specific part of the code in the project. This tool supports Python, R as well.

There are some magic commands that you can use to make your work easier. %clear, %autosave, %debug and %mkdir are some of the commands and these magic commands can be used in two ways. Linewise and cellwise. Linewise is when you wish to run a single command and cell-wise is when you wish to run an entire block of code.

There are also interactive dashboards in Jupyter notebooks.

When you wish to share these jupyter files and as they are in json format. You can save the notebook using different file options like a CSV, Excel, JSON or XML file . You can also convert your notebook into different formats by using the nbconvert option. For analytical queries, you can also build a chart for the data.

You can use the Jupyter notebooks for giving presentations to your customers and clients about your projects.

Jupyter notebooks is an amazing tool as you can document everything what you do. Once we start using Jupyter notebooks, you would definitely do not want to turn back to the previously used tools anymore. I recommend this tool to anyone who would like to explore and take advantage of many features the Jupyter notebooks has to offer.

Azure Fundamentals for Beginners: Azure Data Services

Microsoft Azure is a cloud computing platform and infrastructure for building, testing, deploying and managing your applications and services. Since many years, we have servers, databases on-premises where companies are responsible for buying, installing, managing the hardware and software. Azure was introduced in year 2008 and since then we have seen huge transition in the cloud computing as it provides services in managing the hardware, infrastructure, securing the data including the backups and many other services.

This is a new blog series about Azure Fundamentals, if you would you like to get the notifications as soon as I post a new article to this series, subscribe here at the WANT TO BE NOTIFIED Section at the right side of the screen.

178,606 Cloud Computing Stock Photos, Pictures & Royalty-Free Images -  iStock

There are twenty two service categories available in Azure and some hundreds of services to choose from. If you are new to Azure, you can sign up for a free Azure account. Microsoft offers a free 200$ credit for 30 days so you can go ahead and freely explore their services like you can test and deploy enterprise apps for free, try data analytics with your data and try many more services. Here is the link from Microsoft you can go to and create a free account:

Azure Data Services

For on-premises, we are responsible to procure the hardware, disk space, CPU and all the other hardware in place and to maintain them. We are responsible for the networking, patching, virtualization, operating system, middleware, apps and data. What if virtualization, servers, storage and networking are managed by the Microsoft and we can mainly focus on the application development and we can manage our data? That’s where the IaaS (Infrastructure as a Service) comes in. In this service, we can only take care of the operating system, patching, install the software, apps and data.

This image has an empty alt attribute; its file name is image-45.png

PaaS (Platform as a service) is a service where Microsoft is going to take care of lot more services so you can solely focus on the application development and data development. Microsoft manages all the services managed by IaaS but also operating system, middle ware and runtime. If your database needed to be restored in any case, backups will be provided to restore at any time.

SaaS (Software as a service) is a service like office 365, Dynamics 365 and they do everything for you as you insert, write and read your data from those platforms.

Data Services in Azure:

These services are the Infrastructure as a service and Platform as a service solutions. In the below diagram, we can see on the left side all the resources we can connect to which can connect from the on-premises/cloud/structured/unstructured data types. We do have SSIS but it doesn’t actually have lot of cloud ETL capabilities as it cannot connect to all new data sources. As we move to Azure, there is Azure data factory which provides the graphical user interface where you can drag and drop and perform different activities. For extracting and transforming data, we also have another tool Azure Databricks which gives us lot of flexibility where you can write R, Python and Spark SQL but for using these requires coding. Professionals who are data scientists and data engineers use the databricks.

This image has an empty alt attribute; its file name is image-44.png

Azure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise data warehousing, and big data analytics. 

Once we apply the transformations, we load that data into the Azure data Lake storage or any other storage areas like Azure storage, Azure SQL DB, SQL pool or store the data in the non relational database systems like Cosmos DB. Once we store the data, we can report the data using Power BI.

Resource Group:

We will take all the resources that we create in azure into a logical container. Resource groups are a great way to take all of the resources we create for a project, department or for any other purposes and store them into a resource group. This is same as the directory location where we put all the resources as a resource group. We can update, deploy or delete them as a group. All the resources that we create will be under a resource group. All the metadata information that we have under the resources will be stored in resource group like what type of resource we are creating, what resources were allocated to it and the names of them.

Creating Resource:

There are always ways to automate this but for this example, for simplicity purposes we are going to do this through GUI.

Open your Azure portal, go to the navigation pane and click on the “Create a resource”

This image has an empty alt attribute; its file name is image-46.png

This will bring us to the Market place where we can see the 22 different categories and some hundreds of different services that are available. Find the “Resource Groups” by typing it in the search bar

This image has an empty alt attribute; its file name is image-47.png
This image has an empty alt attribute; its file name is image-48.png

Select the Resource Group and click create

This image has an empty alt attribute; its file name is image-49.png

This will bring us to the resource creation page. We can give the resource group a name, choose the location where it is going to be stored. For latency sake, for productivity and performance, please choose the right region details. We do have other options out there but to get started with azure, this should be good. When you check the other Tags out there, these are for the information like to which project is this resource group associated with, who is the owner and who purchased it.

It is better to keep all of our resources in same region so we can reduce the cost. If we move the data from one region to other, there is actually an extra charge for that for moving data. When you are creating your databases, storage accounts, data factory to move data, it is important to know to keep the data in same data center.

This image has an empty alt attribute; its file name is image-51.png

Then at the bottom of the screen, click review and create.

This image has an empty alt attribute; its file name is image-50.png

It should give you the validation passed and press the create at the bottom of the screen. Once you do that, your resource group has been created and you will get the notification at the top right corner that the resource group has been created.

This image has an empty alt attribute; its file name is image-52.png

You can either go to the resource group from the notification section or you can also go from the navigation pane by clicking on the “Resource groups”

This image has an empty alt attribute; its file name is image-53.png

When you click on the resource group “dbanuggets”, you see a vertical pane opened up across your screen which is known as Blade in Azure. When you click on any tabs in it, you will see vertical tabs opening up. We are currently looking at the resource group we created “dbanuggets” and under that name you see it as a Resource group. We see all administrative properties under that and we can setup properties like security like access control, policies, resource costs, deployment history that we can control. So, we have created a logical container, resource group.

This image has an empty alt attribute; its file name is image-56.png

If you want to close this resource group pane, you can simply press the cross button at the top right corner of the pane to go back to the previous blade.

This image has an empty alt attribute; its file name is image-55.png

Azure Storage:

Azure storage is the general purpose account v2 and the Azure data lake store Gen2. This is the cheapest option for the storage that we have available. If you are creating a storage account on azure, usually Azure data lake gen 2 will be good enough unless you need the features that are only available in the general purpose accounts that are not existed in the Azure data lake store.

Storage Security:

Azure Data lake uses the Active directory. Blob storage account uses the shared access key. If someone can get the name of the storage account and get to the access key, that might be dangerous situation. Access key should be regenerated every couple of weeks or so in a schedule to keep the key secure. With the Azure data lake Gen2 accounts, we can take the advantage of login Id of the Azure active directory. There are some regulatory and compliance that we get for general purpose that we do not get with data lake. One of them is WORM storage which means Write Once Read Many. Once the data has been written, it cannot be modified. Azure blob storage offers that capability of doing WORM storage and Azure data lake doesn’t. Azure blob storage have a capability of soft storage where if you accidentally deleted anything, we can restore that. Azure data lake doesn’t offer that.

Azure data lake is designed to store massive amounts of data for big data analytics. It provides unlimited scalability, Hadoop compatibility, optimized azure blob file system (ADFS) designed for big data analytics, Zone-redundant and Geo-redundant storage.

Lets create Azure Data Lake:

Go to the navigate pane and search for the “Storage account”. To create general purpose storage account or Azure data lake is through the same storage account service. You will see the storage account service, click create.

This image has an empty alt attribute; its file name is image-57.png
This image has an empty alt attribute; its file name is image-58.png

You will see a create storage account pane, and the top you see many tabs like data protection, advanced, tags. You can create the storage account first and then later configure these.

This image has an empty alt attribute; its file name is image-59.png

Select the resource group, We will be selecting the one that we created with name “dbanuggets” and give a storage account name. There are some requirements here that we have to remember while creating the storage account name. It should be all lower case letters and it should be globally unique across all of azure because when we create the storage account, we can connect to it through the name.

This image has an empty alt attribute; its file name is image-60.png

Give the name of the storage account in all lower case letters. Here I am creating the storage account with name “dbanuggetsstorage” and choose the location of the storage to match the resource group. Here I am choosing the East US. Then choose the kind of account you are going to create. Remember, this is not where we actually change this to general purpose v2 or azure data lake gen2.

This image has an empty alt attribute; its file name is image-61.png

Then choose the Replication tab and select the type of redundancy you need. The cheapest option and the one that is built in is the “Locally-redundant storage(LRS)” meaning if you are working with data lake or the storage account, Microsoft makes sure we have like three copies of that data at any given time. This is an amazing thing as we now have three copies available and do not have to worry about what will happen if one data rack goes down. We will still have other two racks readily available. This option is the redundancy with in the data center so if the data center itself goes down, then we need to check for different options available in the azure like zone redundant storage (ZRS), geographical redundancy storage(GRS) and many others. There is a cost associated with the options you choose from here. Here for this example, we are going to choose the cheapest option “Local-redundant storage(LRS)”.

This image has an empty alt attribute; its file name is image-62.png

When you check the other tabs like networking, here we can control the access through IP addresses. The next tab is data protection. Here we can do like soft deletes for blobs. If someone will delete the file accidentally, we can recover those files. You can do versioning for the blobs and also turn on the point in time restore for container.

This image has an empty alt attribute; its file name is image-68.png

The next tab is advanced tab where we can choose where we actually want the storage account to be. You can see this in the advanced tab> Data Lake Storage Gen2> diabled or enabled. This is the single place where you click to enable or disable the Data Lake storage which will signify if your storage is going to be Data lake and this needs to be done at the time of the creating the resource. We cannot go and modify this later. Once we choose the storage type here, we cannot change it later. Let’s go ahead and choose the Data lake to be enabled.

This image has an empty alt attribute; its file name is image-63.png
This image has an empty alt attribute; its file name is image-64.png

Once you enable this option and go to the Data protection tab, lot of options that were there under the data protection are disabled we lose the point in time restore, the ability to do the soft deletes, the versioning for the blobs and we also lose the WORM storage (meaning once we write the data, we cannot modify it). These are the limitations that we get with the Azure Data Lake, so if you need these options we have to go with the blob storage account. Remember that we are not restricted with only one storage account for all our projects. We can have for some projects blob storage and for others Azure Data Lake storage. We can have some data that requires the regulatory compliance like the WORM storage and you do not want any one to accidentally delete the files and you need the soft delete functionality, in that case for that specific project we can choose the blob storage account. We can always have and use the different services for our projects in Azure.

This image has an empty alt attribute; its file name is image-69.png

On the Tags tab, we can give the name and the value to that tag name so once we go to the billing information, this resource is now tagged and we can see our costs clearly. It is a good practice to tag all of our resources. Click review and create at the bottom of the page.

This image has an empty alt attribute; its file name is image-71.png

You see the validation passed on the top of the screen. You can review the resources that you wanted to create and click create at the bottom of the screen.

This image has an empty alt attribute; its file name is image-73.png

I am seeing the deployment in progress once I press the create button. The storage account that you created will not cost us anything now. You get billed for storage account is when you write the data to the storage account and when you read data from it. If we created the storage account but not created anything in it, we do not get billed for that. If we created and using very little like creating only few files then we will only be minimally be charged. We can monitor the costs down the lane in the Azure.

This image has an empty alt attribute; its file name is image-74.png

You will be seeing the notification button at the top right corner of the screen notifying the deployment is complete.

This image has an empty alt attribute; its file name is image-75.png

To see how the storage account that we created look like, go to the navigation pane on the left side of the screen, choose the resource group> select the resource group we created (in my example, dbanuggets). Here is our storage account “dbanuggetsstorage”.

This image has an empty alt attribute; its file name is image-76.png

Once you click on that storage account, a new blade will be opened in the vertical pane and you will be able to see all associated details of that storage account.

This image has an empty alt attribute; its file name is image-77.png

When you scroll down, you will see the “containers”. Click on that and here when you want to upload any files or you need to remove any files from azure storage account, we need to create a container for that first.

This image has an empty alt attribute; its file name is image-78.png

Click on the containers and create a container by giving a name.

This image has an empty alt attribute; its file name is image-79.png

Give the name like here I provided the name of the container as “dbawork”. Remember, we are creating a directory here and we need to give the name all small letters. I am keeping my public access Private here in the example. Then, click create.

This image has an empty alt attribute; its file name is image-80.png

You will see the notification that the container has been created successfully.

This image has an empty alt attribute; its file name is image-81.png

We can now upload files and download files. Usually, we all do this through automating. We will be doing this through azure data factory. Now, lets go ahead and manually upload a file in our example and see how easy it is to import data into Azure. Click on the container we just created.

This image has an empty alt attribute; its file name is image-82.png

Click on the upload button at the top

This image has an empty alt attribute; its file name is image-83.png

Select the folder to upload the file

This image has an empty alt attribute; its file name is image-84.png

Upload the file and press upload

This image has an empty alt attribute; its file name is image-85.png

File has been uploaded. This file is now in azure and stored in azure data center. Any file that we upload to storage account, we have to upload the files to a container. These containers are just like the directory locations where we have folders having the files on our on-premises environments. We can make sure we have security setup at the container level.

This image has an empty alt attribute; its file name is image-86.png

You can directly click on that file that is now in the container and click edit to directly edit the content in the file.

This image has an empty alt attribute; its file name is image-87.png

A new blade pops up opening the file and you can edit it

This image has an empty alt attribute; its file name is image-88.png

For easy uploading and managing your files to the storage, there is this beautiful application you can use “Azure storage Explorer”

Azure storage Explorer is an amazing tool where you can connect to your azure account and see all the different resources you have in azure. It is both for the general purpose and azure data lake Gen2. We can upload files, folders, change permissions. You can upload, download and you can also modify the security on these containers easily through this application. If you do not have it already, download it here.

Once you download the Azure storage explorer, signin with the azure account you created. I signed in with the azure account I created and I can see that in the account section in the application that you can see at the left side of the panel.

This image has an empty alt attribute; its file name is image-89.png

Under the explorer panel, expand the Azure subscription tab to see all the resources you have like storage account and the containers I have along with the fileshares.

This image has an empty alt attribute; its file name is image-91.png

You do not have to login to the azure portal, you can upload, download from here itself. We usually automate the upload processes through azure data factory or through the business applications.

Summary:

In this post, we have learned about Azure Data Services- Infrastructure as a Service (IaaS), Platform as a Service (PaaS), Software as a Service and we have seen the difference in between these services. We have leaned about the Azure Storage-Blob Storage, Data lake storage Gen2. We have learned with examples so you can get started with Azure.

In the next post, we will learn the fundamentals of Azure SQL DB.

Thanks for reading!

Data Page Internals: Structure of the Data Row

When I was preparing for my presentation “Method Behind the Magic of Internals in SQL Server”, I had a plan to write a blog post about the Data page internals and I am finally able to write it. Here are some little nuggets about Data page internals in SQL Server.

Black and Red Typewriter on White Table

SQL Server stores the data in the form of 8kb pages. Each page is interlinked between to the previous page and the next page number. There are two types of pages in SQL Server where data gets stored. Pages that contain the actual data and pages that contain the data about all other pages, meaning pages about the metadata information. We will have a separate blog post of types of pages we have in SQL Server. No matter how many blog posts we read on this topic, it never gets old 🙂

Lets quickly see the structure of a page.

This image has an empty alt attribute; its file name is image-27.png

Page Header is always 96bytes in size. This area contains the metadata about the page itself like the page number, previous page, next page and much more metadata information about the page (A separate blog post about this in near future).

Pay load area is where the actual data gets stored in the slots. When you check the bottom of the page, that’s the Row offset array which have the 2-byte pointers pointing to the exact space where the corresponding row begins on the page.

Data inside the payload area might not always be in the physical order of the data rows but always be logically ordered in the row offset array in the index sort order. SQL Server always read the row offset array from the right to left. The first slot array is 0 which have a pointer pointing to the lowest key value on the page and the second slot array has the pointer pointing to the next lowest key value and so on.

There are main datatypes in SQL Server, Fixed length and the variable length data types. Fixed length data types always store the same space regardless of having null or not null values. With the variable data types, SQL Server will use the required space and additionally two extra bytes.

Each byte you see in the picture has a purpose. The first two blocks containing a byte, Status Bit A and Status Bit B contains the bitmap information about the row, like if the row is logically been deleted/ghosted, row type information, versioning tag, if the row contains any NULL values, Variable length columns. The next 2 bytes is used for storing the length of the fixed length data. The next n bytes are for storing the fixed length data itself. There is a null bitmap after that which will have both the 2-byte column count in the row and null bitmap array. Regardless of if the column in null or not, each and every column will have one bit per every column.

Next is for the variable length data portion where the first 2 bytes here is for the number of variable length column in the row. The next n bytes is for the offset array of the variable column. Even when the value of the column is Null, SQL Server will still stores the offset value of 2-bytes per each variable length column. Next is the actual data of the variable length data. Finally, the 14 byte size versioning tag especially for the row versioning. This is used for the optimistic isolation levels and for the index rebuilds.

Let’s go ahead and create a brand new table and insert two rows to see how they actually look like. By using the DBCC command, let us first view all the pages for the table.

use dbanuggets
go
create table dbo.dbanuggets
(
    ID int not null,
    nugget1 varchar(255) null,
    nugget2 varchar(255) null,
    nugget3 varchar(255) null
);

insert into dbo.dbanuggets(ID, nugget1, nugget3)  values (1,replicate('d',10),replicate('b',10));
insert into dbo.dbanuggets(ID, nugget2) values (2,replicate('a',10));

/* dbcc ind ('databasename','dbo.tablename',-1 /* Displaying info about all pages from the table */);*/

dbcc IND ('dbanuggets', 'dbo.dbanuggets', -1)

To view the page header and the column information, Turn the Traceon(3604):

dbcc traceon(3604);
dbcc page
(
    'databasename' --databasename
    ,1 -- File ID
    ,11712 -- Page ID
    ,3 -- Output mode: 3 - display page header and row details
);
dbcc traceon(3604);
dbcc page ('dbanuggets',1,360,3)

When you check the first record here, it has a column Pagetype as 10 which is a metadata page IAM (Index allocation map). This page will track what all pages belongs to a particular table. The next record which have the PageType as 1 is the actual data page which contains the data rows.

There are two pages that belong to the table. The first one with PageType=10 is the special type of the page called IAM allocation map. This page tracks the pages that belong to particular object. Let’s not focus on it now – we will cover allocation map pages in one of the following blog posts.

Check the row structure here:

When you see, the 3rd one in the row is a two byte value of 0800. This is a byte swapped to the value 0008 which is the offset tells the SQL Server where the fixed-length part of the row ends and offset for the columns in the row. Next 4 bytes is used to store the ID column here which is a fixed length data. The next 4 bytes are used to store the fixed length data, the ID column. Next, the 2 byte value shows how many columns we have, here we have 4 columns. Next, we have one byte Null bitmap. As we have 4 columns, we have one byte bitmap. The value shown here as 04 which is 00000100 in the binary format. It represents that the third column in the row contains NULL value. The next 2 bytes shows the number of the variable length columns which is 0300 in the swapped order which means 3 columns. Next contains the offset value where the column data ends. Next comes the actual data for the variable length columns.

second row look likes this:

Row structure is a bit different here:

When you observe here, the nuggets1 and nuggets3 columns are having null values. The NULL bitmap in the second row shows the binary value 00001010 which tells that the nuggets1 and nuggets3 columns are NULL. In the variable length columns indicates there are only two columns. SQL Server will not have the information about the NULL variable length column in the data row. We can manage the size of the data row by making sure we add the null value columns at the last when declaring the table definition and this is when we use the variable length columns that will have many null values.

There is 8,060 bytes available on the single data page. Null bitmap is always there for the heap tables and clustered index leaf level rows even when the table does not have the nullable columns. If there are no nullable columns in the index, then the non clustered indexes do not have bitmap values.

Summary:

In this post, we have learned how the data page look like and the different parts of the data page. We have also learned about how the data row structure look like with the examples. How having null values doesn’t take space in the row data but then managed by the SQL Server as the Null bitmap value. I will be posting more about the Page internals in the future blog posts.

Decoding the Physical Join Operators: How the Right Index Helps!

In the Part1 of decoding the physical join operators, we learned about the different types of physical operators: Nested loops, Merge joins and Hash joins. We have seen when they are useful and how to take advantage of each for the performance of our queries. We have also seen when they are useful and when they needs to be avoided.

In this part, we will know more about these operators and how the indexes really help these operator to perform better so the queries can execute faster.

In order to use the Merge join, the join must include equalities. Meaning the join should have columns of one table equal to column of another table. If the join criteria is range comparisons like inequalities or with greater than, less than and not equal to, then the optimizer will not use a Merge join.

In the part1 of physical join operators, we have joined every single value from the first table to find all the matching values from the second table. What if we just want to only work with few values (For example, lets say we are looking for 6 values) from table 1 instead of all values? For the hash join, SQL Server has to still goes through every single value from the first table before reading any values from the second table. With the nested loops, as soon as we read one row from table 1, SQL Server can look for the matching rows in the second table. We may have a question here, for each of the 6 values we are looking for from table 1, we still have to read all values from the second table for the nested loop join. That can be pretty slow until we create an Index. Index created on the value created on the second table, all the lookups will become very fast. After reading the first row from the outer table, SQL Server can use the index to find the location of the matching row in the inner table. That helps, right? Instead of reading all the values from the second table for 5 times, SQL Server has to just do 5 index lookups to find the matching values. So, it reads 5 values from the table 1 and 5 index lookups from the table 2.

For Hash join, it starts reading all the rows from the first table building the hash table. Then it starts reading the rows in the inner table. It does that until the SQL Server finds the 6 matching rows which means it may have to read all the rows from the second table as well if there is no index. Here, the hash join is so much slower than the nested loop joins.

Merge join has to sort the data first so using the index on the first table that can avoid a sort created by SQL Server own sorting process. If we have the index on the join columns then instead of sorting all the rows in the outer table, we can simply use the index getting the matching rows from the table. If we have the index on the second table for that value column, SQL Server will use that index to get the rows for the merge join.

If we are returning a small fraction of the rows from each table, then nested loops will be efficient. For the hash match, if we are looking for a couple of values in the first table which we can find them easily using the index on the first table, after building the hash table in the build phase and probe the second table using the index on the second table value column as well. With the nested loops, for every value on the first table, the values on the second table can be found by using the index quickly. Nested loops are very good when we are working on couple of rows from each table. If we want to compare the large amounts of data in the first table, we have to do as many index lookups on the second table. In that situations, it is really better to switch to the hash join to full scan the second table.

The switching point difference for the optimizer to choose from the nested loops or hash joins being the fastest can be really small. If you statistics are out of date, in that case, the row estimate can lead the optimizer choose the non-optimal join type causing performance issues.

Examples:

Nested Loop:

use Adventureworks2016
go
select soh.*, sod.*
from 
sales.SalesOrderHeader soh
inner join sales.salesorderdetail sod on soh.SalesOrderID=sod.SalesOrderID
where soh.salesorderid=71832

In the above example, SQL server used Nested loop operator. It used clustered index seek on the outer table and the estimate based on the predicate is 1 and when you observe it did the clustered index seek on the inner table and the estimated number of rows were 15. Here we have an index on the inner table. We need to have index on inner table to avoid the expensive scan operation. Here the clustered index seek operator is executed once because we are only returning one qualified row from the outer table. Here we have two times the compute scalar operator being used because we have two computed columns that we are getting from the two tables.

Merge Join:

Lets create two new tables with columns ID and Rollnumber and insert values into the tables. Since we do not have any unique constraint, no unique index defined on the Rollnumber column of the table, SQL optimizer doesn’t know about the uniqueness of the column. Since the SQL optimizer doesn’t know about the uniqueness, it doesn’t choose the merge join. We can force the merge join in option and check the execution plan.

select * from [dbo].[Table1]
select * from [dbo].[Table2]
select a.*, b.*
from
table1 a
inner join
table2 b
on a.Rollnumber=b.Rollnumber
option (merge join)
go

SQL server has to sort the data first coming from the table 1 and table 2 and used Merge join as we forced the optimizer using the hint in the query. If you look at the memory grant it used 1MB.

When you hover over your mouse to the merge join operator, you will see many to many operation as True. That means SQL server thought there may be duplicate values in the outer table join column because there is no supportive index or check constraint defined.

As it estimated the duplicate values, sql server also created the work table. You can see this information by setting the set statistics IO ON and see the messages tab.

Now, lets go ahead and create a unique nonclustered index on the second table ‘table2’. This is the inner table SQL server chosen in the execution plan. Now SQL server chooses the table 2 as the outer join because now SQL Server knows we have unique values in the table 2. As the rows are directly read from the index we created, explicit sort operator is no more needed.

create unique nonclustered index IDX_rollnumber on table2(Rollnumber)

When you observe the many to many operator, its false now.

As we created the index and as SQL server now know there are unique values in table 2, it did not use the explicit sort and it no longer need to create the work table and so the many to many operator is false this time. Now, lets go ahead and create the nonclustered index on the first table ‘table1’. This time there are no more explicit sort operators as the table column Rollnumber is in the sorted order by the newly created indexe on table 1.

create unique nonclustered index IDX_rollnumber on table1(Rollnumber)

The prerequisite for the Merge join is to have the data sorted by a supportive index or by explicitly having the SQL server use its sort operator in the execution plan.

Hash Match:

Lets run a very simple query to generate the hash match operator by running the below query in Adventureworks2016 database.

select 
p.firstname,
p.lastname,
pp.phonenumber
from person.person p
inner join person.personphone pp
on
p.BusinessEntityID=pp.BusinessEntityID
go

If you see the number of estimated rows, they are matching to the actuals. If the estimates are off when compared to the actuals, then the hash join will actually spill the rows to disk to do the operation which is really expensive.

When you observe the execution plan, personphone table was used as the outer table to create the hash table and the created hash table is probed to the inner table person to get the matching records. You can also see on which columns the probing was done by hovering your mouse over to the hash match operator under the hash keys probe.

Summary:

There are three types of physical join operators. Nested loops, Merge join and Hash join. Nested loops are suitable when joining small tables or small fractions of big tables. Hash joins are better when we are joining all the rows in a table or large tables. Merge join can be efficient if the rows are sorted but sorting is slow but can be faster if indexes are present on the joining columns.

How knowing all this information will help you? You can easily tune your queries by checking the table statistics are up to date and the indexes. One of the important things that will help optimizer to choose the join type and the join order is based on statistics estimate of how many rows it thinks going to come out of each table. While tuning the queries, check for the indexes on the join columns. If they are missing, creating the index help optimizer to choose the nested loop join. If there is no index, it will force the optimizer to use the hash join. By making sure the statistics are up to date and by checking the necessary indexes exists or not will help the optimizer to choose the good enough plan for your queries.

Decoding the Physical Join Operators: Part 1

I specifically thought of writing about this topic since long time. I always had trouble understanding clearly about Nested loop joins, Merge joins work. Once I learned about how they actually work, I planned to write a blog post about the same so you can know about how these joins work as well. By the end of this post, you will have a nice smile on your face as you feel this is so easy once we understand the concept behind these joins. This is a 2 part series.

Types of Physical Operators

Visualizing Nested Loops Joins And Understanding Their Implications

Nested Loop Joins:

For example, in the diagram you see the representation of Table 1 (outer table) and Table 2 (Inner table). For every value in the first table (outer table), SQL Server will query the second table (Inner table) looking for the matching values. We will read every value in the second table once for every value in the first table. In the below example, once the 1st value containing the orange color is compared with each value in the table 2. Once the complete table 2 values are compared, then SQL Server will loop into the next value on the table 1 to compare each value on the table 2 again and this process is looped over until each value from table 1 is compared to each and every value of table 2 to get the matched values as output of this operator. This is a lot of work. This takes so much CPU.

If the inner table is presorted by having an index, then this will save time because when we iterate values over the inner table, SQL Server can now seek to the rows that we need to compare against the outer table instead of comparing to every row in second table. This can reduce the number of comparisons that needs to be compared which can actually improve the performance.

Now, how do we know if these nested loops are appropriate in our plans or not? You can view the actual execution plan to see on how many rows does the nested loops iterator is getting out. If the number of rows that are coming out of that iterator are only few, than that is fine for the nested loop join. The other cause might be the estimates. Estimates might be off for that data and instead of using the better joins for that data, it might be using the nested loop join. We also might be missing the index which can pre sort the data that is going into the join. This may also be the reason for choosing the nested loop join.

Merge Join:

What if we sorted the data, now we have the data sorted in both the tables. Look at the diagram below for reference. As we have both the tables in the sorted order and we have to join them, SQL Server will first take the first value and compare with the first value in second table which also has the same value 1 as the first table first value 1. Then it goes to the second value in the second table which is 2 and compare with the first value 1 in first table which doesn’t match. Because the second table is sorted, we know that there is no other values that matches in the table 2 with the value 1 from the first table. It will check with the next value 2 from the table 2 which doesn’t match with the value 1 from table 1. value 2 is greater than value 1 so SQL Server now knows that there are no more matching rows in the table 2 and SQL Server will stop processing now and goes back to the next value in table 1 which is value 2 and check with the table 2 value 1. This time SQL Server will compare with the last value we had a join on which is value 1 in the table 2. Value 2 from the table 1 doesn’t match the value 1 from the table 2 so it goes to the second value 2 from the table 2 find the match and goes to the next value 3 from the table 2 and finds value 3 is greater than value 2 and stops the comparison as it knows there will be no matching values next in the table as the data is sorted in the order. It goes to the next value 3 from table 1 and compare from the last value we had join on which is value 2 from the table 2 and find the matching value 3 from the table 2, goes to the next value 4 from the table 2 and found the mismatch and so it knows there are no more matching rows and stops processing and then goes to the value 4 from the table 1 and compare with the last value we had join on and do the same processing until the comparison completes for each value in table 1. This is done as the SQL Server walks and compares both the tables bit by bit. This is the basis of a Merge Join.

In the Merge join, we sorted the table on the join columns and walked down the columns bit by bit comparing the rows. Because the rows are sorted, SQL Server can stop the processing once it founds the matching value in the second table which is greater than the value of the current row in the first table.

Merge join is the fastest physical join operator in the SQL Server. Merge join is so efficient because it can go through each input a single time and it doesn’t have to loop back and do more operations like the nested loop join. When we have many to many merge join, in that case SQL Server cannot reiterate the number of rows of inputs by doing many to many join. It instead creates a work table in tempdb and loops that over and over again to process those many to many joins. If the first table has duplicate values, in that case SQL server will create a work table and do the comparison just like the regular merge join but when it finds the duplicate values in the second table, it writes them into work table in tempdb and does the comparisons there. If it also finds the duplicate values on the first table, SQL Server will go ahead and compares the first table values to already stored values in the work table. Once the iterator moves past those duplicated values, the work table gets cleared out and the merge join continues until all the values gets compared. If you would like to know in depth about how the work tables work in detail, please refer to this wonderful article from sqlserverfast.com

Merge join is not a bad thing and it may be efficient already in your execution plan. What you have to observe when you see the merge joins and performance slow on that plan is to focus on the upstream operations that are going into the merge join. Whether the data is presorted as you already have an index or whether the data is presorted in SQL Server own way then in that case, you can simply check if you can just add that missing column in the index and place in the last key column in the index or use a different join algorithm will be better. The other scenario might be you have lots of duplicate values in your data. If that is the case SQL Server will be using the work tables to handle how the duplicate values can be joined on. So, if you see the duplicate values or using tempdb, then finding the better options will be good.

Combining the tables with the merge join is faster than the nested loop join but we need to sort the data first. You know that the sorting is slow. So, how can we do this better. That’s when the Hash join comes in.

Hash Match:

For every value in the table 1, we can apply the hash function to it and build a Hash table on it. Each and every value will be now in the hash table that has been built. Once that has been done, we can go to the second table and apply the same hash function to each value in table 2 and lookup probe for the hash table 1 we built for matching values and we have a join. We repeat the process for all the values in the second table. This is the Hash join.

There are two phases that the Hash join happens. The first phase is Build phase where SQL Server will build the in-memory hash table from one of the two inputs. Hashes are calculated based on the join keys on the input data and the whole rows are stored in the hash table under the hash buckets. Mostly we have one row per the hash bucket except when we have multiple rows with duplicate join keys or they is a hash collision (total different join keys might get the same hash which is a rare scenario). Once the hash table is build, SQL Server will start the probe phase where SQL Server will calculate the join key hash for each row in the second input and checks to see if it exists in the hash table builds in the first build phase. If it finds the match it will goes to the next to see if the join key themselves actually match because sometimes we might have the hash collision issues.

Hash joins is not the fastest type of the join because it needs to use the hash function which uses more CPU to calculate the hashes of the join keys but this join is really useful to join really large tables and it uses Tempdb to create these hash tables if memory is not sufficient to build them. If SQL Server cannot build all the hashes of the hash table in the memory because it is very large, then the hash can actually spill to disk and use the tempdb. In this scanario, SQL Server can put as many buckets as possible in the memory and the ones it cant keep in memory can be stored in tempdb. In the proble phase, SQL Server will join any rows that are in the memory that are in-mmeory hash table and any hashes that it cannot finds the match in the hash table, it writes to tempdb until all the rows from the probe input has been processed, it loads everything related to from the tempdb into the memory and continues to do the comparisons there. SQL Server can actually stage some of the data into the memory during the join phases and loading it back into the memory once some of the memory gets cleared up.

What can we do if we see the Hash match in our execution plans? We have to first see if this operator is useful in the plan because this is a blocking operator meaning until the build and the probe phases are completed all the downstream operators have to wait in the execution plan for processing. We know that Hash join operator can actually join any data sets but when it actually spills to disk, it can cause the performance issues. That’s when we need to peek in and look at the memory grants that has been allocated is not sufficient because of the incorrect estimates. The memory grant that has been actually allocated to the query might not be sufficient to build the hash table in the memory and so spilling to disk. In this scenario, you can check if the statistics are up to date and check to see if adding that one column to the index might help choose a different optimal physical operator to process the query effectively.

Summary:

In this part 1, we have learned about how the physical join operators work, when they can be efficient and when they are not. Where and what to look at when they are not efficient to make our queries faster. In the second part of the series, we are going to look how the indexes help these physical operators go faster.

Performance Tuning Series: Part 4

In the Part 3 of the series, we have learned about the Sorts, Hash matches and Serial nested loops. We have seen when they can be useful and when they needs to be avoided with examples. In this session let’s learn about Seek and Residual Predicates with examples.

Lets take an example of a very simple query running on the BigTransactionHistory table and check for the execution plan.

Credits: Adam Machanic

SELECT * FROM 
bigTransactionHistory as bth
WHERE
ProductId BETWEEN 1001 AND 50001
AND Actualcost>5000000

When you check the execution plan, Estimates are off here. Estimated number of rows were 80 but the actual number of rows were 0. Did you observe, we have parallelism here? Parallelism is used for high cot query plans but why for this simple query?

Run the same query by checking the statistics this time

SET STATISTICS IO ON
GO
SELECT * FROM 
bigTransactionHistory as bth
WHERE
ProductId BETWEEN 1001 AND 50001
AND Actualcost>5000000

SQL Server has to read 131444 logical reads which is a lot.

There are two types of predicates in SQL Server. Index predicates and Residual predicates. Residual predicates are the predicates which are not satisfied by the index but they are still be pushed to the storage engine. When you observe the predicate information in the execution plan here, you see the plan shows the index predicates (seek predicate) on the ProductID and the storage engine instead of the query processor gave us the residual predicate. It filtered the actual cost in the storage engine and did not pass the rows to the query processor.

So what are we missing here? How to track where the issue is? We can see the line passing from the Index seek to parallelism operator, it is so thin. These lines can be really useful in tracking down where the issues are when your execution plans are big so we do not have to go and look each and every iterator in the plan. Though the residual predicate make us harder to debug the query plan faster but it is really helpful to run the query faster. But, if you wanted to test the query performance in lower environments, enabling the TraceFlag 9130 can be really helpful in debugging the plan easily. Now what this trace flag does it, it will disable the residual predicates and force the query optimizer to process the residual predicates as a filter in the query processor. When you enable this Traceflag and run the same query, look at the difference in the line. Enabling this traceflag is useful when you have lot of I/O input and we cannot figure out where the issue is.

Warning: Do not enable this Traceflag 9130 in production because this will cause query slowness. Enable this in non-production servers to quickly debug the queries. This is a Gem for Query tuning developers and DBA’s.

Sometimes when you try to run your queries searching on the predicates more than once on the same column of the table, then SQL server will go ahead and use the seek predicate on one of the predicate and do the residual predicate on the other. This may not be the good thing to do. Instead, we can use the case expressions and modify the code.

Below is one of the examples using the same predicate on same column:

CREATE TABLE #Purchaseorders
(
[SalesOrderID] INT,
[OrderDate] DATETIME,
[DueDate] DATETIME,
--Primary key ([SalesOrderID])
)

INSERT #Purchaseorders
SELECT 
[SalesOrderID],
CASE
	WHEN [SalesOrderID] BETWEEN 43659 AND 43701 THEN '20110531'
	WHEN [SalesOrderID] BETWEEN 43702 AND 43705 THEN '20110601'
	WHEN [SalesOrderID] BETWEEN 54990 AND 55990 THEN '20081110'
	END,
	'20110612'
FROM [Sales].[SalesOrderDetail]
GO

dbcc dropcleanbuffers

DECLARE @OrderDate DATE='20110531'
SELECT  r.*
FROM #Purchaseorders as r
CROSS APPLY
(
SELECT TOP (1000)
SOH.[OrderDate],
SOH.[SalesOrderID],
SOH.[TotalDue]
FROM [Sales].[SalesOrderHeader] AS SOH
WHERE SOH.SalesOrderID=r.SalesOrderID
AND SOH.[OrderDate] BETWEEN r.OrderDate AND r.DueDate
AND SOH.[OrderDate] >= @OrderDate
ORDER BY 
SOH.[OrderDate]
) AS a
GO

Instead of using both the predicates on same column, we can replace the query to use case expression by modifying the code and check for the execution plan.

DECLARE @OrderDate DATE='20110531'
SELECT  r.*
FROM #Purchaseorders as r
CROSS APPLY
(
SELECT TOP (1000)
SOH.[OrderDate],
SOH.[SalesOrderID],
SOH.[TotalDue]
FROM [Sales].[SalesOrderHeader] AS SOH
WHERE SOH.SalesOrderID=r.SalesOrderID
AND SOH.[OrderDate] BETWEEN r.OrderDate AND r.DueDate
AND SOH.[OrderDate] >= @OrderDate
ORDER BY 
SOH.[OrderDate]
) AS a
GO

Check the execution plan now, optimizer decided to use both of them as the residual predicate in this scenario but its not we who made sql optimizer confuse to use seek predicate or residual predicate. Its the optimizer choice to use residual predicate here.

Summary:

In this post of the Performance tuning series, we have gone through the Seek and the Residual predicates and learned about how enabling the Traceflag 9130 can be really helpful in debugging the plan easily on your non-production servers as it will disable the residual predicates and force the query optimizer to process the residual predicates as a filter in the query processor. You can see the thick lines in your query plans to quickly investigate into where the issue is.

Thanks for reading and don’t forget to subscribe to this website, you have WANT TO BE NOTIFIED? section on right side of the main page where you can enter your email address to receive new blog posts notifications to your email address every week!

Performance Tuning Series: Part 3

In the Part 2 of the series, we have learned about the Key lookups and the Spools. We have seen when they can be useful and when they needs to be avoided with examples. In this session let’s learn about Sorts, Hash matches and Serial nested loops.

This is a continuation to the Performance tuning series Part 1 and Part 2. As I mentioned in the first part of this series, there is no end to this series. As I learn about performance tuning tips and tricks, I will keep on adding to this series. If you love to learn along with me and level up your Query tuning skills, Subscribe to dbanuggets Here and give me that boost to work more 🙂

Sorts:

Sorts are important and necessary for many queries as most of the queries needs to sort their results. Sorted input is often what we exactly need. We will see with an example of a query we run on BigTransactionHistory table to get 500 products that sold for the highest cost and sold for greater than 1500. We will get the rows from the bigtransactionhistory table where the rows are greater than or equal to 1500, tied to the rownumber and do the partition by productid so for each productid, we get a new rownumber or sets of rownumbers and orderby the actualcost dec and get the total 500 rows from there. Run the below query that does the same thing and check the execution plan.

SELECT TOP (500) WITH TIES
ProductID, 
ActualCost
from
(
SELECT 
productID,
ActualCost,
Row_number() over
(
Partition by
productID
order by 
Actualcost desc)as r
from bigtransactionhistory
where
Actualcost>=1500
) AS x
WHERE x.r=1
ORDER BY 
x.Actualcost DESC
GO

Query runs in 8 seconds. When you observe the query plan in the data centric view, first sql server did a NC index scan, sorted the data so we can have the row number, calculated the row number by window aggregate and filter operators and sort the rows again, gathering streams through parallelism and top 500 rows gathered. The first sort after the NC index scan is sorting many rows. When you hover over the mouse to the rows coming out of NC index scan, you can see how many rows are sorted.

What if we convert the same query by making sure we get the nested loops as they are cheaper. To get the nested loop in the execution plan, we can use CROSSAPPLY and remove the partition by and we run the query to get one product at a time.

SELECT TOP (500) WITH TIES
ProductID, 
ActualCost
from bigproduct as p
CROSS APPLY
(
SELECT 
bt.ActualCost,
Row_number() over
(
order by 
bt.Actualcost desc)as r
from bigtransactionhistory as bt
where
bt.ProductID=p.ProductID and 
bt.Actualcost>=1500 ) as x
where x.r=1
ORDER BY 
x.Actualcost DESC
GO

Check the execution plan. Now the query runs much faster in 3 seconds. It did a CI scan on bigproduct table and goes into the loop once per the product, it goes and grab the rows just for that product, sort just the rows for that product and calculate the rows just for the product (segment and sequence project). The sort here in this plan is also expensive but small sorts are very less expensive when compared with the sorts on huge amount of rows.

Hash Match:

Hash match operator is appropriate for the Datawarehouse workloads but not the OLTP workloads. It is appropriate for the medium set of rows against the large set of rows but not god for the large set of rows against the large set of rows. Let’s take an example. In my company, we regularly receive data from different vendors and we import the data into multiple tables depending up on the type of data. Before we import any data, we need to check if the data is already existed in the database. That’s the example we want to take here. Run the below query and check the execution plan.

SELECT TOP(5000000)
[SalesOrderID],
[OrderDate],
[Status],
[TotalDue]
INTO #sohe
FROM [Sales].[SalesOrderHeaderEnlarged]
WHERE
[SalesOrderID] BETWEEN 43659 AND 1181640


SELECT COUNT(*)
FROM #sohe as b
WHERE 
EXISTS(
SELECT * FROM [Sales].[SalesOrderHeaderEnlarged] as sohe
WHERE
sohe.[OrderDate]=b.[OrderDate]
AND sohe.[SalesOrderID]=b.[SalesOrderID]
AND sohe.[SalesOrderID] BETWEEN 43659 AND 1081640
)
OPTION (MAXDOP 1)
GO

When you observe the execution plan, we see the clustered Index seek on the SalesOrderHeaderEnlarged table. Index seek is on the smaller set of the SalesOrderId’s. It did not do the table seek on the temp table sohe because we do not have an Index on the temp table. That’s the SQL Server is forced to do the Hash Match. This is the type of situation where we see the SQL Server is forced to use the Hash match because we do not have proper index to seek.

As you can see, in the execution plan we have a Index recommendation that says the performance will be improved by 13%. Lets go ahead and create the recommended index on the temp table. Then run the same query again to see the difference in the execution plan.

CREATE CLUSTERED INDEX ix_SalesOrderID_OrderDate
ON #sohe
(
[SalesOrderID],[OrderDate]
)
SELECT COUNT(*)
FROM #sohe as b
WHERE 
EXISTS(
SELECT * FROM [Sales].[SalesOrderHeaderEnlarged] as sohe
WHERE
sohe.[OrderDate]=b.[OrderDate]
AND sohe.[SalesOrderID]=b.[SalesOrderID]
AND sohe.[SalesOrderID] BETWEEN 43659 AND 1081640
)
OPTION (MAXDOP 1)
GO

We have the Merge join this time because we have two sets of the data that are identical in size which is great for the merge join. Here the merge join is better than the Hash match. In this scenario, the index creation changed the hash match to merge join which is what we need here to improve the performance.

Serial Nested Loops:

Sometimes, we have never ending queries due to the serial nested loops. Let’s see this with an example. We will use Adventureworks2016 database and as my database is 2019 mode and the server is sql 2019, let me go ahead and place the Auto update statistics to be turned OFF on the database level. Then lets see how many rows have to be updated on the table Sales.SalesOrderHeaderEnlarged to hit the threshold to invalidate the statistics ()on the CustomerID column.

ALTER DATABASE  [adventureworks2016Big]
SET AUTO_UPDATE_STATISTICS OFF

UPDATE [Sales].[SalesOrderHeaderEnlarged] 
SET CustomerID = 11091
WHERE CustomerID < 16575;
GO

SELECT  
	[sch].[name] + '.' + [so].[name] AS [TableName] ,
	[ss].[name] AS [Statistic] ,


        [sp].[modification_counter] AS [RowModifications] ,

        SQRT([sp].[rows]*1000) [NewThreshold]
FROM [sys].[stats] [ss]
JOIN [sys].[objects] [so] 
	ON [ss].[object_id] = [so].[object_id]
JOIN [sys].[schemas] [sch] 
	ON [so].[schema_id] = [sch].[schema_id]
LEFT OUTER JOIN [sys].[indexes] AS [si] 
	ON [so].[object_id] = [si].[object_id]
	AND [ss].[name] = [si].[name]
OUTER APPLY [sys].[dm_db_stats_properties]([so].[object_id], [ss].[stats_id]) sp
WHERE [so].[object_id] = OBJECT_ID(N'Sales.SalesOrderHeaderEnlarged')
ORDER BY [ss].[stats_id];
GO

When you observe in the above results, we are focused on the IX_SalesOrderHeaderEnlarged_CustomerID statistics that is being created on CustomerID column. We need to update 49221 rows to invalidate this particular statistics as shown in the NewThreshold column. We updated 890k records as you see in the Rowmodifications. Now, the statistics is outdated but as you disabled the AUTO update statistics to be turned OFF, SQL Server will not go ahead and just update the stats the next time you run the query using that outdated statistics. Since there are no updates to the statistics, execution plan will still use the same outdated statistics information for the queries using CustomerID column. Let’s go ahead and run the query using the outdated CustomerID column statistics and check the execution plan.

SELECT
	oh.[CustomerID],
	oh.[OrderDate],
	oh.[ShipMethodID],
	od.[OrderQty],
	od.[ProductID]
FROM [Sales].[SalesOrderHeaderEnlarged] oh
JOIN 
[Sales].[SalesOrderDetailEnlarged] od
ON oh.SalesOrderID = od.salesOrderID
where [CustomerID]=11091

When you observe the execution plan above, the highlighted Red color rectangle box represents the estimated number of rows coming out of the SalesorderHeaderEnlarged table which is just 2156 rows but look at the actual rows on the top of the estimated rows. Its 890k rows which is way too above than the estimates. Now, why is that? Its because of our AUTO update stats is set to OFF. Update stats did not kick in when we run the query using the outdated stats as the option is set to OFF. Now let’s go ahead and enable the AUTO update stats to be turned ON and see the magic by running the same query one more time. Check the execution plan this time.

USE [master]
GO
ALTER DATABASE [adventureworks2016Big] SET AUTO_UPDATE_STATISTICS ON WITH NO_WAIT
GO

USE [adventureworks2016Big];
GO
SELECT
	oh.[CustomerID],
	oh.[OrderDate],
	oh.[ShipMethodID],
	od.[OrderQty],
	od.[ProductID]
FROM [Sales].[SalesOrderHeaderEnlarged] oh
JOIN 
[Sales].[SalesOrderDetailEnlarged] od
ON oh.SalesOrderID = od.salesOrderID
where [CustomerID]=11091

When you observe the execution plan now, we see the execution plan itself changed. The estimated are lined up very close to the actuals for the clustered index scan on the SalesOrderHeaderEnarged table.

Point to remember: When your Auto update stats is turned ON at the database level and when you statistic is outdated, the next time you run any query using that outdated statistic, that’s when the SQL Server will first go ahead and update that stats behind the scenes, invalidate the execution plan and create a new execution plan for that query and uses the newly updated statistics information in the execution plan. Once that is done behind the scenes, the query execution comes after that and user will get back the results. SQL Server here thought it would be better to do the clustered index scan as the number of the rows that comes out of the table are around 890k rows.

It is always better to make sure we keep the statistics up to date on the database level and maintain them using maintenance plans as well. For the older versions of SQL server before SQL Server 2016, optimizer will use a different threshold value to invalidate the stats. That is 20%+500 number of rows needs be changed to invalidate the stats and to kick the AUTO update statistics the next time a query runs using that outdated statistic. If your table is huge, then it is hard to get that 20% of the rows to be changed and to hit the threshold. For that reason, you have an option to enable the trace flag 2371 so you can use the new cardinality estimator which only needs SQRT(1000*number of rows) to be changed to hit the new threshold which is way smaller than older threshold.

Summary:

In the part 3 of the Performance tuning series, we learned about Sorts, Hash matches and Serial nested loops. We have seen when they can be appropriate and when they are not and how to avoid them when they are inappropriate and improve the performance of the queries. As we add more posts to this series, we will get better at tuning our queries.

Thanks for reading!

Performance Tuning Series: Part 2

In the Part 1 of Performance tuning series, we learned about the resource consuming iterators and when do we see these iterators in our execution plans. In this part of the performance tuning series, lets talk about some of the performance draining things.

First, lets see about Lookups. We will first run DBCC dropcleanbuffers so all the unchanged pages in the memory will be flushed out to the disk. We are using Adventureworks2016 database for the demo. Run the below query and check for the actual execution plan.

Select 
a.*
into #a
from [Sales].[SalesOrderHeader] as SOH
CROSS APPLY
(
SELECT TOP(1000)
* FROM [Sales].[SalesOrderDetail] as SOD
WHERE SOH.[SalesOrderID]=SOD.[SalesOrderID]
ORDER BY [OrderDate]
) as a
where
SOH.[SalesOrderID] between 43659 and 44659

When you observe the execution plan, SQL Server did clustered index seek on SalesOrderHeader table and then the nested loop and the clustered index seek on the SalesOrderDetail table which then did the compute scalar twice. What is this compute scalar operator here and why we see it here?

Compute Scalar:

In the table SalesOrderDetail, we have a computed column LineTotal. When you see the table definition, you see the LineTotal column is defined as below:

[LineTotal]  AS (isnull(([UnitPrice]*((1.0)-[UnitPriceDiscount]))*[OrderQty],(0.0))),

SQL Server after clustered index scan and pass on the rows to the 1st compute scalar operator to calculate:

[UnitPrice]*((1.0)-[UnitPriceDiscount]))*[OrderQty]

Then it will have to check to see if that value is null, and if it is null, substitute with 0.0. That’s the second compute scalar operator.

Then as the rows passed from the compute scalar operator to Top, it grabs the top 1000 rows here and gather the data and insert into the temp table. It took 0 seconds to run this query.

Later if in any case, you add any additional columns to the tables like for example you had to add DetailID column to the table SalesOrderDetail.

Alter table [Sales].[SalesOrderDetail]
Add DetailID INT Null
GO

Run the same query again. Before running drop the temp table we created and run DBCC dropcleanbuffers to clean clean pages from the memory.

drop table #a

dbcc dropcleanbuffers
Select 
a.*
into #a
from [Sales].[SalesOrderHeader] as SOH
CROSS APPLY
(
SELECT TOP(1000)
* FROM [Sales].[SalesOrderDetail] as SOD
WHERE SOH.[SalesOrderID]=SOD.[SalesOrderID]
ORDER BY [OrderDate]
) as a
where
SOH.[SalesOrderID] between 43659 and 44659

When you observe now, the same query has a different query plan. Now, there is a nested loop and the key lookup. Every row that comes out of SalesOrderDetail table has to do a key lookup and do the compute scalar operation and then do the 2nd compute scalar again and then grab the Top 1000 rows and pass it on to the next operator to do the insert into temp table. Later those rows are gathered as a stream as this is a parallel query plan. We have small number of rows in this table, so we did not see much difference in the execution time but if your tables have lot of data, then this will take lots of time to run.

There are few main things that can make this happen. We need to make sure we have our development environments having same or near to same amount of data as we have in production because for small amount of data like we have here in our example, we did not see much difference in the query execution time. Developers thinking this code can run the same way in production can leads to terrible performance issues once this code pushed to production. It is so important to have the same amount of data in the non-production environments as well. The other thing is to make sure we do proper testing before pushing the data to production. The load testing needs to be done before moving the code to production. The third thing is to not use outer most select * in the query. We may be adding columns to the table to get the data for the other queries so it is so important to load test our queries.

Spools:

Spools are the caches in tempdb. Let’s take an example which can give us the spool in our execution. First, we will create a products temp table from the bigproduct table and do cross apply where productID between 1001 and 20001. We use crossapply and for each product, we get one row, for the products evenly divided by 5, we get 2 rows and the products divided by 7, we get a row and products divided by both 5 and 7, we get 3 rows. This is the first query. For the second query, we will go to the temp table products and grab the top 40 rows for the product by doing the crossapply using the derived table on BigTransactionHistory table. Run the query and check the actual execution plan. Credits: Adam Machanic

SELECT ProductID
INTO #Products
FROM bigproduct
CROSS APPLY
(
SELECT 1 
UNION ALL
SELECT 2
where productID % 5 =0
UNION ALL
SELECT 3
WHERE ProductID % 7 =0
)x(m)
WHERE 
productID BETWEEN 1001 and 20001
GO

--spool
SELECT 
p.productID,
AVG(x.actualcost) AS AvgCostTop40
FROM #Products AS p
CROSS APPLY
(
SELECT 
t.*,
ROW_NUMBER() OVER
(
PARTITION BY
p.ProductID
ORDER BY 
t.ActualCost DESC
) AS r
FROM bigTransactionHistory as t
WHERE
p.ProductID =t.ProductID
) AS x
WHERE x.r BETWEEN 1 AND 40
GROUP BY
p.ProductID
GO

SQL Server will first do the table scan on the temp table Products and first sorted and there is a lazy spool (number 4), that’s called performance spool and most common spool. This spool will works by first sorting the ProductId’s. All the same ProductID’s will be grouped together and each row going into the nested loop (number 3) and request the Lazy spool to get the information about the ProductID. Let’s say for the ProductID 2. Nested Loop will request table spool to get information about ProductID 1 which lazy spool will run its subtree (number 5) and grab the results and submit to the Lazy spool which is in tempdb and that information will be served up and goes on to the next operators. The next row comes in which is ProductID 2 again from the nested loop requesting Lazy spool for the information about the ProductID 2 which then Lazy spool will directly give the results from the tempdb itself without running the subtree again and pass the results to next operator in the query tree. This will help because the sub tree costs around 62% of the plan cost when you check in the subquery. If we have non-distinct data, this lazy spool helps. It’s much cheaper to put the data in the tempdb and store there. This query executed in 8 seconds. Now,is that good? Lets see by disabling the

When you check the execution plan again by hovering your mouse over to the lazy spool, we have rebinds and rewinds.

Rebinds occurs when the subtree behind the spool has to rerun and the Rewinds occurs when the data is served straight out of the spool. So, the data is requested from the spool 12845 times (number of executions). Out of those number of times, we just have to run the spool only for 9577 times (Actual Rebinds) and 3268 times we had the data serve up the spool (Actual Rewinds).

We can also verify that information from one back step from the spool which is sequence project operator from the subtree (number 5 from the execution plan). Hover your mouse over to the sequence project operator and see the number of executions should lineup with the number of the actual Rebinds from the lazy spool operator.

So, this query took 8 seconds to run, right? Is this good or bad? Let’s turn off the spool and see. We can turn off the spool in two different ways. The query runs per productID and do the average of the top 40 rows. If we have one row for the ProductID 2 in temptable, then the average we found for the top 40 rows, we get the result. What if we have two rows for the ProductID 2 in our temptable, then we have 80 rows that we have to average to get the result but the same 40 rows repeated twice for the same ProductID 2 and as a result we get the same average. To solve this issue, we can just do the distinct from the products temp table which will make the optimizer to make the productsID from the products temp table distinct. This can happen when we have missing PK on the tables or the unique constraints on the tables. Statistics can be out of date as well. In this case, we can just get the distinct productID from the temp table. Change the query like shown below:

SELECT 
p.productID,
AVG(x.actualcost) AS AvgCostTop40
FROM (SELECT DISTINCT ProductID from #Products) AS p
CROSS APPLY
(
SELECT 
t.*,
ROW_NUMBER() OVER
(
PARTITION BY
p.ProductID
ORDER BY 
t.ActualCost DESC
) AS r
FROM bigTransactionHistory as t
WHERE
p.ProductID =t.ProductID
) AS x
WHERE x.r BETWEEN 1 AND 40
GROUP BY
p.ProductID
GO

When you observe the actual execution plan now, the spool disappeared and the query executed in 2 seconds. That’s cool. Usually storing the data in the tempdb is not a good option.

We can turn off the spool by using the tarceflag 8690 within the query which will turn off the spool. This traceflag will make the query optimizer disable the performance spool optimization.

Caution: Don’t try to turn on Traceflags in production just by reading on any website

SELECT 
p.productID,
AVG(x.actualcost) AS AvgCostTop40
FROM #Products AS p
CROSS APPLY
(
SELECT 
t.*,
ROW_NUMBER() OVER
(
PARTITION BY
p.ProductID
ORDER BY 
t.ActualCost DESC
) AS r
FROM bigTransactionHistory as t
WHERE
p.ProductID =t.ProductID
) AS x
WHERE x.r BETWEEN 1 AND 40
GROUP BY
p.ProductID
OPTION (QUERYTRACEON 8690)
GO

There is no more spool in the execution plan and the query runs in 2 seconds.

Lets insert the non-unique rows into the table and see how the execution plan looks like. Insert the data first and run the query by keeping the traceflag on.

INSERT INTO #products
SELECT * FROM #products
GO
SELECT 
p.productID,
AVG(x.actualcost) AS AvgCostTop40
FROM #Products AS p
CROSS APPLY
(
SELECT 
t.*,
ROW_NUMBER() OVER
(
PARTITION BY
p.ProductID
ORDER BY 
t.ActualCost DESC
) AS r
FROM bigTransactionHistory as t
WHERE
p.ProductID =t.ProductID
) AS x
WHERE x.r BETWEEN 1 AND 40
GROUP BY
p.ProductID
OPTION (QUERYTRACEON 8690)
GO

This time the same query executed in 5 seconds. Before the same query executed in 2 seconds. That’s because we have inserted more data and when you hover your mouse over to the sequence project operator (compute scalar), we now see 25690 number of executions which is a lot. If we run the same query without the traceflag and see the execution plan. Now, the query uses the Lazy Spool and we have subquery. Query runs in 8 seconds. Even with the double amount of non-uniqueness in the data, spool is not fast here.

These types of spools are not good for query plans except when we have data in the tables and we are reading the productID’s to get the product categories or names for like some millions of rows, in that case spool will be useful but apart from that, we need to avoid spools from the execution plan.

Summary:

In this post of Performance tuning series, we focused on Lookups and Spools. We learned about these two operators, when they can be useful and when we have to avoid them from the query plans. Lookups can be good with small set of data where we need to lookup for only couple of rows but it can get bad when you want to get lots of data. Spools needs to be avoided in all the plans except when you are searching for the non-uniqueness data several times to grab the same results for those non-unique values (in this example ProductID’s).

Thanks for reading!

T-SQL Tuesday #136: Least Favorite Data Type

With the Invitation of Brent Ozar for this month T-SQL Tuesday, I would like to share some thoughts on least favorite data type in SQL Server. Thank you Brent for hosting March month of T-SQL Tuesday.

VARCHAR(MAX)

Though VARCHAR(MAX) is suitable in situations with large strings of data, it has its own complications that we need to consider. In my career as a DBA, I at least saw couple of times SQL developers using VARCHAR(MAX) when they should not. Fixing the datatypes once in production is painful and causes risks.

When the data is stored in VARCHAR(n) datatype column, these values get stored in standard data page. VARCHAR(MAX) uses IN_ROW_Data up to 8000 bytes of data but if the data in more than 8000 bytes, it uses LOB_Data page and a pointer (reference) is stored in the IN_ROW_Data_page. We cannot create an Index on the VARCHAR(MAX) data type as a key column in the index. You can add the column of VARCHAR(MAX) as an included column but that wouldn’t be seekable and will duplicate the column completely which is lot of storage. We cannot compress the LOB data and the data retrieval from the LOB data is relatively slow. If the LOB data is stored as IN-ROW, it can be compressed but only when we compress the table with the PAGE level compression. LOB data cannot be compressed if you use ROW level compression and doesn’t depend if the LOB data is stored as in row or out of row.

Any variable you declare in VARCHAR(MAX) as the datatype will automatically spills to Tempdb which can cause performance problems. If you are using the VARCHAR(MAX) datatype column in your queries and sorting them, that requires so much memory to sort this data which may cause the memory issues. For selecting data having this datatype more than 8000 bytes can cause several off page I/O’s.

Recommendations:

If you are running queries and searching on column, that column should always be used as a fixed length column instead of VARCHAR(MAX) data type. If your row size can be managed to be less than 8000 bytes, that will store the data IN_ROW_Data_page which can avoid any page splits and avoids locking or latching during the updates. Limiting the length of the strings is really important for designing any database. Using VARCHAR(n) is recommended over VARCHAR(MAX) when ever possible.

I am really looking forward to what other SQL family members have to say about their least and the most favorite data types!