Decoding the Physical Join Operators: How the Right Index Helps!

In the Part1 of decoding the physical join operators, we learned about the different types of physical operators: Nested loops, Merge joins and Hash joins. We have seen when they are useful and how to take advantage of each for the performance of our queries. We have also seen when they are useful and when they needs to be avoided.

In this part, we will know more about these operators and how the indexes really help these operator to perform better so the queries can execute faster.

In order to use the Merge join, the join must include equalities. Meaning the join should have columns of one table equal to column of another table. If the join criteria is range comparisons like inequalities or with greater than, less than and not equal to, then the optimizer will not use a Merge join.

In the part1 of physical join operators, we have joined every single value from the first table to find all the matching values from the second table. What if we just want to only work with few values (For example, lets say we are looking for 6 values) from table 1 instead of all values? For the hash join, SQL Server has to still goes through every single value from the first table before reading any values from the second table. With the nested loops, as soon as we read one row from table 1, SQL Server can look for the matching rows in the second table. We may have a question here, for each of the 6 values we are looking for from table 1, we still have to read all values from the second table for the nested loop join. That can be pretty slow until we create an Index. Index created on the value created on the second table, all the lookups will become very fast. After reading the first row from the outer table, SQL Server can use the index to find the location of the matching row in the inner table. That helps, right? Instead of reading all the values from the second table for 5 times, SQL Server has to just do 5 index lookups to find the matching values. So, it reads 5 values from the table 1 and 5 index lookups from the table 2.

For Hash join, it starts reading all the rows from the first table building the hash table. Then it starts reading the rows in the inner table. It does that until the SQL Server finds the 6 matching rows which means it may have to read all the rows from the second table as well if there is no index. Here, the hash join is so much slower than the nested loop joins.

Merge join has to sort the data first so using the index on the first table that can avoid a sort created by SQL Server own sorting process. If we have the index on the join columns then instead of sorting all the rows in the outer table, we can simply use the index getting the matching rows from the table. If we have the index on the second table for that value column, SQL Server will use that index to get the rows for the merge join.

If we are returning a small fraction of the rows from each table, then nested loops will be efficient. For the hash match, if we are looking for a couple of values in the first table which we can find them easily using the index on the first table, after building the hash table in the build phase and probe the second table using the index on the second table value column as well. With the nested loops, for every value on the first table, the values on the second table can be found by using the index quickly. Nested loops are very good when we are working on couple of rows from each table. If we want to compare the large amounts of data in the first table, we have to do as many index lookups on the second table. In that situations, it is really better to switch to the hash join to full scan the second table.

The switching point difference for the optimizer to choose from the nested loops or hash joins being the fastest can be really small. If you statistics are out of date, in that case, the row estimate can lead the optimizer choose the non-optimal join type causing performance issues.

Examples:

Nested Loop:

use Adventureworks2016
go
select soh.*, sod.*
from 
sales.SalesOrderHeader soh
inner join sales.salesorderdetail sod on soh.SalesOrderID=sod.SalesOrderID
where soh.salesorderid=71832

In the above example, SQL server used Nested loop operator. It used clustered index seek on the outer table and the estimate based on the predicate is 1 and when you observe it did the clustered index seek on the inner table and the estimated number of rows were 15. Here we have an index on the inner table. We need to have index on inner table to avoid the expensive scan operation. Here the clustered index seek operator is executed once because we are only returning one qualified row from the outer table. Here we have two times the compute scalar operator being used because we have two computed columns that we are getting from the two tables.

Merge Join:

Lets create two new tables with columns ID and Rollnumber and insert values into the tables. Since we do not have any unique constraint, no unique index defined on the Rollnumber column of the table, SQL optimizer doesn’t know about the uniqueness of the column. Since the SQL optimizer doesn’t know about the uniqueness, it doesn’t choose the merge join. We can force the merge join in option and check the execution plan.

select * from [dbo].[Table1]
select * from [dbo].[Table2]
select a.*, b.*
from
table1 a
inner join
table2 b
on a.Rollnumber=b.Rollnumber
option (merge join)
go

SQL server has to sort the data first coming from the table 1 and table 2 and used Merge join as we forced the optimizer using the hint in the query. If you look at the memory grant it used 1MB.

When you hover over your mouse to the merge join operator, you will see many to many operation as True. That means SQL server thought there may be duplicate values in the outer table join column because there is no supportive index or check constraint defined.

As it estimated the duplicate values, sql server also created the work table. You can see this information by setting the set statistics IO ON and see the messages tab.

Now, lets go ahead and create a unique nonclustered index on the second table ‘table2’. This is the inner table SQL server chosen in the execution plan. Now SQL server chooses the table 2 as the outer join because now SQL Server knows we have unique values in the table 2. As the rows are directly read from the index we created, explicit sort operator is no more needed.

create unique nonclustered index IDX_rollnumber on table2(Rollnumber)

When you observe the many to many operator, its false now.

As we created the index and as SQL server now know there are unique values in table 2, it did not use the explicit sort and it no longer need to create the work table and so the many to many operator is false this time. Now, lets go ahead and create the nonclustered index on the first table ‘table1’. This time there are no more explicit sort operators as the table column Rollnumber is in the sorted order by the newly created indexe on table 1.

create unique nonclustered index IDX_rollnumber on table1(Rollnumber)

The prerequisite for the Merge join is to have the data sorted by a supportive index or by explicitly having the SQL server use its sort operator in the execution plan.

Hash Match:

Lets run a very simple query to generate the hash match operator by running the below query in Adventureworks2016 database.

select 
p.firstname,
p.lastname,
pp.phonenumber
from person.person p
inner join person.personphone pp
on
p.BusinessEntityID=pp.BusinessEntityID
go

If you see the number of estimated rows, they are matching to the actuals. If the estimates are off when compared to the actuals, then the hash join will actually spill the rows to disk to do the operation which is really expensive.

When you observe the execution plan, personphone table was used as the outer table to create the hash table and the created hash table is probed to the inner table person to get the matching records. You can also see on which columns the probing was done by hovering your mouse over to the hash match operator under the hash keys probe.

Summary:

There are three types of physical join operators. Nested loops, Merge join and Hash join. Nested loops are suitable when joining small tables or small fractions of big tables. Hash joins are better when we are joining all the rows in a table or large tables. Merge join can be efficient if the rows are sorted but sorting is slow but can be faster if indexes are present on the joining columns.

How knowing all this information will help you? You can easily tune your queries by checking the table statistics are up to date and the indexes. One of the important things that will help optimizer to choose the join type and the join order is based on statistics estimate of how many rows it thinks going to come out of each table. While tuning the queries, check for the indexes on the join columns. If they are missing, creating the index help optimizer to choose the nested loop join. If there is no index, it will force the optimizer to use the hash join. By making sure the statistics are up to date and by checking the necessary indexes exists or not will help the optimizer to choose the good enough plan for your queries.

Decoding the Physical Join Operators: Part 1

I specifically thought of writing about this topic since long time. I always had trouble understanding clearly about Nested loop joins, Merge joins work. Once I learned about how they actually work, I planned to write a blog post about the same so you can know about how these joins work as well. By the end of this post, you will have a nice smile on your face as you feel this is so easy once we understand the concept behind these joins. This is a 2 part series.

Types of Physical Operators

Visualizing Nested Loops Joins And Understanding Their Implications

Nested Loop Joins:

For example, in the diagram you see the representation of Table 1 (outer table) and Table 2 (Inner table). For every value in the first table (outer table), SQL Server will query the second table (Inner table) looking for the matching values. We will read every value in the second table once for every value in the first table. In the below example, once the 1st value containing the orange color is compared with each value in the table 2. Once the complete table 2 values are compared, then SQL Server will loop into the next value on the table 1 to compare each value on the table 2 again and this process is looped over until each value from table 1 is compared to each and every value of table 2 to get the matched values as output of this operator. This is a lot of work. This takes so much CPU.

If the inner table is presorted by having an index, then this will save time because when we iterate values over the inner table, SQL Server can now seek to the rows that we need to compare against the outer table instead of comparing to every row in second table. This can reduce the number of comparisons that needs to be compared which can actually improve the performance.

Now, how do we know if these nested loops are appropriate in our plans or not? You can view the actual execution plan to see on how many rows does the nested loops iterator is getting out. If the number of rows that are coming out of that iterator are only few, than that is fine for the nested loop join. The other cause might be the estimates. Estimates might be off for that data and instead of using the better joins for that data, it might be using the nested loop join. We also might be missing the index which can pre sort the data that is going into the join. This may also be the reason for choosing the nested loop join.

Merge Join:

What if we sorted the data, now we have the data sorted in both the tables. Look at the diagram below for reference. As we have both the tables in the sorted order and we have to join them, SQL Server will first take the first value and compare with the first value in second table which also has the same value 1 as the first table first value 1. Then it goes to the second value in the second table which is 2 and compare with the first value 1 in first table which doesn’t match. Because the second table is sorted, we know that there is no other values that matches in the table 2 with the value 1 from the first table. It will check with the next value 2 from the table 2 which doesn’t match with the value 1 from table 1. value 2 is greater than value 1 so SQL Server now knows that there are no more matching rows in the table 2 and SQL Server will stop processing now and goes back to the next value in table 1 which is value 2 and check with the table 2 value 1. This time SQL Server will compare with the last value we had a join on which is value 1 in the table 2. Value 2 from the table 1 doesn’t match the value 1 from the table 2 so it goes to the second value 2 from the table 2 find the match and goes to the next value 3 from the table 2 and finds value 3 is greater than value 2 and stops the comparison as it knows there will be no matching values next in the table as the data is sorted in the order. It goes to the next value 3 from table 1 and compare from the last value we had join on which is value 2 from the table 2 and find the matching value 3 from the table 2, goes to the next value 4 from the table 2 and found the mismatch and so it knows there are no more matching rows and stops processing and then goes to the value 4 from the table 1 and compare with the last value we had join on and do the same processing until the comparison completes for each value in table 1. This is done as the SQL Server walks and compares both the tables bit by bit. This is the basis of a Merge Join.

In the Merge join, we sorted the table on the join columns and walked down the columns bit by bit comparing the rows. Because the rows are sorted, SQL Server can stop the processing once it founds the matching value in the second table which is greater than the value of the current row in the first table.

Merge join is the fastest physical join operator in the SQL Server. Merge join is so efficient because it can go through each input a single time and it doesn’t have to loop back and do more operations like the nested loop join. When we have many to many merge join, in that case SQL Server cannot reiterate the number of rows of inputs by doing many to many join. It instead creates a work table in tempdb and loops that over and over again to process those many to many joins. If the first table has duplicate values, in that case SQL server will create a work table and do the comparison just like the regular merge join but when it finds the duplicate values in the second table, it writes them into work table in tempdb and does the comparisons there. If it also finds the duplicate values on the first table, SQL Server will go ahead and compares the first table values to already stored values in the work table. Once the iterator moves past those duplicated values, the work table gets cleared out and the merge join continues until all the values gets compared. If you would like to know in depth about how the work tables work in detail, please refer to this wonderful article from sqlserverfast.com

Merge join is not a bad thing and it may be efficient already in your execution plan. What you have to observe when you see the merge joins and performance slow on that plan is to focus on the upstream operations that are going into the merge join. Whether the data is presorted as you already have an index or whether the data is presorted in SQL Server own way then in that case, you can simply check if you can just add that missing column in the index and place in the last key column in the index or use a different join algorithm will be better. The other scenario might be you have lots of duplicate values in your data. If that is the case SQL Server will be using the work tables to handle how the duplicate values can be joined on. So, if you see the duplicate values or using tempdb, then finding the better options will be good.

Combining the tables with the merge join is faster than the nested loop join but we need to sort the data first. You know that the sorting is slow. So, how can we do this better. That’s when the Hash join comes in.

Hash Match:

For every value in the table 1, we can apply the hash function to it and build a Hash table on it. Each and every value will be now in the hash table that has been built. Once that has been done, we can go to the second table and apply the same hash function to each value in table 2 and lookup probe for the hash table 1 we built for matching values and we have a join. We repeat the process for all the values in the second table. This is the Hash join.

There are two phases that the Hash join happens. The first phase is Build phase where SQL Server will build the in-memory hash table from one of the two inputs. Hashes are calculated based on the join keys on the input data and the whole rows are stored in the hash table under the hash buckets. Mostly we have one row per the hash bucket except when we have multiple rows with duplicate join keys or they is a hash collision (total different join keys might get the same hash which is a rare scenario). Once the hash table is build, SQL Server will start the probe phase where SQL Server will calculate the join key hash for each row in the second input and checks to see if it exists in the hash table builds in the first build phase. If it finds the match it will goes to the next to see if the join key themselves actually match because sometimes we might have the hash collision issues.

Hash joins is not the fastest type of the join because it needs to use the hash function which uses more CPU to calculate the hashes of the join keys but this join is really useful to join really large tables and it uses Tempdb to create these hash tables if memory is not sufficient to build them. If SQL Server cannot build all the hashes of the hash table in the memory because it is very large, then the hash can actually spill to disk and use the tempdb. In this scanario, SQL Server can put as many buckets as possible in the memory and the ones it cant keep in memory can be stored in tempdb. In the proble phase, SQL Server will join any rows that are in the memory that are in-mmeory hash table and any hashes that it cannot finds the match in the hash table, it writes to tempdb until all the rows from the probe input has been processed, it loads everything related to from the tempdb into the memory and continues to do the comparisons there. SQL Server can actually stage some of the data into the memory during the join phases and loading it back into the memory once some of the memory gets cleared up.

What can we do if we see the Hash match in our execution plans? We have to first see if this operator is useful in the plan because this is a blocking operator meaning until the build and the probe phases are completed all the downstream operators have to wait in the execution plan for processing. We know that Hash join operator can actually join any data sets but when it actually spills to disk, it can cause the performance issues. That’s when we need to peek in and look at the memory grants that has been allocated is not sufficient because of the incorrect estimates. The memory grant that has been actually allocated to the query might not be sufficient to build the hash table in the memory and so spilling to disk. In this scenario, you can check if the statistics are up to date and check to see if adding that one column to the index might help choose a different optimal physical operator to process the query effectively.

Summary:

In this part 1, we have learned about how the physical join operators work, when they can be efficient and when they are not. Where and what to look at when they are not efficient to make our queries faster. In the second part of the series, we are going to look how the indexes help these physical operators go faster.

Performance Tuning Series: Part 4

In the Part 3 of the series, we have learned about the Sorts, Hash matches and Serial nested loops. We have seen when they can be useful and when they needs to be avoided with examples. In this session let’s learn about Seek and Residual Predicates with examples.

Lets take an example of a very simple query running on the BigTransactionHistory table and check for the execution plan.

Credits: Adam Machanic

SELECT * FROM 
bigTransactionHistory as bth
WHERE
ProductId BETWEEN 1001 AND 50001
AND Actualcost>5000000

When you check the execution plan, Estimates are off here. Estimated number of rows were 80 but the actual number of rows were 0. Did you observe, we have parallelism here? Parallelism is used for high cot query plans but why for this simple query?

Run the same query by checking the statistics this time

SET STATISTICS IO ON
GO
SELECT * FROM 
bigTransactionHistory as bth
WHERE
ProductId BETWEEN 1001 AND 50001
AND Actualcost>5000000

SQL Server has to read 131444 logical reads which is a lot.

There are two types of predicates in SQL Server. Index predicates and Residual predicates. Residual predicates are the predicates which are not satisfied by the index but they are still be pushed to the storage engine. When you observe the predicate information in the execution plan here, you see the plan shows the index predicates (seek predicate) on the ProductID and the storage engine instead of the query processor gave us the residual predicate. It filtered the actual cost in the storage engine and did not pass the rows to the query processor.

So what are we missing here? How to track where the issue is? We can see the line passing from the Index seek to parallelism operator, it is so thin. These lines can be really useful in tracking down where the issues are when your execution plans are big so we do not have to go and look each and every iterator in the plan. Though the residual predicate make us harder to debug the query plan faster but it is really helpful to run the query faster. But, if you wanted to test the query performance in lower environments, enabling the TraceFlag 9130 can be really helpful in debugging the plan easily. Now what this trace flag does it, it will disable the residual predicates and force the query optimizer to process the residual predicates as a filter in the query processor. When you enable this Traceflag and run the same query, look at the difference in the line. Enabling this traceflag is useful when you have lot of I/O input and we cannot figure out where the issue is.

Warning: Do not enable this Traceflag 9130 in production because this will cause query slowness. Enable this in non-production servers to quickly debug the queries. This is a Gem for Query tuning developers and DBA’s.

Sometimes when you try to run your queries searching on the predicates more than once on the same column of the table, then SQL server will go ahead and use the seek predicate on one of the predicate and do the residual predicate on the other. This may not be the good thing to do. Instead, we can use the case expressions and modify the code.

Below is one of the examples using the same predicate on same column:

CREATE TABLE #Purchaseorders
(
[SalesOrderID] INT,
[OrderDate] DATETIME,
[DueDate] DATETIME,
--Primary key ([SalesOrderID])
)

INSERT #Purchaseorders
SELECT 
[SalesOrderID],
CASE
	WHEN [SalesOrderID] BETWEEN 43659 AND 43701 THEN '20110531'
	WHEN [SalesOrderID] BETWEEN 43702 AND 43705 THEN '20110601'
	WHEN [SalesOrderID] BETWEEN 54990 AND 55990 THEN '20081110'
	END,
	'20110612'
FROM [Sales].[SalesOrderDetail]
GO

dbcc dropcleanbuffers

DECLARE @OrderDate DATE='20110531'
SELECT  r.*
FROM #Purchaseorders as r
CROSS APPLY
(
SELECT TOP (1000)
SOH.[OrderDate],
SOH.[SalesOrderID],
SOH.[TotalDue]
FROM [Sales].[SalesOrderHeader] AS SOH
WHERE SOH.SalesOrderID=r.SalesOrderID
AND SOH.[OrderDate] BETWEEN r.OrderDate AND r.DueDate
AND SOH.[OrderDate] >= @OrderDate
ORDER BY 
SOH.[OrderDate]
) AS a
GO

Instead of using both the predicates on same column, we can replace the query to use case expression by modifying the code and check for the execution plan.

DECLARE @OrderDate DATE='20110531'
SELECT  r.*
FROM #Purchaseorders as r
CROSS APPLY
(
SELECT TOP (1000)
SOH.[OrderDate],
SOH.[SalesOrderID],
SOH.[TotalDue]
FROM [Sales].[SalesOrderHeader] AS SOH
WHERE SOH.SalesOrderID=r.SalesOrderID
AND SOH.[OrderDate] BETWEEN r.OrderDate AND r.DueDate
AND SOH.[OrderDate] >= @OrderDate
ORDER BY 
SOH.[OrderDate]
) AS a
GO

Check the execution plan now, optimizer decided to use both of them as the residual predicate in this scenario but its not we who made sql optimizer confuse to use seek predicate or residual predicate. Its the optimizer choice to use residual predicate here.

Summary:

In this post of the Performance tuning series, we have gone through the Seek and the Residual predicates and learned about how enabling the Traceflag 9130 can be really helpful in debugging the plan easily on your non-production servers as it will disable the residual predicates and force the query optimizer to process the residual predicates as a filter in the query processor. You can see the thick lines in your query plans to quickly investigate into where the issue is.

Thanks for reading and don’t forget to subscribe to this website, you have WANT TO BE NOTIFIED? section on right side of the main page where you can enter your email address to receive new blog posts notifications to your email address every week!

Performance Tuning Series: Part 3

In the Part 2 of the series, we have learned about the Key lookups and the Spools. We have seen when they can be useful and when they needs to be avoided with examples. In this session let’s learn about Sorts, Hash matches and Serial nested loops.

This is a continuation to the Performance tuning series Part 1 and Part 2. As I mentioned in the first part of this series, there is no end to this series. As I learn about performance tuning tips and tricks, I will keep on adding to this series. If you love to learn along with me and level up your Query tuning skills, Subscribe to dbanuggets Here and give me that boost to work more 🙂

Sorts:

Sorts are important and necessary for many queries as most of the queries needs to sort their results. Sorted input is often what we exactly need. We will see with an example of a query we run on BigTransactionHistory table to get 500 products that sold for the highest cost and sold for greater than 1500. We will get the rows from the bigtransactionhistory table where the rows are greater than or equal to 1500, tied to the rownumber and do the partition by productid so for each productid, we get a new rownumber or sets of rownumbers and orderby the actualcost dec and get the total 500 rows from there. Run the below query that does the same thing and check the execution plan.

SELECT TOP (500) WITH TIES
ProductID, 
ActualCost
from
(
SELECT 
productID,
ActualCost,
Row_number() over
(
Partition by
productID
order by 
Actualcost desc)as r
from bigtransactionhistory
where
Actualcost>=1500
) AS x
WHERE x.r=1
ORDER BY 
x.Actualcost DESC
GO

Query runs in 8 seconds. When you observe the query plan in the data centric view, first sql server did a NC index scan, sorted the data so we can have the row number, calculated the row number by window aggregate and filter operators and sort the rows again, gathering streams through parallelism and top 500 rows gathered. The first sort after the NC index scan is sorting many rows. When you hover over the mouse to the rows coming out of NC index scan, you can see how many rows are sorted.

What if we convert the same query by making sure we get the nested loops as they are cheaper. To get the nested loop in the execution plan, we can use CROSSAPPLY and remove the partition by and we run the query to get one product at a time.

SELECT TOP (500) WITH TIES
ProductID, 
ActualCost
from bigproduct as p
CROSS APPLY
(
SELECT 
bt.ActualCost,
Row_number() over
(
order by 
bt.Actualcost desc)as r
from bigtransactionhistory as bt
where
bt.ProductID=p.ProductID and 
bt.Actualcost>=1500 ) as x
where x.r=1
ORDER BY 
x.Actualcost DESC
GO

Check the execution plan. Now the query runs much faster in 3 seconds. It did a CI scan on bigproduct table and goes into the loop once per the product, it goes and grab the rows just for that product, sort just the rows for that product and calculate the rows just for the product (segment and sequence project). The sort here in this plan is also expensive but small sorts are very less expensive when compared with the sorts on huge amount of rows.

Hash Match:

Hash match operator is appropriate for the Datawarehouse workloads but not the OLTP workloads. It is appropriate for the medium set of rows against the large set of rows but not god for the large set of rows against the large set of rows. Let’s take an example. In my company, we regularly receive data from different vendors and we import the data into multiple tables depending up on the type of data. Before we import any data, we need to check if the data is already existed in the database. That’s the example we want to take here. Run the below query and check the execution plan.

SELECT TOP(5000000)
[SalesOrderID],
[OrderDate],
[Status],
[TotalDue]
INTO #sohe
FROM [Sales].[SalesOrderHeaderEnlarged]
WHERE
[SalesOrderID] BETWEEN 43659 AND 1181640


SELECT COUNT(*)
FROM #sohe as b
WHERE 
EXISTS(
SELECT * FROM [Sales].[SalesOrderHeaderEnlarged] as sohe
WHERE
sohe.[OrderDate]=b.[OrderDate]
AND sohe.[SalesOrderID]=b.[SalesOrderID]
AND sohe.[SalesOrderID] BETWEEN 43659 AND 1081640
)
OPTION (MAXDOP 1)
GO

When you observe the execution plan, we see the clustered Index seek on the SalesOrderHeaderEnlarged table. Index seek is on the smaller set of the SalesOrderId’s. It did not do the table seek on the temp table sohe because we do not have an Index on the temp table. That’s the SQL Server is forced to do the Hash Match. This is the type of situation where we see the SQL Server is forced to use the Hash match because we do not have proper index to seek.

As you can see, in the execution plan we have a Index recommendation that says the performance will be improved by 13%. Lets go ahead and create the recommended index on the temp table. Then run the same query again to see the difference in the execution plan.

CREATE CLUSTERED INDEX ix_SalesOrderID_OrderDate
ON #sohe
(
[SalesOrderID],[OrderDate]
)
SELECT COUNT(*)
FROM #sohe as b
WHERE 
EXISTS(
SELECT * FROM [Sales].[SalesOrderHeaderEnlarged] as sohe
WHERE
sohe.[OrderDate]=b.[OrderDate]
AND sohe.[SalesOrderID]=b.[SalesOrderID]
AND sohe.[SalesOrderID] BETWEEN 43659 AND 1081640
)
OPTION (MAXDOP 1)
GO

We have the Merge join this time because we have two sets of the data that are identical in size which is great for the merge join. Here the merge join is better than the Hash match. In this scenario, the index creation changed the hash match to merge join which is what we need here to improve the performance.

Serial Nested Loops:

Sometimes, we have never ending queries due to the serial nested loops. Let’s see this with an example. We will use Adventureworks2016 database and as my database is 2019 mode and the server is sql 2019, let me go ahead and place the Auto update statistics to be turned OFF on the database level. Then lets see how many rows have to be updated on the table Sales.SalesOrderHeaderEnlarged to hit the threshold to invalidate the statistics ()on the CustomerID column.

ALTER DATABASE  [adventureworks2016Big]
SET AUTO_UPDATE_STATISTICS OFF

UPDATE [Sales].[SalesOrderHeaderEnlarged] 
SET CustomerID = 11091
WHERE CustomerID < 16575;
GO

SELECT  
	[sch].[name] + '.' + [so].[name] AS [TableName] ,
	[ss].[name] AS [Statistic] ,


        [sp].[modification_counter] AS [RowModifications] ,

        SQRT([sp].[rows]*1000) [NewThreshold]
FROM [sys].[stats] [ss]
JOIN [sys].[objects] [so] 
	ON [ss].[object_id] = [so].[object_id]
JOIN [sys].[schemas] [sch] 
	ON [so].[schema_id] = [sch].[schema_id]
LEFT OUTER JOIN [sys].[indexes] AS [si] 
	ON [so].[object_id] = [si].[object_id]
	AND [ss].[name] = [si].[name]
OUTER APPLY [sys].[dm_db_stats_properties]([so].[object_id], [ss].[stats_id]) sp
WHERE [so].[object_id] = OBJECT_ID(N'Sales.SalesOrderHeaderEnlarged')
ORDER BY [ss].[stats_id];
GO

When you observe in the above results, we are focused on the IX_SalesOrderHeaderEnlarged_CustomerID statistics that is being created on CustomerID column. We need to update 49221 rows to invalidate this particular statistics as shown in the NewThreshold column. We updated 890k records as you see in the Rowmodifications. Now, the statistics is outdated but as you disabled the AUTO update statistics to be turned OFF, SQL Server will not go ahead and just update the stats the next time you run the query using that outdated statistics. Since there are no updates to the statistics, execution plan will still use the same outdated statistics information for the queries using CustomerID column. Let’s go ahead and run the query using the outdated CustomerID column statistics and check the execution plan.

SELECT
	oh.[CustomerID],
	oh.[OrderDate],
	oh.[ShipMethodID],
	od.[OrderQty],
	od.[ProductID]
FROM [Sales].[SalesOrderHeaderEnlarged] oh
JOIN 
[Sales].[SalesOrderDetailEnlarged] od
ON oh.SalesOrderID = od.salesOrderID
where [CustomerID]=11091

When you observe the execution plan above, the highlighted Red color rectangle box represents the estimated number of rows coming out of the SalesorderHeaderEnlarged table which is just 2156 rows but look at the actual rows on the top of the estimated rows. Its 890k rows which is way too above than the estimates. Now, why is that? Its because of our AUTO update stats is set to OFF. Update stats did not kick in when we run the query using the outdated stats as the option is set to OFF. Now let’s go ahead and enable the AUTO update stats to be turned ON and see the magic by running the same query one more time. Check the execution plan this time.

USE [master]
GO
ALTER DATABASE [adventureworks2016Big] SET AUTO_UPDATE_STATISTICS ON WITH NO_WAIT
GO

USE [adventureworks2016Big];
GO
SELECT
	oh.[CustomerID],
	oh.[OrderDate],
	oh.[ShipMethodID],
	od.[OrderQty],
	od.[ProductID]
FROM [Sales].[SalesOrderHeaderEnlarged] oh
JOIN 
[Sales].[SalesOrderDetailEnlarged] od
ON oh.SalesOrderID = od.salesOrderID
where [CustomerID]=11091

When you observe the execution plan now, we see the execution plan itself changed. The estimated are lined up very close to the actuals for the clustered index scan on the SalesOrderHeaderEnarged table.

Point to remember: When your Auto update stats is turned ON at the database level and when you statistic is outdated, the next time you run any query using that outdated statistic, that’s when the SQL Server will first go ahead and update that stats behind the scenes, invalidate the execution plan and create a new execution plan for that query and uses the newly updated statistics information in the execution plan. Once that is done behind the scenes, the query execution comes after that and user will get back the results. SQL Server here thought it would be better to do the clustered index scan as the number of the rows that comes out of the table are around 890k rows.

It is always better to make sure we keep the statistics up to date on the database level and maintain them using maintenance plans as well. For the older versions of SQL server before SQL Server 2016, optimizer will use a different threshold value to invalidate the stats. That is 20%+500 number of rows needs be changed to invalidate the stats and to kick the AUTO update statistics the next time a query runs using that outdated statistic. If your table is huge, then it is hard to get that 20% of the rows to be changed and to hit the threshold. For that reason, you have an option to enable the trace flag 2371 so you can use the new cardinality estimator which only needs SQRT(1000*number of rows) to be changed to hit the new threshold which is way smaller than older threshold.

Summary:

In the part 3 of the Performance tuning series, we learned about Sorts, Hash matches and Serial nested loops. We have seen when they can be appropriate and when they are not and how to avoid them when they are inappropriate and improve the performance of the queries. As we add more posts to this series, we will get better at tuning our queries.

Thanks for reading!

Performance Tuning Series: Part 2

In the Part 1 of Performance tuning series, we learned about the resource consuming iterators and when do we see these iterators in our execution plans. In this part of the performance tuning series, lets talk about some of the performance draining things.

First, lets see about Lookups. We will first run DBCC dropcleanbuffers so all the unchanged pages in the memory will be flushed out to the disk. We are using Adventureworks2016 database for the demo. Run the below query and check for the actual execution plan.

Select 
a.*
into #a
from [Sales].[SalesOrderHeader] as SOH
CROSS APPLY
(
SELECT TOP(1000)
* FROM [Sales].[SalesOrderDetail] as SOD
WHERE SOH.[SalesOrderID]=SOD.[SalesOrderID]
ORDER BY [OrderDate]
) as a
where
SOH.[SalesOrderID] between 43659 and 44659

When you observe the execution plan, SQL Server did clustered index seek on SalesOrderHeader table and then the nested loop and the clustered index seek on the SalesOrderDetail table which then did the compute scalar twice. What is this compute scalar operator here and why we see it here?

Compute Scalar:

In the table SalesOrderDetail, we have a computed column LineTotal. When you see the table definition, you see the LineTotal column is defined as below:

[LineTotal]  AS (isnull(([UnitPrice]*((1.0)-[UnitPriceDiscount]))*[OrderQty],(0.0))),

SQL Server after clustered index scan and pass on the rows to the 1st compute scalar operator to calculate:

[UnitPrice]*((1.0)-[UnitPriceDiscount]))*[OrderQty]

Then it will have to check to see if that value is null, and if it is null, substitute with 0.0. That’s the second compute scalar operator.

Then as the rows passed from the compute scalar operator to Top, it grabs the top 1000 rows here and gather the data and insert into the temp table. It took 0 seconds to run this query.

Later if in any case, you add any additional columns to the tables like for example you had to add DetailID column to the table SalesOrderDetail.

Alter table [Sales].[SalesOrderDetail]
Add DetailID INT Null
GO

Run the same query again. Before running drop the temp table we created and run DBCC dropcleanbuffers to clean clean pages from the memory.

drop table #a

dbcc dropcleanbuffers
Select 
a.*
into #a
from [Sales].[SalesOrderHeader] as SOH
CROSS APPLY
(
SELECT TOP(1000)
* FROM [Sales].[SalesOrderDetail] as SOD
WHERE SOH.[SalesOrderID]=SOD.[SalesOrderID]
ORDER BY [OrderDate]
) as a
where
SOH.[SalesOrderID] between 43659 and 44659

When you observe now, the same query has a different query plan. Now, there is a nested loop and the key lookup. Every row that comes out of SalesOrderDetail table has to do a key lookup and do the compute scalar operation and then do the 2nd compute scalar again and then grab the Top 1000 rows and pass it on to the next operator to do the insert into temp table. Later those rows are gathered as a stream as this is a parallel query plan. We have small number of rows in this table, so we did not see much difference in the execution time but if your tables have lot of data, then this will take lots of time to run.

There are few main things that can make this happen. We need to make sure we have our development environments having same or near to same amount of data as we have in production because for small amount of data like we have here in our example, we did not see much difference in the query execution time. Developers thinking this code can run the same way in production can leads to terrible performance issues once this code pushed to production. It is so important to have the same amount of data in the non-production environments as well. The other thing is to make sure we do proper testing before pushing the data to production. The load testing needs to be done before moving the code to production. The third thing is to not use outer most select * in the query. We may be adding columns to the table to get the data for the other queries so it is so important to load test our queries.

Spools:

Spools are the caches in tempdb. Let’s take an example which can give us the spool in our execution. First, we will create a products temp table from the bigproduct table and do cross apply where productID between 1001 and 20001. We use crossapply and for each product, we get one row, for the products evenly divided by 5, we get 2 rows and the products divided by 7, we get a row and products divided by both 5 and 7, we get 3 rows. This is the first query. For the second query, we will go to the temp table products and grab the top 40 rows for the product by doing the crossapply using the derived table on BigTransactionHistory table. Run the query and check the actual execution plan. Credits: Adam Machanic

SELECT ProductID
INTO #Products
FROM bigproduct
CROSS APPLY
(
SELECT 1 
UNION ALL
SELECT 2
where productID % 5 =0
UNION ALL
SELECT 3
WHERE ProductID % 7 =0
)x(m)
WHERE 
productID BETWEEN 1001 and 20001
GO

--spool
SELECT 
p.productID,
AVG(x.actualcost) AS AvgCostTop40
FROM #Products AS p
CROSS APPLY
(
SELECT 
t.*,
ROW_NUMBER() OVER
(
PARTITION BY
p.ProductID
ORDER BY 
t.ActualCost DESC
) AS r
FROM bigTransactionHistory as t
WHERE
p.ProductID =t.ProductID
) AS x
WHERE x.r BETWEEN 1 AND 40
GROUP BY
p.ProductID
GO

SQL Server will first do the table scan on the temp table Products and first sorted and there is a lazy spool (number 4), that’s called performance spool and most common spool. This spool will works by first sorting the ProductId’s. All the same ProductID’s will be grouped together and each row going into the nested loop (number 3) and request the Lazy spool to get the information about the ProductID. Let’s say for the ProductID 2. Nested Loop will request table spool to get information about ProductID 1 which lazy spool will run its subtree (number 5) and grab the results and submit to the Lazy spool which is in tempdb and that information will be served up and goes on to the next operators. The next row comes in which is ProductID 2 again from the nested loop requesting Lazy spool for the information about the ProductID 2 which then Lazy spool will directly give the results from the tempdb itself without running the subtree again and pass the results to next operator in the query tree. This will help because the sub tree costs around 62% of the plan cost when you check in the subquery. If we have non-distinct data, this lazy spool helps. It’s much cheaper to put the data in the tempdb and store there. This query executed in 8 seconds. Now,is that good? Lets see by disabling the

When you check the execution plan again by hovering your mouse over to the lazy spool, we have rebinds and rewinds.

Rebinds occurs when the subtree behind the spool has to rerun and the Rewinds occurs when the data is served straight out of the spool. So, the data is requested from the spool 12845 times (number of executions). Out of those number of times, we just have to run the spool only for 9577 times (Actual Rebinds) and 3268 times we had the data serve up the spool (Actual Rewinds).

We can also verify that information from one back step from the spool which is sequence project operator from the subtree (number 5 from the execution plan). Hover your mouse over to the sequence project operator and see the number of executions should lineup with the number of the actual Rebinds from the lazy spool operator.

So, this query took 8 seconds to run, right? Is this good or bad? Let’s turn off the spool and see. We can turn off the spool in two different ways. The query runs per productID and do the average of the top 40 rows. If we have one row for the ProductID 2 in temptable, then the average we found for the top 40 rows, we get the result. What if we have two rows for the ProductID 2 in our temptable, then we have 80 rows that we have to average to get the result but the same 40 rows repeated twice for the same ProductID 2 and as a result we get the same average. To solve this issue, we can just do the distinct from the products temp table which will make the optimizer to make the productsID from the products temp table distinct. This can happen when we have missing PK on the tables or the unique constraints on the tables. Statistics can be out of date as well. In this case, we can just get the distinct productID from the temp table. Change the query like shown below:

SELECT 
p.productID,
AVG(x.actualcost) AS AvgCostTop40
FROM (SELECT DISTINCT ProductID from #Products) AS p
CROSS APPLY
(
SELECT 
t.*,
ROW_NUMBER() OVER
(
PARTITION BY
p.ProductID
ORDER BY 
t.ActualCost DESC
) AS r
FROM bigTransactionHistory as t
WHERE
p.ProductID =t.ProductID
) AS x
WHERE x.r BETWEEN 1 AND 40
GROUP BY
p.ProductID
GO

When you observe the actual execution plan now, the spool disappeared and the query executed in 2 seconds. That’s cool. Usually storing the data in the tempdb is not a good option.

We can turn off the spool by using the tarceflag 8690 within the query which will turn off the spool. This traceflag will make the query optimizer disable the performance spool optimization.

Caution: Don’t try to turn on Traceflags in production just by reading on any website

SELECT 
p.productID,
AVG(x.actualcost) AS AvgCostTop40
FROM #Products AS p
CROSS APPLY
(
SELECT 
t.*,
ROW_NUMBER() OVER
(
PARTITION BY
p.ProductID
ORDER BY 
t.ActualCost DESC
) AS r
FROM bigTransactionHistory as t
WHERE
p.ProductID =t.ProductID
) AS x
WHERE x.r BETWEEN 1 AND 40
GROUP BY
p.ProductID
OPTION (QUERYTRACEON 8690)
GO

There is no more spool in the execution plan and the query runs in 2 seconds.

Lets insert the non-unique rows into the table and see how the execution plan looks like. Insert the data first and run the query by keeping the traceflag on.

INSERT INTO #products
SELECT * FROM #products
GO
SELECT 
p.productID,
AVG(x.actualcost) AS AvgCostTop40
FROM #Products AS p
CROSS APPLY
(
SELECT 
t.*,
ROW_NUMBER() OVER
(
PARTITION BY
p.ProductID
ORDER BY 
t.ActualCost DESC
) AS r
FROM bigTransactionHistory as t
WHERE
p.ProductID =t.ProductID
) AS x
WHERE x.r BETWEEN 1 AND 40
GROUP BY
p.ProductID
OPTION (QUERYTRACEON 8690)
GO

This time the same query executed in 5 seconds. Before the same query executed in 2 seconds. That’s because we have inserted more data and when you hover your mouse over to the sequence project operator (compute scalar), we now see 25690 number of executions which is a lot. If we run the same query without the traceflag and see the execution plan. Now, the query uses the Lazy Spool and we have subquery. Query runs in 8 seconds. Even with the double amount of non-uniqueness in the data, spool is not fast here.

These types of spools are not good for query plans except when we have data in the tables and we are reading the productID’s to get the product categories or names for like some millions of rows, in that case spool will be useful but apart from that, we need to avoid spools from the execution plan.

Summary:

In this post of Performance tuning series, we focused on Lookups and Spools. We learned about these two operators, when they can be useful and when we have to avoid them from the query plans. Lookups can be good with small set of data where we need to lookup for only couple of rows but it can get bad when you want to get lots of data. Spools needs to be avoided in all the plans except when you are searching for the non-uniqueness data several times to grab the same results for those non-unique values (in this example ProductID’s).

Thanks for reading!

T-SQL Tuesday #136: Least Favorite Data Type

With the Invitation of Brent Ozar for this month T-SQL Tuesday, I would like to share some thoughts on least favorite data type in SQL Server. Thank you Brent for hosting March month of T-SQL Tuesday.

VARCHAR(MAX)

Though VARCHAR(MAX) is suitable in situations with large strings of data, it has its own complications that we need to consider. In my career as a DBA, I at least saw couple of times SQL developers using VARCHAR(MAX) when they should not. Fixing the datatypes once in production is painful and causes risks.

When the data is stored in VARCHAR(n) datatype column, these values get stored in standard data page. VARCHAR(MAX) uses IN_ROW_Data up to 8000 bytes of data but if the data in more than 8000 bytes, it uses LOB_Data page and a pointer (reference) is stored in the IN_ROW_Data_page. We cannot create an Index on the VARCHAR(MAX) data type as a key column in the index. You can add the column of VARCHAR(MAX) as an included column but that wouldn’t be seekable and will duplicate the column completely which is lot of storage. We cannot compress the LOB data and the data retrieval from the LOB data is relatively slow. If the LOB data is stored as IN-ROW, it can be compressed but only when we compress the table with the PAGE level compression. LOB data cannot be compressed if you use ROW level compression and doesn’t depend if the LOB data is stored as in row or out of row.

Any variable you declare in VARCHAR(MAX) as the datatype will automatically spills to Tempdb which can cause performance problems. If you are using the VARCHAR(MAX) datatype column in your queries and sorting them, that requires so much memory to sort this data which may cause the memory issues. For selecting data having this datatype more than 8000 bytes can cause several off page I/O’s.

Recommendations:

If you are running queries and searching on column, that column should always be used as a fixed length column instead of VARCHAR(MAX) data type. If your row size can be managed to be less than 8000 bytes, that will store the data IN_ROW_Data_page which can avoid any page splits and avoids locking or latching during the updates. Limiting the length of the strings is really important for designing any database. Using VARCHAR(n) is recommended over VARCHAR(MAX) when ever possible.

I am really looking forward to what other SQL family members have to say about their least and the most favorite data types!

Performance Tuning Series: Part 1

The more you know how to tune the slow running queries, the more confident you are in your job as a DBA. No matter how many times we tune the queries, most of us still feel we are missing something and we keep looking for any tips and tricks to make sure we implement them to run our queries faster! Am I right? I know this because I always feel very interested and get excited to read articles or attending webinars about “Performance tuning”. There is always something new to learn every single time you read about this topic.

In this series of articles, come along with me and learn some Performance tuning tips and tricks to make your queries run faster. This series will be an open series, meaning I will be adding new parts to it as I learn something new in this area. If you wish to learn along with me as we go, please feel free to subscribe by entering your email address in “WANT TO BE NOTIFIED?” section and get immediate updates to your email as soon as I add new posts to this blog.

In this first part, lets start with basics.

When we submit any query to the SQL Server, SQL Server optimizer will create a query plan behind the scenes for that query. Query processor does what the query plan tells it to do. SQL optimizer is a cost based optimizer. Which query plans costs less will be the winner and that “Good enough” plan will be saved in the plan cache. A Good plan is really essential for better performance of any query. A query plan is based on the estimates. Sometimes, due to the bad estimates optimizer might choose non optimal plan and save that plan inside plan cache. In that case, we need to be able to identify and fix the issues.

How Query plans works?

Let’s see an example of the estimation plan. We will be using Adventureworks2016 database. Let’s see the estimated execution plan first by running below query. Estimates show the row, execution and the cost estimates.

select * from [Person].[Person]

When you hover your mouse on the clustered index scan, you see the estimates.

Estimates shows the number of execution is 1. That means the estimate is to execute this operator clustered index scan once. Estimated number of rows per execution shows 19972 which is correct as we see 19972 rows coming out of the table which is 100% accurate. That means the estimates are accurate. Estimated row size is 8277bytes. On the top, we see the cost (estimated I/O cost, operator cost, subtree cost, CPU cost) which is the estimated number of seconds that the query is expected to run in the very earlier versions of the SQL Server. Internally, the cost is used by the optimizer as a metric so that it can decide what best plan it can come up with. This metric is only used at the optimization time but not at the run time. Apart from that, cost for us externally is not having much meaning.

Now, lets run the same query by enabling the actual execution plan turn on. The query plan is identical to the estimated execution plan we had previously. Almost all the times, the estimated plan is same to actual execution plan except when the recompilation occurs. When you observe the actual plan, we see the actual number of rows read shows as 19972 and the number of executions as 1. We will not have the actual cost because cost is not actual, it’s just a metric used by sql server at optimization time and not at the run time.

Let’s take another example with top 100 rows from table. First check with the estimated execution plan

select TOP(100) * from [Person].[Person]

When the clustered index scan is shown, we usually believe SQL Server will always scan and read the entire table, right? But that might not be always the case. Here, we only wanted top 100 rows. Look at the clustered index scan information by hovering the mouse on to clustered index scan operator. Actual number of rows shows as 100. Now, why is that?

Let’s see how the query plan is read in the first place. There are 2 ways we can read an execution plan. Data flows from Right to left and logic flows from Left to Right in any execution plan.

Each operator in the Query plan also known as iterator. In the above query plan, we have select iterator which will request the Top iterator to provide a row which the Top iterator will go ahead and request the clustered index scan iterator to provide a row. A row is passed from clustered index scan to the top iterator which then is passed over to the select operator. Again for the next row, the select iterator requests the top iterator to provide a row in which this process repeats back and forth for each row until the TOP iterator reaches 100. When you hover your mouse over to the Top iterator, we see the TOP expression which means internally it’s called Row Goal. Once the Row goal is hit on the TOP iterator and once the 100 rows are given to select iterator, the clustered index scan iterator will be shut down, later the top iterator will shut down and then the query completes.

Let’s see with an example (Credits: Adam Machanic)


select TOP(1000) 
p.ProductID, 
pch.TransactionID
from [Production].[Product] as p
INNER LOOP JOIN [Production].TransactionHistory as pch
with (FORCESEEK) ON
p.[ProductID]=pch.[ProductID]
WHERE
pch.ActualCost>50
and p.StandardCost<10

When you hover your mouse on the rows passing by top operator, we are getting 357 rows and from the clustered index scan, we see the number of rows read were 223 and not the total number of rows from clustered index scan which is 504 rows. This is because SQL Server doesn’t require to scan once the top operator is hit or once it hit some of the predicates. For each row we have nested loops which is a seek on the ProductCostHistory table.

When you hover your mouse on to the index seek on TransactionHistory table, you will see the number of executions for this index seek is 223 times and number of rows read were 40058.

When you check the number of executions for Index seek on TransactionHistory table, it is 40058. That’s too many index operations. These were some of the key things that we have to consider when we tune the queries. How many number of times the indexes were accessed and how many rows. When your execution plans are slow, then it is always good to look at the estimates vs actual number of rows. If they are different in some magnitude, then that might be a statistics problem or optimizer may be unable to deal in some situations like if you have multi statement table valued functions. Optimizer do not have information into these kind of things and doesn’t predict the number of rows coming out of these Table valued functions. In SQL Server 2012, TVF will predict only 1 row which in SQL Server 2014, it predicts 100 rows but in the latest versions of SQL Server 2019, it can estimate close to actuals by using the SQL Server 2019 Intelligent Query Processing: Table variable deferred compilation.

If that is a statistics problem, make sure the statistics are up to date. Thick lines in the execution plan like below represent the number of rows passing to the next iterator. Thin line meaning less number of rows passing on. The faster we can filter the data, the sooner we can push any predicates down into the plan, the faster the execution plan for the query will be. One of the example for this is using views where the predicate cannot be pushed down but by replacing views using the inline table valued function, the predicate can be easily pushed down by the optimizer without have to do any work.

We will need to check for the thicker lines in the execution plans and see the resource consuming operators which are lookups, sorts, spool, hash and serial nested loops.

Lookups: We have 2 types of lookups. RID lookups (Heaps) and Key lookups (clustered). Let’s say, you are trying to run a simple query searching on a column and that column have a NCI on that column. This query uses the NCI but what if you are also trying to get other columns in the select statement as well. In that case, SQL Server will use the NCI but to get the other columns, it will have to access either the heap table (for RID lookups) and clustered index (for the Key lookups). If there are lot of rows that we are getting, then this may lead to lot of lookups and cause performance problems.

Spools: These are used as the cache in the query processor. The query processor has to cache the data because either we have inadequate indexes or we did not tell the optimizer about the uniqueness of our data like the PK and Unique keys on your tables representing the uniqueness of your data. Spools are created as the hidden tables in Tempdb. Spools iterators will be storing the data at the run time. This may be the case when you do not have appropriate indexes needed on your tables.

Sorts: Many queries we run usually need sorts. Sorts can be really painful if we are sorting large amounts of input data. Order by, merge join, stream aggregation all needs sorting of data. Sorting in the SQL Server is very expensive. The other way to sort the data is through the application.

Hashes: Hashes are two types. We have hash match and hash match aggregate. Hash match is used for the joins and the hash match aggregate is used for aggregation (stream aggregation for ordered input and hash aggregation used for hash table). Hash match builds a hash table in the memory but if there is not enough memory to build a hash table then it will have to spill to disk. Hash joins are not preferable in OLTP when compared with data warehouses where hash joins are fine for the star schemas. For the OLTP systems, we need to check if we are missing any appropriate indexes and that may be the reason for the hash joins.

Nested Loops: These can be serial. Nested loops are the basic join methodology and will be best for the smaller inputs of data. Inappropriate nested loops can cause serious slowness in the queries. This can happen when your statistics are out of date.

Summary:

In this post, we learned some basics of viewing the execution plan. Estimated and the actual execution plan. Clustered index scan doesn’t always means it is bad and doesn’t mean it is always scanning the entire table. We have learned about the resource consuming iterators and when do we see these iterators in our execution plans. We have learned few tips to improve the performance of our slow running queries.

In the coming post, we will see more examples on tuning the queries.

Thanks for reading!

SQL Server 2019 Intelligent Query Processing: Approximate QP with Approx_Count_Distinct

In the previous post, we learned about Scalar UDF Inlining feature. In this post, let’s focus on the Approximate QP with Approx_Count_Distinct feature introduced as a part of SQL Server 2019 Intelligent Query Processing features. This feature is useful to get the approximate count of distinct values just like the  Count distinct function to get the distinct number of records but this new feature will take less amount of CPU and memory to process the query.

First let’s see with an example with the regular COUNT(DISTINCT Column) works and check for the number of rows as a result. For this below example, we are using AdventureworksDW2016 database.

SET STATISTICS TIME ON
select count(distinct LastName) from [dbo].[DimEmployee

Now, run the same query using Approx_Count_Distinct and see the distinct count

SET STATISTICS TIME ON
select APPROX_COUNT_DISTINCT(lastname) from [dbo].[DimEmployee]

We see the distinct count value is less by one when compared with the query using the COUNT(DISTINCT Column). Compare the execution plans for both the queries by choosing the select properties. When you check the compile CPU, memory and time comparing both the plans, query run with the Approx_Count_Distinct have lower values when compared with the other plan. The performance of query run with the new feature using Approx_Count_Distinct is better when compared with the query using COUNT(DISTINCT Column).

Let’s check with an another example using Adventureworks2016 database.

SET STATISTICS TIME ON
select count(distinct SalesOrderID) from [Sales].[SalesOrderDetailEnlarged]

This time, try to run the query using the Approx_Count_Distinct and check for the results count

When you compare the results, there is a little bit variation in the results. The query using Approx_Count_Distinct produced 27997 more than the number produced by the query using COUNT(DISTINCT Column). This feature will produce the approximate distinct column count by taking less resources. Let’s check the resource utilization by comparing both the actual execution plans produced by both the queries side by side. Bottom execution plan is generated by the COUNT(DISTINCT Column) which took 1.084s to perform the index scan and the top execution plan generated by using the Approx_Count_Distinct took 0.374s which is very less. Also check the properties of the select to see the difference in the compile CPU, memory and compile time for both the actual execution plans. Approx_Count_Distinct query took more values when compares to the query using COUNT(DISTINCT Column).

The approximate values do not have to always be more. There may be cases where query using Approx_Count_Distinct results in less distinct values when compared with the query using COUNT(DISTINCT Column). Let’s see this with a different example using COUNT(DISTINCT Column) first.

SET STATISTICS TIME ON
select count(distinct CreditCardID) from [Sales].[SalesOrderHeaderEnlarged]

Run the same query using the Approx_Count_Distinct

SET STATISTICS TIME ON
select APPROX_COUNT_DISTINCT(CreditCardID) from [Sales].[SalesOrderHeaderEnlarged]

This time we had less number of distinct count using Approx_Count_Distinct when compared with previously run query using COUNT(DISTINCT Column). This column CreditcardID is a nullable column and have null values. When you compare the actual execution plan select properties, we can see this time the CPU, memory time is less for the query using Approx_Count_Distinct when compared with the query plan using COUNT(DISTINCT Column).

Summary:

By using the Approximate QP feature with Approx_Count_Distinct, we can get distinct non nullable values in SQL Server with better performance than using COUNT(DISTINCT Column). In this blog post, we have seen both the scenarios showing the less values and more distinct values using this new feature. In three examples we have seen in this post, two sample queries were having performance boost while running the query using the new feature Approx_Count_Distinct with less compile CPU, Memory and time.

Thanks for Reading!

Server 2019 Intelligent Query Processing: Scalar UDF Inlining

In the previous post, we learned about Batch Mode on Row Store feature. In this post, let’s focus on Scalar UDF Inlining feature introduced as a part of SQL Server 2019 Intelligent Query Processing features. This feature is really useful for improving the performance of scalar user-defined function (UDF) inlining without have to modify any code.

Source: https://www.pexels.com/photo/macro-photography-of-tree-235615/

Scalar user defined function will return a single value by taking one or more parameters. For example, if you use a calculation that be used in many number of your queries, instead of repeating the code each time for the queries, we can place the same code inside the function and use the same function in all queries requiring the same type of calculations. These functions basically simplify the code. Scalar functions are good when they are used for small set of rows but might cause performance impact with large amount of data.

First let’s run a sample query on Adventureworks2016 database without and with the UDF function in it and see the I/O information

SET STATISTICS IO ON
SET STATISTICS TIME ON
SELECT
Product.Name AS PName,
Product.ProductNumber AS PNumber,
Product.StandardCost AS StandardPrice,
History.EndDate as Enddate
FROM  Production.Product Product
INNER JOIN Production.ProductListPriceHistory History
ON Product.ProductID = History.ProductID

Look at the number of the logical reads for both the tables, ProductListPriceHistory, Product table and total elapsed time.

Let’s run the same code this time adding just another UDF function to it and look at the information again.

SET STATISTICS IO ON
SET STATISTICS TIME ON
SELECT
Product.Name AS PName,
Product.ProductNumber AS PNumber,
Product.StandardCost AS StandardPrice,
History.EndDate as Enddate,
dbo.ufnGetProductListPrice(Product.ProductID,History.StartDate) as Listprice
FROM  Production.Product Product
INNER JOIN Production.ProductListPriceHistory History
ON Product.ProductID = History.ProductID

When you observe the Logical read for ProductListPriceHistory, Product table and total elapsed time its more this time (76ms).

The logical reads information is the same but the difference here is the elapsed time. Its a little difference here for my 395 rows but this time can differ significantly when you are pulling large number of rows. We cannot get the accurate I/O statistics information for scalar UDF functions. Scalar valued UDF also cannot calculate the Execution cost information accurately. When you check the execution plan of the query we executed before with the UDF in it, we can see the cost shown as 0% for the compute scalar operator but when you check the properties for the select we can see the time for the UDF (UdfElapsedTime) is almost 70%. This cost is not properly estimated by the optimizer causing the performance issues for queries using scalar UDF’s.

Scalar UDF’s always run the query in serial and not parallel. If you have a scalar UDF in your query but that not actually touching any of your data (Like for example, UDF getting the Getdate()), even then you can see that causing the performance issue because the query runs serially. Here in this above query execution plan, the compute scalar operator invokes UDF for 395 times because UDF also runs once per row. UDF query plan is not shown in the actual query as it could generate a different query plan for each execution which can be 395 times in this case.

Scalar UDF Inlining

Scalar UDF inlining feature will convert the UDF query into simple subquery that is placed inline into the code we have in our query with APPLY statement. All these operations happen during the query runtime so we would not have to change any code. To enable the new feature on SQL Server 2019, we need to have the database compatibility to 150 and run the same query having UDF and check the elapsed time now. It is 60sec.

ALTER DATABASE AdventureWorks2016 SET COMPATIBILITY_LEVEL = 150;

SET STATISTICS IO ON
SET STATISTICS TIME ON
SELECT
Product.Name AS PName,
SET STATISTICS IO ON
SET STATISTICS TIME ON
SELECT
Product.Name AS PName,
Product.ProductNumber AS PNumber,
Product.StandardCost AS StandardPrice,
History.EndDate as Enddate,
dbo.ufnGetProductListPrice(Product.ProductID,History.StartDate) as Listprice
FROM  Production.Product Product
INNER JOIN Production.ProductListPriceHistory History
ON Product.ProductID = History.ProductID

Now when you check the actual execution plan, you cannot see the UDF being invoked because of the inlining feature. You will not find the UdfCpuTime, UdfElapsedTime in the select properties as UDF won’t be invoked.

We can disable or enable the scalar UDF feature on the database level, function level and also at the query level.

Database level:

ALTER DATABASE SCOPED CONFIGURATION SET TSQL_SCALAR_UDF_INLINING = ON/OFF

Query Level using a HINT:

OPTION (USE HINT('DISABLE_TSQL_SCALAR_UDF_INLINING'));

Function Level:

CREATE OR ALTER FUNCTION UDF(@Parameter1 [int])
RETURNS [int] 
WITH INLINE = OFF /ON
AS 
BEGIN
--Code here
END

For this new Scalar UDF inlining feature, there are few requirements that the scalar UDF functions needs to meet. You can check if the function is inlineable or not by using the sys.sqlmodules and look for the column is_inlineable.

Conclusion:

Scalar UDF Inlining feature is introduced to improve the performance of your queries invoking the scalar UDF’s where the execution is performance bottleneck. As this new feature transform the UDF into scalar expression or scalar subqueries that are placed inline in the code of the UDF so the query gets called replacing the UDF operator. These expressions and subqueries are then optimized by the optimizer. As a result, we would not be able to see the UDF operator inside our execution plan providing the performance boost.

Thank you for Reading!

SQL Server 2019 Intelligent Query Processing: Batch Mode on Row Store

In the previous post, we learned about Table variable deferred compilation. In this blog, lets focus on the batch mode on rowstore feature introduced in SQL Server 2019. This feature improves the performance of the analytical queries using the batch mode query processing. This feature is for CPU optimization helping analytical queries to run faster. We do not have to specify this option if the database compatibility is 150.

This feature is especially for the analytical queries for CPU bound analytic workloads without needing the columnstore indexes. We can specifically mention the hints in the query for using the batch mode or not.

Columnstore Indexes are used for the data warehouses and analytical purposes improving the I/O and optimizing the CPU through the batch mode execution query processing. Columstore indexes are not suitable for the OLTP systems as these causes overhead for the deletes and update operations. Using batch mode on row store feature will help solve this problem for the OLTP analytical workloads.

Row mode execution will processes are performed on row by row basis where as the batch mode will process as a batch of 900 rows. This is a huge benefit as this feature will use the CPU efficiently.

First, Let’s see with an example of how disabling the batch mode on rowstore option works by running the below query turning the feature OFF

USE AdventureWorks2016Big
GO
ALTER DATABASE SCOPED CONFIGURATION SET BATCH_MODE_ON_ROWSTORE = OFF;

Then run the below query and check the actual execution plan

SELECT ModifiedDate,CarrierTrackingNumber ,
SUM(OrderQty*UnitPrice) FROM Sales.SalesOrderDetailEnlarged
GROUP BY ModifiedDate,CarrierTrackingNumber

When you observe the query plan and hover over the clustered index scan, the actual execution mode is ‘Row’ mode and it has processed 9341409 rows.

Now, enable the Batch mode by using the HINT in the same query and run again. You can also enable the Batch mode on database level option to turn back ON (as we turned OFF for the previous example)

SELECT ModifiedDate,CarrierTrackingNumber ,
SUM(OrderQty*UnitPrice) FROM Sales.SalesOrderDetailEnlarged
GROUP BY ModifiedDate,CarrierTrackingNumber
OPTION(USE HINT('ALLOW_BATCH_MODE'))

There are some limitations that will not allow the use of the batch mode processing like OLTP tables, Indexes that doesn’t have B-Tree structures, heaps, LOB columns, XML and Sparse columns. Batch mode on Row store feature will significantly improve the analytical workloads with queries having groupby, aggregations or sorts operations. Queries which will result in one few rows or queries which do not do any aggregation operations will not see much performance gains with this feature.

Summary:

In this post, we learned about Batch Mode on Row Store feature of SQL Server 2019. This feature will help the analytical workloads and use CPU effectively. If your databases are in 150 compatibility mode, then you are already taking advantage of this new feature.

In the next blog post, we will learn about the next Intelligent Query Processing feature T-SQL Scalar UDF Inlining Feature

Thanks for Reading!