Performance Tuning Series: Part 2

In the Part 1 of Performance tuning series, we learned about the resource consuming iterators and when do we see these iterators in our execution plans. In this part of the performance tuning series, lets talk about some of the performance draining things.

First, lets see about Lookups. We will first run DBCC dropcleanbuffers so all the unchanged pages in the memory will be flushed out to the disk. We are using Adventureworks2016 database for the demo. Run the below query and check for the actual execution plan.

Select 
a.*
into #a
from [Sales].[SalesOrderHeader] as SOH
CROSS APPLY
(
SELECT TOP(1000)
* FROM [Sales].[SalesOrderDetail] as SOD
WHERE SOH.[SalesOrderID]=SOD.[SalesOrderID]
ORDER BY [OrderDate]
) as a
where
SOH.[SalesOrderID] between 43659 and 44659

When you observe the execution plan, SQL Server did clustered index seek on SalesOrderHeader table and then the nested loop and the clustered index seek on the SalesOrderDetail table which then did the compute scalar twice. What is this compute scalar operator here and why we see it here?

Compute Scalar:

In the table SalesOrderDetail, we have a computed column LineTotal. When you see the table definition, you see the LineTotal column is defined as below:

[LineTotal]  AS (isnull(([UnitPrice]*((1.0)-[UnitPriceDiscount]))*[OrderQty],(0.0))),

SQL Server after clustered index scan and pass on the rows to the 1st compute scalar operator to calculate:

[UnitPrice]*((1.0)-[UnitPriceDiscount]))*[OrderQty]

Then it will have to check to see if that value is null, and if it is null, substitute with 0.0. That’s the second compute scalar operator.

Then as the rows passed from the compute scalar operator to Top, it grabs the top 1000 rows here and gather the data and insert into the temp table. It took 0 seconds to run this query.

Later if in any case, you add any additional columns to the tables like for example you had to add DetailID column to the table SalesOrderDetail.

Alter table [Sales].[SalesOrderDetail]
Add DetailID INT Null
GO

Run the same query again. Before running drop the temp table we created and run DBCC dropcleanbuffers to clean clean pages from the memory.

drop table #a

dbcc dropcleanbuffers
Select 
a.*
into #a
from [Sales].[SalesOrderHeader] as SOH
CROSS APPLY
(
SELECT TOP(1000)
* FROM [Sales].[SalesOrderDetail] as SOD
WHERE SOH.[SalesOrderID]=SOD.[SalesOrderID]
ORDER BY [OrderDate]
) as a
where
SOH.[SalesOrderID] between 43659 and 44659

When you observe now, the same query has a different query plan. Now, there is a nested loop and the key lookup. Every row that comes out of SalesOrderDetail table has to do a key lookup and do the compute scalar operation and then do the 2nd compute scalar again and then grab the Top 1000 rows and pass it on to the next operator to do the insert into temp table. Later those rows are gathered as a stream as this is a parallel query plan. We have small number of rows in this table, so we did not see much difference in the execution time but if your tables have lot of data, then this will take lots of time to run.

There are few main things that can make this happen. We need to make sure we have our development environments having same or near to same amount of data as we have in production because for small amount of data like we have here in our example, we did not see much difference in the query execution time. Developers thinking this code can run the same way in production can leads to terrible performance issues once this code pushed to production. It is so important to have the same amount of data in the non-production environments as well. The other thing is to make sure we do proper testing before pushing the data to production. The load testing needs to be done before moving the code to production. The third thing is to not use outer most select * in the query. We may be adding columns to the table to get the data for the other queries so it is so important to load test our queries.

Spools:

Spools are the caches in tempdb. Let’s take an example which can give us the spool in our execution. First, we will create a products temp table from the bigproduct table and do cross apply where productID between 1001 and 20001. We use crossapply and for each product, we get one row, for the products evenly divided by 5, we get 2 rows and the products divided by 7, we get a row and products divided by both 5 and 7, we get 3 rows. This is the first query. For the second query, we will go to the temp table products and grab the top 40 rows for the product by doing the crossapply using the derived table on BigTransactionHistory table. Run the query and check the actual execution plan. Credits: Adam Machanic

SELECT ProductID
INTO #Products
FROM bigproduct
CROSS APPLY
(
SELECT 1 
UNION ALL
SELECT 2
where productID % 5 =0
UNION ALL
SELECT 3
WHERE ProductID % 7 =0
)x(m)
WHERE 
productID BETWEEN 1001 and 20001
GO

--spool
SELECT 
p.productID,
AVG(x.actualcost) AS AvgCostTop40
FROM #Products AS p
CROSS APPLY
(
SELECT 
t.*,
ROW_NUMBER() OVER
(
PARTITION BY
p.ProductID
ORDER BY 
t.ActualCost DESC
) AS r
FROM bigTransactionHistory as t
WHERE
p.ProductID =t.ProductID
) AS x
WHERE x.r BETWEEN 1 AND 40
GROUP BY
p.ProductID
GO

SQL Server will first do the table scan on the temp table Products and first sorted and there is a lazy spool (number 4), that’s called performance spool and most common spool. This spool will works by first sorting the ProductId’s. All the same ProductID’s will be grouped together and each row going into the nested loop (number 3) and request the Lazy spool to get the information about the ProductID. Let’s say for the ProductID 2. Nested Loop will request table spool to get information about ProductID 1 which lazy spool will run its subtree (number 5) and grab the results and submit to the Lazy spool which is in tempdb and that information will be served up and goes on to the next operators. The next row comes in which is ProductID 2 again from the nested loop requesting Lazy spool for the information about the ProductID 2 which then Lazy spool will directly give the results from the tempdb itself without running the subtree again and pass the results to next operator in the query tree. This will help because the sub tree costs around 62% of the plan cost when you check in the subquery. If we have non-distinct data, this lazy spool helps. It’s much cheaper to put the data in the tempdb and store there. This query executed in 8 seconds. Now,is that good? Lets see by disabling the

When you check the execution plan again by hovering your mouse over to the lazy spool, we have rebinds and rewinds.

Rebinds occurs when the subtree behind the spool has to rerun and the Rewinds occurs when the data is served straight out of the spool. So, the data is requested from the spool 12845 times (number of executions). Out of those number of times, we just have to run the spool only for 9577 times (Actual Rebinds) and 3268 times we had the data serve up the spool (Actual Rewinds).

We can also verify that information from one back step from the spool which is sequence project operator from the subtree (number 5 from the execution plan). Hover your mouse over to the sequence project operator and see the number of executions should lineup with the number of the actual Rebinds from the lazy spool operator.

So, this query took 8 seconds to run, right? Is this good or bad? Let’s turn off the spool and see. We can turn off the spool in two different ways. The query runs per productID and do the average of the top 40 rows. If we have one row for the ProductID 2 in temptable, then the average we found for the top 40 rows, we get the result. What if we have two rows for the ProductID 2 in our temptable, then we have 80 rows that we have to average to get the result but the same 40 rows repeated twice for the same ProductID 2 and as a result we get the same average. To solve this issue, we can just do the distinct from the products temp table which will make the optimizer to make the productsID from the products temp table distinct. This can happen when we have missing PK on the tables or the unique constraints on the tables. Statistics can be out of date as well. In this case, we can just get the distinct productID from the temp table. Change the query like shown below:

SELECT 
p.productID,
AVG(x.actualcost) AS AvgCostTop40
FROM (SELECT DISTINCT ProductID from #Products) AS p
CROSS APPLY
(
SELECT 
t.*,
ROW_NUMBER() OVER
(
PARTITION BY
p.ProductID
ORDER BY 
t.ActualCost DESC
) AS r
FROM bigTransactionHistory as t
WHERE
p.ProductID =t.ProductID
) AS x
WHERE x.r BETWEEN 1 AND 40
GROUP BY
p.ProductID
GO

When you observe the actual execution plan now, the spool disappeared and the query executed in 2 seconds. That’s cool. Usually storing the data in the tempdb is not a good option.

We can turn off the spool by using the tarceflag 8690 within the query which will turn off the spool. This traceflag will make the query optimizer disable the performance spool optimization.

Caution: Don’t try to turn on Traceflags in production just by reading on any website

SELECT 
p.productID,
AVG(x.actualcost) AS AvgCostTop40
FROM #Products AS p
CROSS APPLY
(
SELECT 
t.*,
ROW_NUMBER() OVER
(
PARTITION BY
p.ProductID
ORDER BY 
t.ActualCost DESC
) AS r
FROM bigTransactionHistory as t
WHERE
p.ProductID =t.ProductID
) AS x
WHERE x.r BETWEEN 1 AND 40
GROUP BY
p.ProductID
OPTION (QUERYTRACEON 8690)
GO

There is no more spool in the execution plan and the query runs in 2 seconds.

Lets insert the non-unique rows into the table and see how the execution plan looks like. Insert the data first and run the query by keeping the traceflag on.

INSERT INTO #products
SELECT * FROM #products
GO
SELECT 
p.productID,
AVG(x.actualcost) AS AvgCostTop40
FROM #Products AS p
CROSS APPLY
(
SELECT 
t.*,
ROW_NUMBER() OVER
(
PARTITION BY
p.ProductID
ORDER BY 
t.ActualCost DESC
) AS r
FROM bigTransactionHistory as t
WHERE
p.ProductID =t.ProductID
) AS x
WHERE x.r BETWEEN 1 AND 40
GROUP BY
p.ProductID
OPTION (QUERYTRACEON 8690)
GO

This time the same query executed in 5 seconds. Before the same query executed in 2 seconds. That’s because we have inserted more data and when you hover your mouse over to the sequence project operator (compute scalar), we now see 25690 number of executions which is a lot. If we run the same query without the traceflag and see the execution plan. Now, the query uses the Lazy Spool and we have subquery. Query runs in 8 seconds. Even with the double amount of non-uniqueness in the data, spool is not fast here.

These types of spools are not good for query plans except when we have data in the tables and we are reading the productID’s to get the product categories or names for like some millions of rows, in that case spool will be useful but apart from that, we need to avoid spools from the execution plan.

Summary:

In this post of Performance tuning series, we focused on Lookups and Spools. We learned about these two operators, when they can be useful and when we have to avoid them from the query plans. Lookups can be good with small set of data where we need to lookup for only couple of rows but it can get bad when you want to get lots of data. Spools needs to be avoided in all the plans except when you are searching for the non-uniqueness data several times to grab the same results for those non-unique values (in this example ProductID’s).

Thanks for reading!

One thought on “Performance Tuning Series: Part 2

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s