Memory Grant Internals Part 4

This part is the continuation of the series to the post Memory Grant Internals part 3

In our previous part of the series, we have learned about what are these query memory/work space memory and how it works, what happens when the memory allocated to the queries is not sufficient and how the queries have to wait before they get the memory grants. In this part4 of the series, lets learn about how queries will spill to disk when the grants allocated are not enough and how to make sure we address and solve these issues.

If your tables are huge and you are sorting them in your queries, then it will take enormous amount of memory to do the sorting task and it is very hard to allocate memory for such sorts unless you have a huge amount of memory. In such cases, data have to spill to disk. Spilling to disk meaning using the “SQL server scratch pad” which is tempdb to do these sorts and as these spills directly to tempdb on disk, our queries become very slow to run.

As we learned in the first part of the series, memory is needed by these memory consuming iterators: Hash match, Sorts and Exchange iterators. There are two phases for each of these iterators. A build phase and a probe phase. Lets go and learn how these three iterators work in detail:

Hash Match: At the build phase, it will take the outer input and build the hash table. It builds a set of hash buckets. Now what are these hash buckets? Each hash bucket is like a memory location which is a pointer to the linked list of rows. Any rows that will hash to the bucket go into the linked list and be placed there. So, the input side is scanned, buckets are created, rows are hashed and then these rows are placed into the linked list in the respective buckets using a Hash function and during this phase, we have a hash table created. Next, the match operation will start where the inner set will be scanned and the rows on inner set are hashed and go to the respective buckets as well. Then they iterate over the linked list to find the match rows.

How do the hash match, spills to disk?

During the build time, the memory is allocated based on the cardinality estimates and the estimates are based on the input size and the probe. The buckets are created in the memory and we place the rows in those respective buckets. If the grant is exceeded then some of the buckets will be send to the disk. As some of the buckets are already in memory and some in disk, the initial probe of the data using the inner set. The probe rows will be scanned if the row hash to the in memory bucket the match is done for those rows. If the rows match to the on disk bucket, the probe row will be return to the disk along with the outer side bucket. So, because of this we have more disk writes to tempdb at the probe time. This is a build side spill.

There is also probe side spills that can occur at the second phase of the hash match iterator. The buckets that has spilled to disk from the first phase should be now moved to the memory back again and then rebuild the hash table, then the spilled inner probe rows will be read back to the memory to match. In this situation the memory is exceeded again so the buckets have to be rewritten to disk. This is called recursion. So this will be the cycle that the plan will not complete making the plan to bailout. Bailout meaning the plan will use the unoptimized nested loop which cannot complete as it uses inner table multiple times. This plan will use little memory but it can run forever.

Sort: Sort will begin sorting in the memory space that was allocated using the quicksort algorithm (The memory requirement is of at least 200% of the input size). If the memory grant is exceeded even a little bit then the entire intermediate set will spill directly to disk and the sort will continue on the disk completely using the merge sort algorithm. . For the Hash match, if the memory grant is exceeded then few of the buckets will go and spill to tempdb.

Exchange: Exchange iterators will deal with the parallel set of rows from multiple threads. There can be three types of exchanges. Distribute(one stream comes in and multiple streams comes out), repartition(depending on the DOP, n streams can come in and n streams can come out and distributed through a different algorithm) and Gather(DOP n streams comes in and 1 stream comes out). For the distribute and the gather, we have DOP set of buffers but for repartition, there is DOP on each side (DOPx2 sets of buffers). There are two sides for the exchange iterator. Producer and consumer side. These sides of the iterator will communicate with each other using buffers/packets of rows. Each side has DOPx2 per stream sets of buffers.

Gather and repartition iterators can come as merging or order preserving meaning if all the streams that are coming in are ordered then the iterators can output the data in the same order. If there is a data skew and if the huge amount of data is flowing in and flowing in very fast then the merging iterators can not keep these as the same order. In such cases, all the data will be spilled to disk and resorted. This situation is called as intra query parallel deadlock which will extremely slow down the queries.

How to check for these spills?

sys.dm_os_waiting_tasks where all the spills will be waiting on the wait IO_completion which means the slow disks. We can also use the SQL Trace to monitor these spills by capturing the Exchange events Hash warning, sort warning and exchange spill. We can also use the sys.dm_db_[task|session]_space_usage which will have the columns for user object allocations like temp tables and table variables and have internal object allocations column which gives information about anything that spills out to tempdb from the workspace memory.

Lets see how these spills can be monitored with the examples (Credits: Adam Machanic)

Create an Extended event filtering the database and choosing the events for Hash and Sort warnings by running below query:

-- Create Event Session SortandHashWarning
CREATE EVENT SESSION SortandHashWarning ON SERVER
ADD EVENT sqlserver.sort_warning(
ACTION(sqlserver.sql_text,sqlserver.tsql_frame)
WHERE (sqlserver.database_id=(8))), -- replace database_id
ADD EVENT sqlserver.hash_spill_details(
ACTION(sqlserver.sql_text)),
ADD EVENT sqlserver.hash_warning(
ACTION(sqlserver.sql_text))
ADD TARGET package0.event_file(SET filename=N'C:\Spills.xel',max_file_size=(50),max_rollover_files=(2))
WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=OFF,STARTUP_STATE=OFF)
GO

Enable the event and watch live data by right clicking on the extended event we created and choose watch live data.

Sort warnings:

Run the below query to see if the data spills to disk:

select top(1000) * from 
(
select top (668935)
*
from bigTransactionHistory
) as x
order by 
x.ActualCost desc
go

When you observe, no data has actually spilled to disk. What if we go ahead and change the rows and see if data spills to disk:

select top(1000) * from 
(
select top (668936)
*
from bigTransactionHistory
) as x
order by 
x.ActualCost desc
go

When you observe, we have 8 sort warnings as this query went parallel (DOP 8 plan) so sort warning was fired on each thread. If you check the duration, it is 0.3sec but previous query is 0.2seconds. There is only some difference in the duration because we only bought one row extra when compared with previous query. By the time the sort spilled to disk, most of the data is already sorted in memory, but if the getting lot and lot of data need to be spilled then the duration will be more as more data needs to be sorted in disk.

Hash spill warnings:

Lets look at the Hash spill with an example. Run the below query and watch live data in extended events.

select count(*) from 
(
select top (3800000) *
from bigTransactionHistory
) as bth
INNER HASH JOIN
(
Select top(10000)
*
from bigTransactionHistory
) as bth1 on
bth.productid=bth1.ProductID
go

This query did not spill to disk. Now lets increase the number of row we get and see the event information.

Now, you can see the hash warnings and the duration is 0.84sec. With out the spill, the duration was 0.84sec. There is no difference in the duration without the spill and with the spill. Due to the Grace hash algorithm used by the hash match iterator, we only spills some of the buckets to the disk. Hash match spills much less unless it hits the bailout stage. Did you observe, we have only 3 threads spilling to disk for the hash warning but not all 8 threads. That is because if one thread have more data then the other thread, that thread might not spill but the threads having more data can spill. In this scenario, it spilled three threads. This happens when you have skew in your data.

Lets go ahead and see that information in the execution plan for the same query by running it again.

When you observe the input number of rows passing through the Hash match iterator, by looking at the properties (F4), the number of rows for each thread is not even, the last three threads are having more rows compared to other threads. These three threads have spilled to disk and that is the reason we have seen only three hash match warnings in the event information.

Exchange spill:

Lets go ahead and add two more events to capture the exchange spill: Exchange_spill event and Lock_deadlock_chain event.

Now lets run a query that will have an exchange spill (Credits: Erik Darling)

Creating the table first

TRUNCATE TABLE DEADLOCK;
INSERT INTO DEADLOCK WITH (TABLOCK)
SELECT
  (RN - 1) / 12500
, REPLICATE('Z', 100)
FROM (
       SELECT TOP (50000) ROW_NUMBER()
        OVER (ORDER BY (SELECT NULL)) RN
       FROM master..spt_values t1
       CROSS JOIN master..spt_values t2
       CROSS JOIN master..spt_values t3
) t
OPTION (MAXDOP 1);
UPDATE STATISTICS DEADLOCK CI__DEADLOCK WITH FULLSCAN;

Now, run the below query which can give the expange spills:

SELECT t1.ID
FROM DEADLOCK t1
WHERE EXISTS (
       SELECT 1
       FROM DEADLOCK t2
       WHERE t1.ID = t2.ID
)
ORDER BY t1.ID
OPTION (QUERYTRACEON 8649, MERGE JOIN, MAXDOP 2);

Due to the data coming to the final parallelism iterator and the merging iterator as you can see the order by at the last parallelism iterator. If we have these inter query parallel deadlocks, we can go ahead and see if we have any of these merging iterators in our execution plan.

Summary:

In this part 4 of the series, we have learned about types of spills that can happen from these memory consuming iterators when the grants allocated are not enough and how to make sure we address these issues. In the next coming part, we will learn about how to solve these issues.

Thanks for reading!

2 thoughts on “Memory Grant Internals Part 4

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s