While I was preparing for my Tempdb presentation, I learned many interesting facts about Tempdb. Thanks so much Bob Ward (t|g) for providing me with the resources to prepare for my presentation. Bob Ward has presented an amazing 3 hour session about Tempdb for the PASS Summit couple of years ago. This information is invaluable.
Why I am blogging about the interesting things about Tempdb then?
There were lot of things about Tempdb that surprised me as I learn about tempdb from Bob Ward presentation. Here is the link to the complete session. I thought of writing a blog post about these interesting and cool things so you all can know about them as well.
Do you know?
Each time you restart SQL Server, tempdb gets created from the model database and acquire all the properties from the model database including any objects you created in model database? First the primary data file gets created from model database copying one extent at a time from model database to tempdb database. Later transactional log file gets generated. After that, secondary database files will be created for tempdb. Until the tempdb is created, the model and the tempdb databases will be locked and no one can connect to it though user connections are allowed at this point of time as master database will be first opened before tempdb.
You can create tempdb very quickly by using the -f startup parameter but still retain the tempdb original sizes. This will help when you are having problems starting tempdb.
If you enabled Instant file initialization, creating data files will be quick because we do not have to zero out the database files and immediately start using these files. Instant file initialization doesn’t support zeroing out of the transaction log files. If you would like to increase the size of the log file later, then we have to zero out the entire log. When you restart the tempdb and if you already have the big transactional log file available, we do not have to zero out the entire log. We just have to zero some of the records out the first virtual log files.
When a temp table is dropped, the meta data of the temp table is cached. Temp table data is truncated keeping the temp table meta data for reuse.
You cannot cache the temp table when it is not associated with stored procedure, functions and triggers. Temp table caching is not supported for ad-hoc queries.
If multiple people are using the same stored procedure having the same temp table at the same time, multiple temp tables can get created and they can be reused as all these tables are cached.
Tempdb do the minimum logging. We do very less logging because we do not keep the data persisted when we restart the SQL Server. We create tempdb each time we restart with SQL Server. We don’t need a crash recovery for this database. We need transaction log for rollbacks. There is no redo for tempdb because the objects we create are not persisted on restart.
When the Checkpoint process runs on SQL Server automatically, it skips the checkpoint on tempdb. When you manually run a checkpoint, then the checkpoint process happens on tempdb which can take some time to run because we do not do the checkpoint by regular automatic process. Checkpoint process will truncate the t-log of tempdb.
Table variables are stored in tempdb and there is no rollback support for table variables.
Index sorting is done in the user databases and not in tempdb unless you use SORT_IN_TEMPDB option. If the sort is not enough to fit in memory, it will spill to disk and use temporary space in tempdb.
You are enabling the version store in tempdb if you are using snapshot isolation, read committed snapshot isolation, online index rebuild and triggers.
If you try to change the size of the tempdb file using management studio, altering of the file to change the size doesn’t happen. SHRINKFILE happens behind the scenes. When you manually alter the file using the script, the alter database and changing the file will be successful and will change when you restart the server.
Auto shrink is not supported for tempdb and we cannot shrink the internal objects inside the tempdb.
When you run CHECKDB on tempdb, we run with implicit TABLOCK because a database snapshot is not supported for checkdb on tempdb. Online checkdb is not supported for tempdb. Checkdb on tempdb will run check tables on all the tables including the temp tables we create. We cannot do repair on tempdb because we cannot place the tempdb in single user mode. CHECKSUM is supported in tempdb.
We can change the location of the tempdb files while in use but it is effected only after the SQL Server restart.
These are some of the interesting and cool things I learned about tempdb! I hope you learned something new today. Thanks for reading!
Memory optimized tempdb metadata was introduced in 2019 to solve the metadata contention problem we have in Tempdb.
Before going to learn about taking the advantage of this feature, lets learn what’s the metadata contention is.
Metadata contention happens when there is a contention on the system pages which stores the data about the objects inside the tempdb. Each time you create, update or delete any object in tempdb, we need to update these system tables. As you know tempdb is being used by all the applications and databases in the server and there is only one tempdb per instance, metadata contention can occur when these system tables are updated while creating the temporary objects in tempdb.
Metadata contention was the issue since SQL Server 2000. In SQL Server 2005, they introduced temp cable caching to reduce the meta data contention. For the inserts, caching the metadata for the temp tables we have for the stored procedures helped reduce the metadata contention for many years. Regularly, the metadata of these cached objects needs to be removed from the cache. While the cleaning up process happens, we need to find and delete the related rows from our metadata tables. As more and more table metadata was added with the new versions of the SQL Server, the contention was observed while deletes happen from the cache. These cached objects needs to be deleted when ever there is a memory pressure or when the cached objects are invalidated due to altering the temp tables after they are created.
To solve this problem, three main changes were added before implementing the memory optimized tempdb metadata feature. I will be explaining briefly those changes here.
Cleanup process has been changed from the synchronous to asynchronous. Cleanup process use to be in synchronous previously as the temp tables use be to be dropped when the stored procedures ends. With the asynchronous process, dropping doesn’t occur and treated as NO OPERATION no matter if the table is explicitly dropped by using the drop table syntax or if the cached object is invalidated by adding index after creating the temp table. These objects instead will be moved to the deleted list where a cleanup thread taken care by the cache clock hand sweep process will cleanup all these deleted list and the data related to the deleted list objects from the system tables as well.
Only one helper thread to cleanup, one per NUMA node. previously, once the objects are sent to the deleted list were deleted by the multiple threads that comes to actually create the temp tables. Any threads that comes to create the temp tables has to first go and delete these cached objects from the deleted list, find and delete the metadata from system tables as well before moving ahead. This logic has changed later to only one thread dedicated to the cleanup process. One helper thread is used for one NUMA node.
Latching algorithm has changed. The old algorithm used to delete the metadata rows from the system tables of tempdb when deleting the cached object from the cache which holds the exclusive lock on the pages which is known as PAGELATCH_EX. This exclusive lock will be hold for the entire process. Once the process is done cleaning up the meta data rows from system files, the exclusive lock is released. This has changed to the new latching algorithm where a shared lock is acquired until the cleanup process finds the metadata rows in the system tables. Once it find the rows to delete, then the exclusive lock is placed and the rows are deleted. If the rows are not found, then the shared lock will be released.
In SQL Server 2019, Microsoft introduced a new feature called “Memory-Optimized TempDB Metadata feature” which can store the Metadata of temp tables stored as non-durable memory optimized tables. Let’s see how the PAGELATCH_EX wait types look like in general. We are running the below stored procedure that we create without enabling the feature (default)
alter PROC TEST
CREATE TABLE #table1
INSERT INTO #table1
SELECT TOP 1000 [DepartmentID],
FROM [HumanResources].[Department] WITH(NOLOCK);
--use SQLQuerystress and run the below query. I used 50 iterations and 100 threads and see the results.
DECLARE @ID INT;
SET @ID = 1;
WHILE @ID <= 100
SET @ID = @ID + 1;
Check for the pagelatch waits by running below query
--credits: Pam lahoud
er.session_id, er.wait_type, er.wait_resource,
OBJECT_NAME(page_info.[object_id],page_info.database_id) as [object_name],
WHEN -1 THEN DATALENGTH(st.text)
END - er.statement_start_offset)/2) + 1) AS statement_text,
page_info.database_id,page_info.[file_id], page_info.page_id, page_info.[object_id],
FROM sys.dm_exec_requests AS er
CROSS APPLY sys.dm_exec_sql_text(er.sql_handle) AS st
CROSS APPLY sys.fn_PageResCracker (er.page_resource) AS r
CROSS APPLY sys.dm_db_page_info(r.[db_id], r.[file_id], r.page_id, 'DETAILED') AS page_info
WHERE er.wait_type like '%page%'
You will have to refresh couple of times to see the PAGELATCH_EX wait type. This is a metadata contention. We can know this as a metadata contention by looking at the objectname, sysschobjs is the system table name. To know how the feature works, we need to enable the feature and restart the database.
ALTER SERVER CONFIGURATION SET MEMORY_OPTIMIZED TEMPDB_METADATA=ON;
Restart the SQL Server and check if the feature is enabled or not
--should give you the value of 1 if the feature is enabled
SELECT SERVERPROPERTY('IsTempDBMetadataMemoryOptimized') AS IsTempDBMetadataMemoryOptimized;
Run the same query again using SQLQuerystress for 50 iterations and 100 threads
You will not see any waittypes this time
If you have heavy workloads on your tempdb, it is totally worth trying this feature. Though there are some limitations to this feature and that is the reason this feature is not enabled by default. With this feature turned on, it will reduce the metadata contention as there will be no latching or locking of metadata pages
You need tp restart your SQL Server instance once you enable this feature. You cannot use columnstore indexes on #temp tables with this feature turned on.
In this final part of the series, lets talk about how we balance these memory grants. If lot of your queries are requesting for more memory than they required, then that causes concurrency issues. Memory grants are allocated to every query even before they get executed based on the cardinality estimates and memory consuming iterators. Once the query memory is allocated to a query, that memory is only released once the query execution completes even if the query actually uses a part of the allocated query. If more memory is requested by the queries than they need, then we are wasting the memory. What happens if the queries only receive less memory grants than they need, then we there is a performance impact on the running queries.
These issues occurs due to some problems. If your estimates are inaccurate, if the plans that got generated are non-optimal in scenarios where you have skew in your data and creating a plan compiled based on your initial parameter value. For example, if you execute the stored procedure for the first time using a parameter value giving you 100 rows, a plan gets generated and with a next parameter gives you 2 million rows, we still use the same execution plan which will be non-optimal for these parameters having large number of rows. This is also known as parameter sniffing. Sometimes, we have “out of model” situation meaning anything we put into a query plan that the query optimizer isn’t handle to set up properly. We need to make sure we are out of “out of model scenarios” to get good estimates. Sometimes, rewriting of the queries will be helpful as well.
There are few ways where we can fix these problems.
Updating the statistics. Even you have the default auto update statistics set to ON, maintaining the statistics regularly using the maintenance plan
Recompiling the query. Recompiling will help keep away the parameter sniffing issues.
Try to avoid the multi statement TDF’s. From SQL Server 2017, Adaptive query processing is introduced where the estimates for these TDF’s are lined up with the actual number of rows but that is not the case for the older versions of SQL Server.
You can use the OPTION(OPTIMIZE FOR) to influence the optimizer.
Check for the variable length Data types. For example, we have a column with varchar(4000) data type but all your values are just using a single character. The estimate is going to be for the average row size is going to be half. Varchar and other variable length data type columns estimate only 50% of the defined size.
Use the resource Governor. If you have many high grant and low cost queries waiting in the queue, you can categorize them into a workload group by creating a pool for just for these types of queries. If you need concurrency, limiting the grant percentage and the setting the timeout options for resource pool helps.
Lets see with the examples.
Low memory grant even the stats were updated. Run the below query and see the actual execution plan along with the extended event live data to see if it spills to disk.
select top(1000) * from
select top (600000)
* from DBO.bigTransactionHistory
order by ActualCost desc
OPTION (MAXDOP 1)
The memory grant was 55MB which is very less than the maximum grant. There is a spill warning in the event information as well. Why did the spill happen when we have enough memory? Statistics were up to date but may be the estimated row size might differ with the actual and that is the reason it had to spill to disk. In this case, the optimizer was not able to get the good plan for this query.
2. Multi statement table valued function. In the second example, lets see how many rows multi statement variables get estimated. First lets see for the sql server 2012 database compatibility.
SELECT tst.CustomerID, COUNT(*)
FROM Sales.Customer c
INNER JOIN dbo.TEST_MTVF('1/1/2014') tst on c.CustomerID = tst.CustomerID
INNER JOIN Production.Product prod on tst.ProductID = prod.ProductID
GROUP BY tst.CustomerID
When you observe the actual execution plan and hover the mouse over to the rows passing from table to nested loop iterator. The estimated number of rows is 1 but the actual number of rows are 37339 which is a lot. For SQL Server 2012 and below, the estimates to the MTLF is always 1 no matter how many actual number of rows were there.
Now, lets place the database in the sql server 2016 model and see the estimates. Run the same query again and check the actual execution plan to see the estimates. When you observe below, the estimates were showing 100 rows. For SQL Server database compatibility of both 130 (SQL Server 2016) and 120 (SQL Server 2014) estimates for MTVF always show 100 rows no matter how many actual rows might be.
Lets see for the latest versions 2019. Run the same query and check the estimates for MTVF. Now the estimates are exactly lined up with the actual number of rows. This is the same for SQL Server 2017 database compatibility mode (140) That’s because in 2017, Adaptive query processing is introduced which will optimize the execution plan in a better way and get the correct estimates for MTVF’s.
3. Variable length Data types. Lets run the below query which have a varchar datatype and see how the estimates work here:
select *, CONVERT(VARCHAR(8000), 'a') as new
select * from #B
( SELECT 1
) AS n (l)
ORDER BY name, l
While running the above query, if you query select * from sys.dm_exec_query_memory_grants, you will see the granted memory is 183MB but used memory is only 19MB. Rest all the memory which is granted is wasted. Based on the varchar(8000), memory grant has been allocated but when in actual only some memory is being used to process the query. Due to this, all the other queries running concurrently have to wait for this query memory to be released. Once the memory has been granted, memory doesn’t be released whether the query uses that memory or not.
4. Using TOP with (optimize for) option. In this example, lets run the below query and see the event information. we can use a variable and use the optimize for option like below. There is no sort spilling warning.
declare @C int=100000
select top(1000) * from
(select top (@C)
) as a
option (MAXDOP 1, OPTIMIZE FOR (@C=150000))
We can use Optimize for option in many ways where we can limit the grant by mentioning less number of rows in optimize for option to get lower amount of grant where as if the memory grant is needed more than optimize for bigger value than we mentioned in the variable, like the see in the above example.
5. We can also change the resource semaphore options request_max_memory_grant_percent to the needed percentage. The default is 25%. We can place the limit on how much percentage the workload group can take. In below example, We can set the limit by altering the workload group with 15% like below:
Alter Workload Group [default] with
Alter Resource Governor RECONFIGURE
In this final part of the series, we have learned how to balance these memory grants with examples. This is our final part for the Memory Grant Internals Series. Too much or Too little memory grants is bad. Balancing the memory is important. During these parts of the series, we started the learning path with memory consuming iterators, how these iterators work and share the memory between them, what is a resource governor, parts of the resource governor, how the spills to disk happens with these iterators, how to identify these spills and finally how to avoid them.
Hope you liked the series and learned from it. If you have any ideas about the topics we can cover in coming blog posts, please do share with me. I would love to learn and share the knowledge while I learn along with you.
In our previous part of the series, we have learned about what are these query memory/work space memory and how it works, what happens when the memory allocated to the queries is not sufficient and how the queries have to wait before they get the memory grants. In this part4 of the series, lets learn about how queries will spill to disk when the grants allocated are not enough and how to make sure we address and solve these issues.
If your tables are huge and you are sorting them in your queries, then it will take enormous amount of memory to do the sorting task and it is very hard to allocate memory for such sorts unless you have a huge amount of memory. In such cases, data have to spill to disk. Spilling to disk meaning using the “SQL server scratch pad” which is tempdb to do these sorts and as these spills directly to tempdb on disk, our queries become very slow to run.
As we learned in the first part of the series, memory is needed by these memory consuming iterators: Hash match, Sorts and Exchange iterators. There are two phases for each of these iterators. A build phase and a probe phase. Lets go and learn how these three iterators work in detail:
Hash Match: At the build phase, it will take the outer input and build the hash table. It builds a set of hash buckets. Now what are these hash buckets? Each hash bucket is like a memory location which is a pointer to the linked list of rows. Any rows that will hash to the bucket go into the linked list and be placed there. So, the input side is scanned, buckets are created, rows are hashed and then these rows are placed into the linked list in the respective buckets using a Hash function and during this phase, we have a hash table created. Next, the match operation will start where the inner set will be scanned and the rows on inner set are hashed and go to the respective buckets as well. Then they iterate over the linked list to find the match rows.
How do the hash match, spills to disk?
During the build time, the memory is allocated based on the cardinality estimates and the estimates are based on the input size and the probe. The buckets are created in the memory and we place the rows in those respective buckets. If the grant is exceeded then some of the buckets will be send to the disk. As some of the buckets are already in memory and some in disk, the initial probe of the data using the inner set. The probe rows will be scanned if the row hash to the in memory bucket the match is done for those rows. If the rows match to the on disk bucket, the probe row will be return to the disk along with the outer side bucket. So, because of this we have more disk writes to tempdb at the probe time. This is a build side spill.
There is also probe side spills that can occur at the second phase of the hash match iterator. The buckets that has spilled to disk from the first phase should be now moved to the memory back again and then rebuild the hash table, then the spilled inner probe rows will be read back to the memory to match. In this situation the memory is exceeded again so the buckets have to be rewritten to disk. This is called recursion. So this will be the cycle that the plan will not complete making the plan to bailout. Bailout meaning the plan will use the unoptimized nested loop which cannot complete as it uses inner table multiple times. This plan will use little memory but it can run forever.
Sort: Sort will begin sorting in the memory space that was allocated using the quicksort algorithm (The memory requirement is of at least 200% of the input size). If the memory grant is exceeded even a little bit then the entire intermediate set will spill directly to disk and the sort will continue on the disk completely using the merge sort algorithm. . For the Hash match, if the memory grant is exceeded then few of the buckets will go and spill to tempdb.
Exchange: Exchange iterators will deal with the parallel set of rows from multiple threads. There can be three types of exchanges. Distribute(one stream comes in and multiple streams comes out), repartition(depending on the DOP, n streams can come in and n streams can come out and distributed through a different algorithm) and Gather(DOP n streams comes in and 1 stream comes out). For the distribute and the gather, we have DOP set of buffers but for repartition, there is DOP on each side (DOPx2 sets of buffers). There are two sides for the exchange iterator. Producer and consumer side. These sides of the iterator will communicate with each other using buffers/packets of rows. Each side has DOPx2 per stream sets of buffers.
Gather and repartition iterators can come as merging or order preserving meaning if all the streams that are coming in are ordered then the iterators can output the data in the same order. If there is a data skew and if the huge amount of data is flowing in and flowing in very fast then the merging iterators can not keep these as the same order. In such cases, all the data will be spilled to disk and resorted. This situation is called as intra query parallel deadlock which will extremely slow down the queries.
How to check for these spills?
sys.dm_os_waiting_tasks where all the spills will be waiting on the wait IO_completion which means the slow disks. We can also use the SQL Trace to monitor these spills by capturing the Exchange events Hash warning, sort warning and exchange spill. We can also use the sys.dm_db_[task|session]_space_usage which will have the columns for user object allocations like temp tables and table variables and have internal object allocations column which gives information about anything that spills out to tempdb from the workspace memory.
Lets see how these spills can be monitored with the examples (Credits: Adam Machanic)
Create an Extended event filtering the database and choosing the events for Hash and Sort warnings by running below query:
-- Create Event Session SortandHashWarning
CREATE EVENT SESSION SortandHashWarning ON SERVER
ADD EVENT sqlserver.sort_warning(
WHERE (sqlserver.database_id=(8))), -- replace database_id
ADD EVENT sqlserver.hash_spill_details(
ADD EVENT sqlserver.hash_warning(
ADD TARGET package0.event_file(SET filename=N'C:\Spills.xel',max_file_size=(50),max_rollover_files=(2))
WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=OFF,STARTUP_STATE=OFF)
Enable the event and watch live data by right clicking on the extended event we created and choose watch live data.
Run the below query to see if the data spills to disk:
select top(1000) * from
select top (668935)
) as x
When you observe, no data has actually spilled to disk. What if we go ahead and change the rows and see if data spills to disk:
select top(1000) * from
select top (668936)
) as x
When you observe, we have 8 sort warnings as this query went parallel (DOP 8 plan) so sort warning was fired on each thread. If you check the duration, it is 0.3sec but previous query is 0.2seconds. There is only some difference in the duration because we only bought one row extra when compared with previous query. By the time the sort spilled to disk, most of the data is already sorted in memory, but if the getting lot and lot of data need to be spilled then the duration will be more as more data needs to be sorted in disk.
Hash spill warnings:
Lets look at the Hash spill with an example. Run the below query and watch live data in extended events.
select count(*) from
select top (3800000) *
) as bth
INNER HASH JOIN
) as bth1 on
This query did not spill to disk. Now lets increase the number of row we get and see the event information.
Now, you can see the hash warnings and the duration is 0.84sec. With out the spill, the duration was 0.84sec. There is no difference in the duration without the spill and with the spill. Due to the Grace hash algorithm used by the hash match iterator, we only spills some of the buckets to the disk. Hash match spills much less unless it hits the bailout stage. Did you observe, we have only 3 threads spilling to disk for the hash warning but not all 8 threads. That is because if one thread have more data then the other thread, that thread might not spill but the threads having more data can spill. In this scenario, it spilled three threads. This happens when you have skew in your data.
Lets go ahead and see that information in the execution plan for the same query by running it again.
When you observe the input number of rows passing through the Hash match iterator, by looking at the properties (F4), the number of rows for each thread is not even, the last three threads are having more rows compared to other threads. These three threads have spilled to disk and that is the reason we have seen only three hash match warnings in the event information.
Lets go ahead and add two more events to capture the exchange spill: Exchange_spill event and Lock_deadlock_chain event.
Now lets run a query that will have an exchange spill (Credits: Erik Darling)
Creating the table first
TRUNCATE TABLE DEADLOCK;
INSERT INTO DEADLOCK WITH (TABLOCK)
(RN - 1) / 12500
, REPLICATE('Z', 100)
SELECT TOP (50000) ROW_NUMBER()
OVER (ORDER BY (SELECT NULL)) RN
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);
UPDATE STATISTICS DEADLOCK CI__DEADLOCK WITH FULLSCAN;
Now, run the below query which can give the expange spills:
FROM DEADLOCK t1
WHERE EXISTS (
FROM DEADLOCK t2
WHERE t1.ID = t2.ID
ORDER BY t1.ID
OPTION (QUERYTRACEON 8649, MERGE JOIN, MAXDOP 2);
Due to the data coming to the final parallelism iterator and the merging iterator as you can see the order by at the last parallelism iterator. If we have these inter query parallel deadlocks, we can go ahead and see if we have any of these merging iterators in our execution plan.
In this part 4 of the series, we have learned about types of spills that can happen from these memory consuming iterators when the grants allocated are not enough and how to make sure we address these issues. In the next coming part, we will learn about how to solve these issues.
We have 1gb of memory. 1GB is 128000 pages. Each page is 8kb. Multiplying 8×128000 gives 1GB. There are two default resource semaphore pools. Query memory objects (default) available and small query memory objects (default) available from dbcc memorystatus. when we add these both values 87823+4619=92442 which divided by 128000 gives 0.72 which is the percentage of the total available buffer pool.
Lets go ahead and run a query and check the execution plan for the memory grant
select top (1000) * from
select top (10000000)
[Sales].[SalesOrderHeaderEnlarged]) as a
OPTION (MAXDOP 1)
When you observe the memory grant, its 183MB. What if we go and get more rows this time
select top (1000) * from
select top (15000000)
[Sales].[SalesOrderHeaderEnlarged]) as a
OPTION (MAXDOP 1)
If you run the query select * from sys.dm_exec_query_memory_grants and see the ideal memory which sql server wanted to allocate for this query, we see 1022120kb (1GB). Depending on the 1GB of memory set as Max server memory, the requested_memory_kb is 183496kb (183MB) and used_memory_kb is 183496kb (183MB)
The memory grant for the above query is 183496/8=22937 pages of memory grant. The total available grant for the queries is 92442. We can run the above query 4 times to hit the total memory grant (22937*4=91748). We can use the SQLQueryStress tool to run the query and see the query memory grants. Let us run the query with 100 number of iterations and 10 number of threads.
While the query runs, check the memory grants by running select * from sys.dm_exec_query_memory_grants and see the queries running.
Three queries are running and others are waiting for the memory grants which is represented as the grant_time as NULL as they are waiting for their grants to be released so they can run. Only three queries are running as each query requires 150% of the requested memory grant. Once these queries completes, the other queries gets the memory grants.
As the workload groups has three importance queries. Low queries (0-4), medium queries (5-9) and high queries (10-14). By default, the default workload group has importance of medium queries.
Adam Machanic has explained with some awesome query examples of how the high cost query with low memory grants and high cost query that requires high memory grant works. Lets go ahead and look at these examples. (Credits: Adam Machanic)
Lets go ahead and run the below query which have high query cost but low memory. Run the sqlquerystress at the same time
select top (5000)
from dbo.bigProduct as p
( Select count(*) as actualcost
from dbo.bigTransactionHistory as b
b.ActualCost between p.productid-1000 and p.productid+1000
ORDER BY name
OPTION (MAXDOP 1)
When you check the actual execution plan, the estimated cost is very high, 11479.3 sec to run this query.
But when you check the input to the sort, it only have estimated data size as 263kb which should be very less memory.
While you run the above query, run the sqlquerystress again to see the running queries and queues
When you do the select *fromsys.dm_exec_query_memory_grants and see the spid number 56 is waiting and shows the grant_time as null which means the query is waiting. This query is waiting for the 1.1MB memory though the requested_memory is very less. This is due to the high query cost. Look at the queue id is shown as 9. So, the low cost queries which have queue id of 8 should all run through before spid 56 can run. The timeout in seconds is 86400 which is long time for the timeout to occur. SQL Server prioritize low cost queries to run first but as this query is high in cost and low in memory, the query still have to wait until all the low cost queries run.
Now if we have lot and lot of queries having high query cost but only need some memory to run, how do we resolve this issue? We can solve this issue by altering the default workload group and set the request_memory_grant_timeout_sec. Lets go ahead and alter the default workload group to set the timeout to 10 sec.
Alter workload group [default] with
Alter resource governor reconfigure
Now run the same query again with the other queries running with the SQLQueryStress and look the results by running select *fromsys.dm_exec_query_memory_grants. When you observe, the timeout_sec on all queries has changed to 10sec. Now this query with spid 56 has to only wait 10sec to get the memory granted to run the query.
After 10 seconds, if you run select *fromsys.dm_exec_query_memory_grants again, now the query should be running and the grant_time is no more null.
We can also view these waits for the memory grants by viewing the sys.dm_os_waiting_tasks. You can see all the spids waiting for their resource semaphore grants and waiting to run.
select * from sys.dm_os_waiting_tasks
Lets run a high cost and high memory grant query and check the execution plan for the query cost (129778 sec) and memory grant (566MB). Remember, we have the total available memory only 1gb. So this query needs the entire available memory to run.
productid from dbo.bigTransactionHistory
order by actualcost) as p
(select count(*) as thecount
from dbo.bigTransactionHistory as b
where b.ActualCost between p.productid-1000 and p.productid +1000
order by b.thecount desc
order by ProductID
OPTION (MAXDOP 1)
Lets execute this query along with the other queries running through the SQLQuerystress. When you check the sys.dm_os_waiting_tasks requested memory grant, its 185MB but the query need 150% of requested memory grant to get the memory grant and to be run.
Now, when I go and run the sys.dm_os_waiting_tasks again and check the granted memory, now the query has the granted memory because we hit the timeout of 10sec and as it completed this timeout, it received the granted memory of 640KB to start running the query.
So, once the query time out occurs (here in the query timeout meaning not timeout of the query but instead it waits for the query timeout seconds which is 10sec in this case and then go ahead and grant the memory as required memory to run the query. So, once the query needs to run after the timeout period, it will have to do one of these three things. If there is enough memory to run, it will run. If there is not enough memory to run even for the minimum grant we will get an error. If we have memory for the minimum grant but not enough memory for full grant, then it will scale all the way back and just grant that minimum memory to run. So this query waits until the 10sec timeout and then run with the required memory grant which is the minimum required grant to start the query to run (640kb).
In this part 3 of the memory grant internal series, we have seen about how the query memory/work space memory works. We have also learned what will happen if we do not have enough memory. That’s when things can get complicated and now the sorts and hashes have to spill to disk in tempdb.
In the next part of the series, we will learn about how queries spill to disk when the query memory is not granted as required for the query to sort in memory. When the query spill to disk, queries runs very slower. Lets learn about these in the next part of the series.
Once the query optimizer finish its job, the query processor will start its job. Query processor will first looks for the importance of the query based on the work load group. So, what is this workload group? To know what it is, we need to Dive deep into Resource Governor. As we are dedicating this series to dive deep into the memory Grant internals, lets focus this post on Resource governor which handles the memory grants for our queries.
Resource Governor is introduced in SQL Server 2008. Resource governor job is to control and manage the consumption of the CPU, Memory and IOPS resources used by the incoming requests. Resource governor will ensure that one run away query is not consuming the entire memory.
There are three different components in the resource governor.
Resource Pool: Small percentage of CPU, Memory and IOPS resources are collectively called as the resource pool.
Workload Group: You can define the connections based on the priority is known as Workload group.
Classifier: Classifier is a User defined function created in the master database in which we mention the predefined criteria to classify the incoming requests to respective workload groups is known as classifier. Classifier function can classify the incoming connections by using the application name, database name, login name etc.;
If your SQL server handles many database applications connecting to single database server instance, requests made from all the applications will compete to get enough resources to process the requests which will cause some serious performance issues. With the help of resource governor, we can distribute the assigned resources to specific workloads based up on the low, medium and high priorities. SQL Server can create up to 64 user defined resource pools.
When we install SQL Server, two pools are automatically generated. The Internal pool and Default pool. Internal pool is used for the internal background processes like DAC connections. Internal pool cannot be altered. Internal pool will be the priority when compared with the user defined resource pool. Default pool will handle all user connections that are not sent to the respective user-defined resource pool.
So, when a request is sent to the SQL Server, that session will be diverted to the associated workload group. This workload group will use the resources provided in the respective resource pool.
Now, lets get to the query processor. First the query processor will categorize the query we are sending to the SQL Server based on the workload group, then next it looks at the estimated cost and the size of the memory grant and categorize the query will be placed in the set of queues and these queues are known as Resource semaphores. These queues force queries to wait to get into the semaphore so they can get some memory to run. Resource semaphore is a gateway where SQL queries have to wait until they get the memory to even start processing the query. If SQL Server ran out of memory and queries waiting for the memory to be released to be processed, during this time we see resource semaphore wait types.
Resource pool have two primary resource semaphores. A small and large semaphores. Both have individual memory grants. Small semaphores are for queries with cost less than 3 and grant size less than 5MB. These queries use smaller pool of memory. The large semaphores are for larger queries and use different set of queues for large memory grant.
Each of these semaphores three sets of queues. They have low, medium and high priority sets. The high priority set queries will get importance and first get the memory grants and low priority queries have to wait until they get the memory. The large semaphore have five queues per set. Here the priority is different. Low cost queries gets the higher priority. Low cost queries will run faster than the high cost queries and that is the reason SQL server will prioritize the low cost queries first because low cost queries run pretty quick when compared with high cost queries which need more memory grants.
Queries doesn’t start unless available memory is 150% of requested query memory because sql server will always allocate a little bit of additional buffer memory. Lets take the 100MB query which will get 150MB with additional buffer. What if this query only get 100MB and doesn’t get that additional buffer in the beginning to run the query? In that case, the query will wait until the 25x the query cost which is in seconds. When the timeout occurs and if the exact grant (100MB) is available without the buffer amount then the query starts to run this time. In other scenario, even after waiting for so long and after the timeout, the exact memory (100MB) without the memory grant is not available and less than 100MB is available, in that case the grant can be decreased to the minimum required grant. What if even the minimum required grant is not available, in that case we will receive an error 8645. We can configure the timeout by using the resource governor.
Configuration options for Resource Governor
sp_configure: You can set the options at the server level. You have the option Max server memory (MB). The query memory will be taken out of the buffer pool which is within the max server memory. We also have the Server level default query wait time which is in seconds. Zero is the default value (25x cost of the cost of the query). We also have server level option min_memory_per_query (KB) which is 1MB by default.
Resource pool: There are two main settings. Min_memory_percent and Max_memory_percent. These are the settings to determine how much buffer pool grant can be used by the resource pool.
Workload Group: Within each workload group, we can have the importance level. We have the option request_memory_grant_timeout_sec where we can set the timeout specifically for the workload group. We have the option request_max_memory_grant_percent where we can set the memory grant percent of the resource pool grant that this workload group can take. When you have several different types of workloads on the servers, we can classify and give the workloads a very specific memory utilization.
We can check the memory status by using these commands:
DBCC MEMORYSTATUS (looking at the current and max values of Small Query memory objects, Query memory objects) with give you the information about the maximum workspace/query memory. These two are the default resource semaphore pools.
DMV’s like sys.dm_exec_query_resource_semaphores (information about all the resource semaphores and queues), sys.dm_exec_query_memory_grants provides lot more information to check when you have memory issues (provides one row each for query that has the memory consuming iterators. Gives the information on the whether the query is waiting for memory or granted the memory, how much memory, required memory and ideal memory information). Other DMV we can use is sys.dm_os_waiting_tasks gives the rows for the requests that are waiting for memory.
Lets start doing a hands on lab:
Let me set my max server memory to 1GB so we it will be even for us to play with
sp_configure 'show advanced options', 1;
sp_configure 'max server memory', 1000;
Run a simple query by enabling the actual execution plan to see how the memory grants are allocated for a query:
Select top (1000)
order by ActualCost
OPTION (MAXDOP 1)
We are choosing the option MAXDOP 1 because we want to run this in a single thread and if we use parallelism, then the memory grants are allocated for each thread. To be simple, we used the maxdop value as 1. Now when you observe the execution plan and see the memory grant on the select, you see the value 10776kb
Now see the estimated data size by choosing the arrow pointing to the sort operator. The estimated data size is 4707kb. The input size for the sort is 200% which will be double of the estimated data size. multiply, 4707kbx2 =9414kb plus additional buffer to make the memory grant 10776kb. That is how the memory grants are calculated based of the estimated data size.
Now, lets go ahead and see how memory is shared between the iterators. Lets run this below query and see the execution plan.
select top (50)
select top (100)
order by ActualCost desc
order by ActualCost
order by ActualCost desc
OPTION (MAXDOP 1)
Press F4 to see the properties of first sort and look into the memory fractions for this sort. The memory fractions input shows as 1 which means the input phase took all the granted memory available and once the build phase is completed, half the memory is available which is represented as 0.5 in the memory fraction output. This memory available will be send to the next iterator.
Now, press F4 on the next sort iterator and check the memory fraction values. The total available memory now is 0.5 as it passed from the previous sort. During the build phase, it used 0.5 and after the build it has same 0.5 available. Now this memory will be passed to the next sort iterator.
Lets go and see for the third sort iterator, press F4 to see the values. It will use the same memory from the previous iterator to build phase 0.5 and after the build phase it released the total memory granted to this query.
This is how we will analyze the memory utilization of a given plan is to look for the memory grant for the final iterator and look for the memory fractions down the stream to know how much each iterator is trying to use.
In this series of memory grant Internals part 2, we have learned about resource governor, components we have in resource governor, configuration options for the resource governor, how to monitor the memory status and how we can see the memory being used by each iterator and share the memory between them before releasing the memory.
In the Part 3 of the series, we are going to learn how to monitor the memory grants by using the DMV’s with examples and much more…
Memory!! Our servers need memory to store our data, execution plans, storing temporary data and to do some system operations as well. Memory is needed for Query compilation and to stored the plan in the cache once compiled and for processing our queries for sort and hash operations.
Now, you may have a question. Why should I careabout knowing the Internals of memory Grants?
Memory is like a set of toys we give to few kids in the party. Everything is well and fine as they generously share the toys with each other. Meanwhile, many kids joined the party and now everyone wants to play with the same set of toys. You know what will happen, screaming and fighting everywhere. No kid is happy until they get a chance to play with set of toys. What do we say to kids fighting for the toys?? “We need to share…”
Same thing happens with memory. Every query needs a memory grant (Query memory Grant) that needs to be allocated before going to the execution phase. Memory grants allocated are based on the hash, sorts operations and based on the cardinality estimates.
Too much memory grant allocated to a query than it requires causes concurrency issues, too little memory on the other hand causes pages spilling to disk (tempdb). Too much is bad and too little is bad too. We need to make sure we are sharing the memory the right way!
By the end of this series Memory grant internals, you will be able to understand how you can effectively manage the memory grants by using the Resource governor. We will also learn about the memory grant feedback feature(SQL Server 2019) and how it is useful in allocating memory grants preventing hash and sort operations spilling to disk.
Introduction to Memory Grants
SQL Server have different types of components (Plan Cache, Buffer pool and other memory components) in the memory area and requires memory for many things. We also require memory to process our queries. Each operator in the execution plan requires memory to process the rows from one operator to the other operator. Some of the iterators need to be buffered as the rows come into them before sending them to the next iterator in the execution plan. These iterators needs more memory. These are called memory consuming iterators. The grants that we given to these memory consuming iterators are called Query memory grants.
There are three main memory consuming iterators:
Hash Match Iterator: Takes the input rows and build the hash table before the hash table can be probed.
Sort Iterator: Takes the rows and sort in a specific order required.
Exchange Iterator: Takes the sets of rows coming in and has to exchange in the buffers to send to the next iterator.
Memory is not free! So, from where do these memory grants come from?
There is a component known as Fixed memory in the memory architecture of SQL server. This memory is “fixed” because it cannot be expanded when it is in use. This memory holds the memory grant connections we use for hashing and sorting for our queries. Like the other caches like plan cache and the buffer pool which can easily expand and shrink when there is a memory pressure, fixed memory cannot expand and shrink when in use. Once a memory is granted to a query for its execution, memory grant cannot be released until the query completes whether the granted memory is used by the query or not.
Fixed memory gets the memory grants from the buffer pool and the plan cache. When more and more queries are requesting for memory grants, sql server will go ahead and use the buffer pool memory. Maximum utilization is approximately 75% of total buffer pool memory.
Every query that sends to the SQL Server will be first go to the pre-optimization stage which is parsing and binding. Then the optimizer comes up with an ideal grant. Ideal grant is the total amount of memory the query needs to process all the operations (memory consuming iterators) to do in memory without having to spill to disk. If the ideal grant is very high then the memory is leveled down based up on the resource pool settings.
We are going to learn about the resource governor and resource pool in coming parts of the series.
How do compiler gets the Ideal grant?
Statistics: Optimizer will trust the statistics information to estimate the number of rows coming out of the iterators inside the execution plan. Optimizer looks for the row count and row size estimates.
Hash and Sort memory consuming Iterators: Based up on the hashes and sorts in the execution plan and depending up on the number of row count and row size of input sets. These iterations can actually share the memory grants between them which is known as memory fractions.
Sharing the memory:
Memory consuming iterators (hash and sort) have two phases. Build phase and Utilization phase. For the sort operator, build is to taking the rows and sorting them. Once they are sorted they are scanned. For the sort build phase, the memory is about 200% of the input data size, which is huge. After the build phase, the memory drops to 100%. For the hash iterator, build is to take the rows and build a hash table. Then the hash table will be probed. After the build phases, once the memory is released at the operator level, the next operator downstream can use the same memory.
In this series, we have learned
What are memory grants?
What are the main memory grant iterators?
From where do these memory grants come from?
How the compiler gets the ideal grant for any query?
How hash and sort operators share the memory?
In the part 2 of the series, we are going to learn about how resource governor take a part in managing these memory grants.