With Rob Volk Invitation to October month of T-SQL Tuesday for the Data Analogies, I tried thinking of an analogy for SQL Server Statistics.
Here is my analogy:
For example, One of our friends requested us to volunteer for organizing their kid first birthday party. You and me, we both are more than happy to help. We are so excited to celebrate it with all our friends inviting as many people as possible.
We gathered invitee list together making thousand people in total including adults and children. We planned to create a RSVP list with some beautiful invitation link requesting all the invitees to submit the RSVP with head count number for girl kids and boy kids along with the head count number of families without kids. We sent the invite two months before the event date expecting everyone would submit the RSVP. Based up on the number of head count (girls, boys and families without kids), we want to purchase the return gifts.
Girls get Barbie dolls, boys get Car toy and families without kids get a nice home decor set.
We waited for the RSVP submissions and gathered the results five days before the event date so we can shop for the return gifts. Out of thousand invitees, we received only three hundred RSVP’s. We believed that is the only head count we will see on our event date and we just bought return gifts and food items for three hundred people.
On our surprise, we see thousand people on the event date at the party since the party is on a weekend. We now understood that many people were actually willing to come to the party but for some reason, they did not had chance to do the RSVP. Now, we are all tensed up not knowing what to do.
This is not at all what we expect would happen. We expected something and what actually happening at this event is something else. Everything got messed up now. There is stress everywhere more than fun in the party. We immediately requested some of our friends to order for more dishes, asked some other friends to grab some random gifts from near by store. Many invitees were waiting for their turn to get the food on their plate. People frustrations were clearly seen on their faces. One situation carried to next situation in the event making the whole event a disaster. We anyway passed the event but with no satisfaction. We should have had good estimates of the head count to have the good party plan.
Instead, if we have got the good percent of RSVP’s like around 80% of the total thousand invitees, we would have been in the better shape with the plan. 80% of thousand is 800 people. This would have given us the better estimate close to the actual estimate of the head count. What if 100% of the invitees submitted the RSVP, that would be wonderful, right? Estimation would have been exact match with our actual head count in the party. WOW, that is so good to be true. There is no one single party I have seen where all invitees submitted their RSVP’s. There are always people missing RSVP’s. It is always still good to have the hundred percent estimate though. From invitees point of view, it takes time and effort to make sure they all update the RSVP’s to get the exact estimate.
For making this event successful, we should have been a little cautious while planning for getting these RSVP’s updated by the invitees. We would have send them the notification for every few days or once every week depending up on the number of invitees list. The more invitees we add to the event and the more updates we do to the list, the more frequent notification emails for our RSVP’s should be.
Now we know what we have missed. As we always learn from our past mistakes, we now learn we would have had send the notifications regularly so we get the RSVP’s updated and so we get close estimation which ultimately produces a good party plan.
What if the same concept applies to the SQL Server statistics?
SQL Server optimizer always get the estimate from statistics for how many number of rows (Invitee column statistics) it might expect to get. How does it do that? It will gather and trust the estimates being correct from statistics. Invitees is the column statistics here and RSVP’s are our statistics updates. Our notifications to update the RSVP’s is same as updating our Invitees column statistics using sql agent update stats job or through AUTO_UPDATE_STATS option enabled or through Ola hellengren scripts.
From the estimates of our RSVP’s, we ordered food and gifts for only 300 people out of 1000 people. In the same way, SQL optimizer will allocate less memory based up on the estimates. Because of the less number of the gifts we bought, we requested other friends to go and purchase gifts immediately from the near by gift store. This caused stress on our friends and they did not find all the gifts they need anyway. This made the whole party plan go bad. In the same way, due to the bad statistics estimation, Sql optimizer allocates less memory to process the query which is not sufficient and so it will finally have to spill to the disk. One bad column statistics estimate carries the stress to other operators in the query plan effecting the other operators in the plan and eventually effect complete execution plan.
To solve this issue, Sql Server invitees column stats should be updated regularly.
Depending up on the number of inserts and updates happen to this invitees column in the table, we have to plan for updating these statistics as well. As it takes energy and time for most of the invitees to go and update the RSVP’s, it takes resources and time for updating the invitees column statistics as well.
In order to get good estimates, RSVP’s should be updated as much as possible. If there are not many invitees that are being updated frequently, decent percentage of RSVP’s should be always good for good enough party plan. In the same way, decent amount of percentage stats update should be good enough to generate good plan. If updates happens regularly to the invitees column, frequent statistics updates is recommended.
It is always best to set the AUTO_UPDATE_STATISTICS option enabled. With this option enabled, Sql Server will go ahead and update these statistics automatically in the background.
For the versions SQL Server 2014 and below, the auto update stats kicks in when the threshold hits when 20% of the records gets updated+500 record change but for the later versions SQL Server 2016 and above, auto update stats kicks in when the threshold hits sqrt(1000 * total number of rows) which is way small than the older threshold.
Bottom line from this post is,
The better the estimates, the better the party plan will be 🙂
One thought on “T-SQL Tuesday #131: SQL Server Statistics Analogy”
Nice insights on statistics. Learnt sonething new.