With the invitation from Ben Weissman for the June month of T-SQL Tuesday, here are some of my thoughts about Hybrid world. Thanks Ben for hosting this month of T-SQL Tuesday!
With the increasing amount of data in our databases, handling the resources might get tough. Many shops currently have their databases spread across different environments like on-premises and on the cloud. While they plan to completely move their data into the cloud, process might be not easy if they are dealing with lot of data across different datacenters. That’s when the Azure Arc data services can really help manage all of these in a single secure location. We can manage SQL managed instances, Postgres SQL instances or in any other public cloud in one single panel. We can run this on our infrastructure or in any other public cloud.
Azure Arc data services in SQL
For the companies who are on-premises, Azure Arc provides the latest technologies like the evergreen SQL which means there will be no end of support for their old versions of databases. This technology provides continuous automatic updates and the features supporting databases. This will help in the capacity optimization and scale-out smoothly for the data workloads on-premises databases with no application downtime.
I did not yet had a chance to work with the companies using the hybrid environments. That doesn’t have to limit me from learning these amazing technologies Microsoft has to offer. I already started my learning journey with Azure Arc data services. It is a bit overwhelming to understand everything at once as there is so much to learn about this new technology. It is never too late to begin the learning. As many companies started moving to cloud, its never too late to improve the skills. With this T-SQL Tuesday invitation from Ben, I started learning about the Azure Arc data services. Here are the list of resources I am following and plan to follow to get started with Azure Arc data services, Kubernetes and Big data clusters:
Big Data Clusters for the Absolute Beginner: Thank you Mohammad Darab for this amazing introduction session to Big Data Clusters. Am sure you are going to give a big kudos to Darab after watching this session. It’s that Amazing! Also, don’t forget to look at the playlist from Darab youtube channel focusing on Big Data clusters.
Microsoft Learn: Microsoft offers free training. Customize the path you wanted to learn and start the learning journey. Also, you can get free Azure account for free worth of $200 credit for a month to get hands on experience.
While I was preparing for my Tempdb presentation, I learned many interesting facts about Tempdb. Thanks so much Bob Ward (t|g) for providing me with the resources to prepare for my presentation. Bob Ward has presented an amazing 3 hour session about Tempdb for the PASS Summit couple of years ago. This information is invaluable.
Why I am blogging about the interesting things about Tempdb then?
There were lot of things about Tempdb that surprised me as I learn about tempdb from Bob Ward presentation. Here is the link to the complete session. I thought of writing a blog post about these interesting and cool things so you all can know about them as well.
Do you know?
Each time you restart SQL Server, tempdb gets created from the model database and acquire all the properties from the model database including any objects you created in model database? First the primary data file gets created from model database copying one extent at a time from model database to tempdb database. Later transactional log file gets generated. After that, secondary database files will be created for tempdb. Until the tempdb is created, the model and the tempdb databases will be locked and no one can connect to it though user connections are allowed at this point of time as master database will be first opened before tempdb.
You can create tempdb very quickly by using the -f startup parameter but still retain the tempdb original sizes. This will help when you are having problems starting tempdb.
If you enabled Instant file initialization, creating data files will be quick because we do not have to zero out the database files and immediately start using these files. Instant file initialization doesn’t support zeroing out of the transaction log files. If you would like to increase the size of the log file later, then we have to zero out the entire log. When you restart the tempdb and if you already have the big transactional log file available, we do not have to zero out the entire log. We just have to zero some of the records out the first virtual log files.
When a temp table is dropped, the meta data of the temp table is cached. Temp table data is truncated keeping the temp table meta data for reuse.
You cannot cache the temp table when it is not associated with stored procedure, functions and triggers. Temp table caching is not supported for ad-hoc queries.
If multiple people are using the same stored procedure having the same temp table at the same time, multiple temp tables can get created and they can be reused as all these tables are cached.
Tempdb do the minimum logging. We do very less logging because we do not keep the data persisted when we restart the SQL Server. We create tempdb each time we restart with SQL Server. We don’t need a crash recovery for this database. We need transaction log for rollbacks. There is no redo for tempdb because the objects we create are not persisted on restart.
When the Checkpoint process runs on SQL Server automatically, it skips the checkpoint on tempdb. When you manually run a checkpoint, then the checkpoint process happens on tempdb which can take some time to run because we do not do the checkpoint by regular automatic process. Checkpoint process will truncate the t-log of tempdb.
Table variables are stored in tempdb and there is no rollback support for table variables.
Index sorting is done in the user databases and not in tempdb unless you use SORT_IN_TEMPDB option. If the sort is not enough to fit in memory, it will spill to disk and use temporary space in tempdb.
You are enabling the version store in tempdb if you are using snapshot isolation, read committed snapshot isolation, online index rebuild and triggers.
If you try to change the size of the tempdb file using management studio, altering of the file to change the size doesn’t happen. SHRINKFILE happens behind the scenes. When you manually alter the file using the script, the alter database and changing the file will be successful and will change when you restart the server.
Auto shrink is not supported for tempdb and we cannot shrink the internal objects inside the tempdb.
When you run CHECKDB on tempdb, we run with implicit TABLOCK because a database snapshot is not supported for checkdb on tempdb. Online checkdb is not supported for tempdb. Checkdb on tempdb will run check tables on all the tables including the temp tables we create. We cannot do repair on tempdb because we cannot place the tempdb in single user mode. CHECKSUM is supported in tempdb.
We can change the location of the tempdb files while in use but it is effected only after the SQL Server restart.
These are some of the interesting and cool things I learned about tempdb! I hope you learned something new today. Thanks for reading!
Memory optimized tempdb metadata was introduced in 2019 to solve the metadata contention problem we have in Tempdb.
Before going to learn about taking the advantage of this feature, lets learn what’s the metadata contention is.
Metadata contention happens when there is a contention on the system pages which stores the data about the objects inside the tempdb. Each time you create, update or delete any object in tempdb, we need to update these system tables. As you know tempdb is being used by all the applications and databases in the server and there is only one tempdb per instance, metadata contention can occur when these system tables are updated while creating the temporary objects in tempdb.
Metadata contention was the issue since SQL Server 2000. In SQL Server 2005, they introduced temp cable caching to reduce the meta data contention. For the inserts, caching the metadata for the temp tables we have for the stored procedures helped reduce the metadata contention for many years. Regularly, the metadata of these cached objects needs to be removed from the cache. While the cleaning up process happens, we need to find and delete the related rows from our metadata tables. As more and more table metadata was added with the new versions of the SQL Server, the contention was observed while deletes happen from the cache. These cached objects needs to be deleted when ever there is a memory pressure or when the cached objects are invalidated due to altering the temp tables after they are created.
To solve this problem, three main changes were added before implementing the memory optimized tempdb metadata feature. I will be explaining briefly those changes here.
Cleanup process has been changed from the synchronous to asynchronous. Cleanup process use to be in synchronous previously as the temp tables use be to be dropped when the stored procedures ends. With the asynchronous process, dropping doesn’t occur and treated as NO OPERATION no matter if the table is explicitly dropped by using the drop table syntax or if the cached object is invalidated by adding index after creating the temp table. These objects instead will be moved to the deleted list where a cleanup thread taken care by the cache clock hand sweep process will cleanup all these deleted list and the data related to the deleted list objects from the system tables as well.
Only one helper thread to cleanup, one per NUMA node. previously, once the objects are sent to the deleted list were deleted by the multiple threads that comes to actually create the temp tables. Any threads that comes to create the temp tables has to first go and delete these cached objects from the deleted list, find and delete the metadata from system tables as well before moving ahead. This logic has changed later to only one thread dedicated to the cleanup process. One helper thread is used for one NUMA node.
Latching algorithm has changed. The old algorithm used to delete the metadata rows from the system tables of tempdb when deleting the cached object from the cache which holds the exclusive lock on the pages which is known as PAGELATCH_EX. This exclusive lock will be hold for the entire process. Once the process is done cleaning up the meta data rows from system files, the exclusive lock is released. This has changed to the new latching algorithm where a shared lock is acquired until the cleanup process finds the metadata rows in the system tables. Once it find the rows to delete, then the exclusive lock is placed and the rows are deleted. If the rows are not found, then the shared lock will be released.
In SQL Server 2019, Microsoft introduced a new feature called “Memory-Optimized TempDB Metadata feature” which can store the Metadata of temp tables stored as non-durable memory optimized tables. Let’s see how the PAGELATCH_EX wait types look like in general. We are running the below stored procedure that we create without enabling the feature (default)
alter PROC TEST
CREATE TABLE #table1
INSERT INTO #table1
SELECT TOP 1000 [DepartmentID],
FROM [HumanResources].[Department] WITH(NOLOCK);
--use SQLQuerystress and run the below query. I used 50 iterations and 100 threads and see the results.
DECLARE @ID INT;
SET @ID = 1;
WHILE @ID <= 100
SET @ID = @ID + 1;
Check for the pagelatch waits by running below query
--credits: Pam lahoud
er.session_id, er.wait_type, er.wait_resource,
OBJECT_NAME(page_info.[object_id],page_info.database_id) as [object_name],
WHEN -1 THEN DATALENGTH(st.text)
END - er.statement_start_offset)/2) + 1) AS statement_text,
page_info.database_id,page_info.[file_id], page_info.page_id, page_info.[object_id],
FROM sys.dm_exec_requests AS er
CROSS APPLY sys.dm_exec_sql_text(er.sql_handle) AS st
CROSS APPLY sys.fn_PageResCracker (er.page_resource) AS r
CROSS APPLY sys.dm_db_page_info(r.[db_id], r.[file_id], r.page_id, 'DETAILED') AS page_info
WHERE er.wait_type like '%page%'
You will have to refresh couple of times to see the PAGELATCH_EX wait type. This is a metadata contention. We can know this as a metadata contention by looking at the objectname, sysschobjs is the system table name. To know how the feature works, we need to enable the feature and restart the database.
ALTER SERVER CONFIGURATION SET MEMORY_OPTIMIZED TEMPDB_METADATA=ON;
Restart the SQL Server and check if the feature is enabled or not
--should give you the value of 1 if the feature is enabled
SELECT SERVERPROPERTY('IsTempDBMetadataMemoryOptimized') AS IsTempDBMetadataMemoryOptimized;
Run the same query again using SQLQuerystress for 50 iterations and 100 threads
You will not see any waittypes this time
If you have heavy workloads on your tempdb, it is totally worth trying this feature. Though there are some limitations to this feature and that is the reason this feature is not enabled by default. With this feature turned on, it will reduce the metadata contention as there will be no latching or locking of metadata pages
You need tp restart your SQL Server instance once you enable this feature. You cannot use columnstore indexes on #temp tables with this feature turned on.
Data flows are where we do the bulk of transformations and data cleansing. In this session, lets go ahead and create Data flow and run it.
Open the ADF, go to the data flows pane on the left and create a new data flow. Enable the data flow debug.
We will take the example of creating a data flow by merging two tables, bigproducts and transactionhistory tables. Here, we need to bring in our data source. We already have one for our bigproduct table but we need to also have for TransactionHistory table. Let’s create another dataset for TransactionHistory table. Create the dataset just like the other datasets we created in this part of Azure fundamentals.
Now, go back to the data flow tab and add two sources. Bigproduct and transactionhistory sources. Just click on the add source space.
Now, lets add another data source
To add additional transforms, we need to add by clicking on the addition button.
Click on the plus symbol to add a join condition.
In the join condition, provide the left table and the right table, kind of join you needed and the join condition like shown below:
Now, lets go ahead and remove all the columns we do not wanted. Again click on the plus icon on the right side of the join condition
choose select and get rid of all the columns we do not want and then rename the columns to make it compatible with the destination
Delete the columns you do not want after we choose the select
Now, lets use the derived column in ADF. With the derived column, we wanted to check the name that we bought from the transactionhistory table, if there is no match we need to get a value that we define instead of null. Again press the plus symbol after the select and choose the derived column under the schema modifiers.
In the incoming stream, choose select because that’s the last step and choose the column that you want to replace when there are any nulls, in the expression choose the open expression builder like below.
In the expression elements, choose functions and search for iif condition
You need to drag the function into the pane like above. In the expression iifnull(), go ahead and edit it to add name
In the expression, we are mentioning If the name is not null, bring the name as is. If the name is null, replace the null with NA. Click save and finish
We want to load this into the file in the data lake. In the data flow, in the background it is using the Azure databricks. Let’s give the destination. For the destination let’s just clone one of the datasets and edit.
Here it gives me the same copy of the bigproducts file. Change the name of the file. Remove the filename and it dynamically creates the file
Go to the data flow now, Create sink
Now, in the properties, choose the dataset as the dataset we newly created
In the settings tab of the window, choose the output to the single file
Now, the data flow is completed. We can run the data flow from the pipeline and we directly cannot run the data flow itself. Create a new pipeline from the pipelines tab on the left side
Drag the data flow activity
Run the debug on the top. Debug succeeded.
Microsoft will spin up a databricks cluster in the background, it will runs the code against the databricks cluster, once it runs and we get the results the databricks cluster will shutdown. In order to debug and test this in data factory,we need to turn on the dataflow debug option. By enabling this option, we are manually creating the databbricks cluster for testing purposes that’s running in the background.
See and check if we have output under the storage account>products container>producttransactionhistory file.
In this post, we have learned about Data flows and created a sample of data flow and used transformations to transform the data as we need. In the coming post, we will be learning about the Azure synapse.
With the invitation from AndyLeonard for the May month of T-SQL Tuesday, here are my thoughts on how to respond to technology changes.
As there is a saying, “When there is no Change, there is no Growth”
This is true in every aspect of life including technology. As the Technology changes, it is helping our lives to be more simpler and easier than before. With the continuous advancements in the technology, many professionals find themselves rushing in learning the new technologies. These advancements in the technologies adding up new features to previously existed technologies improve in performance and solve problems making our lives easier. The only question is how quickly can we cope with this change?
The only answer I believe is to train the employees with the new changes in the technologies. It is really important for the companies to not only bring the latest technologies to the companies for the use but also to mainly focus on the employee trainings to use the latest technologies efficiently.
Apart from the trainings, there are many employees who love and always look for the ways to learn new technologies. For every new technologies we have now, a free training platform may already be available to train the individuals so they have ease in using those technologies in their work. For example, Microsoft Azure. We do have a free training from Microsoft known as Microsoft learn. Microsoft also offers a free azure account so professionals can go ahead and get that hands on experience as they learn. This platform has helped many professionals across the world including me and helping in advancing our careers. Huge thanks to Microsoft for taking this initiative.
To keep ourselves updated with the latest evolving technologies, I follow several resources out there along with the technical documentation itself. There are several blogs related to specific technologies, communities helping each other to learn these new technologies and several technical newsletters and publications to keep us updated. I follow social media articles and posts from linkedin and twitter to be updated as much as possible with upcoming technologies.
Along with these, there are several youtube channels providing the concepts of the new technologies as they evolve. There are other training companies who can actually provide trainings especially from beginner to advance level along with hands on labs.
Keeping us up to date to know what is going on is not a tough job but investing time and interest in learning these technologies to effectively use in our daily work life is really important.
At the beginning, everything seems to be so confusing. If we see the number of technologies evolving at the same time, it is overwhelming for sure but if we can take one technology at a time and if we can focus on how the new technologies can actually simplify our daily work, we will be embracing the new technologies for the latest innovations they can offer.
Finally, it is very important to be ready for more and more advancements and to anticipate the changes. As the technology evolves, it becomes much easier to use. Keeping ourselves updated is the only way to keep growing!
In this post, we will learn about the Azure Data Factory.
What is Azure Data Factory (ADF)?
Its an ETL tool based in the cloud. Everything that we do in azure data factory is executed using the azure compute resources. We will do the development in azure portal using the UI instead of using the visual studio. We have the drag and drop interactivity with the minimum code. This is unlike the databricks where we need to write the code. Data factory is similar to the SSIS on premises. We use the expression language here in the GUI tool where you write the code in the expression builder. We use Azure data factory for the data orchestration and data transformation. One of the nice things about Data factory is that it integrates nicely with both the hybrid environments, on premises and cloud environments together.
How to set up Azure Data Factory?
Its very simple. We provide a name which should be globally unique, subscription, resource group, version and location. We are going to learn to setup the ADF here in this blog post.
Pipelines which is same as the package on on-premises environments. Pipeline is a collection of activities and actions that do a particular work (loading the file, cleaning and importing the data etc.)
Data Flows are the data flow activities where you do the data transformation (cleanup, merge the data, adding columns in the code etc.)
Mapping Data Flows is similar to the SSIS on on-premises. We have source, transformations that cleanup the data and transform the data and the destination where the data be loaded.
Wrangling Data Flows are derived from the Power Query editor in Power BI. It is a UI way of cleaning data (for example removing values, unpivoting the data)
In order to connect to the source connections in ADF, we need to provide where the source connection information from like from where the data is coming from by providing the server name and the credentials just like we provide in the SSIS. We provide that information in ADF by using linked services which is similar to the SSIS connection managers on-premises. These linked services can be used among the pipelines in ADF.
Dataset is a connection to a specific table or a specific query that we are pulling the data from a table. Dataset can be a file, documents or a storage location with in the data lake account.
Within your pipeline of the ADF, we have many different activities which will define the type of action that we want to perform on our data. One example activity can be a stored procedure inserting/updating/deleting a data in a table, running a databricks notebook directly and executing it using data factory through the activity known as databricks activity, copy activity etc.
Now, lets go ahead and do this in Action!
Just by provisioning the Data factory doesn’t charge you anything. Only you get charged for the pipelines and the amount of resources the pipelines runs generally. Create a data factory from the create new resources tab and search for the data factory.
Once the window opens, provide the resource name and the location where your company located region, give the data factory a name and version V2 as this is the latest version available.
You can set the git repository for the source control in the next tab but right now lets not setup that right now, so check on the configure git later and then click on review and create
We do take the data in the data factory and publish in your code repository and then from the code repository, we build the pipeline release from the code repository and so we can publish to the azure data factory in the production.
Now, the data factory is created. We can now create the connections for our data sources, pipelines and data flows. Go to the resource, click on the resource we created. You can see the data factory got created.
This opens the administration page but if you wish to do the development, we need to open up the author & monitor page where we can build the connection managers, datasets and pipelines.
A window opens up like the above. You can see the side panel where you can see the data factory. This is where all of our pipelines and data flows are built. Then we get the built-in monitoring that comes out of the box and we have manage. Lets click on the manage button.
When you click on the manage, you see the connections where you can create the linked services, integration runtimes. Under the source control, you can create the git configuration, ARM template. Under author we see triggers which are used for schedule executions just like the SQL Agent on the on-premises servers.
Let’s go ahead and create the linked services. On the top, you can see new. Click on that and create a connection manager. We will create two connection managers here. One for the azure data lake account and one for the Azure SQL Database so we can interact with the data sources within them.
Lets first create the Azure data lake. Click on the Azure tab on the right side near the search. Click on the Azure Data lake storage to create a connection manager for Azure data lake.
Give the proper name to represent it as the azure data lake for easy understanding. Here I choose “AzureDataLake_dbanuggets_datafactory” where dbanuggets-datafactory being my data factory account. Give the authentication method as the Account key here but for the best practices we should not be using the Account key as it will impact using the Azure devops in your code repository because it will not allow to store the sensitive information but store the sensitive information like all the passwords and all in the key vault so we can reference the key vault name from here. For our testing, lets keep this as the Account key and for the account selection method, use the “From azure subscription” and choose the azure subscription as “Microsoft Azure sponsorship” and the storage account name is going to be my storage account that we have created on our first part of this series.
Click on the test connection at the bottom of the screen, once test connection is succeded, click on the create
The Azure data lake connection has been created
Now, lets go ahead and create the Azure SQL database connection. Create on New at the top on the linked servers page, choose Azure tab and scroll down to choose Azure SQL Database and then click on continue
Give the details and test the connection
But my connection failed here. Why is that?
When you click on more side to the error, you will be able to see the details of the failure
Azure data factory is a service in azure and it cannot connect to Azure server. We need to go to the Azure server firewall rules and check the properties over there. When you check the error, It failed because the IP address mentioned in the virtual machine is not our IP address. That’s the IP address of the virtual machine that’s running in the background that the azure data factory is using to try to connect to the server. We will go back to the all resources window and find the server that we created earlier.
Under the security, find the firewall rules
Go ahead and click on the firewall rules where you will be choosing “Allow Azure services and resources to access this server” which means any service in azure can now try to authenticate to this server. This might be a concern for lot of customers because if we are on Azure VM, if that option is on, then from any part of the world anyone can connect to by trying different usernames and passwords. Other than adding that IP address from that VM, for our testing sake, lets enable the “Allow Azure services and resources to access this server” temporarily so the data factory works with this. but as the IP address of the data factory that we saw in the error is dynamic and that will actually changes. So, once we turn on the azure services to yes here and then click on save at the top of the window.
Once again do the test connection and it is successful now
Click on create. Now, the Azure SQL Database connection is created
Lets go ahead with a simple example. Lets go ahead and pull the data from the dbanuggets table into the azure data lake. We will be taking the data from the dbanuggets table and copy that to the logical container for grouping those files together in our data lake. So, we need a dataset that is pointing to the dbanuggets table in azure data factory and another one to write the file into the azure data lake.
To do this go to the Author on the Azure data factory side bar and see the different factory resources.
Pipelines is the collection of the different activities that does the work. Data sets are the connections for the reusability.
Create a new databaset by clicking on the ellipsis symbol right after the datasets
Add new dataset
Select the Azure tab under the search bar and choose the Azure data lake storage Gen 2 and click on continue at the bottom of the window
In the formats, click the CSV Delimited text
Give the name of the container, linked server name we created before, the file path actually doesn’t exist so if we give a file path that does not exist, it actually creates one for us and give the file name across the file path section. Keep the first row as header. For the import schema, mention as none because that csv file doesn’t exist yet. We will create one. Then click on okay at the bottom of the page.
We have created our first resource in our data factory other than the connection manager and we can see that as a tab across the top. As you create the new resources they will be opening up as the new tabs on the top. Once we complete working on them we will have to click on the publish all and close them. Once this resource is created you can see many options that you can choose from. Select the filesystem account that we created earlier.
Parameters are really useful to help the dataset reusable. Instead of having the dataset dynamically for the bigproduct, we can make dynamic through the parameters so it can also be dynamically change the name of the file automatically by using the parameters and expressions so it can be reusable.
We need to create one more dataset that will be pointing to the bigproducts table from the SQL database. Click on the datasets ellipsis again and choose the Azure SQL Database and click continue
Give the name and connect to the Azure SQL database and to the table and for the import schema, the beauty of the Azure data factory is we do not have to know the schema up front which is something known as late arrive schema, they can actually change any time and data factory will handle that. This is not available in the SSIS packages which is completely schema bound. If one of your table schema changes for example, if you drop any column or modify the name of any column, all of your SSIS packages referencing the table will break. We can choose none to not worry about the schema and it becomes dynamic but lets choose to import the schema for our example.
After we created this dataset, we need to click on the publish all so we can save the data that has been created until now to the data factory in production. So, if in case you do not click on the publish and leave by closing the windows or if something happens, then all your data will be lost. So, lets go ahead and publish.
If you setup source control, azure devops setup than all this published code is sent to the repository and not publishing to the data factory. As we now created two datasets, lets go ahead and move the data from the database to the data lake.
We need to create a pipeline for that. Go to the pipeline tab on the left side of the pane>new pipeline>all the way on the right opens up a window>give the name
Give the name and click on the properties button to close
From the data transformations activity, drag the copy data which will only copy the data
Down, you can see all the activities. Give the name and go to the sink. Sink meaning destination. You can see the 1 on the top of the source and sink tabs is to tell us that there are required parameters that we need to set in order to complete the activity. In the general tab, give the name
In the source tab, give the source and for the destination as well. We can do the mapping manually or we can let it map itself which is the late arriving schema, what ever schema it may be, ADF will be able to map itself. This is such a beautiful feature.
We are getting the data from the table
In the sink (destination), provide the destination details
We have other details as well if you would like to change but for this example, lets not change those.
We do not want to change anything in the mapping, let the mapping takes place automatically. Now, go ahead and run the debug. This is like the testing mode in the azure data factory. We are running the most recent copy of that pipeline not running the one which we published for the azure data factory.
Note: If the debug failed for you just like me before fixing it, please make sure your source and destination connections were already setup correctly and published properly.
When you go to all resources tab on the left side pane and open the storage account and then to the storage explorer, you can see the newly created container products and when you click on that container, you will be able to see the csv file we created.
Publish the changes again
If you wanted to change the name of the file dynamically, what if we wanted to add the date along with the filename we already created. Open the file path location again for the csv we created and at the file name location, select the add dynamic content at the bottom of the name
A new window will be popped up to add the dynamic content
Go to the string functions and then to the concat operator
Once you click on the concat, we will automatically have the concat function pop up in the add dynamic content and then add the below content to add the date after the name. for now, just add the name you wish to have before the date
Then for the date, go to the date function and press the utcnow
After this you will automatically see the utcnow() been added to the content, just add the .csv to the end of the content. click finish.
Now, go ahead and debug the pipeline one more time to see how this works. Debug succeeded.
Go ahead and check the file has been created by clicking on refresh
A new file has been created along with the date. So cool, right?
But then we wanted only the date part, lets go ahead and do that:
Go ahead and get the substring function from the dataset window
Go ahead and insert the utcnow() in to the substring and only get the date beginning from the 0 to 10th character. remember, in SQL Server we will get the beginning position from 1 but in ADF, we will get from 0. Now, click finish.
now, run the debug again and see its succeded
Go to the storage explorer again and see the file again
In this post, we have learned about the Azure Data Factory. We have learned creating the linked services, datasets and pipelines. We tested the pipeline by using the debug button and created the files with different format by adding the dynamic content. In the coming post, lets learn about creating the data flows.
Azure SQL Database is a managed version of SQL that is in Azure. Microsoft is managing the database, backend, infrastructure, backups. If until now you have been managing the services on-premises and you would like to move to the Azure but still manage some of the stuff yourself, then you may be looking for the Infrastructure as a Service where you will still be responsible for patching your operating systems, updating your software. So, you would like to go to the SQL Server on a VM which is very cost effective and you will be getting maximum close to the feature functionalities of on-premises. If you want all the feature functionalities just like your on-premises then you can go with the SQL Server on a VM.
We have Azure SQL DB where you can create this DB in less than few minutes and it is very less expensive. We can have a Azure SQL DB hosted on a Microsoft managed data center and we can do our processing on the database like we can read, write and modify the data with less price. We don’t have to pay for the server or create any servers for the hosting the database. Everything will be taken care by the Microsoft.
Azure SQL Database has some limitations. When you run the database as Azure SQL Database, we might not get all the functionalities we get for the database on on-premises server. We have another option here as Platform as a Service where everything will be managed for you which is what known as SQL Managed Instance. Here we have a SQL Database that is managed by the VM but you get all the server level functionalities which you lose when you have the Azure SQL DB. When you wanted to use big data analytics or machine learning or data science and if you are looking for powerful resources to accommodate these, the option for this is SQL pools (formerly known as Azure SQL Data warehouse). Azure SQL Data warehouse is a large parallel processing service.
When we create the Azure SQL data warehouse in Azure, it take just few minutes to set up and start running. When you compare that to DWH on-premises Server build, it takes a lot of time like at least couple of weeks and cost lot of money to procure all the hardware that you need to make it happen. This has completely changed with Azure as this is a very quick process to setup an DWH environment in a matter of minutes with very low cost.
Azure SQL Database:
Azure SQL Database is a Platform as a Service meaning that Microsoft will be managing the hardware, infrastructure, backups, DR failover, data redundancy, patching operating system so our responsibility is to take care of the database, tables and objects in it, writing, reading and modifying the data in the database. It is a relational database as a service. Microsoft guarantees that database will be available 99.99% of the times and they will provide the credits if in case the database goes down for a length of time, Microsoft will financially back that agreement by crediting the money back. You only pay for what you need here.
Here, we get three built in replicas of the database for high availability. We also get the point in time restore capabilities up to 35 days with the default setting available. For the Azure SQL database, we can choose the performance level and the price we are going to pay. We can use the SQL Server management studio tools and the SQL environment is similar and very familiar with Azure SQL Database. We can also replicate the Azure database to many geographical locations. Just like we discussed in the previous post about the storage accounts, if your entire data center goes down where your Azure SQL Database is located, then we can immediately failover to the other geographical location so start using the database there.
Couple of more built-in features about Azure SQL Database. When you delete your database in Azure, these databases are automatically backed up. We automatically have Transparent data encryption, Always encrypted and we have row level security. We also get the Azure defender for SQL which monitors the activity of logins and users connecting to the database where we can setup to send us the alerts if any malicious activity happens. We also get automatic database tuning and also Azure Active Directory Integration.
Azure SQL Database Limitations:
We cannot backup the database from our end. Microsoft will take care of the backups.
There is no windows authentication. All we do is through Active directory.
There is no Change Data Capture, SQL Agent, Database Mail, PolyBase, Service broker, MDS/DQS, BI services (SSIS, SSRS, SSAS), Limited VNET integration, can be only the transactional and snapshot subscriber only.
When we say there is no SQL Agent doesn’t mean that we cannot create Jobs. We can use Azure data factory instead of SSIS to move the data and to clean the data, instead of using the SQL agent, we will be using the triggers inside our Azure data factory that kicks off those events to either trigger on the schedule or when event occurs.
PaaS for the Databases is really beneficial because we do have more time for optimizing the queries, focusing on the performance tuning on applications, monitoring the applications use by the users and scaling the applications well so companies can save money on Azure as well.
We can connect to the Azure Database using SQL Server management studio just like the on-premises database. We can also use the tools visual studio and PowerShell even when we are connecting the resources to Azure.
Let’s just go ahead and practice this by building the Virtual Server, meaning we are not purchasing any physical hardware or physical servers for creating this virtual server. Microsoft virtually create these logical servers and this logical Server will be just saved as the logical container that we created in our previous post. This logical server will save all the metadata information about the databases, elastic pools, any data warehouses that are attached to the server.
To create the logical Server, go to the “create a resource” pane on the left side of the panel and search for SQL Server (logical server) in the search bar which will give the service.
Click on the create and it brings the GUI creation screen for that resource.
Now, go ahead and give the resource Group name which is “dbanuggets” and in the server details tab, give the server name which should be unique across all of the Azure. If someone anywhere in the world already used the server name you are providing, then you will not be able to create the Server. This is because we use the server name when we connect to the databases on the server. Give the server name as “dbanuggetsserver” and Give the location details.
In the administrative accounts, provide the SQL Server account admin login and the password that will allow us to authenticate to the server and let us see the databases within that server. Give the username and password and press create.
Go ahead and press create by pressing the create button at the bottom of the screen again.
It took a minute to create and received the notification that the deployment is complete.
If you go to your resource group>select resource group dbanuggets that we created in previous post> and check to see the server has been created and listed under the resource group.
As our server is ready now, let’s go ahead and create the database now.
Go to the create a resource pane on the left side of the panel and the resource we wanted to create now is SQL Database. so go ahead and search that
Click Create on SQL Database
You can go ahead and provide the Resource group, location and name of the database
When you scroll down, you see two important options. One is “Do you want to use an elastic pool”
Azure SQL Databases are the individual databases that have their own resources allocated to them. This is very different from on-premises environments. On the on-premises environments all the databases are created on same provisioned server and they share all the available resources among them. In Azure, each database has its own resources but to improve the performance and to save the cost, it is important to create a pool of resources and attach all the databases to those resources.
If you have databases that have traffic during certain periods of the day or you have some of the databases that are not often used but when they are used, they need full performance. Those are the databases which will essentially be good candidates for creating a pool but not all the databases comes into this category. If you would like to get the similar kind of resources where the databases share the resources like in the on-prem environments, then enabling the elastic pool can be helpful.
Compute+Storage meaning what amount compute like the disk space, memory, CPU, disk space or disk speed for reading and writing and what amount of storage we want to setup and configure for this database. We can do that by either clicking the “configure database” option. When you click on it, it will take us to the new tab and it shows two kinds of different pricing structures: One is vCore which is the new model and the other is DTU which is a legacy model.
From the below tab opened, you can see the price options increase from left to right starting with basic, standard and premium.
For our examples, we will be using the DTU model because it is easier and less priced entry. The vCore model is easier to compare to on-premise environments. For example, in on-premises we have these many cores so we will go with same number of vCores. It is easy with the vCore model to set it up. DTU is being supported but as a backward computability and as a legacy model.
We will choose basic 5DTU’s. What is this 5DTU? It is some amount of compute, CPU, memory and speed on your hard drive of how fast we can write and read our data. If we go from % DTU’s to 10 DTU’s meaning we are doubling all of these resources. For 5DTU’s, we get only 2GB of storage and that will cost us 5$ a month which is very less. Click apply.
review and then click create.
If in future you would like to go to the higher service tier for your Azure SQL database, then it is completely possible. You can easily scale up and scale down as well. You can also change from the DTU model to the vCore model so we can scale in between them easily with out any issues.
If in case you need to scale up your resources later to some thousands of DTU’s and after couple of months, you do not need those many resources, then you can easily scale down in Azure SQL Database. That is pay as you go. Review and create button at the bottom of the screen.
Your deployment is now complete
Open the link of the database, Under the SQL database, below screen shows up as a blade and you can click on the copy to clipboard to copy the server name like below
Copy the server name to the clipboard and open the SQL Server Management Studio and paste the servername that you copied from the azure portal. Use SQL authentication and give the login ID and password you created when creating the database on azure.
Once you press connect, a screen pops up and it doesn’t allow me to connect because what we have not done on the server is that we did not give access to ourselves from the IP address
This is our first level of step to take in protecting your data. Because I created an admin account, I can directly press sign in and it will add in my current IP address. We can go to the azure portal and in the server page on the left side of the panel click on Firewalls and Virtual networks.
Click on Add Client IP
Give the rule name
Click Save at the top of the same window
Now, try to connect to the server by connecting to the azure servername, id and password. You should be able to connect now.
This looks so similar to the database on premises environment and now we are connecting to the azure database from the local management studio.
I created a sample table in the database by using the below script and added few rows
After creating and inserting this data. Lets do select * from dbo.bigproduct and see the results.
You can create tables, insert data, read and write the data in to the database right from the management studio. This is just the same as on-premises database. There is a little bit of latency insert the data as we are paying just couple of dollars on a database per month in this free service Azure account.
In this session, we leaned about the Azure SQL Database. We started by creating a virtual server, created a database on that server on azure, we added the admin account by adding the SQL authentication account, provided the permissions on to the client IPaddress and connected to the database using SQL Server management studio. We created a table and inserted the data in to the Azure SQL Database using management studio on local machine.
In the coming blog, we are going to learn about Azure Data Factory fundamentals. Stay tuned..
With the invitation of Steve Jones for April month of T-SQL Tuesday, I am going to share some of my thoughts on using Jupyter notebooks.
Jupyter Notebook is an open-source web application that can serve us to create, share code and documents with others. It provides that beautiful way to document your code, text, images and query results. Jupyter notebooks in Azure data studio can run PySpark, Python 3, Spark R and Scala, as well as PowerShell. Run and visualize the data as you see the results without having to leave the environment. This is an excellent tool for documenting things. You do not have to write your long emails attaching bunch of screenshots any more. Once you start using the Jupyter notebooks, it will save you so much time and give you peace of mind. This tool can not only be used for documentation purposes but also can be used for your presentations for your demo purposes. Not only for the SQL code, this is a very nice tool for doing machine learning models, visualizing data and much more. Your code can be written in different cells which can be executed individually. This can help us to run specific part of the code in the project. This tool supports Python, R as well.
There are some magic commands that you can use to make your work easier. %clear, %autosave, %debug and %mkdir are some of the commands and these magic commands can be used in two ways. Linewise and cellwise. Linewise is when you wish to run a single command and cell-wise is when you wish to run an entire block of code.
There are also interactive dashboards in Jupyter notebooks.
When you wish to share these jupyter files and as they are in json format. You can save the notebook using different file options like a CSV, Excel, JSON or XML file . You can also convert your notebook into different formats by using the nbconvert option. For analytical queries, you can also build a chart for the data.
You can use the Jupyter notebooks for giving presentations to your customers and clients about your projects.
Jupyter notebooks is an amazing tool as you can document everything what you do. Once we start using Jupyter notebooks, you would definitely do not want to turn back to the previously used tools anymore. I recommend this tool to anyone who would like to explore and take advantage of many features the Jupyter notebooks has to offer.
Microsoft Azure is a cloud computing platform and infrastructure for building, testing, deploying and managing your applications and services. Since many years, we have servers, databases on-premises where companies are responsible for buying, installing, managing the hardware and software. Azure was introduced in year 2008 and since then we have seen huge transition in the cloud computing as it provides services in managing the hardware, infrastructure, securing the data including the backups and many other services.
This is a new blog series about Azure Fundamentals, if you would you like to get the notifications as soon as I post a new article to this series, subscribe here at the WANT TO BE NOTIFIED Section at the right side of the screen.
There are twenty two service categories available in Azure and some hundreds of services to choose from. If you are new to Azure, you can sign up for a free Azure account. Microsoft offers a free 200$ credit for 30 days so you can go ahead and freely explore their services like you can test and deploy enterprise apps for free, try data analytics with your data and try many more services. Here is the link from Microsoft you can go to and create a free account:
Azure Data Services
For on-premises, we are responsible to procure the hardware, disk space, CPU and all the other hardware in place and to maintain them. We are responsible for the networking, patching, virtualization, operating system, middleware, apps and data. What if virtualization, servers, storage and networking are managed by the Microsoft and we can mainly focus on the application development and we can manage our data? That’s where the IaaS (Infrastructure as a Service) comes in. In this service, we can only take care of the operating system, patching, install the software, apps and data.
PaaS (Platform as a service) is a service where Microsoft is going to take care of lot more services so you can solely focus on the application development and data development. Microsoft manages all the services managed by IaaS but also operating system, middle ware and runtime. If your database needed to be restored in any case, backups will be provided to restore at any time.
SaaS (Software as a service) is a service like office 365, Dynamics 365 and they do everything for you as you insert, write and read your data from those platforms.
Data Services in Azure:
These services are the Infrastructure as a service and Platform as a service solutions. In the below diagram, we can see on the left side all the resources we can connect to which can connect from the on-premises/cloud/structured/unstructured data types. We do have SSIS but it doesn’t actually have lot of cloud ETL capabilities as it cannot connect to all new data sources. As we move to Azure, there is Azure data factory which provides the graphical user interface where you can drag and drop and perform different activities. For extracting and transforming data, we also have another tool Azure Databricks which gives us lot of flexibility where you can write R, Python and Spark SQL but for using these requires coding. Professionals who are data scientists and data engineers use the databricks.
Azure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise data warehousing, and big data analytics.
Once we apply the transformations, we load that data into the Azure data Lake storage or any other storage areas like Azure storage, Azure SQL DB, SQL pool or store the data in the non relational database systems like Cosmos DB. Once we store the data, we can report the data using Power BI.
We will take all the resources that we create in azure into a logical container. Resource groups are a great way to take all of the resources we create for a project, department or for any other purposes and store them into a resource group. This is same as the directory location where we put all the resources as a resource group. We can update, deploy or delete them as a group. All the resources that we create will be under a resource group. All the metadata information that we have under the resources will be stored in resource group like what type of resource we are creating, what resources were allocated to it and the names of them.
There are always ways to automate this but for this example, for simplicity purposes we are going to do this through GUI.
Open your Azure portal, go to the navigation pane and click on the “Create a resource”
This will bring us to the Market place where we can see the 22 different categories and some hundreds of different services that are available. Find the “Resource Groups” by typing it in the search bar
Select the Resource Group and click create
This will bring us to the resource creation page. We can give the resource group a name, choose the location where it is going to be stored. For latency sake, for productivity and performance, please choose the right region details. We do have other options out there but to get started with azure, this should be good. When you check the other Tags out there, these are for the information like to which project is this resource group associated with, who is the owner and who purchased it.
It is better to keep all of our resources in same region so we can reduce the cost. If we move the data from one region to other, there is actually an extra charge for that for moving data. When you are creating your databases, storage accounts, data factory to move data, it is important to know to keep the data in same data center.
Then at the bottom of the screen, click review and create.
It should give you the validation passed and press the create at the bottom of the screen. Once you do that, your resource group has been created and you will get the notification at the top right corner that the resource group has been created.
You can either go to the resource group from the notification section or you can also go from the navigation pane by clicking on the “Resource groups”
When you click on the resource group “dbanuggets”, you see a vertical pane opened up across your screen which is known as Blade in Azure. When you click on any tabs in it, you will see vertical tabs opening up. We are currently looking at the resource group we created “dbanuggets” and under that name you see it as a Resource group. We see all administrative properties under that and we can setup properties like security like access control, policies, resource costs, deployment history that we can control. So, we have created a logical container, resource group.
If you want to close this resource group pane, you can simply press the cross button at the top right corner of the pane to go back to the previous blade.
Azure storage is the general purpose account v2 and the Azure data lake store Gen2. This is the cheapest option for the storage that we have available. If you are creating a storage account on azure, usually Azure data lake gen 2 will be good enough unless you need the features that are only available in the general purpose accounts that are not existed in the Azure data lake store.
Azure Data lake uses the Active directory. Blob storage account uses the shared access key. If someone can get the name of the storage account and get to the access key, that might be dangerous situation. Access key should be regenerated every couple of weeks or so in a schedule to keep the key secure. With the Azure data lake Gen2 accounts, we can take the advantage of login Id of the Azure active directory. There are some regulatory and compliance that we get for general purpose that we do not get with data lake. One of them is WORM storage which means Write Once Read Many. Once the data has been written, it cannot be modified. Azure blob storage offers that capability of doing WORM storage and Azure data lake doesn’t. Azure blob storage have a capability of soft storage where if you accidentally deleted anything, we can restore that. Azure data lake doesn’t offer that.
Azure data lake is designed to store massive amounts of data for big data analytics. It provides unlimited scalability, Hadoop compatibility, optimized azure blob file system (ADFS) designed for big data analytics, Zone-redundant and Geo-redundant storage.
Lets create Azure Data Lake:
Go to the navigate pane and search for the “Storage account”. To create general purpose storage account or Azure data lake is through the same storage account service. You will see the storage account service, click create.
You will see a create storage account pane, and the top you see many tabs like data protection, advanced, tags. You can create the storage account first and then later configure these.
Select the resource group, We will be selecting the one that we created with name “dbanuggets” and give a storage account name. There are some requirements here that we have to remember while creating the storage account name. It should be all lower case letters and it should be globally unique across all of azure because when we create the storage account, we can connect to it through the name.
Give the name of the storage account in all lower case letters. Here I am creating the storage account with name “dbanuggetsstorage” and choose the location of the storage to match the resource group. Here I am choosing the East US. Then choose the kind of account you are going to create. Remember, this is not where we actually change this to general purpose v2 or azure data lake gen2.
Then choose the Replication tab and select the type of redundancy you need. The cheapest option and the one that is built in is the “Locally-redundant storage(LRS)” meaning if you are working with data lake or the storage account, Microsoft makes sure we have like three copies of that data at any given time. This is an amazing thing as we now have three copies available and do not have to worry about what will happen if one data rack goes down. We will still have other two racks readily available. This option is the redundancy with in the data center so if the data center itself goes down, then we need to check for different options available in the azure like zone redundant storage (ZRS), geographical redundancy storage(GRS) and many others. There is a cost associated with the options you choose from here. Here for this example, we are going to choose the cheapest option “Local-redundant storage(LRS)”.
When you check the other tabs like networking, here we can control the access through IP addresses. The next tab is data protection. Here we can do like soft deletes for blobs. If someone will delete the file accidentally, we can recover those files. You can do versioning for the blobs and also turn on the point in time restore for container.
The next tab is advanced tab where we can choose where we actually want the storage account to be. You can see this in the advanced tab> Data Lake Storage Gen2> diabled or enabled. This is the single place where you click to enable or disable the Data Lake storage which will signify if your storage is going to be Data lake and this needs to be done at the time of the creating the resource. We cannot go and modify this later. Once we choose the storage type here, we cannot change it later. Let’s go ahead and choose the Data lake to be enabled.
Once you enable this option and go to the Data protection tab, lot of options that were there under the data protection are disabled we lose the point in time restore, the ability to do the soft deletes, the versioning for the blobs and we also lose the WORM storage (meaning once we write the data, we cannot modify it). These are the limitations that we get with the Azure Data Lake, so if you need these options we have to go with the blob storage account. Remember that we are not restricted with only one storage account for all our projects. We can have for some projects blob storage and for others Azure Data Lake storage. We can have some data that requires the regulatory compliance like the WORM storage and you do not want any one to accidentally delete the files and you need the soft delete functionality, in that case for that specific project we can choose the blob storage account. We can always have and use the different services for our projects in Azure.
On the Tags tab, we can give the name and the value to that tag name so once we go to the billing information, this resource is now tagged and we can see our costs clearly. It is a good practice to tag all of our resources. Click review and create at the bottom of the page.
You see the validation passed on the top of the screen. You can review the resources that you wanted to create and click create at the bottom of the screen.
I am seeing the deployment in progress once I press the create button. The storage account that you created will not cost us anything now. You get billed for storage account is when you write the data to the storage account and when you read data from it. If we created the storage account but not created anything in it, we do not get billed for that. If we created and using very little like creating only few files then we will only be minimally be charged. We can monitor the costs down the lane in the Azure.
You will be seeing the notification button at the top right corner of the screen notifying the deployment is complete.
To see how the storage account that we created look like, go to the navigation pane on the left side of the screen, choose the resource group> select the resource group we created (in my example, dbanuggets). Here is our storage account “dbanuggetsstorage”.
Once you click on that storage account, a new blade will be opened in the vertical pane and you will be able to see all associated details of that storage account.
When you scroll down, you will see the “containers”. Click on that and here when you want to upload any files or you need to remove any files from azure storage account, we need to create a container for that first.
Click on the containers and create a container by giving a name.
Give the name like here I provided the name of the container as “dbawork”. Remember, we are creating a directory here and we need to give the name all small letters. I am keeping my public access Private here in the example. Then, click create.
You will see the notification that the container has been created successfully.
We can now upload files and download files. Usually, we all do this through automating. We will be doing this through azure data factory. Now, lets go ahead and manually upload a file in our example and see how easy it is to import data into Azure. Click on the container we just created.
Click on the upload button at the top
Select the folder to upload the file
Upload the file and press upload
File has been uploaded. This file is now in azure and stored in azure data center. Any file that we upload to storage account, we have to upload the files to a container. These containers are just like the directory locations where we have folders having the files on our on-premises environments. We can make sure we have security setup at the container level.
You can directly click on that file that is now in the container and click edit to directly edit the content in the file.
A new blade pops up opening the file and you can edit it
For easy uploading and managing your files to the storage, there is this beautiful application you can use “Azure storage Explorer”
Azure storage Explorer is an amazing tool where you can connect to your azure account and see all the different resources you have in azure. It is both for the general purpose and azure data lake Gen2. We can upload files, folders, change permissions. You can upload, download and you can also modify the security on these containers easily through this application. If you do not have it already, download it here.
Once you download the Azure storage explorer, signin with the azure account you created. I signed in with the azure account I created and I can see that in the account section in the application that you can see at the left side of the panel.
Under the explorer panel, expand the Azure subscription tab to see all the resources you have like storage account and the containers I have along with the fileshares.
You do not have to login to the azure portal, you can upload, download from here itself. We usually automate the upload processes through azure data factory or through the business applications.
In this post, we have learned about Azure Data Services- Infrastructure as a Service (IaaS), Platform as a Service (PaaS), Software as a Service and we have seen the difference in between these services. We have leaned about the Azure Storage-Blob Storage, Data lake storage Gen2. We have learned with examples so you can get started with Azure.
In the next post, we will learn the fundamentals of Azure SQL DB.
When I was preparing for my presentation “Method Behind the Magic of Internals in SQL Server”, I had a plan to write a blog post about the Data page internals and I am finally able to write it. Here are some little nuggets about Data page internals in SQL Server.
SQL Server stores the data in the form of 8kb pages. Each page is interlinked between to the previous page and the next page number. There are two types of pages in SQL Server where data gets stored. Pages that contain the actual data and pages that contain the data about all other pages, meaning pages about the metadata information. We will have a separate blog post of types of pages we have in SQL Server. No matter how many blog posts we read on this topic, it never gets old 🙂
Lets quickly see the structure of a page.
Page Header is always 96bytes in size. This area contains the metadata about the page itself like the page number, previous page, next page and much more metadata information about the page (A separate blog post about this in near future).
Pay load area is where the actual data gets stored in the slots. When you check the bottom of the page, that’s the Row offset array which have the 2-byte pointers pointing to the exact space where the corresponding row begins on the page.
Data inside the payload area might not always be in the physical order of the data rows but always be logically ordered in the row offset array in the index sort order. SQL Server always read the row offset array from the right to left. The first slot array is 0 which have a pointer pointing to the lowest key value on the page and the second slot array has the pointer pointing to the next lowest key value and so on.
There are main datatypes in SQL Server, Fixed length and the variable length data types. Fixed length data types always store the same space regardless of having null or not null values. With the variable data types, SQL Server will use the required space and additionally two extra bytes.
Each byte you see in the picture has a purpose. The first two blocks containing a byte, Status Bit A and Status Bit B contains the bitmap information about the row, like if the row is logically been deleted/ghosted, row type information, versioning tag, if the row contains any NULL values, Variable length columns. The next 2 bytes is used for storing the length of the fixed length data. The next n bytes are for storing the fixed length data itself. There is a null bitmap after that which will have both the 2-byte column count in the row and null bitmap array. Regardless of if the column in null or not, each and every column will have one bit per every column.
Next is for the variable length data portion where the first 2 bytes here is for the number of variable length column in the row. The next n bytes is for the offset array of the variable column. Even when the value of the column is Null, SQL Server will still stores the offset value of 2-bytes per each variable length column. Next is the actual data of the variable length data. Finally, the 14 byte size versioning tag especially for the row versioning. This is used for the optimistic isolation levels and for the index rebuilds.
Let’s go ahead and create a brand new table and insert two rows to see how they actually look like. By using the DBCC command, let us first view all the pages for the table.
create table dbo.dbanuggets
ID int not null,
nugget1 varchar(255) null,
nugget2 varchar(255) null,
nugget3 varchar(255) null
insert into dbo.dbanuggets(ID, nugget1, nugget3) values (1,replicate('d',10),replicate('b',10));
insert into dbo.dbanuggets(ID, nugget2) values (2,replicate('a',10));
/* dbcc ind ('databasename','dbo.tablename',-1 /* Displaying info about all pages from the table */);*/
dbcc IND ('dbanuggets', 'dbo.dbanuggets', -1)
To view the page header and the column information, Turn the Traceon(3604):
When you check the first record here, it has a column Pagetype as 10 which is a metadata page IAM (Index allocation map). This page will track what all pages belongs to a particular table. The next record which have the PageType as 1 is the actual data page which contains the data rows.
There are two pages that belong to the table. The first one with PageType=10 is the special type of the page called IAM allocation map. This page tracks the pages that belong to particular object. Let’s not focus on it now – we will cover allocation map pages in one of the following blog posts.
Check the row structure here:
When you see, the 3rd one in the row is a two byte value of 0800. This is a byte swapped to the value 0008 which is the offset tells the SQL Server where the fixed-length part of the row ends and offset for the columns in the row. Next 4 bytes is used to store the ID column here which is a fixed length data. The next 4 bytes are used to store the fixed length data, the ID column. Next, the 2 byte value shows how many columns we have, here we have 4 columns. Next, we have one byte Null bitmap. As we have 4 columns, we have one byte bitmap. The value shown here as 04 which is 00000100 in the binary format. It represents that the third column in the row contains NULL value. The next 2 bytes shows the number of the variable length columns which is 0300 in the swapped order which means 3 columns. Next contains the offset value where the column data ends. Next comes the actual data for the variable length columns.
second row look likes this:
Row structure is a bit different here:
When you observe here, the nuggets1 and nuggets3 columns are having null values. The NULL bitmap in the second row shows the binary value 00001010 which tells that the nuggets1 and nuggets3 columns are NULL. In the variable length columns indicates there are only two columns. SQL Server will not have the information about the NULL variable length column in the data row. We can manage the size of the data row by making sure we add the null value columns at the last when declaring the table definition and this is when we use the variable length columns that will have many null values.
There is 8,060 bytes available on the single data page. Null bitmap is always there for the heap tables and clustered index leaf level rows even when the table does not have the nullable columns. If there are no nullable columns in the index, then the non clustered indexes do not have bitmap values.
In this post, we have learned how the data page look like and the different parts of the data page. We have also learned about how the data row structure look like with the examples. How having null values doesn’t take space in the row data but then managed by the SQL Server as the Null bitmap value. I will be posting more about the Page internals in the future blog posts.