In November 2011, I wrote a post titled SQL Server – Performance – Best Practice – Create Index Before or After a data insert?. The post dealt with answering the question – which method was better – whether to create an index before or after a bulk-insert? The post was essentially targeted to code that generated temporary tables for SSRS reporting or bulk processing.
Erik Eckhardt provided some great feedback on the post. Essentially, his feedback was:
- He wanted to see a Profiler trace comparison between the two approaches
- A fragmentation study was in order – which method produced the most efficient indexes?
I had intended to carry out this study a long time ago (during the Christmas week-end), but it’s better late than never. So, here’s the results of the study.
Creating the test data
The approach that I will be taking today is:
- Create two identical tables in the TempDB
- On one table, indexes will be created before the insert of test data is done
- On the other, I will first insert some data, and then create indexes
WARNING: The scripts provided in this post are provided “as-is” and without warranty for understanding purposes only. Using the DBCC scripts provided in the post on your QA or production environments can have serious side effects
The script that I will be using today is same as that in my earlier post. However, please find it below for your kind reference. First up is the query that creates the indices before inserting the data.
USE tempdb GO SET NOCOUNT ON /***************************************************************************************************************** STEP 01: Create a table, complete with a primary key (to create a clustered index), and a non-clustered index *****************************************************************************************************************/ USE tempdb GO CREATE TABLE DontUpdateStatistics ( StatId INT IDENTITY(1,1), RandomId INT DEFAULT (RAND()*1000), StatName VARCHAR(10) DEFAULT 'DEFAULT', CONSTRAINT PK_DontUpdateStatistics_StatId PRIMARY KEY CLUSTERED ( StatId ) ) GO CREATE NONCLUSTERED INDEX NC_DontUpdateStatistics_RandomId ON DontUpdateStatistics(RandomId) GO /***************************************************************************************************************** STEP 02: Insert some test data into the table *****************************************************************************************************************/ USE tempdb GO INSERT INTO DontUpdateStatistics DEFAULT VALUES GO 100000
Next up is the script that I recommend – wherein the indexes are created after the data is inserted.
USE tempdb GO SET NOCOUNT ON /***************************************************************************************************************** STEP 04: Create a table - this time, do not create any indexes *****************************************************************************************************************/ CREATE TABLE UpdateStatistics ( StatId INT IDENTITY(1,1), RandomId INT DEFAULT (RAND()*1000), StatName VARCHAR(10) DEFAULT 'DEFAULT' ) GO /***************************************************************************************************************** STEP 05: Insert some test data into the table *****************************************************************************************************************/ USE tempdb GO INSERT INTO UpdateStatistics DEFAULT VALUES GO 100000 /***************************************************************************************************************** STEP 06: Now create the primary key clustered index and the non-clustered index *****************************************************************************************************************/ ALTER TABLE UpdateStatistics ADD CONSTRAINT PK_UpdateStatistics_StatId PRIMARY KEY CLUSTERED ( StatId ) GO CREATE NONCLUSTERED INDEX NC_UpdateStatistics_RandomId ON UpdateStatistics(RandomId) GO
I will now start by answering the 2nd question first – which method produced an index that was more efficient from a fragmentation perspective?
Fragmentation study
Please note that in both the queries above, the clustered index is created on the column – StatId, which is an identity column. The inserts are therefore in the order we would want them to be in both cases, causing a “write-once” operation as Erik put it.
Once the data has been inserted into the temporary tables, I ran the following query that uses the sys.dm_db_index_physical_stats DMV to take a look at the index physical statistics. NOTE: I will not use the DBCC SHOWCONTIG because it is marked for deprecation (BOL reference here).
USE tempdb GO SELECT 'Table: DontUpdateStatistics' AS TableName, index_id AS IndexId, index_type_desc AS IndexType, index_depth AS IndexDepth, avg_fragment_size_in_pages AS 'AverageFragmentation(Pages)', avg_fragmentation_in_percent AS 'AverageFragmentation(Percentage)', fragment_count AS Fragments, page_count AS 'PageCount', partition_number AS PartitionNumber, alloc_unit_type_desc AS AllocationUnitType, index_level AS IndexLevel FROM sys.dm_db_index_physical_stats(DB_ID(),OBJECT_ID('DontUpdateStatistics'),-1,0,NULL) SELECT 'Table: UpdateStatistics'AS TableName, index_id AS IndexId, index_type_desc AS IndexType, index_depth AS IndexDepth, avg_fragment_size_in_pages AS 'AverageFragmentation(Pages)', avg_fragmentation_in_percent AS 'AverageFragmentation(Percentage)', fragment_count AS Fragments, page_count AS 'PageCount', partition_number AS PartitionNumber, alloc_unit_type_desc AS AllocationUnitType, index_level AS IndexLevel FROM sys.dm_db_index_physical_stats(DB_ID(),OBJECT_ID('UpdateStatistics'),-1,0,NULL)
I looked at the following values:
- avg_fragmentation_in_percent – The percent of logical fragmentation (out-of-order pages in the index)
- fragment_count – The number of fragments (physically consecutive leaf pages) in the index
- avg_fragment_size_in_pages – Average number of pages in one fragment in an index
Essentially, The avg_fragment_size_in_pages and avg_fragmentation_in_percent values are inversely proportional to each other. Therefore, we are looking for a higher value of avg_fragment_size_in_pages, and a lower value of avg_fragmentation_in_percent.
Below is the output that I received when the above queries were executed against my test instance.
In case you are not able to see the screenshot above, here’s the same data in tabular format:
TableName | Index Id |
IndexType | Index Depth |
Average Fragmentation(Pages) |
Average Fragmentation(Percentage) |
Fragments | Page Count |
Partition Number |
Allocation Unit Type |
Index Depth |
Table: DontUpdateStatistics | 1 | CLUSTERED INDEX | 2 | 11.5666666666667 | 1.44092219020173 | 30 | 347 | 1 | IN_ROW_DATA | 0 |
Table: DontUpdateStatistics | 2 | NONCLUSTERED INDEX | 2 | 1 | 99.6108949416342 | 257 | 257 | 1 | IN_ROW_DATA | 0 |
Table: UpdateStatistics | 1 | CLUSTERED INDEX | 2 | 173.5 | 0 | 2 | 347 | 1 | IN_ROW_DATA | 0 |
Table: UpdateStatistics | 2 | NONCLUSTERED INDEX | 2 | 58 | 0.574712643678161 | 3 | 174 | 1 | IN_ROW_DATA | 0 |
Results
Contrary to expectations, the results above clearly indicate that we have higher fragmentation in cases when indexes are created before data is inserted into the tables. It does not matter when these values (i.e. the results of the DMV) are fetched – before or after a SELECT is executed against the table, i.e. the index fragmentation remains the same and are not affected by when the statistics on the table are updated.
CPU Utilization and Elapsed Time study
I would address this part of the study in 2 parts:
- In my original post (here), we already saw that when running the SELECT statements against the table, the Profiler was able to detect that the table: DontUpdateStatistics – did not have the expected statistics
- We have also seen that when STATISTICS IO was turned ON, the SQL Server clearly showed a performance difference between the two SELECT statements
- In this post, I will try to see whether the performance gain in the second SELECT has been redistributed during the index creation or data load phases
To see the overall CPU time and elapsed time statistics, I used the SET STATISTICS TIME ON option wrapped around the Index (both clustered & non-clustered) creation and data insert statements. The result of the execution based on the data returned for the test data of 100,000 records is summarized below:
Data Insert (ms) | Index (clustered & non-clustered) creation (ms) | Totals (ms) | |
CPU Time | — | — | — |
Table: DontUpdateStatistics | 4687 | 0 | 4687 |
Table: UpdateStatistics | 4533 | 400 | 4933 |
Elapsed Time | — | — | — |
Table: DontUpdateStatistics | 24264 | 7 | 24271 |
Table: UpdateStatistics | 22364 | 23872 | 46236 |
Summary
It is now time to summarize the findings.
Data can be loaded to a table in multiple ways. We have been comparing the pros & cons of 2 such methods:
- Create necessary supporting indexes first, insert data afterwards
- Insert data first, create necessary supporting indexes afterwards
My original post compared these approaches in terms of the number of read-ahead reads (number of pages placed into the cache) required before the SQL server can serve our request to fetch data from the tables depending upon when and whether statistics were updated for the tables. In this post, we studied the level of index fragmentation and time statistics.
The following chart is a summary of my findings based on which approach “wins” (Wins represented as “W”, whereas a loss by “L”):
Most up-to-date statistics | Read-ahead reads required to service a query | Index fragmentation level | Time (CPU + Elapsed time) statistics | |
Index before, insert after |
L |
L |
L |
W |
Insert before, index after |
W |
W |
W |
L |
Further reading |
This post |
This post |
In short, if execution time is of utmost importance for a given scenario, the approach #1 is preferred wherein indexes are created before inserting the data (which is the most common approach). However, if fragmentation, up-to-date statistics and reduced number of read-ahead reads are of importance (in case of read-only data), the approach #2 is preferable wherein data is inserted before creating the indexes.
I trust you found the study conducted above useful & interesting. If you have any comments, or would like to know more about it, please do feel free to contact me via the associated discussion or twitter. As always, I really appreciate the feedback that you, the community, provides.
References
- DMV: sys.dm_db_index_physical_stats
- DBCC command: DBCC SHOWCONTIG
- Reorganizing and Rebuilding Indexes
- SET STATISTICS IO
- SET STATISTICS TIME
Until we meet next time,
Pingback: #0391 – SQL Server – SSIS – OLE DB Destination – Table Fast Load – Performance – Table Lock option | SQLTwins by Nakul Vachhrajani