SQL Server – Performance – Best Practice – Create Index Before or After a data insert?


As a part database developer myself, I am sure that what I am about to document today will be useful to almost everyone who has ever written a stored procedure for with indexed tables for SSRS reporting or massive batch processing. What I will be trying today is to find an answer to the dilemma – Should indexes be created before or after the insert to the table?

The moment of index creation impacts table statistics, and therefore should impact performance. The point that got me thinking towards this test is the following line from Books On Line (http://msdn.microsoft.com/en-us/library/ms190397.aspx), Section: “Determining When to Create Statistics”, which quite clearly mentions that: “The query optimizer creates statistics for indexes on tables or views when the index is created.

The approach that I will be taking today is:

  1. Create two identical tables in the TempDB
    • On one table, indexes will be created before the insert of test data is done
    • On the other, I will first insert some data, and then create indexes
  2. Later, I will fire simple select queries against both, while monitoring the performance via monitoring statistics usage by:
    • setting of SET STATISTICS IO ON and,
    • Using the SQL Profiler

WARNING: The scripts provided in this post are provided “as-is” and without warranty for understanding purposes only. Using the DBCC scripts provided in the post on your QA or production environments can have serious side effects.

Creating the test environment

In order to create the test environment, I will first create a test table, create the necessary indexes and finally add some test data to it.

USE tempdb
GO

/*****************************************************************************************************************
   STEP 01: Create a table, complete with a primary key (to create a clustered index), and a non-clustered index
*****************************************************************************************************************/
USE tempdb
GO
CREATE TABLE DontUpdateStatistics
( StatId INT IDENTITY(1,1),
  RandomId INT DEFAULT (RAND()*1000),
  StatName VARCHAR(10) DEFAULT 'DEFAULT',
  CONSTRAINT PK_DontUpdateStatistics_StatId PRIMARY KEY CLUSTERED 
  (
    StatId
  )
)
GO

CREATE NONCLUSTERED INDEX NC_DontUpdateStatistics_RandomId ON DontUpdateStatistics(RandomId)
GO

/*****************************************************************************************************************
   STEP 03: Insert some test data into the table
*****************************************************************************************************************/
USE tempdb
GO
INSERT INTO DontUpdateStatistics DEFAULT VALUES
GO 100000

Next, I will create another table. However indexes will be created after test data is inserted.

USE tempdb
GO
/*****************************************************************************************************************
   STEP 04: Create a table - this time, do not create any indexes
*****************************************************************************************************************/
CREATE TABLE UpdateStatistics
( StatId INT IDENTITY(1,1),
  RandomId INT DEFAULT (RAND()*1000),
  StatName VARCHAR(10) DEFAULT 'DEFAULT'
)
GO

/*****************************************************************************************************************
   STEP 05: Insert some test data into the table
*****************************************************************************************************************/
USE tempdb
GO
INSERT INTO UpdateStatistics DEFAULT VALUES
GO 100000

/*****************************************************************************************************************
   STEP 06: Now create the primary key clustered index and the non-clustered index
*****************************************************************************************************************/
ALTER TABLE UpdateStatistics
    ADD CONSTRAINT PK_UpdateStatistics_StatId PRIMARY KEY CLUSTERED 
    (
        StatId
    )
GO
CREATE NONCLUSTERED INDEX NC_UpdateStatistics_RandomId ON UpdateStatistics(RandomId)
GO

Run the Test

The first thing I will do is to check if statistics for both tables are updated or not.

Confirm statistics updates

/*****************************************************************************************************************
   STEP 07: See if the statisics on the table in tempDB were updated
Script source: http://blog.sqlauthority.com/2010/01/25/sql-server-find-statistics-update-date-update-statistics/
             : by Pinal Dave
*****************************************************************************************************************/
USE tempdb
GO
SELECT name AS index_name,
STATS_DATE(OBJECT_ID, index_id) AS StatsUpdated
FROM sys.indexes
WHERE OBJECT_ID = OBJECT_ID('DontUpdateStatistics') OR
      OBJECT_ID = OBJECT_ID('UpdateStatistics')
GO

As can be seen, the statistics for Table #1 (“DontUpdateStatistics”) did not update:

image

The Test Queries

Next, allow me to present a set of SELECT queries that would fetch data from the tables of interest.

USE tempdb
GO
/*****************************************************************************************************************
   STEP 08 A: Clear out the Buffers and the procedure cache.
            : Make sure to run a CHECKPOINT!
Reference source: 
http://beyondrelational.com/blogs/nakul/archive/2011/10/03/dbcc-dropcleanbuffers-needs-to-be-preceded-by-a-checkpoint-to-remove-all-buffers.aspx
*****************************************************************************************************************/
CHECKPOINT
GO
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE

/*****************************************************************************************************************
   STEP 08 B: Turn STATISTICS IO ON, and select some data from the DontUpdateStatistics table
*****************************************************************************************************************/
SET STATISTICS IO ON

SELECT * FROM DontUpdateStatistics WHERE StatId BETWEEN 10 AND 1000 
                                     AND RandomId > 300
GO

SET STATISTICS IO OFF

/*****************************************************************************************************************
   STEP 08 C: Clear out the Buffers and the procedure cache.
            : Make sure to run a CHECKPOINT!
Reference source: 
http://beyondrelational.com/blogs/nakul/archive/2011/10/03/dbcc-dropcleanbuffers-needs-to-be-preceded-by-a-checkpoint-to-remove-all-buffers.aspx
*****************************************************************************************************************/
CHECKPOINT
GO
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
GO

/*****************************************************************************************************************
   STEP 08 D: Turn STATISTICS IO ON, and select some data from the UpdateStatistics table
*****************************************************************************************************************/
SET STATISTICS IO ON

SELECT * FROM UpdateStatistics WHERE StatId BETWEEN 10 AND 1000 
                                 AND RandomId > 300
GO

SET STATISTICS IO OFF

Configure the SQL Profiler

To use the SQL Server Profiler in order to help me back my case, I connected to my SQL Server instance, and selected only one event:

Event Group: Errors & Warnings

Event Class: Missing Column Statistics

image

The Result

With the SQL Profiler running, I turned on the Actual Execution Plan in SSMS (by pressing Ctrl+M) and then executed the test query. The result was as shown below:

The Execution Plan image
Statistics Output image
Profiler Output image

Confirm statistics updates

/*****************************************************************************************************************
   STEP 09: See if the statisics on the table in tempDB were updated
Script source: http://blog.sqlauthority.com/2010/01/25/sql-server-find-statistics-update-date-update-statistics/
             : by Pinal Dave
*****************************************************************************************************************/
USE tempdb
GO
SELECT name AS index_name,
STATS_DATE(OBJECT_ID, index_id) AS StatsUpdated
FROM sys.indexes
WHERE OBJECT_ID = OBJECT_ID('DontUpdateStatistics') OR
      OBJECT_ID = OBJECT_ID('UpdateStatistics')
GO

As can be seen, the statistics for Table #1 (“DontUpdateStatistics”) are now updated. The statistics for Table #2 (“UpdateStatistics”) are unchanged:

image

Post-Mortem

First of all, I would like to draw attention to a line mentioned in the Profiler’s description of the “Missing Column statistics” event class. The line states – “You will not see this event produced unless the option to auto-compute statistics is turned off”. For the TempDB, auto-compute statistics is ON, so that’s not the root cause to this issue.

image

If this is not the issue, then what is? Summarizing the results, I can derive the following conclusions:

  • Both queries appear to have executed with equal performance impact – with a 50% split across them according to the execution plan
  • However, looking at the statistics IO output, I can see that:
    • Read-Ahead Reads for table #1 (“DontUpdateStatistics”) = 16
    • Read-Ahead Reads for table #2 (“UpdateStatistics”) = 4
    • The query against Table #1 (“DontUpdateStatistics”) is requires 4 times more reads when compared to the query against query #2
  • The SQL Profiler warns me that column statistics are not available, and therefore, a sub-optimal plan can be chosen by the optimizer

Revisiting Books On Line, (http://msdn.microsoft.com/en-us/library/ms190397.aspx), the section: “Determining When to Create Statistics”, clearly mentions that: The query optimizer creates statistics for indexes on tables or views when the index is created.

More precisely, statistics, if missing, are created when a select is performed on the columns for which missing statistics were found.

In our case, we created the indexes on table #2 (“UpdateStatistics”) after data was inserted. Because creating indexes involves selecting data from the table, a statistics update is performed. For table #1 (“DontUpdateStatistics”) though, the indexes were created before inserting the data. When the query #1 was executed, the statistics are generated, and then the optimization happens – which makes the query a little bit slower as it spends more time on the read-ahead reads.

If we look up on Books On Line for STATISTICS IO (http://msdn.microsoft.com/en-us/library/ms184361.aspx), we can see that the read-ahead read count gives us the number of pages placed into the cache. In our case, all pages for table #1 have to be placed in the cache for the statistics update.

Conclusion

It’s been a rather long post, with a simple conclusion. It is important to think about when index creation should be performed because using the proper sequence is important from the performance and IO perspective.

Until we meet next time,

Be courteous. Drive responsibly.

10 thoughts on “SQL Server – Performance – Best Practice – Create Index Before or After a data insert?

  1. Hardik Doshi

    Nice post with good example.

    I like the sentence: It is **important to think** about when index creation should be performed because using the proper sequence is important from the performance and IO perspective.

    Like

    Reply
  2. marc_jellinek@hotmail.com

    Here’s a few points that you didn’t cover:

    **Indexes are used to enforce referential integrity and uniqueness constraints.**

    If you create your table, load your data, then create your indexes, you stand the risk of the CREATE INDEX failing because your pre-existing data doesn’t conform to the assumptions of the index. If you have a column(s) which is supposed to have unique values, enforced by an index, PRIMARY KEY constraint, FOREIGN KEY constraint or uniqueness contraint, the CREATE INDEX statement will fail.

    If you are going to create tables, load data, then create indexes (as is common practice in data warehousing), you must pre-validate your data or risk the CREATE INDEX statement failing.

    **Indexes affect INSERT, UPDATE and DELETE performance**

    Why is it common practice to load data first, then create indexes (or drop indexes, load data, then re-create the indexes)? Because for every row inserted or updated, it is validated against the indexes associated with the table you are inserting into, the indexes are updated and then the row is inserted. For every row deleted, all indexes associated with the table are also updated.

    **When to create indexes first**

    If you aren’t going to be there to pre-validate your data, create your indexes first. Your data will be validated against your indexes.

    **When to load data, then (re-)create indexes**

    If you can pre-validate your data, load your data first, then (re-)create your indexes. If you are using clustered indexes, you also want to pre-sort your data.

    Like

    Reply
  3. ErikEckhardt

    One thing I’d like to know is the end-to-end performance difference.

    Even if statistics are already created in one method when doing the first SELECT, this doesn’t prove the overall performance was better. All we really know is that it delays creating the statistics, but the engine still has to do this task in both scenarios, so bundling the cost of it into the SELECT doesn’t really prove anything since that method avoided the cost earlier.

    I’d like to see a profiler trace of the entire process for method 1 compared to a profiler trace of the entire process for method 2. I’d also like to see information on the final fragmentation of the two tables. Theoretically (at least in the strange recesses of my own imagination), creating the index before loading the table could result in less fragmentation. Depending on the planned usage of the table, this could be more important than a minor performance difference up to the first select.

    The amount of reorganization needed to convert a heap to a clustered index depends on the disorganization of the data as inserted. If during row insertion from another source a destination table’s clustered index is already present, it seems to me that less memory or tempdb could be required because it could result in almost a write-once operation (though of course that’s not completely true because there would be page splits nonetheless, but after each write the row could be forgotten about). Contrast this with the work necessary to sort the heap *in-place*. That could theoretically take *double* the new storage allocation temporarily, while the new location is written with the data and before the old location is destroyed–as required by ACID and to make the transaction reversible if it does not complete.

    An analogy would be the difference between moving 52 shuffled cards one at a time to a new sorted list, vs. reorganizing the cards in-place. The first task is substantially easier (think of them end to end on a table rather than stacked).

    It could be that a very disorganized heap would yield best performance with index-first, and a nearly index-order heap would yield best performance with index-last. I think more testing is in order.

    Like

    Reply
  4. satishchandrat

    Our build process involves creation of database, DB objects all driven through TSql scripts.
    Some of the master tables have Seed data. Once the Database, table and procs are created, including indexes and constraints, the seed data(insert) scripts are executed, in the sequence. So far fine.
    As part of supporting multi languages, one of the our languange attributes related data has grown 10 fold. Instead of a single large script file, the seed data script is split on per language basis, confining the script file size.
    The issue being encountered, is that there is a long delay in the execution of all the scripts.

    As suggested my Marc, if we split the sequence of execution as
    create tables—load data—apply indexes and constraints,
    Would this help. If the data is pre-validate to support the index and constraint definition.

    Any inputs/suggestions appreciated.

    Like

    Reply
  5. ErikEckhardt

    Why don’t you try it and post the results back to us? And once the data is loaded, how fragmented are the indexes?

    Like

    Reply
  6. Nakul Vachhrajani

    @ErikEckhardt – I surely will. It’s been a busy month at the office and therefore, I have been noting the things to do down in my little notepad. Hopefully, I should get some time in the Christmas week-end.

    Like

    Reply
  7. marc_jellinek@hotmail.com

    Question about a long delay in the execution of the scripts, now that you are handling 10x the data:

    Are you seeing transaction logs and data files autogrowing? If so, pre-allocate the necessary space. Remember, this will apply to both the user database and tempdb.

    Also, you may want to check if “instant file initialization” is enabled. You do this by adding the SQL Server service account to the “Perform Volume Maintenance Tasks” local policy. This will allow data files to grow very quickly, but has no effect on transaction logs.

    Where are you seeing the delay? During [create tables], [load], [apply indexes and constraints]?

    Have you paid attention to the order in which indexes are applied? Create CLUSTERED indexes first, then NONCLUSTERED indexes. If you create NONCLUSTERED indexes first, then CLUSTERED indexes, when you create the clustered index, all nonclustered indexes will be re-created.

    Like

    Reply
  8. Pingback: #0391 – SQL Server – SSIS – OLE DB Destination – Table Fast Load – Performance – Table Lock option | SQLTwins by Nakul Vachhrajani

Let me know what you think about this post by leaving your feedback here!

This site uses Akismet to reduce spam. Learn how your comment data is processed.