Tag Archives: DBA

Articles for the DBA – accidental or otherwise

#0267 – SQL Server – IDENTITY columns – Myths – IDENTITY columns cannot have holes or “gaps”


Whenever a pattern evolves into a frequently used practice, it does not take long for myths to start getting associated with it. IDENTITY columns have become such a common pattern that often I hear about these myths from novice developers. Using this series, I hope to bust those myths once and for all.


Busting the Myth – IDENTITY columns cannot have holes


Whenever a break in a series is encountered, it is referred to as a hole. For example, if we have a series of numbers from 1 to 5 such that the number 2 is missing (1, 3, 4, 5), then we have a hole.


I often hear from developers that “Holes cannot be generated in the values of a column defined as the IDENTITY column” is a prime reason why one uses IDENTITY columns. This is in-fact incorrect as I will be proving by this post.


The Test


To provide that holes can be generated in an IDENTITY column, I will use the script shown below. The script:



  1. Prepares the environment by creating a table and adding some test data into it

  2. Check the current identity value of the table using IDENT_CURRENT()

  3. Opens a transaction

  4. Performs an insert into a table which has the IDENTITY property defined

  5. Later, the transaction is rolled back, and the current identity value of the table is queried using IDENT_CURRENT()

  6. To confirm, a record is inserted into the table and the values selected – clearly indicating the presence of the hole

USE tempdb;
GO

SET NOCOUNT ON;

--1. Prepare the environment
--   Create a table, and add some test data into it

--Safety Check
IF OBJECT_ID('tIdentity','U') IS NOT NULL
    DROP TABLE dbo.tIdentity;
GO

--Create a table
CREATE TABLE dbo.tIdentity (IdentityId INT IDENTITY(1,1),
                            IdentityValue VARCHAR(10)
                           );
GO

--Add some test data
INSERT INTO dbo.tIdentity (IdentityValue)
VALUES ('One'),
       ('Two'),
       ('Three');
GO

--2. Check the current identity value of the table using IDENT_CURRENT()
SELECT IDENT_CURRENT('dbo.tIdentity') AS IdentityValueBeforeTest;
GO

--3. Perform the test
--3a. Open a transaction
BEGIN TRANSACTION IdentityHolesCheck
    --3b. Perform an insert
    INSERT INTO dbo.tIdentity (IdentityValue)
    VALUES ('Four');
--3c. Rollback the transaction
ROLLBACK TRANSACTION IdentityHolesCheck;
GO

--4. Check the current identity value of the table using IDENT_CURRENT()
SELECT IDENT_CURRENT('dbo.tIdentity') AS IdentityValueAfterTest;
GO

--5. Confirmatory test
INSERT INTO dbo.tIdentity (IdentityValue)
VALUES ('Five');
GO

--6. Select the values from the table
SELECT ti.IdentityId, ti.IdentityValue
FROM dbo.tIdentity AS ti;
GO

--Cleanup
IF OBJECT_ID('tIdentity','U') IS NOT NULL
    DROP TABLE dbo.tIdentity;
GO

/**********************************************
               RESULTS
**********************************************/
/*
IdentityValueBeforeTest
-----------------------
3

IdentityValueAfterTest
----------------------
4

IdentityId  IdentityValue
----------- -------------
1           One
2           Two
3           Three
5           Five
*/

While this was a simulation, in a practical world this situation is similar to a failed insert. Even if the INSERT did not go through, the IDNETITY value was incremented leaving the hole behind.


Conclusion


The test shown above clearly proves that it is possible to have holes in an IDENTITY column. Any code that expects an uninterrupted series of values may break if this has not been taken into consideration.


Until we meet next time,


Be courteous. Drive responsibly.

#0266 – SQL Server – IDENTITY Columns – @@IDENTITY, SCOPE_IDENTITY, IDENT_CURRENT and $IDENTITY


IDENTITY columns are perhaps the least understood and yet most used features of Microsoft SQL Server.


In this post, I will explore some system functions and keywords that we as developers and administrators can use to get information about IDENTITY columns.


Whenever a record is inserted to a table using an IDENTITY column via an INSERT, SELECT INTO or bulk copy command, the identity values on the affected tables are incremented. Microsoft SQL Server gives us three system functions that can provide the last used identity value – depending upon the scope and session.


IDENT_CURRENT()


Given a table name, this system function returns the last identity value used for the given table or view. Since this function works on a particular table or view, this value is independent of the session or the scope.


SCOPE_IDENTITY()


This system function provides the last used IDENTITY value within the same scope.


Statements inside the same stored procedure, trigger or batch comprise of the same scope – those within triggers fired as a result of a statement inside a stored procedure aren’t.


If no statement has been executed which affects an IDENTITY value, this function returns NULL.


@@IDENTITY


The @@IDENTITY system function contains the last value that was generated by the statement – irrespective of the scope.


If no statement has been executed which affects an IDENTITY value, this function returns NULL. However, please note that this function reports the identity value last generated by the statement – irrespective of the scope. This means that if a statement causes execution of a trigger which generates identity values in other tables, calling the @@IDENTITY function immediately after this statement will return the last IDENTITY value generated by the trigger.


Let’s look at each of these via an example:

USE AdventureWorks2008R2;
GO
SET NOCOUNT ON ;

SELECT  IDENT_CURRENT('Purchasing.PurchaseOrderDetail') AS PODetailIdentity,
        IDENT_CURRENT('Production.TransactionHistory') AS TransactionlIdentity,
        SCOPE_IDENTITY() AS ScopeIdentityValue,
        @@IDENTITY AS LastUsedIdentity;

    --Perform the INSERT into the table: Purchasing.PurchaseOrderDetail
    --The insert trigger on this table: Purchasing.iPurchaseOrderDetail, causes a record
    --to be inserted into Production.TransactionHistory, thereby incrementing it's IDENTITY column - TransactionID
    INSERT  INTO Purchasing.PurchaseOrderDetail
            (
              PurchaseOrderID,
              DueDate,
              OrderQty,
              ProductID,
              UnitPrice,
             --LineTotal,
              ReceivedQty,
              RejectedQty,
             --StockedQty,
              ModifiedDate
            )
            SELECT  1 AS PurchaseOrderId,
                    '2005-05-31' AS DueDate,
                    1 AS OrderQty,
                    4 AS ProductId,
                    57.02 AS UnitPrice,
                    --258.06 AS LineTotal,
                    1 AS ReceivedQty,
                    0 AS RejectedQty,
                    --1 AS StockedQty,
                    '2013-05-09' AS ModifedDate ;
    GO

SELECT  IDENT_CURRENT('Purchasing.PurchaseOrderDetail') AS PODetailIdentity,
        IDENT_CURRENT('Production.TransactionHistory') AS TransactionlIdentity,
        SCOPE_IDENTITY() AS ScopeIdentityValue,
        @@IDENTITY AS LastUsedIdentity;

/****RESULTS****/
/*
        PODetailIdentity    TransactionlIdentity    ScopeIdentityValue    LastUsedIdentity
Before: 8847                213450                  NULL                  NULL  
After : 8848                213451                  8848                  213451
*/

Inserting a new record in the table: Purchasing.PurchaseOrderDetail causes the insert trigger on this table: Purchasing.iPurchaseOrderDetail to be executed. This trigger maintains a history of the PurchaseOrder Detail transactions in the table – Production.TransactionHistory, thereby incrementing it’s IDENTITY column – TransactionID.


As can be seen in the output:



  • IDENT_CURRENT() provides us the last used identity value for the given table

  • SCOPE_IDENTITY() provides us the last used identity value within the scope of the batch

  • @@IDENTITY provides us the last used identity value irrespective of the scope

$IDENTITY


The $IDENTITY keyword allows us to refer to an identity column in a table without using the corresponding column name. Here’s an example:

USE AdventureWorks2008R2 ;
GO
SELECT  $IDENTITY
FROM    Purchasing.PurchaseOrderDetail ;
GO

The query above will return the value of the [PurchaseOrderDetailId] in the table [Purchasing].[PurchaseOrderDetail] since it’s marked as an identity column.


Other functions – IDENT_INCR() and IDENT_SEED()


If one needs to look at the value of the seed and increment values used in the identity column definition, the IDENT_INCR() and IDENT_SEED() functions can be used.

USE AdventureWorks2008R2;
GO
SELECT IDENT_SEED('Production.TransactionHistory') AS IdentityIncrement,
       IDENT_INCR('Production.TransactionHistory') AS IdentitySeed;
GO
/* RESULTS */
/*
IdentityIncrement   IdentitySeed
100000              1
*/

Checking for Identity columns in a table


The Catalog View sys.identity_columns has one record for each identity column and provides information like the seed, increment, last value generated for that column and other meta-data information.


Usage is shown in the query below:

USE AdventureWorks2008R2 ;
GO
SELECT * FROM sys.identity_columns
WHERE object_id = OBJECT_ID('Production.TransactionHistory');
GO

Until we meet next time,


Be courteous. Drive responsibly.

#0170-SQL Server-Deprecated Features-Column Alias defined by a string enclosed in quotation marks


I recently wrote a piece about the impact of table and column aliases on query performance and plan cache size. As I was writing the post, I recalled that there are multiple ways in which column aliases can be defined in a T-SQL query, depending upon the developer’s preference. However, one of these methods is marked for deprecation in one of the future releases of Microsoft SQL Server (it is still a valid method for SQL Server 2012, so there is no cause of worry in the immediate future).

The available methods for column aliasing and their status is shown in the table below:

Format Status
‘column_alias’ = expression Deprecated
expression AS column_alias Active
expression AS ”column_alias” Active
expression AS [column_alias] Active
[column_alias] = expression Active
expression AS “column_alias” Active

All of these methods are demonstrated in the query below:

USE AdventureWorks2012;
GO

SELECT                                            --           Format            ;   Status
    "Employee Birth Date" = Employee.BirthDate,   -- "column_alias" = expression ; Deprecated
    Employee.HireDate AS JoiningDate,             -- expression AS column_alias  ; Active
    Employee.BusinessEntityID AS "EmployeeId",    -- expression AS "column_alias"; Active
    Employee.OrganizationLevel AS [Org. Level],   -- expression AS [column_alias]; Active
    [Salary Flag] = Employee.SalariedFlag,        -- [column_alias] = expression ; Active
    Employee.SickLeaveHours AS "SickHours",       -- expression AS “column_alias"; Active
    "VacationHours" = Employee.VacationHours      -- "column_alias" = expression ; Deprecated
FROM HumanResources.Employee AS Employee;
GO

My recommendations

My recommendations around any such feature is:

  • To have one standard for the entire organization/product
  • The element adopted as the standard should not have been marked for deprecation for at least the next 2 major releases of Microsoft SQL Server
  • Finally, and the most important consideration is that the feature should help the team become more productive and efficient

I personally use either expression AS [column_alias] OR [column_alias] = expression methods.

Before you leave, do share your preferred format for column aliasing.

Until we meet next time,

Be courteous. Drive responsibly.

#0138 – SQL Server – Performance – Best Practice – Create Index Before or After a data insert – I/O, Fragmentation, CPU and Elap


In November 2011, I wrote a post titled SQL Server – Performance – Best Practice – Create Index Before or After a data insert?. The post dealt with answering the question – which method was better – whether to create an index before or after a bulk-insert? The post was essentially targeted to code that generated temporary tables for SSRS reporting or bulk processing.

Erik Eckhardt provided some great feedback on the post. Essentially, his feedback was:

  1. He wanted to see a Profiler trace comparison between the two approaches
  2. A fragmentation study was in order – which method produced the most efficient indexes?

I had intended to carry out this study a long time ago (during the Christmas week-end), but it’s better late than never. So, here’s the results of the study.

Creating the test data

The approach that I will be taking today is:

  1. Create two identical tables in the TempDB
    • On one table, indexes will be created before the insert of test data is done
    • On the other, I will first insert some data, and then create indexes

WARNING: The scripts provided in this post are provided “as-is” and without warranty for understanding purposes only. Using the DBCC scripts provided in the post on your QA or production environments can have serious side effects

The script that I will be using today is same as that in my earlier post. However, please find it below for your kind reference. First up is the query that creates the indices before inserting the data.

USE tempdb
GO

SET NOCOUNT ON

/*****************************************************************************************************************
   STEP 01: Create a table, complete with a primary key (to create a clustered index), and a non-clustered index
*****************************************************************************************************************/
USE tempdb
GO
CREATE TABLE DontUpdateStatistics
( StatId INT IDENTITY(1,1),
  RandomId INT DEFAULT (RAND()*1000),
  StatName VARCHAR(10) DEFAULT 'DEFAULT',
  CONSTRAINT PK_DontUpdateStatistics_StatId PRIMARY KEY CLUSTERED 
  (
    StatId
  )
)
GO

CREATE NONCLUSTERED INDEX NC_DontUpdateStatistics_RandomId ON DontUpdateStatistics(RandomId)
GO

/*****************************************************************************************************************
   STEP 02: Insert some test data into the table
*****************************************************************************************************************/
USE tempdb
GO
INSERT INTO DontUpdateStatistics DEFAULT VALUES
GO 100000

Next up is the script that I recommend – wherein the indexes are created after the data is inserted.

USE tempdb
GO

SET NOCOUNT ON

/*****************************************************************************************************************
   STEP 04: Create a table - this time, do not create any indexes
*****************************************************************************************************************/
CREATE TABLE UpdateStatistics
( StatId INT IDENTITY(1,1),
  RandomId INT DEFAULT (RAND()*1000),
  StatName VARCHAR(10) DEFAULT 'DEFAULT'
)
GO

/*****************************************************************************************************************
   STEP 05: Insert some test data into the table
*****************************************************************************************************************/
USE tempdb
GO
INSERT INTO UpdateStatistics DEFAULT VALUES
GO 100000

/*****************************************************************************************************************
   STEP 06: Now create the primary key clustered index and the non-clustered index
*****************************************************************************************************************/
ALTER TABLE UpdateStatistics
    ADD CONSTRAINT PK_UpdateStatistics_StatId PRIMARY KEY CLUSTERED 
    (
        StatId
    )
GO
CREATE NONCLUSTERED INDEX NC_UpdateStatistics_RandomId ON UpdateStatistics(RandomId)
GO

I will now start by answering the 2nd question first – which method produced an index that was more efficient from a fragmentation perspective?

Fragmentation study

Please note that in both the queries above, the clustered index is created on the column – StatId, which is an identity column. The inserts are therefore in the order we would want them to be in both cases, causing a “write-once” operation as Erik put it.

Once the data has been inserted into the temporary tables, I ran the following query that uses the sys.dm_db_index_physical_stats DMV to take a look at the index physical statistics. NOTE: I will not use the DBCC SHOWCONTIG because it is marked for deprecation (BOL reference here).

USE tempdb
GO
SELECT 'Table: DontUpdateStatistics' AS TableName,
       index_id AS IndexId,
       index_type_desc AS IndexType,
       index_depth AS IndexDepth,
       avg_fragment_size_in_pages AS 'AverageFragmentation(Pages)',
       avg_fragmentation_in_percent AS 'AverageFragmentation(Percentage)',
       fragment_count AS Fragments,
       page_count AS 'PageCount',
       partition_number AS PartitionNumber,
       alloc_unit_type_desc AS AllocationUnitType,
       index_level AS IndexLevel
FROM sys.dm_db_index_physical_stats(DB_ID(),OBJECT_ID('DontUpdateStatistics'),-1,0,NULL)

SELECT 'Table: UpdateStatistics'AS TableName,
       index_id AS IndexId,
       index_type_desc AS IndexType,
       index_depth AS IndexDepth,
       avg_fragment_size_in_pages AS 'AverageFragmentation(Pages)',
       avg_fragmentation_in_percent AS 'AverageFragmentation(Percentage)',
       fragment_count AS Fragments,
       page_count AS 'PageCount',
       partition_number AS PartitionNumber,
       alloc_unit_type_desc AS AllocationUnitType,
       index_level AS IndexLevel
FROM sys.dm_db_index_physical_stats(DB_ID(),OBJECT_ID('UpdateStatistics'),-1,0,NULL)

I looked at the following values:

  • avg_fragmentation_in_percent – The percent of logical fragmentation (out-of-order pages in the index)
  • fragment_count – The number of fragments (physically consecutive leaf pages) in the index
  • avg_fragment_size_in_pages – Average number of pages in one fragment in an index

Essentially, The avg_fragment_size_in_pages and avg_fragmentation_in_percent values are inversely proportional to each other. Therefore, we are looking for a higher value of avg_fragment_size_in_pages, and a lower value of avg_fragmentation_in_percent.

Below is the output that I received when the above queries were executed against my test instance.

Fragmentation Results

In case you are not able to see the screenshot above, here’s the same data in tabular format:

TableName Index

Id
IndexType Index

Depth
Average

Fragmentation(Pages)
Average

Fragmentation(Percentage)
Fragments Page

Count
Partition

Number
Allocation

Unit

Type

Index

Depth
Table: DontUpdateStatistics 1 CLUSTERED INDEX 2 11.5666666666667 1.44092219020173 30 347 1 IN_ROW_DATA 0
Table: DontUpdateStatistics 2 NONCLUSTERED INDEX 2 1 99.6108949416342 257 257 1 IN_ROW_DATA 0
Table: UpdateStatistics 1 CLUSTERED INDEX 2 173.5 0 2 347 1 IN_ROW_DATA 0
Table: UpdateStatistics 2 NONCLUSTERED INDEX 2 58 0.574712643678161 3 174 1 IN_ROW_DATA 0

Results

Contrary to expectations, the results above clearly indicate that we have higher fragmentation in cases when indexes are created before data is inserted into the tables. It does not matter when these values (i.e. the results of the DMV) are fetched – before or after a SELECT is executed against the table, i.e. the index fragmentation remains the same and are not affected by when the statistics on the table are updated.

CPU Utilization and Elapsed Time study

I would address this part of the study in 2 parts:

  1. In my original post (here), we already saw that when running the SELECT statements against the table, the Profiler was able to detect that the table: DontUpdateStatistics – did not have the expected statistics
    • We have also seen that when STATISTICS IO was turned ON, the SQL Server clearly showed a performance difference between the two SELECT statements
  2. In this post, I will try to see whether the performance gain in the second SELECT has been redistributed during the index creation or data load phases

To see the overall CPU time and elapsed time statistics, I used the SET STATISTICS TIME ON option wrapped around the Index (both clustered & non-clustered) creation and data insert statements. The result of the execution based on the data returned for the test data of 100,000 records is summarized below:

Data Insert (ms) Index (clustered & non-clustered) creation (ms) Totals (ms)
CPU Time
Table: DontUpdateStatistics 4687 0 4687
Table: UpdateStatistics 4533 400 4933
Elapsed Time
Table: DontUpdateStatistics 24264 7 24271
Table: UpdateStatistics 22364 23872 46236

Summary

It is now time to summarize the findings.

Data can be loaded to a table in multiple ways. We have been comparing the pros & cons of 2 such methods:

  1. Create necessary supporting indexes first, insert data afterwards
  2. Insert data first, create necessary supporting indexes afterwards

My original post compared these approaches in terms of the number of read-ahead reads (number of pages placed into the cache) required before the SQL server can serve our request to fetch data from the tables depending upon when and whether statistics were updated for the tables. In this post, we studied the level of index fragmentation and time statistics.

The following chart is a summary of my findings based on which approach “wins” (Wins represented as “W”, whereas a loss by “L”):

Most up-to-date statistics Read-ahead reads required to service a query Index fragmentation level Time (CPU + Elapsed time) statistics
Index before, insert after

L

L

L

W

Insert before, index after

W

W

W

L

Further reading

Previous post

Previous post

This post

This post

In short, if execution time is of utmost importance for a given scenario, the approach #1 is preferred wherein indexes are created before inserting the data (which is the most common approach). However, if fragmentation, up-to-date statistics and reduced number of read-ahead reads are of importance (in case of read-only data), the approach #2 is preferable wherein data is inserted before creating the indexes.

I trust you found the study conducted above useful & interesting. If you have any comments, or would like to know more about it, please do feel free to contact me via the associated discussion or twitter. As always, I really appreciate the feedback that you, the community, provides.

References

Until we meet next time,

Be courteous. Drive responsibly.

SQL Server – Performance – Best Practice – Create Index Before or After a data insert?


As a part database developer myself, I am sure that what I am about to document today will be useful to almost everyone who has ever written a stored procedure for with indexed tables for SSRS reporting or massive batch processing. What I will be trying today is to find an answer to the dilemma – Should indexes be created before or after the insert to the table?

The moment of index creation impacts table statistics, and therefore should impact performance. The point that got me thinking towards this test is the following line from Books On Line (http://msdn.microsoft.com/en-us/library/ms190397.aspx), Section: “Determining When to Create Statistics”, which quite clearly mentions that: “The query optimizer creates statistics for indexes on tables or views when the index is created.

The approach that I will be taking today is:

  1. Create two identical tables in the TempDB
    • On one table, indexes will be created before the insert of test data is done
    • On the other, I will first insert some data, and then create indexes
  2. Later, I will fire simple select queries against both, while monitoring the performance via monitoring statistics usage by:
    • setting of SET STATISTICS IO ON and,
    • Using the SQL Profiler

WARNING: The scripts provided in this post are provided “as-is” and without warranty for understanding purposes only. Using the DBCC scripts provided in the post on your QA or production environments can have serious side effects.

Creating the test environment

In order to create the test environment, I will first create a test table, create the necessary indexes and finally add some test data to it.

USE tempdb
GO

/*****************************************************************************************************************
   STEP 01: Create a table, complete with a primary key (to create a clustered index), and a non-clustered index
*****************************************************************************************************************/
USE tempdb
GO
CREATE TABLE DontUpdateStatistics
( StatId INT IDENTITY(1,1),
  RandomId INT DEFAULT (RAND()*1000),
  StatName VARCHAR(10) DEFAULT 'DEFAULT',
  CONSTRAINT PK_DontUpdateStatistics_StatId PRIMARY KEY CLUSTERED 
  (
    StatId
  )
)
GO

CREATE NONCLUSTERED INDEX NC_DontUpdateStatistics_RandomId ON DontUpdateStatistics(RandomId)
GO

/*****************************************************************************************************************
   STEP 03: Insert some test data into the table
*****************************************************************************************************************/
USE tempdb
GO
INSERT INTO DontUpdateStatistics DEFAULT VALUES
GO 100000

Next, I will create another table. However indexes will be created after test data is inserted.

USE tempdb
GO
/*****************************************************************************************************************
   STEP 04: Create a table - this time, do not create any indexes
*****************************************************************************************************************/
CREATE TABLE UpdateStatistics
( StatId INT IDENTITY(1,1),
  RandomId INT DEFAULT (RAND()*1000),
  StatName VARCHAR(10) DEFAULT 'DEFAULT'
)
GO

/*****************************************************************************************************************
   STEP 05: Insert some test data into the table
*****************************************************************************************************************/
USE tempdb
GO
INSERT INTO UpdateStatistics DEFAULT VALUES
GO 100000

/*****************************************************************************************************************
   STEP 06: Now create the primary key clustered index and the non-clustered index
*****************************************************************************************************************/
ALTER TABLE UpdateStatistics
    ADD CONSTRAINT PK_UpdateStatistics_StatId PRIMARY KEY CLUSTERED 
    (
        StatId
    )
GO
CREATE NONCLUSTERED INDEX NC_UpdateStatistics_RandomId ON UpdateStatistics(RandomId)
GO

Run the Test

The first thing I will do is to check if statistics for both tables are updated or not.

Confirm statistics updates

/*****************************************************************************************************************
   STEP 07: See if the statisics on the table in tempDB were updated
Script source: http://blog.sqlauthority.com/2010/01/25/sql-server-find-statistics-update-date-update-statistics/
             : by Pinal Dave
*****************************************************************************************************************/
USE tempdb
GO
SELECT name AS index_name,
STATS_DATE(OBJECT_ID, index_id) AS StatsUpdated
FROM sys.indexes
WHERE OBJECT_ID = OBJECT_ID('DontUpdateStatistics') OR
      OBJECT_ID = OBJECT_ID('UpdateStatistics')
GO

As can be seen, the statistics for Table #1 (“DontUpdateStatistics”) did not update:

image

The Test Queries

Next, allow me to present a set of SELECT queries that would fetch data from the tables of interest.

USE tempdb
GO
/*****************************************************************************************************************
   STEP 08 A: Clear out the Buffers and the procedure cache.
            : Make sure to run a CHECKPOINT!
Reference source: 
http://beyondrelational.com/blogs/nakul/archive/2011/10/03/dbcc-dropcleanbuffers-needs-to-be-preceded-by-a-checkpoint-to-remove-all-buffers.aspx
*****************************************************************************************************************/
CHECKPOINT
GO
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE

/*****************************************************************************************************************
   STEP 08 B: Turn STATISTICS IO ON, and select some data from the DontUpdateStatistics table
*****************************************************************************************************************/
SET STATISTICS IO ON

SELECT * FROM DontUpdateStatistics WHERE StatId BETWEEN 10 AND 1000 
                                     AND RandomId > 300
GO

SET STATISTICS IO OFF

/*****************************************************************************************************************
   STEP 08 C: Clear out the Buffers and the procedure cache.
            : Make sure to run a CHECKPOINT!
Reference source: 
http://beyondrelational.com/blogs/nakul/archive/2011/10/03/dbcc-dropcleanbuffers-needs-to-be-preceded-by-a-checkpoint-to-remove-all-buffers.aspx
*****************************************************************************************************************/
CHECKPOINT
GO
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
GO

/*****************************************************************************************************************
   STEP 08 D: Turn STATISTICS IO ON, and select some data from the UpdateStatistics table
*****************************************************************************************************************/
SET STATISTICS IO ON

SELECT * FROM UpdateStatistics WHERE StatId BETWEEN 10 AND 1000 
                                 AND RandomId > 300
GO

SET STATISTICS IO OFF

Configure the SQL Profiler

To use the SQL Server Profiler in order to help me back my case, I connected to my SQL Server instance, and selected only one event:

Event Group: Errors & Warnings

Event Class: Missing Column Statistics

image

The Result

With the SQL Profiler running, I turned on the Actual Execution Plan in SSMS (by pressing Ctrl+M) and then executed the test query. The result was as shown below:

The Execution Plan image
Statistics Output image
Profiler Output image

Confirm statistics updates

/*****************************************************************************************************************
   STEP 09: See if the statisics on the table in tempDB were updated
Script source: http://blog.sqlauthority.com/2010/01/25/sql-server-find-statistics-update-date-update-statistics/
             : by Pinal Dave
*****************************************************************************************************************/
USE tempdb
GO
SELECT name AS index_name,
STATS_DATE(OBJECT_ID, index_id) AS StatsUpdated
FROM sys.indexes
WHERE OBJECT_ID = OBJECT_ID('DontUpdateStatistics') OR
      OBJECT_ID = OBJECT_ID('UpdateStatistics')
GO

As can be seen, the statistics for Table #1 (“DontUpdateStatistics”) are now updated. The statistics for Table #2 (“UpdateStatistics”) are unchanged:

image

Post-Mortem

First of all, I would like to draw attention to a line mentioned in the Profiler’s description of the “Missing Column statistics” event class. The line states – “You will not see this event produced unless the option to auto-compute statistics is turned off”. For the TempDB, auto-compute statistics is ON, so that’s not the root cause to this issue.

image

If this is not the issue, then what is? Summarizing the results, I can derive the following conclusions:

  • Both queries appear to have executed with equal performance impact – with a 50% split across them according to the execution plan
  • However, looking at the statistics IO output, I can see that:
    • Read-Ahead Reads for table #1 (“DontUpdateStatistics”) = 16
    • Read-Ahead Reads for table #2 (“UpdateStatistics”) = 4
    • The query against Table #1 (“DontUpdateStatistics”) is requires 4 times more reads when compared to the query against query #2
  • The SQL Profiler warns me that column statistics are not available, and therefore, a sub-optimal plan can be chosen by the optimizer

Revisiting Books On Line, (http://msdn.microsoft.com/en-us/library/ms190397.aspx), the section: “Determining When to Create Statistics”, clearly mentions that: The query optimizer creates statistics for indexes on tables or views when the index is created.

More precisely, statistics, if missing, are created when a select is performed on the columns for which missing statistics were found.

In our case, we created the indexes on table #2 (“UpdateStatistics”) after data was inserted. Because creating indexes involves selecting data from the table, a statistics update is performed. For table #1 (“DontUpdateStatistics”) though, the indexes were created before inserting the data. When the query #1 was executed, the statistics are generated, and then the optimization happens – which makes the query a little bit slower as it spends more time on the read-ahead reads.

If we look up on Books On Line for STATISTICS IO (http://msdn.microsoft.com/en-us/library/ms184361.aspx), we can see that the read-ahead read count gives us the number of pages placed into the cache. In our case, all pages for table #1 have to be placed in the cache for the statistics update.

Conclusion

It’s been a rather long post, with a simple conclusion. It is important to think about when index creation should be performed because using the proper sequence is important from the performance and IO perspective.

Until we meet next time,

Be courteous. Drive responsibly.