This is in continuation to yesterday’s post on T-SQL worst practices that affect the accuracy of the results returned/processed
What have I done? T-SQL Worst Practices
I am attempting to summarize all of the worst T-SQL practices that Jacob demonstrated at Tech-Ed 2011 (India). The summarization has been done after classifying these practices into 3 major buckets:
- Practices that can affect accuracy of data returned/processed (Part 01 – read here)
- Practices that can adversely impact performance (Part 02 – today’s post)
- Practices that can have unpredictable execution side-effects (Part 03 – coming tomorrow!)
While the points covered are exactly the same as covered during the Tech-Ed, I have taken hints from Jacob’s queries and have come up with my own versions using the AdventureWorks2008R2 database supplied as a sample database for SQL 11 (“Denali”) CTP01. In cases where I am directly using Jacob’s scripts, I have mentioned it as such.
Part 02 – Practices affecting performance
Using Incorrect data types in a query/Too much reliance on implicit conversions
When I was a little kid in school, my parents used to tell me that there is a lot more than what meets the eye. A lot of things happen in the background before anything reaches us, and sometimes we forget the importance and underestimate the impact of these background processes. Years later, the movie “Munnabhai M.B.B.S.” came along, and that demonstrated, amongst many others, the same concept – how we tend to ignore the cleaning crew who help maintain one of the most important things in a hospital – hygiene.
Anyway, coming over to SQL Server, one of the most taken-for-granted features is implicit conversion. Implicit conversion is the conversion of one data-type to another without the user knowing that such a conversion is happening. Please note that these conversions happen under some pre-defined rules of data-type precedence. Data types of a lower precedence are implicitly converted to a higher precedence if a data-type mismatch is found.
With this background in mind, let’s see how this feature can land us in performance trouble.
Assume that we have the following table:
/*----------------------------------------------------------------------------- Date : 23 March 2011 SQL Version : SQL Server 2005/2008/Denali Author : Jacob Sebastian Email : jacob@beyondrelational.com Twitter : @jacobsebastian Blog : http://beyondrelational.com/blogs/jacob Website : http://beyondrelational.com Summary: This script is part of the demo presented by Jacob Sebastian during Tech-ED India 2011 on "TSQL Worst Practices". Disclaimer: THIS CODE AND INFORMATION ARE PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A PARTICULAR PURPOSE. -----------------------------------------------------------------------------*/ USE Teched GO IF OBJECT_ID('orders', 'U') IS NOT NULL BEGIN DROP TABLE orders END GO CREATE TABLE orders (OrderID INT IDENTITY, OrderDate DATETIME, Amount MONEY, Refno VARCHAR(20)) INSERT INTO orders (OrderDate, Amount, Refno) SELECT TOP 100000 DATEADD(minute, ABS(a.object_id % 50000 ), CAST('2010-02-01' AS DATETIME)), ABS(a.object_id % 10), CAST(ABS(a.object_id) AS VARCHAR) FROM sys.all_objects a CROSS JOIN sys.all_objects b INSERT INTO orders (OrderDate, Amount, Refno) SELECT GETDATE(), 100, '555'
Now, let’s create a non-clustered index on the RefNo column.
CREATE NONCLUSTERED INDEX idx_refno ON orders(refno) include(amount)
Next, let’s run through the following queries. Both these queries perform the same function, i.e. select a record whose reference number is 555.
SELECT count(*) FROM orders WHERE Refno = 555 SELECT COUNT(*) FROM orders WHERE Refno = '555'
These queries do their work right, but one of them is not advisable for production-class usage.
Here’s why: If the actual execution plan of these queries is enabled by pressing Ctrl+M or by going to Query –> Include Actual Execution plan, we see that Query #1 does an index scan instead of an index seek.
The reason is that in the first query, Refno, which is a VARCHAR column is being compared with an integer value. Per rules of data-type precedence (discussed earlier), VARCHAR values need to be implicitly converted to INT because INT has a higher precedence. Thus, SQL Server first has to create a temporary column with the INT representation of each of the Refno values, and then compare that with the supplied criteria.
In the second case, we are supplying SQL Server with the data in character format, and hence, no implicit conversion is required. SQL Server can therefore simply do an index seek to fetch the right results.
Now, let’s do it the other way around. Let’s explicitly convert Refno to an INT and then perform similar SELECT operations.
DROP INDEX idx_refno ON Orders ALTER TABLE Orders ALTER COLUMN Refno INT CREATE NONCLUSTERED INDEX idx_refno ON orders(refno) include(amount) SELECT count(*) FROM orders WHERE Refno = 555 SELECT COUNT(*) FROM orders WHERE Refno = '555'
Looking at the execution plan, we are in for a shocker. We would have expected that now, the second query should be the slower one, but it is not!
This is because the column being compared is of a higher precedence. Therefore, only the supplied condition parameter needs to be implicitly converted to INT, thus reducing the performance impact to being negligible.
Functions may not always be your friends
SQL Server has a great collection of functions, some of which are really, really useful. Data manipulation is a common operation in most systems, and the functions available at our disposal for date manipulation are very powerful. However, they might land us in a performance bottleneck if not used carefully.
First of all, let’s create a very simple, non-clustered index on the OrderDate column of the Sales.SalesOrderHeader table in the AdventureWorks2008R2 database.
USE [AdventureWorks2008R2] GO CREATE NONCLUSTERED INDEX [Idx_MissingIndexOrderDate] ON [Sales].[SalesOrderHeader] ([OrderDate]) GO
Next, let us try to find out all the sales done in the month of April, 2008. Now, there are a lot of solutions available, the most popular ones being the ones below. Which one do you think is the performance hog?
--Number of sales done in April, 2008 SELECT COUNT(*) FROM Sales.SalesOrderHeader sh WHERE MONTH(sh.OrderDate) = 4 AND YEAR(sh.OrderDate) = 2008 SELECT COUNT(*) FROM Sales.SalesOrderHeader sh WHERE sh.OrderDate >= '2008-04-01' AND sh.OrderDate <= '2008-04-30' SELECT COUNT(*) FROM Sales.SalesOrderHeader sh WHERE sh.OrderDate BETWEEN '2008-04-01' AND '2008-04-30'
If you turn on the execution plan, you would notice that Query #1 takes up the most time, i.e. that is the slowest of them all.
The reason is the presence of the functions in the WHERE clause. Functions around columns in a WHERE clause cause table/index scans and must therefore be used judiciously.
If you use these functions in the SELECT clause, you will notice that there isn’t much performance delay. Here’s the proof:
--Number of sales done in April, 2008 SELECT MONTH(sh.OrderDate), YEAR(sh.OrderDate),sh.OrderDate FROM Sales.SalesOrderHeader sh WHERE MONTH(sh.OrderDate) = 4 AND YEAR(sh.OrderDate) = 2008 SELECT MONTH(sh.OrderDate), YEAR(sh.OrderDate),sh.OrderDate FROM Sales.SalesOrderHeader sh WHERE sh.OrderDate BETWEEN '2008-04-01' AND '2008-04-30'
Question to the audience: Now, let’s say that instead of just the date value (which we were dealing with in case of SalesOrderHeader.OrderDate), we have a DATETIME value to work with. What would be good solutions, according to for the same problem statement? Jacob touched upon this during the Tech-Ed, but I am eliminating that part from the post series so that it can become a great user exercise.
Summarizing – Part 02:
Finally, summarizing all the points that have been discussed today (Part 02) and during Tech-Ed 2011
- Incorrect use of data-types (especially in the WHERE clause) can cause performance issues because they would force TABLE/INDEX scans due to implicit conversion
- As far as possible, take the time out during design to ensure that you choose the proper data type. In our example, Refno column should probably never have been a VARCHAR column
- Functions around columns in a WHERE clause cause table/index scans and must therefore be used judiciously
Thank-you, Jacob, for this really great session!
You can read my memories of Tech-Ed by going through the following series:
Tech Ed 2011 (India) – Day 01 – Connect, Learn, Explore, Evolve!
Tech Ed 2011 (India) – Day 02 – IE 9 – India launch – Fast is now beautiful!
Until tomorrow,
Be courteous. Drive responsibly.
Want to ask me a question? Do so at: http://beyondrelational.com/ask/nakul/default.aspx