Category Archives: Blog

Articles related to this Blog

T-SQL Worst Practices – a compilation from Jacob’s session at Tech-Ed 2011 – Part 03 – Practices that contribute to execution er


This is in continuation to yesterday’s post on T-SQL worst practices that affect the performance of the results returned/processed.

What have I done? T-SQL Worst Practices

I am attempting to summarize all of the worst T-SQL practices that Jacob demonstrated at Tech-Ed 2011 (India). The summarization has been done after classifying these practices into 3 major buckets:

  1. Practices that can affect accuracy of data returned/processed (Part 01 – read here)
  2. Practices that can adversely impact performance (Part 02 – read here)
  3. Practices that can have unpredictable execution side-effects (Part 03 – today’s post)

While the points covered are exactly the same as covered during the Tech-Ed, I have taken hints from Jacob’s queries and have come up with my own versions using the AdventureWorks2008R2 database supplied as a sample database for SQL 11 (“Denali”) CTP01. In cases where I am directly using Jacob’s scripts, I have mentioned it as such.

Part 03 – Practices that contribute to execution errors

The previous two sessions focused on the efficiency and accuracy of the application. However, errors (logical or otherwise) during execution are a nightmare for technical support and any team assigned the task of maintaining the product after it’s release. Even worse is when these execution errors in the code are sporadic. Today, in the final part of this series, we will look at some of the coding practices which might result in an unstable product.

Using Ordinal positions in the ORDER BY clause

Manager: I need a list of the first and last initials of all contacts in our database. Can you do that?

You: Sure.That’s quite easy. Here you go:

USE [AdventureWorks2008R2]
GO

SELECT LEFT(p.FirstName, 1) AS FirstInitial, 
       LEFT(p.LastName, 1) AS LastInitial
FROM Person.Person p
GO

Manager: But, I want it sorted by the First Initial.

You: No issues. It’s just a small change.

USE [AdventureWorks2008R2]
GO

SELECT LEFT(p.FirstName, 1) AS FirstInitial, 
       LEFT(p.LastName, 1) AS LastInitial
FROM Person.Person p
ORDER BY 1
GO

Manager: That was fast! But, there might be situations where I need to sort it by the Last Name.

You: Okay, it looks like we will need to go beyond simple ad-hoc queries. I will create a stored procedure for you.

USE [AdventureWorks2008R2]
GO

CREATE PROCEDURE GetContacts
(
   @Sort VARCHAR(20)
)
AS
BEGIN
    SELECT LEFT(p.FirstName, 1) AS FirstInitial, 
           LEFT(p.LastName, 1) AS LastInitial
    FROM Person.Person p
    ORDER BY CASE @Sort WHEN 'first' THEN 1
                        ELSE 2
                        END
END
GO

Manager (in the morning the very next day): The stored procedure you gave me yesterday does not work. Fix it and see that you always test your code before submitting to me.

image

You: How is this possible? Today is going to be a bad day. I missed the bus, and now my manager is angry at me. I must do something quickly……Let me try this:

SELECT LEFT(p.FirstName, 1) AS FirstInitial, 
       LEFT(p.LastName, 1) AS LastInitial
FROM Person.Person p
ORDER BY FirstInitial
GO

You: Okay. This works, so let me put it inside the stored procedure to fix the issue that my manager is encountering.

ALTER PROCEDURE GetContacts
(
    @Sort VARCHAR(20)
)
AS
BEGIN
    SELECT LEFT(p.FirstName, 1) AS FirstInitial, 
           LEFT(p.LastName, 1) AS LastInitial
    FROM Person.Person p
    ORDER BY CASE @Sort WHEN 'first' THEN FirstInitial
                        ELSE LastInitial
                        END
END
GO

When you execute the ALTER statement to begin your unit test:

Msg 207, Level 16, State 1, Procedure GetContacts, Line 10
Invalid column name ‘FirstInitial’.
Msg 207, Level 16, State 1, Procedure GetContacts, Line 11
Invalid column name ‘LastInitial’.

You (dumb-struck): ?? What is going on? Nothing is working as expected…

Actually, it’s not your fault entirely. It’s the way expressions are evaluated when used in the ORDER BY clause. When used in the ORDER BY clause, expressions need to be computed before-hand. In your case, you are using a column alias, which does not exist yet and hence the error. Here’s what you can do to resolve the ugly situation:

ALTER PROCEDURE GetContacts
(
	@Sort VARCHAR(20)
)
AS
BEGIN
    SELECT LEFT(p.FirstName, 1) AS FirstInitial, 
           LEFT(p.LastName, 1) AS LastInitial
    FROM Person.Person p
	ORDER BY CASE @Sort WHEN 'first' THEN LEFT(p.FirstName, 1)
                        ELSE LEFT(p.LastName, 1)
                        END
END
GO

Here’s the result – it works!

image

Maintaining Integrity during transactions

At the end of the day, anything and everything within a database is all about maintaining data integrity. Microsoft SQL Server ensures this and also provides all the basic components to the development community so that they can ensure that errors during code execution are handled properly and roll-backs issued so that anything that is inconsistent is not committed to the database.

However, we as developers tend to ignore a couple of things, or take things for granted – one of them is the fact that Microsoft SQL Server leaves the user to decide whether or not it should automatically abort a transaction in the case of an error.

Let’s set the stage by assuming that we have a table called Customers and have some basic data in it.

/*-----------------------------------------------------------------------------
  Date			: 23 March 2011
  SQL Version           : SQL Server 2005/2008/Denali
  Author		: Jacob Sebastian
  Email			: jacob@beyondrelational.com
  Twitter		: @jacobsebastian
  Blog			: http://beyondrelational.com/blogs/jacob
  Website		: http://beyondrelational.com

  Summary:
  This script is part of the demo presented by Jacob Sebastian during 
  Tech-ED India 2011 on "TSQL Worst Practices".

  Disclaimer:  
  THIS CODE AND INFORMATION ARE PROVIDED "AS IS" WITHOUT WARRANTY OF 
  ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED 
  TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A
  PARTICULAR PURPOSE.
-----------------------------------------------------------------------------*/
USE TechEd; 
GO
IF OBJECT_ID('UpdateCustomer') IS NOT NULL DROP PROCEDURE UpdateCustomer; 
GO

IF OBJECT_ID('Customers','U') IS NOT NULL BEGIN
	DROP TABLE Customers 
END
GO

CREATE TABLE Customers (
	CustID INT IDENTITY,
	CustName VARCHAR(100),
	CustCity VARCHAR(100),
	LastOrder DATETIME
)
GO

INSERT INTO Customers (CustName, CustCity)
SELECT 'Jacob', 'Ahmedabad' UNION ALL
SELECT 'Pinal', 'Bangalore'

Now, let’s create a stored procedure to update the Customers table.

USE TechEd; 
GO

CREATE PROCEDURE UpdateCustomer(
	@CustID INT,
	@SalesAmount MONEY
)
AS

BEGIN TRY
	BEGIN TRAN

	UPDATE Customers SET LastOrder = GETDATE() WHERE CustID = @CustID 
	
	UPDATE customerss SET
		SalesAmount = SalesAmount + @SalesAmount 
	WHERE CustID = @CustID 

	COMMIT TRAN

END TRY
BEGIN CATCH
	IF @@TRANCOUNT > 0 
		ROLLBACK TRAN
	PRINT 'Error Number: ' + CAST (ERROR_NUMBER() AS VARCHAR)
END CATCH
GO

Note that although we have an incorrect table name in the stored procedure, it was successfully created. (Do not go and correct it – this incorrect table name is intentional).

Now, in a SQL Server Management Studio window, run the following query. You will encounter a fatal error mentioned below – fatal enough for SQL Server to stop execution then and there. This is expected and not an accident.

EXEC UpdateCustomer 1, 100

(1 row(s) affected)
Msg 208, Level 16, State 1, Procedure UpdateCustomer, Line 13
Invalid object name ‘customerss’.
Msg 266, Level 16, State 2, Procedure UpdateCustomer, Line 13
Transaction count after EXECUTE indicates a mismatching number of BEGIN and COMMIT statements. Previous count = 0, current count = 1.

Then, run the below query in another SQL Server Management Studio query window.

SELECT * FROM customers 

You will notice that the execution continues without producing any results for a long time.

So, what’s the problem?

The problem here is that although the execution of the stored procedure encountered an error, the transaction that we had started was never aborted or rolled-back. As a result, the Customers table remained under lock, causing the SELECT to wait indefinitely. As proof, you may run the following on query window #1:

SELECT @@trancount
ROLLBACK TRAN

You will see that we have one open transaction. Issuing the rollback will cause query window #2 to return results.

What’s the worst that can happen? Well, that depends on your definition of “worst”. According to me, you are already in big trouble – your code just broke. However, if that is not serious enough, your table is now under lock until and unless and external force (i.e. manual ROLLBACK) is applied (almost like Newton’s laws, isn’t it?). This can result in your entire application becoming unusable and ultimately a “site-down” (or P1) support call with the customer (and your manager) screaming on you.

Solution #01: Create a wrapper stored procedure

One of the solutions is to simulate what we just did, i.e. create a wrapper stored procedure, which would be able to:

  • detect that the underlying procedure call failed
  • check if there are any open transactions
  • If yes, rollback all open transactions

We can implement this as under:

USE TechEd; 
GO

BEGIN TRY
	EXEC UpdateCustomer 1, 100
END TRY BEGIN CATCH
	IF @@TRANCOUNT > 0 ROLLBACK TRAN
	PRINT 'Transaction Rolled Back'
END CATCH
GO

Try it out. It works, but there is a problem. In an enterprise class system, we may have hundreds of stored procedures – and it is simply not practical to have wrappers on each.

Solution #02: SET XACT_ABORT ON

This solution is actually, quite simple (and what I would recommend). All we need to tell Microsoft SQL Server is to automatically issue a ROLLBACK if the transaction encounters a fatal error.

All that one needs to do is to use a SET option called XACT_ABORT and set it to ON. (BOL link here)

USE TechEd; 
GO

ALTER PROCEDURE UpdateCustomer(
	@CustID INT,
	@SalesAmount MONEY
)
AS
SET XACT_ABORT ON
BEGIN TRY
	BEGIN TRAN

	UPDATE Customers SET LastOrder = GETDATE() WHERE CustID = @CustID 
	
	UPDATE customerss SET
		SalesAmount = SalesAmount + @SalesAmount 
	WHERE CustID = @CustID 

	COMMIT TRAN

END TRY
BEGIN CATCH
	IF @@TRANCOUNT > 0 
		ROLLBACK TRAN
	PRINT 'Error Number: ' + CAST (ERROR_NUMBER() AS VARCHAR)
END CATCH
GO

Now, repeat the exercise of executing the following in two separate query editor windows. Notice that this time, we do not encounter locking and the second query succeeds.

EXEC UpdateCustomer 1, 100

-- run in another session
SELECT * FROM customers 

Summarizing – Part 03:

Finally, summarizing all the points that have been discussed today (Part 03) and during Tech-Ed 2011

  1. Always test your code before it leaves your table! Although things “should” work, they don’t do so just that one time when you don’t test
  2. If an ordinal position is specified in an expression, it may not produce the expected results
  3. If an expression is used in ORDER BY, it will be evaluated when FILTERS are evaluated
  4. Always use SET XACT_ABORT ON in stored procedures to rollback open transactions in case of unhandled errors

Thank-you, Jacob, for this really great session!

You can read my memories of Tech-Ed by going through the following series:

Tech Ed 2011 (India) – Day 01 – Connect, Learn, Explore, Evolve!

Tech Ed 2011 (India) – Day 02 – IE 9 – India launch – Fast is now beautiful!

Tech Ed 2011 (India) – Day 03 – For the DBAs – Contained Databases, Extended Events, Virtualization, Wait Stats, Wait Queues & High Availability

Until we meet next time,

Be courteous. Drive responsibly.

Want to ask me a question? Do so at: http://beyondrelational.com/ask/nakul/default.aspx

T-SQL Worst Practices – a compilation from Jacob’s session at Tech-Ed 2011 – Part 02 – Practices that affect performance


This is in continuation to yesterday’s post on T-SQL worst practices that affect the accuracy of the results returned/processed

What have I done? T-SQL Worst Practices

I am attempting to summarize all of the worst T-SQL practices that Jacob demonstrated at Tech-Ed 2011 (India). The summarization has been done after classifying these practices into 3 major buckets:

  1. Practices that can affect accuracy of data returned/processed (Part 01 – read here)
  2. Practices that can adversely impact performance (Part 02 – today’s post)
  3. Practices that can have unpredictable execution side-effects (Part 03 – coming tomorrow!)

While the points covered are exactly the same as covered during the Tech-Ed, I have taken hints from Jacob’s queries and have come up with my own versions using the AdventureWorks2008R2 database supplied as a sample database for SQL 11 (“Denali”) CTP01. In cases where I am directly using Jacob’s scripts, I have mentioned it as such.

Part 02 – Practices affecting performance

Using Incorrect data types in a query/Too much reliance on implicit conversions

When I was a little kid in school, my parents used to tell me that there is a lot more than what meets the eye. A lot of things happen in the background before anything reaches us, and sometimes we forget the importance and underestimate the impact of these background processes. Years later, the movie “Munnabhai M.B.B.S.” came along, and that demonstrated, amongst many others, the same concept – how we tend to ignore the cleaning crew who help maintain one of the most important things in a hospital – hygiene.

Anyway, coming over to SQL Server, one of the most taken-for-granted features is implicit conversion. Implicit conversion is the conversion of one data-type to another without the user knowing that such a conversion is happening. Please note that these conversions happen under some pre-defined rules of  data-type precedence. Data types of a lower precedence are implicitly converted to a higher precedence if a data-type mismatch is found.

With this background in mind, let’s see how this feature can land us in performance trouble.

Assume that we have the following table:

/*-----------------------------------------------------------------------------
  Date			: 23 March 2011
  SQL Version           : SQL Server 2005/2008/Denali
  Author		: Jacob Sebastian
  Email			: jacob@beyondrelational.com
  Twitter		: @jacobsebastian
  Blog			: http://beyondrelational.com/blogs/jacob
  Website		: http://beyondrelational.com

  Summary:
  This script is part of the demo presented by Jacob Sebastian during 
  Tech-ED India 2011 on "TSQL Worst Practices".

  Disclaimer:  
  THIS CODE AND INFORMATION ARE PROVIDED "AS IS" WITHOUT WARRANTY OF 
  ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED 
  TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A
  PARTICULAR PURPOSE.
-----------------------------------------------------------------------------*/

USE Teched 
GO

IF OBJECT_ID('orders', 'U') IS NOT NULL BEGIN
	DROP TABLE orders 
END
GO

CREATE TABLE orders (OrderID INT IDENTITY, OrderDate DATETIME, Amount MONEY, Refno VARCHAR(20))
INSERT INTO orders (OrderDate, Amount, Refno) 
SELECT TOP 100000
	DATEADD(minute, ABS(a.object_id % 50000 ), CAST('2010-02-01' AS DATETIME)),
	ABS(a.object_id % 10),
	CAST(ABS(a.object_id) AS VARCHAR)
FROM sys.all_objects a
CROSS JOIN sys.all_objects b
INSERT INTO orders (OrderDate, Amount, Refno)  SELECT GETDATE(), 100, '555'

Now, let’s create a non-clustered index on the RefNo column.

CREATE NONCLUSTERED INDEX idx_refno ON orders(refno)
include(amount)

Next, let’s run through the following queries. Both these queries perform the same function, i.e. select a record whose reference number is 555.

SELECT count(*) 
FROM orders 
WHERE Refno = 555

SELECT COUNT(*) 
FROM orders 
WHERE Refno = '555'

These queries do their work right, but one of them is not advisable for production-class usage.

Here’s why: If the actual execution plan of these queries is enabled by pressing Ctrl+M or by going to Query –> Include Actual Execution plan, we see that Query #1 does an index scan instead of an index seek.

image

The reason is that in the first query, Refno, which is a VARCHAR column is being compared with an integer value. Per rules of data-type precedence (discussed earlier), VARCHAR values need to be implicitly converted to INT because INT has a higher precedence. Thus, SQL Server first has to create a temporary column with the INT representation of each of the Refno values, and then compare that with the supplied criteria.

In the second case, we are supplying SQL Server with the data in character format, and hence, no implicit conversion is required. SQL Server can therefore simply do an index seek to fetch the right results.

Now, let’s do it the other way around. Let’s explicitly convert Refno to an INT and then perform similar SELECT operations.

DROP INDEX idx_refno ON Orders

ALTER TABLE Orders ALTER COLUMN Refno INT

CREATE NONCLUSTERED INDEX idx_refno ON orders(refno)
include(amount)


SELECT count(*) 
FROM orders 
WHERE Refno = 555

SELECT COUNT(*) 
FROM orders 
WHERE Refno = '555'

Looking at the execution plan, we are in for a shocker. We would have expected that now, the second query should be the slower one, but it is not!

image

This is because the column being compared is of a higher precedence. Therefore, only the supplied condition parameter needs to be implicitly converted to INT, thus reducing the performance impact to being negligible.

Functions may not always be your friends

SQL Server has a great collection of functions, some of which are really, really useful. Data manipulation is a common operation in most systems, and the functions available at our disposal for date manipulation are very powerful. However, they might land us in a performance bottleneck if not used carefully.

First of all, let’s create a very simple, non-clustered index on the OrderDate column of the Sales.SalesOrderHeader table in the AdventureWorks2008R2 database.

USE [AdventureWorks2008R2]
GO
CREATE NONCLUSTERED INDEX [Idx_MissingIndexOrderDate]
    ON [Sales].[SalesOrderHeader] ([OrderDate])
GO

Next, let us try to find out all the sales done in the month of April, 2008. Now, there are a lot of solutions available, the most popular ones being the ones below. Which one do you think is the performance hog?

--Number of sales done in April, 2008
SELECT COUNT(*)
FROM Sales.SalesOrderHeader sh
WHERE MONTH(sh.OrderDate) = 4 AND YEAR(sh.OrderDate) = 2008

SELECT COUNT(*)
FROM Sales.SalesOrderHeader sh
WHERE sh.OrderDate >= '2008-04-01' AND sh.OrderDate <= '2008-04-30'

SELECT COUNT(*)
FROM Sales.SalesOrderHeader sh
WHERE sh.OrderDate BETWEEN '2008-04-01' AND '2008-04-30'

If you turn on the execution plan, you would notice that Query #1 takes up the most time, i.e. that is the slowest of them all.

image

The reason is the presence of the functions in the WHERE clause. Functions around columns in a WHERE clause cause table/index scans and must therefore be used judiciously.

If you use these functions in the SELECT clause, you will notice that there isn’t much performance delay. Here’s the proof:

--Number of sales done in April, 2008
SELECT MONTH(sh.OrderDate), YEAR(sh.OrderDate),sh.OrderDate
FROM Sales.SalesOrderHeader sh
WHERE MONTH(sh.OrderDate) = 4 AND YEAR(sh.OrderDate) = 2008

SELECT MONTH(sh.OrderDate), YEAR(sh.OrderDate),sh.OrderDate
FROM Sales.SalesOrderHeader sh
WHERE sh.OrderDate BETWEEN '2008-04-01' AND '2008-04-30'

image

Question to the audience: Now, let’s say that instead of just the date value (which we were dealing with in case of SalesOrderHeader.OrderDate), we have a DATETIME value to work with. What would be good solutions, according to for the same problem statement? Jacob touched upon this during the Tech-Ed, but I am eliminating that part from the post series so that it can become a great user exercise.

Summarizing – Part 02:

Finally, summarizing all the points that have been discussed today (Part 02) and during Tech-Ed 2011

  1. Incorrect use of data-types (especially in the WHERE clause) can cause performance issues because they would force TABLE/INDEX scans due to implicit conversion
  2. As far as possible, take the time out during design to ensure that you choose the proper data type. In our example, Refno column should probably never have been a VARCHAR column
  3. Functions around columns in a WHERE clause cause table/index scans and must therefore be used judiciously

Thank-you, Jacob, for this really great session!

You can read my memories of Tech-Ed by going through the following series:

Tech Ed 2011 (India) – Day 01 – Connect, Learn, Explore, Evolve!

Tech Ed 2011 (India) – Day 02 – IE 9 – India launch – Fast is now beautiful!

Tech Ed 2011 (India) – Day 03 – For the DBAs – Contained Databases, Extended Events, Virtualization, Wait Stats, Wait Queues & High Availability

Until tomorrow,

Be courteous. Drive responsibly.

Want to ask me a question? Do so at: http://beyondrelational.com/ask/nakul/default.aspx

T-SQL Worst Practices – a compilation from Jacob’s session at Tech-Ed 2011 – Part 01 – Practices that affect accuracy


Last month, all the techies in India had a great time at the Tech-Ed 2011 (India). Out of the many sessions being conducted in parallel, one of the T-SQL programming sessions, there was one session from the renowned SQL Server MVP – Jacob Sebastian, which I found to be very interesting and different.

What have I done? T-SQL Worst Practices

There are a lot of reading materials and classes that talk about and discuss T-SQL Best Practices. But there are very few who talk about T-SQL worst practices. That’s because it takes experience and a lot of deep understanding and intimacy with the product in hand to come up with such a list. This session was one of them, and that’s precisely why my it grabbed my attention.

In this session, Jacob demonstrated some of the most common T-SQL worst practices that one can make (besides of course not clapping when one likes a session). All of these are not major issues that would jump right out at you, but are things that appear to be absolutely normal and harmless in the development and QA environments, but turn into performance bottlenecks or may end up returning incorrect results when deployed into production.

Today, I attempt to summarize these worst practices into a blog post. We will look at:

  1. Practices that can affect accuracy of data returned/processed (Part 01)
  2. Practices that can adversely impact performance (Part 02)
  3. Practices that can have unpredictable execution side-effects (Part 03)

While the points covered are exactly the same as covered during the Tech-Ed, I have taken hints from Jacob’s queries and have come up with my own versions using the AdventureWorks2008R2 database supplied as a sample database for SQL 11 (“Denali”) CTP01. In cases where I am directly using Jacob’s scripts, I have mentioned it as such.

Part 01 – Practices affecting accuracy of data returned/processed

Using NOT IN is not really the solution

Manager: I need to find all the Sales staff who do not fall within a given set of territory IDs.

You: That’s easy. I will write up a simple T-SQL query to get you the results in a moment.

USE AdventureWorks2008R2
GO
SELECT * FROM Sales.SalesPerson WHERE TerritoryID NOT IN (1,2,3,4,5)

Manager: Is this correct? It does not appear to be so.

USE AdventureWorks2008R2
GO
--17 records
SELECT COUNT(*) FROM Sales.SalesPerson
--8 records
SELECT COUNT(*) FROM Sales.SalesPerson WHERE TerritoryID IN (1,2,3,4,5)
--6 records, instead of the expected 17-8 = 9 records!
SELECT COUNT(*) FROM Sales.SalesPerson WHERE TerritoryID NOT IN (1,2,3,4,5)

From the above calculations, we see that about records are missing from the results set. These records are such which do not have any Territory ID allocated!

You: What went wrong? The query should have worked….

Here’s what happened…

Root Cause: NULL values are special. In the world of electronics, we are all familiar with NULL as the “tri-state” – ON, OFF and a third-state mysterious state. NULL technically means undefined, and hence cannot take part in comparison operations, by default.

When we used the NOT IN operator, we were asking SQL Server to carry out a comparison which it cannot do for records with NULL values and hence, these records were ignored.

Solution 01: You can modify this behaviour of SQL Server by setting ANSI_NULLS to OFF. However, this is not recommended because:

  1. Setting ANSI_NULLS to OFF causes the comparison operators in SQL Server to stop following the ISO standards
  2. ANSI_NULLS will always be ON in a future release of SQL Server (refer Books On Line)

Solution 02: Alternatively, you can also use the NOT EXISTS clause to fetch the correct results:

USE AdventureWorks2008R2
GO
-- Solution #01 - Set ANSI_NULLS to OFF. 
/****************************************
This is not a recommended solution
****************************************/
SET ANSI_NULLS OFF
SELECT * FROM Sales.SalesPerson WHERE TerritoryID NOT IN (1,2,3,4,5)
-- Because setting ANSI_NULLS to OFF is not recommended, revert back
SET ANSI_NULLS ON

-- Solution #02 - Using the NOT EXISTS Clause
SELECT * FROM Sales.SalesPerson sp1 WHERE NOT EXISTS (SELECT *
                                                      FROM Sales.SalesPerson sp2
                                                      WHERE sp1.TerritoryID = sp2.TerritoryID AND sp2.TerritoryID 
                                                      IN (1,2,3,4,5))

COUNT(*) and LEFT v/s RIGHT joins

Assume the following scenario wherein we have a list of Sales Personnel in one table and the orders booked by them in another table. The scripts and data is supplied by Jacob, because such data does not exist in the AdventureWorks2008R2 database.

/*-----------------------------------------------------------------------------
  Date			: 23 March 2011
  SQL Version           : SQL Server 2005/2008/Denali
  Author		: Jacob Sebastian
  Email			: jacob@beyondrelational.com
  Twitter		: @jacobsebastian
  Blog			: http://beyondrelational.com/blogs/jacob
  Website		: http://beyondrelational.com

  Summary:
  This script is part of the demo presented by Jacob Sebastian during 
  Tech-ED India 2011 on "TSQL Worst Practices".

  Disclaimer:  
  THIS CODE AND INFORMATION ARE PROVIDED "AS IS" WITHOUT WARRANTY OF 
  ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED 
  TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A
  PARTICULAR PURPOSE.
-----------------------------------------------------------------------------*/
USE teched 
GO

IF OBJECT_ID('salesrep','U') IS NOT NULL DROP TABLE salesrep
GO

CREATE TABLE salesrep (id int, name varchar(20))
GO

INSERT INTO salesrep(id, name) SELECT 1, 'Jacob'
INSERT INTO salesrep(id, name) SELECT 2, 'Pinal'
INSERT INTO salesrep(id, name) SELECT 3, 'Mike'
INSERT INTO salesrep(id, name) SELECT 4, 'Steve'

IF OBJECT_ID('orders', 'U') IS NOT NULL DROP TABLE orders 
GO

CREATE TABLE orders (salesrep INT, OrderID INT, Amount MONEY)
INSERT INTO orders (salesrep, OrderID, Amount) select 1, 1, 100
INSERT INTO orders (salesrep, OrderID, Amount) select 1, 2, 100
INSERT INTO orders (salesrep, OrderID, Amount) select 2, 3, 100
INSERT INTO orders (salesrep, OrderID, Amount) select 2, 4, 100
INSERT INTO orders (salesrep, OrderID, Amount) select 2, 5, 100

SELECT * FROM salesrep
SELECT * FROM orders 

Manager: I want a list of all Sales Personnel with the number of orders booked by them.

You: Sure. I will have the results ready for your review in 5 minutes

SELECT s.name,
       COUNT(*) AS OrderCount
FROM salesrep s
LEFT JOIN orders o ON o.salesrep = s.id
GROUP BY s.name  

image

You (thinking): The results don’t look right. What’s going on?

Root Cause: Again, it is the NULL values that bother us. COUNT (*) causes an aggregation of all columns in the result set. Looking at Books On Line (COUNT aggregate function), we can understand that except for COUNT, all aggregate functions ignore NULL values. COUNT is the black sheep in the sense that by default it will count a NULL and all duplicate values as a 1.

Solution(s): There are a minimum of two solutions to this mystery.

  1. The first of the solutions is to use the INNER JOIN instead of the LEFT JOIN. However, this might not be what the manager wanted (remember that the manager had asked for a list of all sales personnel with the number of orders booked by them). If we use INNER JOIN, we will only get a list of sales personnel who have booked at least one order in their name
  2. The second involves using a column from the right-hand side table of the LEFT JOIN. This would be a correct query
-- Solution 01 - Technically correct, but may not be what the manager wanted
SELECT s.name,
       COUNT(*) AS OrderCount
FROM salesrep s
INNER JOIN orders o ON o.salesrep = s.id
GROUP BY s.name   

-- Solution 02 - Technically correct and what the manager wanted
SELECT s.name,
       COUNT(o.orderid) AS OrderCount
FROM salesrep s
LEFT JOIN orders o ON o.salesrep = s.id
GROUP BY s.name  

image

Summarizing – Part 01:

Finally, summarizing all the points that have been discussed today (Part 01) and during Tech-Ed 2011

  1. Setting ANSI_NULLS to OFF is bad! Any code using this dependency should be revisited
  2. Using NOT IN may give you incorrect results. You may use NOT NULL or NOT EXISTS check instead. Be careful about the possibility of NULL values when using NOT IN
  3. COUNT(*) with LEFT JOIN may produce incorrect results
  4. Use COUNT(column) from RIGHT side table instead of COUNT(*)
  5. There are always hidden requirements as in the case of the manager of the COUNT(*) example – pay attention to each and every word!

Thank-you, Jacob, for this really great session!

You can read my memories of Tech-Ed by going through the following series:

Tech Ed 2011 (India) – Day 01 – Connect, Learn, Explore, Evolve!

Tech Ed 2011 (India) – Day 02 – IE 9 – India launch – Fast is now beautiful!

Tech Ed 2011 (India) – Day 03 – For the DBAs – Contained Databases, Extended Events, Virtualization, Wait Stats, Wait Queues & High Availability

Until tomorrow,

Be courteous. Drive responsibly.

Measuring the number of rows in a table – are there any alternatives to COUNT(*)?


Today, we will talk about a slightly different, yet very common subject. We will talk about what are the different ways to measure the number of rows in a table.

I found a couple of posts over the Internet, and I will reference them as and when I demonstrate the associated methods. This post basically is an assimilation of these methods and attempts to perform a comparative analysis to categorize them in terms of:

  • Performance (in terms of time)
  • Accuracy

The most common methods

The most common of all methods is the one that every person who has even come in contact with Microsoft SQL Server for even 5 minutes knows about. It is the very generic:

SELECT COUNT(*) FROM HumanResources.Employee

Other similar options are:

SELECT COUNT(1) FROM HumanResources.Employee
SELECT COUNT(BusinessEntityID) FROM HumanResources.Employee

Some developers believe that COUNT(1) or using the 1st primary key column is faster for the COUNT() function, however, that is untrue. Here’s the proof:

image

If you have tables with rows greater than the range of INT data type, you may use COUNT_BIG() instead.

Using Catalog Views

Microsoft SQL Server comes with a few catalog views, which return information that is used by the SQL Server Database Engine. Obviously, this is one of the places where one should go and look for the row count information. A table with a clustered index will have at least one entry for that index in the catalog view – sys.sysindexes. This entry will have the count of the number of rows in the index, which in-turn is equal to the number of rows in the table (for SQL 2008 and up, a clustered index cannot be filtered and hence clustered indexes will always hold the information for the entire table).

SELECT si.rows AS 'RowCount'
FROM sys.sysindexes si
WHERE si.id = OBJECT_ID('HumanResources.Employee') AND si.indid < 2

Comparing against the conventional options, we find that this option is faster!

image

However, please note that the catalog views may report stale data.

Using DMVs

Mr. Pinal Dave (blog) provides a great method of accurately determining the number of rows in a table with the use of DMVs in his post here.

His query has been modified to only look for the HumanResources.Employee table below:

SELECT sc.name + '.' + ta.name TableName, SUM(pa.rows) RowCnt
FROM sys.tables ta
INNER JOIN sys.partitions pa ON pa.OBJECT_ID = ta.OBJECT_ID
INNER JOIN sys.schemas sc ON ta.schema_id = sc.schema_id
WHERE ta.is_ms_shipped = 0 AND 
      pa.index_id IN (1,0) AND
      sc.name = 'HumanResources' AND
      ta.name = 'Employee'
GROUP BY sc.name, ta.name
ORDER BY SUM(pa.rows) DESC

This method falls behind in performance when searching for row counts of individual tables as demonstrated below. However, without the schema name and the table name filters added in the WHERE clause, this is a handy script to get the row counts for all tables in a database. As a matter of fact, I have verified with the help of the SQL Profiler that Microsoft also uses the sys.partitions DMV to fetch the row counts when showing them on the Object Explorer Details window. (If you would like to know more about the Object Explorer Details window, refer my posts – Part I and Part II, which are part of my series on the Underappreciated features of Microsoft SQL Server)

I would recommend this to be in the "tool-kit” of each and every database developer, administrator and operator.

image

Using Aggregation Functions

The following method, from Madhivanan’s post (blog) (read this post here) is quite interesting:

SELECT SUM(1) AS ‘RowCount’ FROM HumanResources.Employee

Performance-wise, this method is same as the well-known COUNT(*) or COUNT(1), but it’s interesting because it uses an aggregation function to get the row count.

Using System Stored Procedures

This one again comes from Madhivanan’s post, with a small extension from Books-On-Line.

Microsoft SQL Server comes with a big list of system functions (the most common ones being sp_help, sp_executesql and many others which we use on a regular basis). It is no surprise then that we also have a system function to provide information on the number of rows in a table.

Rows in a table occupy space, and hence, it is fit that the system stored procedure responsible to predict the space usage is also responsible to predict the number of rows in a table.

--Default version, may return stale results
EXEC sp_spaceused @objname=N'HumanResources.Employee';

--Force update of usage statistics, will always return current usage
EXEC sp_spaceused @objname=N'HumanResources.Employee', @updateusage=N'TRUE';

image

Here’s the catch: System stored procedures ALWAYS return an INT. If you have rows that exceed the range of an INT in number, do not use these.

Also, this is slower in performance when compared to the conventional methods mentioned above.

image

Read more on sp_spaceused on Books-On-Line at: http://msdn.microsoft.com/en-us/library/ms188776.aspx

The next method that also uses a system function is the one below which came from a friend:

SELECT COUNT(@@ROWCOUNT) FROM HumanResources.Employee

This method performs equally well as the conventional ones.

Summary

Summarizing, I would attempt to present a small table that would serve as a ready look-up for each of these methods. I hope you find it useful.

Method Performance (Time) Accuracy
Conventional

(SELECT COUNT(*) and others)
No Yes
Catalog Views

(sys.sysindexes)
Yes Approximate
DMVs Partial* Yes
Aggregate Functions No Yes
System Stored Procedures No Approximate

(*Partial = Fares low on performance for individual tables, but scores high for a group of tables).

There are a couple of points which I would like the reader to keep in mind when gathering data for row count analysis:

  • System stored procedures will always return an INT value
  • Size information is not maintained for non-clustered indexes – please exercise due diligence before using these methods with non-clustered indexes (or tables without clustered indexes, i.e. heaps)
  • Row counts may be incorrect for large tables. For their accuracy, always update statistics by using the following before attempting to fetch an accurate row count:
--Update the table usage statistics
DBCC UPDATEUSAGE ('AdventureWorks2008','HumanResources.Employee')
GO
--Per Books On Line:
--Reports and corrects inaccuracies in the sysindexes table, 
--which may result in incorrect space usage reports by the sp_spaceused system stored procedure.

Computing Row Counts for a range of values

When working with a range of values, getting the row counts is a bit difficult and has very limited options. To the best of my knowledge, for users using SQL Server 2005, we can only use the conventional queries:

SELECT COUNT(*) FROM HumanResources.Employee WHERE BirthDate BETWEEN '1950-01-01' and '1960-01-01'

However, for users using SQL Server 2008 and up, the option to create a filtered index (if the range is pre-defined) and then using the catalog view method is always open.

And finally…

Now for the big question – which of the above methods are my favourites? Here’s my choice:

  1. Conventional methods if I need the accurate row count for a single table
  2. Catalog Views if I only need approximate row counts
  3. For fetching row counts of all tables in a database or a schema, I would go in for Pinal’s script

Do you use any other methods? If yes, do let me know – I will publish the alternatives with due credit. Also, if you use any other methods to get the row counts for range of values, please do share your solutions – they will also be published with your name as the contributor.

Until we meet next time,

Be courteous. Drive responsibly.

Tech Ed 2011 (India) – Day 03 – For the DBAs – Contained Databases, Extended Events, Virtualization, Wait Stats, Wait Queues &am


As you know, I am documenting my experience at the Tech Ed 2011 (India) held from March 23-25, 2011 at Bengaluru, India.

You can find the summary of my experiences on Day 01 and Day 02 at:

Day 01 – http://beyondrelational.com/blogs/nakul/archive/2011/03/28/tech-ed-2011-india-day-01-connect-learn-explore-evolve.aspx

Day 02 – http://beyondrelational.com/blogs/nakul/archive/2011/03/30/tech-ed-2011-india-day-02-ie-9-india-launch-fast-is-now-beautiful.aspx

Today, as India celebrates it’s win into the cricket World Cup 2011 finals, I will take you through the final day in the wonderful journey of Tech-Ed 2011.

A day for the DBAs

Day 03 was very special – it was a day which had a full track dedicated to the DBA community – guaranteed to attract all DBAs – accidental or not like – bees are to honey.

Hands-on Lab on Contained Databases by Jacob Sebastian

After the keynotes, the day immediately picked up speed as the DBA community got to participate in a hands-on lab on Contained Databases – a long awaited feature, and now part of SQL 11 (Code named “Denali”). The lab was conducted by none other than renowned SQL Server MVP and leader of the Ahmedabad SQL Sever User Group, Jacob Sebastian! His trademark style of explaining difficult concepts by telling a story made this concept a breeze to understand.

I will be taking everyone through the entire hand-on session in a series I am planning for the next week. Stay tuned, and I am sure you will find it useful.

Understanding SQL Server Behavioural Patterns – SQL Server Extended Events by Pinal Dave

Extended Events are not new to SQL Server 11/(“Denali”), but not many know how to efficiently use them. Pinal Dave, the renowned Microsoft Evangelist, demonstrated how to use Extended Events to execute a couple of common database administration tasks. What I liked the most were the demos on:

  1. Detect long-running queries
  2. Detect page splits

One of the most interesting items of the session was how Pinal demonstrated the concept of what events and extended events are by co-relating them to the human heart and a scenario wherein a hypothetical elephant needs to be taken to the hospital after collapsing during the Tech-Ed. That’s what I call – fun with learning!

You can download the scripts used during this demonstration by visiting Pinal’s blog at: http://blog.sqlauthority.com/2010/03/29/sql-server-introduction-to-extended-events-finding-long-running-queries/

IMG_8524

IMG_8527

Data Recovery with Backups by Vinod Kumar

Vinod Kumar was back after Day 01 – this time to show the audience the immense value a good backup strategy can have. Vinod demonstrated how to recover from the following situations from a backup:

  1. Recovering/Restoring to a point in time
  2. Recovering/Restoring to a named transaction
  3. Recovering from a page corruption

Quite frankly, I did not know that one could “mark” a transaction and that this can be used during recovery – just as a point-in-time restore. As a proof of the importance of this feature, Vinod discussed one of the practical implementations that have been done using this feature. In the interest of privacy, the exact nature of the project was not disclosed to the audience.

IMG_8531

SQLCAT: SQL Server Consolidation – Using Virtualization & Lessons Learned by Prem Mehra

Prem Mehra is a Program Manager with Microsoft at Redmond and works for the SQL Customer Advisory Team (SQLCAT). This team is comprised of solution experts that regularly engage in the largest, most complex and most unique customer deployment engagements worldwide.

In this session, Prem discussed some of the consolidation projects that the SQLCAT team has done in a virtualized/cloud based environment. The audience had some great things to learn from him, and my biggest take-away was the following:

“If a SQL Server is not performing well, the problem is either in the application or in the underlying hardware platform – in the order mentioned.”

Also, Prem demonstrated the ideal approach in moving towards a cloud based deployment (which is shown here as a picture captured of his slide) and how the upcoming SQL Server 11 (“Denali”) will help in the journey.

IMG_8541

IMG_8544

(Source: http://msdn.microsoft.com/en-us/library/ee819082.aspx)

IMG_8550

SQL Server Waits and Queues – Your Gateway to Performance Troubleshooting by Pinal Dave

Do you really know how to remove SQL Server bottlenecks when a server seems to be really, really slow? Yeah, of course! Look at the wait statistics and queues – how difficult can it be?

Not much. So, are CXPACKET waits bad? Can we eliminate LOGBUFFER waits? What are Signal waits?

Pinal did it once again – with a simple example of a group of people awaiting a cab and then stopping to withdraw cash from an ATM, Pinal very beautifully explained Signal Waits. He then went on to demonstrate and explain what CXPACKET waits are and how they will always be there, and that they might sometimes be good. Finally, in an power-packed demo, he demonstrated how moving the data file of a database to his portable USB drive can reduce LOGBUFFER waits. (Yes, you read me right – Pinal moved the data file of a database to his portable USB drive)

Here’s where you can read Pinal’s series on Wait stats & queues – http://blog.sqlauthority.com/2011/02/28/sql-server-summary-of-month-wait-type-day-28-of-28/

IMG_8552

SQL Server AlwaysOn: The next generation high availability solution by Balmukund Lakhani

High Availability is one of the key requirements of any SQL Server deployments. Who better than the Technical Lead in the SQL Server support team with Microsoft India GTSC to demonstrate how simple it is to setup a cluster, configure a high-availability solution, and demonstrate a failover – all LIVE and within the hour!

Balmukund demonstrated the High Availability “AlwaysON” features which will be made available in SQL Server 11 (code named “Denali”). The solution will definitely make the lives of all DBAs and IT administrators easier because combined with the power of Windows Server 2008 R2, setting up a cluster (or an Availability Group) is a breeze. The cluster can have one automatic fail-over partner, with other manual fail-over partners. The best benefit of this solution is that it is similar to mirroring – but without the drawbacks of mirroring.

Here’s a step-by-step guide to getting you started on HADR: http://blogs.msdn.com/b/sqlserverfaq/archive/2010/12/17/sql-server-denali-alwayson-hadr-step-by-setup-setup-guide.aspx

IMG_8556

IMG_8558

Closing Keynote – Visit by Anil Kumble!

Finally, the time came to close the Tech-Ed 2011 (India). With a heavy heart, all participants submitted their feedback forms for the last sessions of the day and proceeded towards the keynote hall for the closing keynote.

The final attraction of this year’s Tech-Ed was that we had a very special guest with us. None other than the star cricketer – Anil Kumble! None other than Gaurav Kapoor, who hosted the live sessions of Tech-Ed, introduced Anil Kumble on stage and the entire group erupted in cheers! Hands went-up all the way to catch a snap or two of Anil Kumble in person. So high was the excitement, that nothing could distract the audience. I will let the pictures speak for themselves.

IMG_8560 

A full-house awaiting the start of the closing ceremony.

IMG_8561

IMG_8570

IMG_8581

IMG_8592

What’s next?

My next post will be a deep dive into Jacob’s session on the worst T-SQL practices on Tech-Ed Day 01. Next week, we will continue the series on the Underappreciated Features of SQL Server, and also explore the “Contained Database” feature of SQL Server 11 (code named – “ Denali”). Contained Databases is based on the hands-on lab conducted on the same topic by Jacob on Day 03 of Tech-Ed 2011.

Watch out for some exciting new and exciting reading material hot off the presses!

Until we meet next time,

Be courteous. Drive responsibly.