Windowing functions – Underappreciated features of Microsoft SQL Server


This is part of the series of posts that I am running under the title “Under Appreciated Features of SQL Server“. The series has been inspired by Andy Warren’s editorial on SQLServerCentral.com.

Introduction

Normally, we all work with subsets of rows that meet a pre-defined, user defined criteria (Ex. records corresponding to transactions done on any given date, etc). Any subset of user-defined records or rows is called a “window”. Please note that the keyword here is user-defined. A window thus corresponds to rows returned as a result of a particular user query.

Any function that operates on each rows of this user-defined set of rows or “window” is called a “windowing function”. Windowing functions are an ISO defined SQL standard, and Microsoft SQL Server (2005 and up) provides us with the option to use ranking and aggregate windowing functions. Today we will study a few of these powerful functions.

SQL Server Books-On-Line is a great learning resource for these, and hence I have basically compiled the information in Books-On-Line as an introduction to the windowing functions.

OVER clause

The OVER clause determines the partitioning and ordering of the rowset before the associated window function is applied. This is similar to the function of the GROUP BY and the ORDER BY clause, but is executed during the SELECT operation itself – so that the output of these operations can be provided to the ranking functions we will see in a minute. The ORDER BY clause has been enhanced to work with the OVER clause that allows us to create ordered partitions.

The advantage that the OVER clause and the windowing functions have over the conventional GROUP BY and ORDER BY implementations is that the partitions and ordering for each function is independent. To do this conventionally, we would have required the use of sub-queries – the OVER clause is far more efficient than that. In the following example, I take the script provided in Books On Line and attempt to reproduce just one aggregate operation (SUM) using the conventional methods.

--Enable the actual execution plan to see the performance benefits of the OVER clause
--Refer (http://beyondrelational.com/blogs/nakul/archive/2011/03/10/some-underappreciated-productivity-features-of-ssms.aspx)
--or use Ctrl+M

--Clear out the procedure cache
DBCC FREEPROCCACHE
GO

USE AdventureWorks2008;
GO
SELECT SalesOrderID, ProductID, OrderQty
    ,SUM(OrderQty) OVER(PARTITION BY SalesOrderID) AS 'Total'
    ,AVG(OrderQty) OVER(PARTITION BY SalesOrderID) AS 'Avg'
    ,COUNT(OrderQty) OVER(PARTITION BY SalesOrderID) AS 'Count'
    ,MIN(OrderQty) OVER(PARTITION BY SalesOrderID) AS 'Min'
    ,MAX(OrderQty) OVER(PARTITION BY SalesOrderID) AS 'Max'
FROM Sales.SalesOrderDetail 
WHERE SalesOrderID IN(43659,43664);
GO

--Clear out the procedure cache
DBCC FREEPROCCACHE
GO

USE AdventureWorks2008;
GO
SELECT sd.SalesOrderID, sd.ProductID, sd.OrderQty,
       (SELECT SUM(sd1.OrderQty) FROM Sales.SalesOrderDetail sd1 WHERE sd1.SalesOrderID = sd.SalesOrderID) AS 'Total'
FROM Sales.SalesOrderDetail sd
WHERE sd.SalesOrderID IN (43659, 43664)
GROUP BY sd.SalesOrderID, sd.ProductID, sd.OrderQty

Let’s see how the actual execution plan came along. (To display the actual execution plan, you may refer my previous post or use Ctrl+M before executing the query)

image

The screen-shot says it all! Performing all aggregation operations – SUM, AVG, COUNT, MIN and MAX using the OVER clause are far more efficient than doing a single SUM operation using the conventional methods.

Finally, please note that the OVER clause cannot be used with the CHECKSUM aggregate function.

Ranking functions

As the name suggests, ranking functions return a ranking value for each row in the particular window/partition. Ranking may be unique depending upon the type of ranking function used. Please note that all ranking functions are non-deterministic. You may want to keep this in mind for performance considerations.

Four ranking functions are available in Microsoft SQL Server, and they are as under:

Ranking Function Brief description (per BOL) Unique ranking? BOL Link
RANK Returns the rank of each row within the partition of a result set.
If two records share the same rank, a gap is observed in the sequence, ex. if two records share the same rank #1, the next rank issued will be #3.
N Click here
DENSE_RANK Returns the rank of rows within the partition of a result set, without any gaps in the ranking.
Ex. if two records share the same rank #1, the next rank issued will be #2.
N Click here
NTILE

Distributes the rows in an ordered partition into a specified number of groups. The groups are numbered, starting at one. For each row, NTILE returns the number of the group to which the row belongs.

N Click here
ROW_NUMBER

Returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition.

Y Click here

To demonstrate these functions in action, I have  slightly modified the example provided to us in BOL as under:

USE AdventureWorks2008;
GO

--Ordering the result by the Postal Code
SELECT p.FirstName, p.LastName
    
    --Unique row-number
    ,ROW_NUMBER() OVER (ORDER BY a.PostalCode) AS 'Row Number'
    
    --Provide a rank, such that the largest Postal Code gets the highest rank 
    --Break in the sequence expected if more than one record share the same rank
    ,RANK() OVER (ORDER BY a.PostalCode) AS 'Rank'
    
    --Provide a rank, such that the largest Postal Code gets the highest rank
    --No break in the sequence
    ,DENSE_RANK() OVER (ORDER BY a.PostalCode) AS 'Dense Rank'

    --Break the result set in 4 groups.
    --Because for AdventureWorks, the result is a set of 13 records, we have the set with the
    --largest postal code has 4 records, rest have 3. The result seen in the column 'Quartile' is
    --actually the group number in which the record belongs to
    ,NTILE(4) OVER (ORDER BY a.PostalCode) AS 'Quartile'

    ,s.SalesYTD, a.PostalCode
FROM Sales.SalesPerson s 
    INNER JOIN Person.Person p 
        ON s.BusinessEntityID = p.BusinessEntityID
    INNER JOIN Person.Address a 
        ON a.AddressID = p.BusinessEntityID
WHERE TerritoryID IS NOT NULL 
    AND SalesYTD <> 0;

 

The results of executing the above script are as under:

image

  • Ordering is by PostalCode to determine the RANK. Hence, we will have only two sets of values – beginning at the actual row number (i.e. 1 and 6)
  • On the other hand, for DENSERANK(), we continue to have two sets of values, but they longer follow the Row Number
  • Finally, notice that for NTILE, the SQL Server attempted to divide 13 records into 4 groups, which resulted in the 1st group being larger than the other 3

Reference

The following are good references for understanding windowing functions:

  1. OVER clause – http://msdn.microsoft.com/en-us/library/ms189461.aspx
  2. RANKING functions – http://msdn.microsoft.com/en-us/library/ms189798.aspx
  3. Aggregate functions – http://msdn.microsoft.com/en-us/library/ms173454.aspx

In Conclusion

Windowing functions are so efficient and easy to use that starting the day I was introduced to them, I have always used them wherever possible. I suggest you too, give it a try. Do post your feedback and also mention the challenges that you have in working with windowing functions.

Until we meet next time,

Be courteous. Drive responsibly.

T-SQL Quiz 2011 is ON!


This month, we will exercise some gray cells. The T-SQL Quiz 2011 is now in progress, and I hope that by now, all of the T-SQL fans amongst you would have visited it.

For those who came in late, the T-SQL Quiz is an annual Quiz on T-SQL organized on this very web-site! I have been following it and participating since the last two years, and below is my experience so far:

  • T-SQL Quiz is a great way to learn 30 new things in 30 days! – a crash course in the comfort of your home without any investment
  • Some of the best Quizmasters contribute to the quiz – they have tons of industry experience, and are active community contributors
  • You can have as many contributions as you wish
  • T-SQL Quiz is always open for viewing – even after the submissions close. This can serve as great future reference when similar situations come up at work
  • Anyone can participate!
  • Most importantly, it’s fun!

So, how can I start?

Well, you start by visiting the T-SQL 2011 Quiz URL – http://beyondrelational.com/quiz/SQLServer/TSQL/2011/default.aspx – and then clicking on each day of the calendar to look at, and answer the question for that particular day.

But, I am online only once a week…

That’s not a problem. Each question is open for a period of 30 days, so you have plenty of time to research, learn about and then answer the question.

The rewards

This is the most interesting part – rewards! The first prize is an Apple iPad sponsored by Redgate Software!

So, what are you waiting for? Give T-SQL Quiz 2011 a try this week-end itself. I am sure you will enjoy the experience. If you have any questions, head over to the FAQs section at http://beyondrelational.com/quiz/faq.aspx

Until we meet next time,

Be courteous. Drive responsibly.

Some underappreciated productivity features of SSMS


As we look forward to Project Juneau; I thought of quickly writing about some features of it’s predecessor – the SQL Server Management Studio (well-known as SSMS). SSMS had power-packed features available at a single click of the mouse – which we might be using in our daily work.

None of what I am about to mention is new, it’s just that they are some great productivity enhancing methods of doing the same thing. I hope that you like these tips, and adopt them in your day-to-day encounters with SSMS.

Trace Query in SQL Server Profiler

When working on a performance tuning exercise, the SQL Server Profiler is one of the most used tools. Assume that we have a query all ready for execution in SSMS and need to launch a new instance of the SQL Profiler. There’s no need to navigate out of the Query Editor window! The context menu has a direct link to launch the profiler! What’s better is that launching Profiler using this option logs in to the server automatically, and starts off your default trace!

Once the profiler is launched and the default trace started off (automatically!), all we need to do is run the query after coming back to SSMS.

image image

Query Options

We may want to run a SQL query with multiple different execution settings, and do not really want to use the keyboard to type in the options with their proper syntax, the context menu on the query editor is there for our rescue.

In the screen-shots below, I have demonstrated how simply using the Query options window takes effect:

image image

Because setting the query options is only local to the particular session, opening up a new SSMS window and running the query there will confirm that changes are not persisted.

Other “cool” features

In addition to the two features I wrote about today, you can also do the following directly from the Query Editor window itself:

  1. Change the active connection
  2. Display Estimated and Actual execution plans
  3. Launch the Database engine tuning advisor
  4. Include Client statistics – (great introduction to client statistics is from Mr. Pinal Dave’s (blog) post – http://blog.sqlauthority.com/2009/10/01/sql-server-sql-server-management-studio-and-client-statistics/)

If you want to know more about SQL 11 (“Denali”)

If you want to know more about SQL 11 (“Denali”), a good place to start will be the “What’s new” listing available on this very web-site! You can reach this list right here – http://beyondrelational.com/whatisnew/sqlserver/denali.aspx. Simply click on the feature you want to explore and if one of us from the community has written about it, chances are, it is linked here.

Until we meet next time,

Be courteous. Drive responsibly.

Filtered Indexes – Underappreciated features of Microsoft SQL Server


This is part of the series of posts that I am running under the title "Under Appreciated Features of SQL Server". The series has been inspired by Andy Warren’s editorial on SQLServerCentral.com.

(For those new to indexes, please do not proceed any further. Please go through the “Introduction to Indexes” series by Gail Shaw. You can access the series from here.)

Indexes are not new, but Microsoft SQL Server 2008 introduced a new feature called “Filtered Indexes”. A filtered index is an optimized non-clustered index (for obvious reasons, you cannot have a filtered clustered index), especially suited to cover queries that select from a well-defined subset of data. What this means is that if a particular subset of data is frequently fetched from a table (about 80% of the time), then such tables are ideal candidates to have filtered indexes on the specified range of data.

A Brief demo

It looks quite simple, but the major change is that the CREATE INDEX statement allows allows for the use of a filter predicate to index a portion of rows in the table.

If a column contains mostly NULL values, and the queries in the application only fetch the non NULL values, such columns are great candidates to have filtered indexes on. In the AdventureWorks database, bill of materials are generally important for products whose deliveries are already completed. Hence, the application queries to Production.BillOfMaterials are great candidates for use of a filtered index.

Here’s a brief demo on how filtered indexes can be created, and how they can be effective.

First up, we will see how we would conventionally address the problem of efficiently fetching ComponentID and StartDate from the Production.BillOfMaterials table. Use Ctrl+M (or Query->Include Actual Execution Plan) to view the actual Execution Plan of the query batch.

--The below example is a direct take from BOL, with a few minor changes
USE AdventureWorks2008;
GO

--Create a Regular Index on the Production.BillOfMaterials table
IF EXISTS (SELECT name FROM sys.indexes
    WHERE name = N'RIBillOfMaterialsWithEndDate' 
    AND object_id = OBJECT_ID(N'Production.BillOfMaterials'))
DROP INDEX RIBillOfMaterialsWithEndDate
    ON Production.BillOfMaterials;
GO

CREATE NONCLUSTERED INDEX "RIBillOfMaterialsWithEndDate"
    ON Production.BillOfMaterials (ComponentID, StartDate, EndDate)
GO

-- Clear the Procedure cache
DBCC FREEPROCCACHE
GO
--This select will NOT make use of the filtered index
SELECT ComponentID, StartDate FROM Production.BillOfMaterials WHERE EndDate IS NOT NULL

Looking at the execution plan, you can see that SQL Server utilizes the newly created non-clustered index.

image

Notice that we had to include the column – EndDate as part of the columns on the index. However, if the application is only interested in records whose EndDate IS NOT NULL, a better idea would be to create a Filtered non-clustered index.

--The below example is a direct take from BOL, with a few minor changes
USE AdventureWorks2008;
GO

--For a clean test, begin by dropping the "regular index"
IF EXISTS (SELECT name FROM sys.indexes
    WHERE name = N'RIBillOfMaterialsWithEndDate' 
    AND object_id = OBJECT_ID(N'Production.BillOfMaterials'))
DROP INDEX RIBillOfMaterialsWithEndDate
    ON Production.BillOfMaterials;
GO

--Create a Filtered Index on the Production.BillOfMaterials table
IF EXISTS (SELECT name FROM sys.indexes
    WHERE name = N'FIBillOfMaterialsWithEndDate' 
    AND object_id = OBJECT_ID(N'Production.BillOfMaterials'))
DROP INDEX FIBillOfMaterialsWithEndDate
    ON Production.BillOfMaterials;
GO

CREATE NONCLUSTERED INDEX "FIBillOfMaterialsWithEndDate"
    ON Production.BillOfMaterials (ComponentID, StartDate)
    WHERE EndDate IS NOT NULL;
GO

-- Clear the Procedure cache
DBCC FREEPROCCACHE
GO
--This select will make use of the filtered index
SELECT ComponentID, StartDate FROM Production.BillOfMaterials WHERE EndDate IS NOT NULL
GO

Let’s look at the execution plan to see if SQL Server uses the filtered index or not:

image

The Advantages

While highlighting the usage of the two indexes – the regular and filtered ones, a few key differences come out:

Reduced index storage size

For performance reasons, and in order to make the SELECT an index-only operation, we had to put the EndDate on the index itself. This means that the overall row size on the index is higher in case of the regular index as compared to the filtered index. As proof of this, observe that the “Estimated Row Size” on the regular index is 27B, whereas that on the filtered index is just 19B.

While it may not look like a big difference, it is a big one when you are looking at deployment at a shared data center or the cloud – where one is charged per the amount of space used.

If you have multiple such queries, that work on pre-defined filters, it may be a better idea to replace a full-table non-clustered index with multiple filtered indexes. Of course, this depends upon the nature of the queries that the application uses, and upon the nature of the underlying data.

Reduced index maintenance costs

Coming up as a direct result of the above discussion, a filtered index is also cheaper to maintain. Indexes are maintained only when the data within the index is affected by a Data Manipulation Language (DML) operation. Because the data stored in the index is limited, it may be possible that it is not affected by many DML operations that the user might execute on the table. On the other hand, if a filtered index contains only the frequently affected data, the smaller size of the index reduces the cost of updating the index statistics.

Improved Query Performance

Enhancements are not easily accepted unless they give some immediate benefits. A well-designed filtered index improves query performance because it is smaller in size when compared to the full non-clustered index, which also contributes to filtered statistics, which are easier and much more accurate to maintain.

As proof of this, observe the following parameters from the screen-shots above:

  1. Estimated I/O cost
  2. Estimated CPU cost
  3. Estimated operator cost
  4. Estimated subtree cost

In each of these parameters, the filtered index wins hands-on. For further confirmation, we will force the query to use each of these indexes in order:

DBCC FREEPROCCACHE
GO
--This select will make use of the regular index
SELECT ComponentID, StartDate FROM Production.BillOfMaterials WITH (INDEX (RIBillOfMaterialsWithEndDate)) WHERE EndDate IS NOT NULL
GO
DBCC FREEPROCCACHE
GO
--This select will make use of the filtered index
SELECT ComponentID, StartDate FROM Production.BillOfMaterials WITH (INDEX (FIBillOfMaterialsWithEndDate)) WHERE EndDate IS NOT NULL
GO

Here’s the proof!

image

A Word of Caution

Upon looking at BOL, I encountered the following words of caution – “A filtered index can be used as table hint, but will cause the query optimizer to generate error 8622 if it does not cover all of the rows that the query selects.” This is a very important design consideration when thinking about filtered indexes.

Let’s attempt to simulate a situation wherein we attempt to fetch more rows than what are covered by the filtered index.

SELECT ComponentID, StartDate, EndDate
FROM Production.BillOfMaterials
WITH (INDEX (FIBillOfMaterialsWithEndDate))

The following is the output:

Msg 8622, Level 16, State 1, Line 1

Query processor could not produce a query plan because of the hints defined in this query. Resubmit the query without specifying any hints and without using SET FORCEPLAN.

Filtered Indexes v/s Views

Views offer a much wider range of functionality than filtered indexes. Views naturally allow for computed columns, joins, multiple tables and complex predicate logic, which filtered indexes cannot utilize.

This is exactly why we cannot create filtered indexes on views. However, this does not mean that filtered indexes cannot be used when defined on a table in a view. If the query optimizer determines that the results of the query will be correct, it will go ahead and use the filtered index.

More examples on this, and other filtered index design considerations can be found at: http://msdn.microsoft.com/en-us/library/cc280372.aspx

Some food for thought

Because we have filtered indexes, it is logical that we should have filtered statistics as well. If you have written about or are planning to write about filtered statistics, do link it up to this post. I will be sure to post my feedback 🙂

Until we meet next time,

Be courteous. Drive responsibly.

Compound Operators – the most common mistake


I will be taking a break from my series on “Underappreciated features of Microsoft SQL Server” to address a problem that I see frequently during the code reviews that I conduct in my capacity as a technical lead in the project I work for.

SQL Server 2008 introduced a great new feature – support for compound operators. This feature makes T-SQL programming a breeze for anyone who is a typical “hybrid” database developer – the kind that switches often between let’s say C# programming with a little bit of T-SQL here and there.

Compound Operators

As a quick introduction, compound operators execute some operation and set an original value to the result of the operation. This is what we commonly know see as “i++” to run a loop in code – the “++” operator takes the original value of the variable “i” and increments it by 1. The examples for compound operators provided on Books-On-Line provide great learning material – you can get these from here.

The most common error

I have often seen a very common mistake with respect to compound operators during code reviews – the sequence! Sequence is important here as well like most things in life, and the slightest typo error will cause SQL Server to behave differently. Let’s see this a practical example.

DECLARE @iTest INT
SET @iTest = 1

--Wrong! This works like an assignment.
--Effectively, we are assigning a value of +2 to @iTest
SELECT @iTest =+ 2
SELECT @iTest

--Right! This will use Compound Operators
SELECT @iTest += 2
SELECT @iTest

--For those who are curious, this works! (performs simple addition)
SELECT @iTest ++ 2

-- But, none of the following works
SELECT @iTest ++= 2
SELECT @iTest =++ 2

Simply run the above code in SSMS against your instance of SQL Server 2008 to see the difference. The first set of statements will cause a normal garden-variety assignment operation – the second will perform the compound operation, which we originally intended it to do. Do note that SQL Server will NOT throw an error message, because there is nothing wrong with the statement! It’s a simple, logical error, which might take days to troubleshoot if not looked at carefully.

As you would probably have guessed by now, this is only applicable to the compound operators += and –= (all others will cause SQL Server to error out). These are some of the most common operations that we use in code, and our habit of typing fast causes this error to occur very frequently. So the next time you use compound operators, be extra careful. For code reviewers, update your checklists to explicitly check for these errors.

Until we meet next time,

Be courteous. Drive responsibly.