This is part of the series of posts that I am running under the title “Under Appreciated Features of SQL Server“. The series has been inspired by Andy Warren’s editorial on SQLServerCentral.com.
Introduction
Normally, we all work with subsets of rows that meet a pre-defined, user defined criteria (Ex. records corresponding to transactions done on any given date, etc). Any subset of user-defined records or rows is called a “window”. Please note that the keyword here is user-defined. A window thus corresponds to rows returned as a result of a particular user query.
Any function that operates on each rows of this user-defined set of rows or “window” is called a “windowing function”. Windowing functions are an ISO defined SQL standard, and Microsoft SQL Server (2005 and up) provides us with the option to use ranking and aggregate windowing functions. Today we will study a few of these powerful functions.
SQL Server Books-On-Line is a great learning resource for these, and hence I have basically compiled the information in Books-On-Line as an introduction to the windowing functions.
OVER clause
The OVER clause determines the partitioning and ordering of the rowset before the associated window function is applied. This is similar to the function of the GROUP BY and the ORDER BY clause, but is executed during the SELECT operation itself – so that the output of these operations can be provided to the ranking functions we will see in a minute. The ORDER BY clause has been enhanced to work with the OVER clause that allows us to create ordered partitions.
The advantage that the OVER clause and the windowing functions have over the conventional GROUP BY and ORDER BY implementations is that the partitions and ordering for each function is independent. To do this conventionally, we would have required the use of sub-queries – the OVER clause is far more efficient than that. In the following example, I take the script provided in Books On Line and attempt to reproduce just one aggregate operation (SUM) using the conventional methods.
--Enable the actual execution plan to see the performance benefits of the OVER clause --Refer (http://beyondrelational.com/blogs/nakul/archive/2011/03/10/some-underappreciated-productivity-features-of-ssms.aspx) --or use Ctrl+M --Clear out the procedure cache DBCC FREEPROCCACHE GO USE AdventureWorks2008; GO SELECT SalesOrderID, ProductID, OrderQty ,SUM(OrderQty) OVER(PARTITION BY SalesOrderID) AS 'Total' ,AVG(OrderQty) OVER(PARTITION BY SalesOrderID) AS 'Avg' ,COUNT(OrderQty) OVER(PARTITION BY SalesOrderID) AS 'Count' ,MIN(OrderQty) OVER(PARTITION BY SalesOrderID) AS 'Min' ,MAX(OrderQty) OVER(PARTITION BY SalesOrderID) AS 'Max' FROM Sales.SalesOrderDetail WHERE SalesOrderID IN(43659,43664); GO --Clear out the procedure cache DBCC FREEPROCCACHE GO USE AdventureWorks2008; GO SELECT sd.SalesOrderID, sd.ProductID, sd.OrderQty, (SELECT SUM(sd1.OrderQty) FROM Sales.SalesOrderDetail sd1 WHERE sd1.SalesOrderID = sd.SalesOrderID) AS 'Total' FROM Sales.SalesOrderDetail sd WHERE sd.SalesOrderID IN (43659, 43664) GROUP BY sd.SalesOrderID, sd.ProductID, sd.OrderQty
Let’s see how the actual execution plan came along. (To display the actual execution plan, you may refer my previous post or use Ctrl+M before executing the query)
The screen-shot says it all! Performing all aggregation operations – SUM, AVG, COUNT, MIN and MAX using the OVER clause are far more efficient than doing a single SUM operation using the conventional methods.
Finally, please note that the OVER clause cannot be used with the CHECKSUM aggregate function.
Ranking functions
As the name suggests, ranking functions return a ranking value for each row in the particular window/partition. Ranking may be unique depending upon the type of ranking function used. Please note that all ranking functions are non-deterministic. You may want to keep this in mind for performance considerations.
Four ranking functions are available in Microsoft SQL Server, and they are as under:
Ranking Function | Brief description (per BOL) | Unique ranking? | BOL Link |
RANK | Returns the rank of each row within the partition of a result set. If two records share the same rank, a gap is observed in the sequence, ex. if two records share the same rank #1, the next rank issued will be #3. |
N | Click here |
DENSE_RANK | Returns the rank of rows within the partition of a result set, without any gaps in the ranking. Ex. if two records share the same rank #1, the next rank issued will be #2. |
N | Click here |
NTILE |
Distributes the rows in an ordered partition into a specified number of groups. The groups are numbered, starting at one. For each row, NTILE returns the number of the group to which the row belongs. |
N | Click here |
ROW_NUMBER |
Returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition. |
Y | Click here |
To demonstrate these functions in action, I have slightly modified the example provided to us in BOL as under:
USE AdventureWorks2008; GO --Ordering the result by the Postal Code SELECT p.FirstName, p.LastName --Unique row-number ,ROW_NUMBER() OVER (ORDER BY a.PostalCode) AS 'Row Number' --Provide a rank, such that the largest Postal Code gets the highest rank --Break in the sequence expected if more than one record share the same rank ,RANK() OVER (ORDER BY a.PostalCode) AS 'Rank' --Provide a rank, such that the largest Postal Code gets the highest rank --No break in the sequence ,DENSE_RANK() OVER (ORDER BY a.PostalCode) AS 'Dense Rank' --Break the result set in 4 groups. --Because for AdventureWorks, the result is a set of 13 records, we have the set with the --largest postal code has 4 records, rest have 3. The result seen in the column 'Quartile' is --actually the group number in which the record belongs to ,NTILE(4) OVER (ORDER BY a.PostalCode) AS 'Quartile' ,s.SalesYTD, a.PostalCode FROM Sales.SalesPerson s INNER JOIN Person.Person p ON s.BusinessEntityID = p.BusinessEntityID INNER JOIN Person.Address a ON a.AddressID = p.BusinessEntityID WHERE TerritoryID IS NOT NULL AND SalesYTD <> 0;
The results of executing the above script are as under:
- Ordering is by PostalCode to determine the RANK. Hence, we will have only two sets of values – beginning at the actual row number (i.e. 1 and 6)
- On the other hand, for DENSERANK(), we continue to have two sets of values, but they longer follow the Row Number
- Finally, notice that for NTILE, the SQL Server attempted to divide 13 records into 4 groups, which resulted in the 1st group being larger than the other 3
Reference
The following are good references for understanding windowing functions:
- OVER clause – http://msdn.microsoft.com/en-us/library/ms189461.aspx
- RANKING functions – http://msdn.microsoft.com/en-us/library/ms189798.aspx
- Aggregate functions – http://msdn.microsoft.com/en-us/library/ms173454.aspx
In Conclusion
Windowing functions are so efficient and easy to use that starting the day I was introduced to them, I have always used them wherever possible. I suggest you too, give it a try. Do post your feedback and also mention the challenges that you have in working with windowing functions.
Until we meet next time,
Be courteous. Drive responsibly.