This is the second half of my two-part article on windowing functions in SQL Server 2012. In Part 1, I explained the new running and sliding aggregation capabilities added to the *OVER* clause in SQL Server 2012. In this post, I’ll explain the new T-SQL analytic windowing functions. All of these functions operate using the windowing principles I explained in Part 1.

### Eight New Analytic Functions

There are eight new analytic functions that have been added to T-SQL. All of them work in conjunction with an ordered window defined with an associated *ORDER BY* clause that can be optionally partitioned with a *PARTITION BY* clause and framed with a *BETWEEN* clause. The new functions are:

* • FIRST_VALUE*

* • LAST_VALUE*

* • LAG*

* • LEAD*

* • PERCENT_RANK*

* • CUME_DIST*

* • PERCENTILE_CONT*

* • PERCENTILE_DISC*

In the following code listing, the *FIRST_VALUE*, *LAST_VALUE*, *LAG*, and *LEAD* functions are used to analyze a set of orders at the product level.

DECLARE @Orders AS table(OrderDate date, ProductID int, Quantity int) INSERT INTO @Orders VALUES ('2011-03-18', 142, 74), ('2011-04-11', 123, 95), ('2011-04-12', 101, 38), ('2011-05-21', 130, 12), ('2011-05-30', 101, 28), ('2011-07-25', 123, 57), ('2011-07-28', 101, 12) SELECT OrderDate, ProductID, Quantity, WorstOn = FIRST_VALUE(OrderDate) OVER(PARTITION BY ProductID ORDER BY Quantity), BestOn = LAST_VALUE(OrderDate) OVER(PARTITION BY ProductID ORDER BY Quantity ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), PrevOn = LAG(OrderDate, 1) OVER(PARTITION BY ProductID ORDER BY OrderDate), NextOn = LEAD(OrderDate, 1) OVER(PARTITION BY ProductID ORDER BY OrderDate) FROM @Orders ORDER BY OrderDate OrderDate ProductID Quantity WorstOn BestOn PrevOn NextOn ---------- --------- -------- ---------- ---------- ---------- ---------- 2011-03-18 142 74 2011-03-18 2011-03-18 NULL NULL 2011-04-11 123 95 2011-07-25 2011-04-11 NULL 2011-07-25 2011-04-12 101 38 2011-07-28 2011-04-12 NULL 2011-05-30 2011-05-21 130 12 2011-05-21 2011-05-21 NULL NULL 2011-05-30 101 28 2011-07-28 2011-04-12 2011-04-12 2011-07-28 2011-07-25 123 57 2011-07-25 2011-04-11 2011-04-11 NULL 2011-07-28 101 12 2011-07-28 2011-04-12 2011-05-30 NULL

In this query, four analytic functions specify an *OVER* clause that partitions the result set by *ProductID*. The product partitions defined for *FIRST_VALUE* and *LAST_VALUE* are sorted by *Quantity*, while the product partitions for *LAG* and *LEAD* are sorted by *OrderDate*. The full result set is sorted by *OrderDate*, so you need to visualize the sorted partition for each of the four functions to understand the output—the result set sequence is not the same as the row sequence used in the windowing functions.

*FIRST_VALUE* and *LAST_VALUE*

The *WorstOn* and *BestOn* columns use *FIRST_VALUE* and *LAST_VALUE* respectively to return the “worst” and “best” dates for the product in each partition. Performance is measured by quantity, so sorting each product’s partition by quantity will position the worst order at the first row in the partition and the best order at the last row in the partition. *FIRST_VALUE* and *LAST_VALUE* can return the value of any column (*OrderDate*, in this case), not just the aggregate column itself. For *LAST_VALUE*, it is also necessary to explicitly define a window over the entire partition with *ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING*. Otherwise, as explained in my coverage of *OVER* clause enhancements in Part 1, the default window is *ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW*, which frames (constrains) the window, and does not consider the remaining rows in the partition that are needed to obtain the highest quantity for *LAST_VALUE*.

In the output, notice that *OrderDate*, *LowestOn*, and *HighestOn* for the first order (product 142) are all the same value (3/18). This is because product 142 was only ordered once, so *FIRST_VALUE* and *LAST_VALUE* operate over a partition that has only this one row in it, with an *OrderDate* value of 3/18. The second row is for product 123, quantity 95, ordered on 4/11. Four rows ahead in the result set (not the partition) there is another order for product 123, quantity 57, placed on 7/25. This means that, for this product, *FIRST_VALUE* and *LAST_VALUE* operate over a partition that has these two rows in it, sorted by quantity. This positions the 7/25 order (quantity 57) first and the 4/11 (quantity 95) last within the partition. As a result, rows for product 123 report 7/25 for *WorstDate* and 4/11 for *BestDate*. The next order (product 101) appears two more times in the result set, creating a partition of three rows. Again, based on the *Quantity* sort of the partition, each row in the partition reports the product’s worst and best dates (which are 7/28 and 4/12, respectively).

*LAG* and *LEAD*

The *PrevOn* and *NextOn* columns use *LAG* and *LEAD* to return the previous and next date that each product was ordered. They specify an *OVER* clause that partitions by *ProductId* as before, but the rows in these partitions are sorted by *OrderDate*. Thus, the *LAG* and *LEAD* functions examine each product’s orders in chronological sequence, regardless of quantity. For each row in each partition, *LAG* is able to access previous (lagging) rows within the same partition. Similarly, *LEAD* can access subsequent (leading) rows within the same partition. The first parameter to *LAG* and *LEAD* specifies the column value to be returned from a lagging or leading row, respectively. The second parameter specifies the number of rows back or forward *LAG* and *LEAD* should seek within each partition, relative to the current row. The query passes *OrderDate* and *1* as parameters to *LAG* and *LEAD*, using product partitions that are ordered by date. Thus, the query returns the most recent past date, and nearest future date, that each product was ordered.

Because the first order’s product (142) was only ordered once, its single-row partition has no lagging or leading rows, and so *LAG* and *LEAD* both return *NULL* for *PrevOn* and *NextOn*. The second order (on 4/11) is for product 123, which was ordered again on 7/25, creating a partition with two rows sorted by *OrderDate*, with the 4/11 order positioned first and the 7/25 order positioned last within the partition. The first row in a multi-row window has no lagging rows, but at least one leading row. Similarly, the last order in a multi-row window has at least one lagging row, but no leading rows. As a result, the first order (4/11) reports *NULL* and 7/25 for *PrevOn* and *NextOn* (respectively), and the second order (7/25) reports 4/11 and *NULL* for *PrevOn* and *NextOn* (respectively). Product 101 was ordered three times, which creates a partition of three rows. In this partition, the second row has both a lagging row and a leading row. Thus, the three orders report *PrevOn* and *NextOn* values for product 101, respectively indicating *NULL*-5/30 for the first (4/12) order, 4/12-7/28 for the second (5/30) order, and 5/30-*NULL* for the third and last order.

The last functions to examine are *PERCENT_RANK* (rank distribution), *CUME_DIST* (cumulative distribution, or percentile), PERCENTILE_CONT (continuous percentile lookup), and PERCENTILE_DISC (discreet percentile lookup). The following queries demonstrate these functions, which are all closely related, by querying sales figures across each quarter of two years.

DECLARE @Sales table(Yr int, Qtr int, Amount money) INSERT INTO @Sales VALUES (2010, 1, 5000), (2010, 2, 6000), (2010, 3, 7000), (2010, 4, 2000), (2011, 1, 1000), (2011, 2, 2000), (2011, 3, 3000), (2011, 4, 4000) -- Distributed across all 8 quarters SELECT Yr, Qtr, Amount, R = RANK() OVER(ORDER BY Amount), PR = PERCENT_RANK() OVER(ORDER BY Amount), CD = CUME_DIST() OVER(ORDER BY Amount) FROM @Sales ORDER BY Amount -- Distributed (partitioned) by year with percentile lookups SELECT Yr, Qtr, Amount, R = RANK() OVER(PARTITION BY Yr ORDER BY Amount), PR = PERCENT_RANK() OVER(PARTITION BY Yr ORDER BY Amount), CD = CUME_DIST() OVER(PARTITION BY Yr ORDER BY Amount), PD5 = PERCENTILE_DISC(.5) WITHIN GROUP (ORDER BY Amount) OVER(PARTITION BY Yr), PD6 = PERCENTILE_DISC(.6) WITHIN GROUP (ORDER BY Amount) OVER(PARTITION BY Yr), PC5 = PERCENTILE_CONT(.5) WITHIN GROUP (ORDER BY Amount) OVER(PARTITION BY Yr), PC6 = PERCENTILE_CONT(.6) WITHIN GROUP (ORDER BY Amount) OVER(PARTITION BY Yr) FROM @Sales ORDER BY Yr, Amount Yr Qtr Amount R PR CD ---- --- ------- - ----------------- ----- 2011 1 1000.00 1 0 0.125 2011 2 2000.00 2 0.142857142857143 0.375 2010 4 2000.00 2 0.142857142857143 0.375 2011 3 3000.00 4 0.428571428571429 0.5 2011 4 4000.00 5 0.571428571428571 0.625 2010 1 5000.00 6 0.714285714285714 0.75 2010 2 6000.00 7 0.857142857142857 0.875 2010 3 7000.00 8 1 1 Yr Qtr Amount R PR CD PD5 PD6 PC5 PC6 ---- --- ------- - ----------------- ---- ------- ------- ---- ---- 2010 4 2000.00 1 0 0.25 5000.00 6000.00 5500 5800 2010 1 5000.00 2 0.333333333333333 0.5 5000.00 6000.00 5500 5800 2010 2 6000.00 3 0.666666666666667 0.75 5000.00 6000.00 5500 5800 2010 3 7000.00 4 1 1 5000.00 6000.00 5500 5800 2011 1 1000.00 1 0 0.25 2000.00 3000.00 2500 2800 2011 2 2000.00 2 0.333333333333333 0.5 2000.00 3000.00 2500 2800 2011 3 3000.00 3 0.666666666666667 0.75 2000.00 3000.00 2500 2800 2011 4 4000.00 4 1 1 2000.00 3000.00 2500 2800

The new functions are all based on the *RANK* function introduced in SQL Server 2005. So both these queries also report on *RANK*, which will aid both in my explanation and your understanding of each of the new functions.

*PERCENT_RANK* and *CUME_DIST*

In the first query, *PERCENT_RANK* and *CUME_DIST* (aliased as *PR* and *CD* respectively) rank quarterly sales across the entire two year period. Look at the value returned by *RANK* (aliased as *R*). It ranks each row in the unpartitioned window (all eight quarters) by dollar amount. Both 2011Q2 and 2010Q4 are tied for $2,000 in sales, so *RANK* assigns them the same value (2). The next row break the tie, so *RANK* continues with 4, which accounts for the “empty slot” created by the two previous rows that were tied.

Now examine the values returned by *PERCENT_RANK* and *CUME_DIST*. Notice how they reflect the same information as *RANK* with decimal values ranging from 0 and 1. The only difference between the two is a slight variation in their formulaic implementation, such that *PERCENT_RANK* always starts with 0 while *CUME_DIST* always starts with a value greater than 0. Specifically, *PERCENT_RANK* returns (*RANK* – 1) / (N – 1) for each row, where N is the total number of rows in the window. This always returns 0 for the first (or only) row in the window. *CUME_DIST* returns *RANK* / N, which always returns a value greater than 0 for the first row in the window (which would be 1, if there’s only row). For windows with two or more rows, both functions return 1 for the last row in the window with decimal values distributed among all the other rows.

The second query examines the same sales figures, only this time the result set is partitioned by year. There are no ties within each year, so *RANK* assigns the sequential numbers 1 through 4 to each of the quarters, for 2010 and 2011, by dollar amount. You can see that *PERCENT_RANK* and *CUME_DIST* perform the same *RANK* calculations as explained for the first query (only, again, partitioned by year this time).

*PERCENTILE_DISC* and *PERCENTILE_CONT*

This query also demonstrates *PERCENTILE_DISC* and *PERCENTILE_CONT*. These very similar functions each accept a percentile parameter (the desired *CUME_DIST* value) and “reach in” to the window for the row at or near that percentile. The code demonstrates by calling both functions twice, once with a percentile parameter value of .5 and once with .6, returning columns aliased as *PD5*, *PD6*, *PC5*, and *PC6*. Both functions examine the *CUME_DIST* value for each row in the window to find the one nearest to .5 and .6. The subtle difference between them is that *PERCENTILE_DISC* will return a precise (discreet) value from the row with the matching percentile (or greater), while *PERCENTILE_CONT* interpolates a value based on a continuous range. Specifically, *PERCENTILE_CONT* returns a value ranging from the row matching the specified percentile—or a calculated value higher than that (based on the specified percentile) if there is no exact match—and the row with the next higher percentile in the window. This explains the values they return in this query.

Notice that these functions define their window ordering using

ORDER BYin aWITHIN GROUPclause rather than in theOVERclause. Thus, you do not (and cannot) specifyORDER BYin theOVERclause. TheOVERclause is still required, however, soOVER(with empty parentheses) must be specified even if you don’t want to partition usingPARTITION BY.

For the year 2010, the .5 percentile (*CUME_DIST* value) is located exactly on quarter 1, which had $5,000 in sales. Thus *PERCENTILE_DISC(.5)* returns 5000. There is no row in the window with a percentile of .6, so *PERCENTILE_DISC(.6)* matches up against the first row with a percentile greater than or equal to .6, which is the row for quarter 2 with $6,000 in sales, and thus returns 6000. In both cases, *PERCENTILE_DISC* returns a discreet value from a row in the window at or greater than the specified percentile. The same calculations are performed for 2011, returning 2000 for *PERCENTILE_DISC(.5)* and 3000 for *PERCENTILE_DISC(.6)*, corresponding to the $2,000 in sales for quarter 2 (percentile .5) and the $3,000 in sales for quarter 3 (percentile .75)

As I stated, *PERCENTILE_CONT* is very similar. It takes the same percentile parameter to find the row in the window matching that percentile. If there is no exact match, the function calculates a value based on the scale of percentiles distributed across the entire window, rather than looking ahead to the row having the next greater percentile value, as *PERCENTILE_DISC* does. Then it returns the median between that value and the value found in the row with the next greater percentile. For 2010, the .5 percentile matches up with 5000 (as before). The next percentile in the window is for .75 for 6000. The median between 5000 and 6000 is 5500 and thus, *PERCENTILE_CONT(.5)* returns 5500. There is no row in the window with a percentile of .6, so *PERCENTILE_CONT(.6)* calculates what the value for .6 would be (somewhere between 5000 and 6000, a bit closer to 5000) and then calculates the median between that value and the next percentile in the window (again, .75 for 6000). Thus, *PERCENTILE_CONT(.6)* returns 5800; slightly higher than the 5500 returned for *PERCENTILE_CONT(.5)*.

### Conclusion

This post explained the eight new analytic functions added to T-SQL in SQL Server 2012. These new functions, plus the running and sliding aggregation capabilities covered in Part 1, greatly expand the windowing capabilities of the *OVER* clause available since SQL Server 2005.

January 5, 2012 at 1:18 pm

Great features and post…Thanks