Geospatial Support for Circular Data in SQL Server 2012

SQL Server 2012 adds many significant improvements to the spatial support that was first introduced with SQL Server 2008. In this blog post, I’ll explore one of the more notable enhancements: support for curves and arcs (circular data). SQL Server 2008 only supported straight lines, or polygons composed of straight lines. The three new shapes in SQL Server 2012 are circular strings, compound curves, and curve polygons. All three are supported in Well-Known Text (WKT), Well-Known Binary (WKB), and Geometry Markup Language (GML) by both the geometry (planar, or “flat-earth” model) and geography (ellipsoidal sphere, or geodetic) data types, and all of the existing methods work on the new shapes.

Circular Strings

A circular string defines a basic curved line, similar to how a line string defines a straight line. It takes a minimum of three coordinates to define a circular string; the first and third coordinates define the end points of the line, and the second coordinate (the “anchor” point, which lies somewhere between the end points) determines the arc of the line. Here is the shape represented by CIRCULARSTRING(0 1, .25 0, 0 -1):

The following code produces four circular strings. All of them have the same start and end points, but different anchor points. The lines are buffered slightly to make them easier to see in the spatial viewer.

-- Create a "straight" circular line
SELECT geometry::Parse('CIRCULARSTRING(0 8, 4 0, 8 -8)').STBuffer(.1)
UNION ALL  -- Curve it
SELECT geometry::Parse('CIRCULARSTRING(0 8, 4 4, 8 -8)').STBuffer(.1)
UNION ALL  -- Curve it some more
SELECT geometry::Parse('CIRCULARSTRING(0 8, 4 6, 8 -8)').STBuffer(.1)
UNION ALL  -- Curve it in the other direction
SELECT geometry::Parse('CIRCULARSTRING(0 8, 4 -6, 8 -8)').STBuffer(.1)

The spatial viewer in SQL Server Management Studio shows the generated shapes:

The first shape is a “straight circular” line, because the anchor point is position directly between the start and end points. The next two shapes use the same end points with the anchor out to the right (4), one a bit further than the other (6). The last shape also uses the same end points, but specifies an anchor point that curves the line to the left rather than the right (-6).

You can extend circular strings with as many curve segments as you want. Do this by defining another two coordinates for each additional segment. The last point of the previous curve serves as the first end point of the next curve segment, so the two additional coordinates respectively specify the next segment’s anchor and second end point. Thus, valid circular strings will always have an odd number of points. You can extend a circular string indefinitely to form curves and arcs of any kind.

It’s easy to form a perfect circle by connecting two semi-circle segments. For example, the following illustration shows the circle produced by CIRCULARSTRING(0 4, 4 0, 8 4, 4 8, 0 4).

This particular example connects the end of the second segment to the beginning of the first segment to form a closed shape. Note that this is certainly not required of circular strings (or line strings), and that closing the shape by connecting the last segment to the first still does not result in a polygon, which is a two dimensional shape that has area. Despite being closed, this circle is still considered a one-dimensional shape with no area. As you’ll soon see, the curve polygon can be used to convert closed line shapes into true polygons.

Compound Curves

A compound curve is a set of circular strings, or circular strings combined with line strings, that form a desired curved shape. The end point of each element in the collection must match the starting point of the following element, so that compound curves are defined in a “connect-the-dots” fashion. The following code produces a compound curve and compares it with the equivalent geometry collection shape.

-- Compound curve
 DECLARE @CC geometry = '
  COMPOUNDCURVE(
   (4 4, 4 8),
   CIRCULARSTRING(4 8, 6 10, 8 8),
   (8 8, 8 4),
   CIRCULARSTRING(8 4, 2 3, 4 4)
  )'

-- Equivalent geometry collection
 DECLARE @GC geometry = '
  GEOMETRYCOLLECTION(
   LINESTRING(4 4, 4 8),
   CIRCULARSTRING(4 8, 6 10, 8 8),
   LINESTRING(8 8, 8 4),
   CIRCULARSTRING(8 4, 2 3, 4 4)
  )'

-- They both render the same shape in the spatial viewer
 SELECT @CC.STBuffer(.5)
 UNION ALL
 SELECT @GC.STBuffer(1.5)

This code creates a keyhole shape using a compound curve, and also creates an identical shape as a geometry collection (though notice that the LINESTRING keyword is not—and cannot—be specified when defining a compound curve). It then buffers both of them with different padding, so that the spatial viewer clearly shows the two identical shapes on top of one another, as shown:

Both the compound curve and the geometry collection yield identical shapes. In fact, the expression @CC.STEquals(@GC) which compares the two instances for equality returns 1 (for true). The STEquals method tests for “spatial equality,” meaning it returns true if two instances produce the same shape even if they are being rendered using different spatial data classes. Furthermore, recall that segments of a circular string can be made perfectly straight by positioning the anchor directly between the end points, meaning that the circular string offers yet a third option for producing the very same shape. So which one should you use? Comparing these spatial data classes will help you determine which one is best to use in different scenarios.

A geometry collection (which was already supported in SQL Server 2008) is the most accommodating, but carries the most storage overhead. Geometry collections can hold instances of any spatial data class, and the instances don’t need to be connected to (or intersected with) each other in any way. The collection simply holds a bunch of different shapes as a set, which in this example just happens to be several line strings and circular strings connected at their start and end points.

In contrast, the new compound curve class in SQL Server 2012 has the most constraints but is the most lightweight in terms of storage. It can only contain line strings or circular strings, and each segment’s start point must be connected to the previous segment’s end point (although it is most certainly not necessary to connect the first and last segments to form a closed shape as in this example). The DATALENGTH function shows the difference in storage requirements; DATALENGTH(@CC) returns 152 and DATALENGTH(@GC) returns 243. In our current example, DATALENGTH(@CC) returns 152 and DATALENGTH(@GC) returns 243. This means that the same shape requires 38% less storage space by using a compound curve instead of a geometry collection. A compound curve is also more storage-efficient than a multi-segment circular line string when straight lines are involved. This is because there is overhead for the mere potential of a curve, since the anchor point requires storage even when it’s position produces straight lines, whereas compound curves are optimized specifically to connect circular strings and (always straight) line strings.

Curve Polygons

A curve polygon is very similar to an ordinary polygon; like an ordinary polygon, a curve polygon specifies a “ring” that defines a closed shape, and can also specify additional inner rings to define “holes” inside the shape. The only fundamental difference between a polygon and a curve polygon is that the rings of a curve polygon can include circular shapes, whereas an ordinary polygon is composed exclusively with straight lines. Specifically, each ring in a curve polygon can consist of any combination of line strings, circular strings, and compound curves that collectively define the closed shape. For example, the following code produces a curve polygon with the same keyhole outline that I just demonstrated for the compound curve.

-- Curve polygon
 SELECT geometry::Parse('
  CURVEPOLYGON(
   COMPOUNDCURVE(
    (4 4, 4 8),
    CIRCULARSTRING(4 8, 6 10, 8 8),
    (8 8, 8 4),
    CIRCULARSTRING(8 4, 2 3, 4 4)
   )
  )')

This code has simply specified the same compound curve as the closed shape of a curve polygon. Although the shape is the same, the curve polygon is a two-dimensional object, whereas the compound curve version of the same shape is a one-dimensional object. This can be seen visually by the spatial viewer results, which shades the interior of the curve polygon as shown here:

Conclusion

Circular data support is an important new capability added to the spatial support in SQL Server 2012. In this blog post, I demonstrated the three new spatial data classes for curves and arcs: circular strings, compound curves, and curve polygons. Stay tuned for my next post (coming soon), for more powerful and fun spatial enhancements coming soon in SQL Server 2012!

SQL Server 2012 Windowing Functions Part 2 of 2: New Analytic Functions

This is the second half of my two-part article on windowing functions in SQL Server 2012. In Part 1, I explained the new running and sliding aggregation capabilities added to the OVER clause in SQL Server 2012. In this post, I’ll explain the new T-SQL analytic windowing functions. All of these functions operate using the windowing principles I explained in Part 1.

Eight New Analytic Functions

There are eight new analytic functions that have been added to T-SQL. All of them work in conjunction with an ordered window defined with an associated ORDER BY clause that can be optionally partitioned with a PARTITION BY clause and framed with a BETWEEN clause. The new functions are:

 • FIRST_VALUE
 • LAST_VALUE
 • LAG
 • LEAD
 • PERCENT_RANK
 • CUME_DIST
 • PERCENTILE_CONT
 • PERCENTILE_DISC

In the following code listing, the FIRST_VALUE, LAST_VALUE, LAG, and LEAD functions are used to analyze a set of orders at the product level.

DECLARE @Orders AS table(OrderDate date, ProductID int, Quantity int)
INSERT INTO @Orders VALUES
 ('2011-03-18', 142, 74),
 ('2011-04-11', 123, 95),
 ('2011-04-12', 101, 38),
 ('2011-05-21', 130, 12),
 ('2011-05-30', 101, 28),
 ('2011-07-25', 123, 57),
 ('2011-07-28', 101, 12)

SELECT
  OrderDate,
  ProductID,
  Quantity,
  WorstOn = FIRST_VALUE(OrderDate) OVER(PARTITION BY ProductID ORDER BY Quantity),
  BestOn = LAST_VALUE(OrderDate) OVER(PARTITION BY ProductID ORDER BY Quantity
                          ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),
  PrevOn = LAG(OrderDate, 1) OVER(PARTITION BY ProductID ORDER BY OrderDate),
  NextOn = LEAD(OrderDate, 1) OVER(PARTITION BY ProductID ORDER BY OrderDate)
 FROM @Orders
 ORDER BY OrderDate

OrderDate   ProductID  Quantity  WorstOn     BestOn      PrevOn      NextOn
----------  ---------  --------  ----------  ----------  ----------  ----------
2011-03-18  142        74        2011-03-18  2011-03-18  NULL        NULL
2011-04-11  123        95        2011-07-25  2011-04-11  NULL        2011-07-25
2011-04-12  101        38        2011-07-28  2011-04-12  NULL        2011-05-30
2011-05-21  130        12        2011-05-21  2011-05-21  NULL        NULL
2011-05-30  101        28        2011-07-28  2011-04-12  2011-04-12  2011-07-28
2011-07-25  123        57        2011-07-25  2011-04-11  2011-04-11  NULL
2011-07-28  101        12        2011-07-28  2011-04-12  2011-05-30  NULL

In this query, four analytic functions specify an OVER clause that partitions the result set by ProductID. The product partitions defined for FIRST_VALUE and LAST_VALUE are sorted by Quantity, while the product partitions for LAG and LEAD are sorted by OrderDate. The full result set is sorted by OrderDate, so you need to visualize the sorted partition for each of the four functions to understand the output—the result set sequence is not the same as the row sequence used in the windowing functions.

FIRST_VALUE and LAST_VALUE

The WorstOn and BestOn columns use FIRST_VALUE and LAST_VALUE respectively to return the “worst” and “best” dates for the product in each partition. Performance is measured by quantity, so sorting each product’s partition by quantity will position the worst order at the first row in the partition and the best order at the last row in the partition. FIRST_VALUE and LAST_VALUE can return the value of any column (OrderDate, in this case), not just the aggregate column itself. For LAST_VALUE, it is also necessary to explicitly define a window over the entire partition with ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. Otherwise, as explained in my coverage of OVER clause enhancements in Part 1, the default window is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which frames (constrains) the window, and does not consider the remaining rows in the partition that are needed to obtain the highest quantity for LAST_VALUE.

In the output, notice that OrderDate, LowestOn, and HighestOn for the first order (product 142) are all the same value (3/18). This is because product 142 was only ordered once, so FIRST_VALUE and LAST_VALUE operate over a partition that has only this one row in it, with an OrderDate value of 3/18. The second row is for product 123, quantity 95, ordered on 4/11. Four rows ahead in the result set (not the partition) there is another order for product 123, quantity 57, placed on 7/25. This means that, for this product, FIRST_VALUE and LAST_VALUE operate over a partition that has these two rows in it, sorted by quantity. This positions the 7/25 order (quantity 57) first and the 4/11 (quantity 95) last within the partition. As a result, rows for product 123 report 7/25 for WorstDate and 4/11 for BestDate. The next order (product 101) appears two more times in the result set, creating a partition of three rows. Again, based on the Quantity sort of the partition, each row in the partition reports the product’s worst and best dates (which are 7/28 and 4/12, respectively).

LAG and LEAD

The PrevOn and NextOn columns use LAG and LEAD to return the previous and next date that each product was ordered. They specify an OVER clause that partitions by ProductId as before, but the rows in these partitions are sorted by OrderDate. Thus, the LAG and LEAD functions examine each product’s orders in chronological sequence, regardless of quantity. For each row in each partition, LAG is able to access previous (lagging) rows within the same partition. Similarly, LEAD can access subsequent (leading) rows within the same partition. The first parameter to LAG and LEAD specifies the column value to be returned from a lagging or leading row, respectively. The second parameter specifies the number of rows back or forward LAG and LEAD should seek within each partition, relative to the current row. The query passes OrderDate and 1 as parameters to LAG and LEAD, using product partitions that are ordered by date. Thus, the query returns the most recent past date, and nearest future date, that each product was ordered.

Because the first order’s product (142) was only ordered once, its single-row partition has no lagging or leading rows, and so LAG and LEAD both return NULL for PrevOn and NextOn. The second order (on 4/11) is for product 123, which was ordered again on 7/25, creating a partition with two rows sorted by OrderDate, with the 4/11 order positioned first and the 7/25 order positioned last within the partition. The first row in a multi-row window has no lagging rows, but at least one leading row. Similarly, the last order in a multi-row window has at least one lagging row, but no leading rows. As a result, the first order (4/11) reports NULL and 7/25 for PrevOn and NextOn (respectively), and the second order (7/25) reports 4/11 and NULL for PrevOn and NextOn (respectively). Product 101 was ordered three times, which creates a partition of three rows. In this partition, the second row has both a lagging row and a leading row. Thus, the three orders report PrevOn and NextOn values for product 101, respectively indicating NULL-5/30 for the first (4/12) order, 4/12-7/28 for the second (5/30) order, and 5/30-NULL for the third and last order.

The last functions to examine are PERCENT_RANK (rank distribution), CUME_DIST (cumulative distribution, or percentile), PERCENTILE_CONT (continuous percentile lookup), and PERCENTILE_DISC (discreet percentile lookup). The following queries demonstrate these functions, which are all closely related, by querying sales figures across each quarter of two years.

DECLARE @Sales table(Yr int, Qtr int, Amount money)
INSERT INTO @Sales VALUES
  (2010, 1, 5000), (2010, 2, 6000), (2010, 3, 7000), (2010, 4, 2000),
  (2011, 1, 1000), (2011, 2, 2000), (2011, 3, 3000), (2011, 4, 4000)

-- Distributed across all 8 quarters
SELECT
  Yr, Qtr, Amount,
  R = RANK() OVER(ORDER BY Amount),
  PR = PERCENT_RANK() OVER(ORDER BY Amount),
  CD = CUME_DIST() OVER(ORDER BY Amount)
 FROM @Sales
 ORDER BY Amount

-- Distributed (partitioned) by year with percentile lookups
SELECT
  Yr, Qtr, Amount,
  R = RANK() OVER(PARTITION BY Yr ORDER BY Amount),
  PR = PERCENT_RANK() OVER(PARTITION BY Yr ORDER BY Amount),
  CD = CUME_DIST() OVER(PARTITION BY Yr ORDER BY Amount),
  PD5 = PERCENTILE_DISC(.5) WITHIN GROUP (ORDER BY Amount) OVER(PARTITION BY Yr),
  PD6 = PERCENTILE_DISC(.6) WITHIN GROUP (ORDER BY Amount) OVER(PARTITION BY Yr),
  PC5 = PERCENTILE_CONT(.5) WITHIN GROUP (ORDER BY Amount) OVER(PARTITION BY Yr),
  PC6 = PERCENTILE_CONT(.6) WITHIN GROUP (ORDER BY Amount) OVER(PARTITION BY Yr)
 FROM @Sales
 ORDER BY Yr, Amount

Yr    Qtr  Amount   R  PR                 CD
----  ---  -------  -  -----------------  -----
2011  1    1000.00  1  0                  0.125
2011  2    2000.00  2  0.142857142857143  0.375
2010  4    2000.00  2  0.142857142857143  0.375
2011  3    3000.00  4  0.428571428571429  0.5
2011  4    4000.00  5  0.571428571428571  0.625
2010  1    5000.00  6  0.714285714285714  0.75
2010  2    6000.00  7  0.857142857142857  0.875
2010  3    7000.00  8  1                  1

Yr    Qtr  Amount   R  PR                 CD    PD5      PD6      PC5   PC6
----  ---  -------  -  -----------------  ----  -------  -------  ----  ----
2010  4    2000.00  1  0                  0.25  5000.00  6000.00  5500  5800
2010  1    5000.00  2  0.333333333333333  0.5   5000.00  6000.00  5500  5800
2010  2    6000.00  3  0.666666666666667  0.75  5000.00  6000.00  5500  5800
2010  3    7000.00  4  1                  1     5000.00  6000.00  5500  5800
2011  1    1000.00  1  0                  0.25  2000.00  3000.00  2500  2800
2011  2    2000.00  2  0.333333333333333  0.5   2000.00  3000.00  2500  2800
2011  3    3000.00  3  0.666666666666667  0.75  2000.00  3000.00  2500  2800
2011  4    4000.00  4  1                  1     2000.00  3000.00  2500  2800

The new functions are all based on the RANK function introduced in SQL Server 2005. So both these queries also report on RANK, which will aid both in my explanation and your understanding of each of the new functions.

PERCENT_RANK and CUME_DIST

In the first query, PERCENT_RANK and CUME_DIST (aliased as PR and CD respectively) rank quarterly sales across the entire two year period. Look at the value returned by RANK (aliased as R). It ranks each row in the unpartitioned window (all eight quarters) by dollar amount. Both 2011Q2 and 2010Q4 are tied for $2,000 in sales, so RANK assigns them the same value (2). The next row break the tie, so RANK continues with 4, which accounts for the “empty slot” created by the two previous rows that were tied.

Now examine the values returned by PERCENT_RANK and CUME_DIST. Notice how they reflect the same information as RANK with decimal values ranging from 0 and 1. The only difference between the two is a slight variation in their formulaic implementation, such that PERCENT_RANK always starts with 0 while CUME_DIST always starts with a value greater than 0. Specifically, PERCENT_RANK returns (RANK – 1) / (N – 1) for each row, where N is the total number of rows in the window. This always returns 0 for the first (or only) row in the window. CUME_DIST returns RANK / N, which always returns a value greater than 0 for the first row in the window (which would be 1, if there’s only row). For windows with two or more rows, both functions return 1 for the last row in the window with decimal values distributed among all the other rows.

The second query examines the same sales figures, only this time the result set is partitioned by year. There are no ties within each year, so RANK assigns the sequential numbers 1 through 4 to each of the quarters, for 2010 and 2011, by dollar amount. You can see that PERCENT_RANK and CUME_DIST perform the same RANK calculations as explained for the first query (only, again, partitioned by year this time).

PERCENTILE_DISC and PERCENTILE_CONT

This query also demonstrates PERCENTILE_DISC and PERCENTILE_CONT. These very similar functions each accept a percentile parameter (the desired CUME_DIST value) and “reach in” to the window for the row at or near that percentile. The code demonstrates by calling both functions twice, once with a percentile parameter value of .5 and once with .6, returning columns aliased as PD5, PD6, PC5, and PC6. Both functions examine the CUME_DIST value for each row in the window to find the one nearest to .5 and .6. The subtle difference between them is that PERCENTILE_DISC will return a precise (discreet) value from the row with the matching percentile (or greater), while PERCENTILE_CONT interpolates a value based on a continuous range. Specifically, PERCENTILE_CONT returns a value ranging from the row matching the specified percentile—or a calculated value higher than that (based on the specified percentile) if there is no exact match—and the row with the next higher percentile in the window. This explains the values they return in this query.

Notice that these functions define their window ordering using ORDER BY in a WITHIN GROUP clause rather than in the OVER clause. Thus, you do not (and cannot) specify ORDER BY in the OVER clause. The OVER clause is still required, however, so OVER (with empty parentheses) must be specified even if you don’t want to partition using PARTITION BY.

For the year 2010, the .5 percentile (CUME_DIST value) is located exactly on quarter 1, which had $5,000 in sales. Thus PERCENTILE_DISC(.5) returns 5000. There is no row in the window with a percentile of .6, so PERCENTILE_DISC(.6) matches up against the first row with a percentile greater than or equal to .6, which is the row for quarter 2 with $6,000 in sales, and thus returns 6000. In both cases, PERCENTILE_DISC returns a discreet value from a row in the window at or greater than the specified percentile. The same calculations are performed for 2011, returning 2000 for PERCENTILE_DISC(.5) and 3000 for PERCENTILE_DISC(.6), corresponding to the $2,000 in sales for quarter 2 (percentile .5) and the $3,000 in sales for quarter 3 (percentile .75)

As I stated, PERCENTILE_CONT is very similar. It takes the same percentile parameter to find the row in the window matching that percentile. If there is no exact match, the function calculates a value based on the scale of percentiles distributed across the entire window, rather than looking ahead to the row having the next greater percentile value, as PERCENTILE_DISC does. Then it returns the median between that value and the value found in the row with the next greater percentile. For 2010, the .5 percentile matches up with 5000 (as before). The next percentile in the window is for .75 for 6000. The median between 5000 and 6000 is 5500 and thus, PERCENTILE_CONT(.5) returns 5500. There is no row in the window with a percentile of .6, so PERCENTILE_CONT(.6) calculates what the value for .6 would be (somewhere between 5000 and 6000, a bit closer to 5000) and then calculates the median between that value and the next percentile in the window (again, .75 for 6000). Thus, PERCENTILE_CONT(.6) returns 5800; slightly higher than the 5500 returned for PERCENTILE_CONT(.5).

Conclusion

This post explained the eight new analytic functions added to T-SQL in SQL Server 2012. These new functions, plus the running and sliding aggregation capabilities covered in Part 1, greatly expand the windowing capabilities of the OVER clause available since SQL Server 2005.

SQL Server 2012 Windowing Functions Part 1 of 2: Running and Sliding Aggregates

The first windowing capabilities appeared in SQL Server 2005 with the introduction of the OVER clause and a set of ranking functions (ROW_NUMBER, RANK, DENSE_RANK, and NTILE). The term “window,” as it is used here, refers to the scope of visibility from one row in a result set relative to neighboring rows in the same result set. By default, OVER produces a single window over the entire result set, but its associated PARTITION BY clause lets you divide the result set up into distinct windows—one per partition. Furthermore, its associated ORDER BY clause enables cumulative calculations within each window.

In addition to the four ranking functions, the OVER clause can be used with the traditional aggregate functions (SUM, COUNT, MIN, MAX, AVG). This is extremely useful, because it allows you to calculate aggregations without being forced to summarize all the detail rows with a GROUP BY clause. However, prior to SQL Server 2012, running and sliding calculations with an associated ORDER BY clause was supported only for the ranking functions. Using ORDER BY with OVER for any of the aggregate functions was not allowed. This prevents running and sliding aggregations, severely limited the windowing capability of OVER since its introduction in SQL Server 2005.

Fortunately, SQL Server 2012 finally addresses this shortcoming. In this blog post, the first in a two-part article, I’ll show you how to use OVER/ORDER BY with all the traditional aggregate functions in SQL Server 2012 to provide running aggregates within ordered windows and partitions. I’ll also show you how to frame windows using the ROWS or RANGE clause, which adjusts the size and scope of the window and enables sliding aggregations. SQL Server 2012 also introduces eight new analytic functions that are designed specifically to work with ordered (and optionally partitioned) windows using the OVER clause. I will cover those new analytic functions in Part 2.

To demonstrate running and sliding aggregates, create a table and populate it with sample financial transactions for several different accounts, as shown below. (Note the use of the DATEFROMPARTS function, also new in SQL Server 2012, which is used to construct a date value from year, month, and day parameters.)

CREATE TABLE TxnData (AcctId int, TxnDate date, Amount decimal)
GO

INSERT INTO TxnData (AcctId, TxnDate, Amount) VALUES
  (1, DATEFROMPARTS(2011, 8, 10), 500),  -- 5 transactions for acct 1
  (1, DATEFROMPARTS(2011, 8, 22), 250),
  (1, DATEFROMPARTS(2011, 8, 24), 75),
  (1, DATEFROMPARTS(2011, 8, 26), 125),
  (1, DATEFROMPARTS(2011, 8, 28), 175),
  (2, DATEFROMPARTS(2011, 8, 11), 500),  -- 8 transactions for acct 2
  (2, DATEFROMPARTS(2011, 8, 15), 50),
  (2, DATEFROMPARTS(2011, 8, 22), 5000),
  (2, DATEFROMPARTS(2011, 8, 25), 550),
  (2, DATEFROMPARTS(2011, 8, 27), 105),
  (2, DATEFROMPARTS(2011, 8, 27), 95),
  (2, DATEFROMPARTS(2011, 8, 29), 100),
  (2, DATEFROMPARTS(2011, 8, 30), 2500),
  (3, DATEFROMPARTS(2011, 8, 14), 500),  -- 4 transactions for acct 3
  (3, DATEFROMPARTS(2011, 8, 15), 600),
  (3, DATEFROMPARTS(2011, 8, 22), 25),
  (3, DATEFROMPARTS(2011, 8, 23), 125)

Running Aggregations

Used by itself, the OVER clause operates over a window that encompasses the entire result set of a query. Windows can be partitioned in your queries using OVER with PARTITION BY, enabling partition-level aggregations to be calculated for each window. And with SQL Server 2012, an ORDER BY clause can also be specified with OVER to achieve row-level running aggregations within each window. The following code demonstrates the use of OVER with ORDER BY to produce running aggregations:

SELECT AcctId, TxnDate, Amount,
  RAvg = AVG(Amount) OVER (PARTITION BY AcctId ORDER BY TxnDate),
  RCnt = COUNT(*)    OVER (PARTITION BY AcctId ORDER BY TxnDate),
  RMin = MIN(Amount) OVER (PARTITION BY AcctId ORDER BY TxnDate),
  RMax = MAX(Amount) OVER (PARTITION BY AcctId ORDER BY TxnDate),
  RSum = SUM(Amount) OVER (PARTITION BY AcctId ORDER BY TxnDate)
 FROM TxnData
 ORDER BY AcctId, TxnDate

AcctId TxnDate    Amount RAvg        RCnt RMin RMax RSum
------ ---------- ------ ----------- ---- ---- ---- ----
1      2011-08-10 500    500.000000  1    500  500  500
1      2011-08-22 250    375.000000  2    250  500  750
1      2011-08-24 75     275.000000  3    75   500  825
1      2011-08-26 125    237.500000  4    75   500  950
1      2011-08-28 175    225.000000  5    75   500  1125
2      2011-08-11 500    500.000000  1    500  500  500
2      2011-08-15 50     275.000000  2    50   500  550
2      2011-08-22 5000   1850.000000 3    50   5000 5550
 :

The results of this query are partitioned (windowed) by account. Within each window, the account’s running averages, counts, minimum/maximum values, and sums are ordered by transaction date, showing the chronologically accumulated values for each account. No ROWS clause or RANGE clause is specified, so ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is assumed by default. This yields a window frame size that spans from the beginning of the partition (the first row of each account) through the current row. When the account number changes, the previous window is “closed” and new calculations start running for a new window over the next account number.

Sliding Aggregations

You can narrow each account’s window by framing it with a ROWS BETWEEN n PRECEDING AND CURRENT ROW clause within the OVER clause. This enables sliding calculations, as demonstrated by this slightly modified version of the previous query:

SELECT AcctId, TxnDate, Amount,
  SAvg = AVG(Amount) OVER (PARTITION BY AcctId ORDER BY TxnDate ROWS BETWEEN 2 PRECEDING AND CURRENT ROW),
  SCnt = COUNT(*)    OVER (PARTITION BY AcctId ORDER BY TxnDate ROWS 2 PRECEDING),
  SMin = MIN(Amount) OVER (PARTITION BY AcctId ORDER BY TxnDate ROWS 2 PRECEDING),
  SMax = MAX(Amount) OVER (PARTITION BY AcctId ORDER BY TxnDate ROWS 2 PRECEDING),
  SSum = SUM(Amount) OVER (PARTITION BY AcctId ORDER BY TxnDate ROWS 2 PRECEDING)
 FROM TxnData
 ORDER BY AcctId, TxnDate

AcctId TxnDate    Amount SAvg        SCnt SMin SMax SSum
------ ---------- ------ ----------- ---- ---- ---- ----
1      2011-08-10 500    500.000000  1    500  500  500
1      2011-08-22 250    375.000000  2    250  500  750
1      2011-08-24 75     275.000000  3    75   500  825
1      2011-08-26 125    150.000000  3    75   250  450
1      2011-08-28 175    125.000000  3    75   175  375
2      2011-08-11 500    500.000000  1    500  500  500
2      2011-08-15 50     275.000000  2    50   500  550
2      2011-08-22 5000   1850.000000 3    50   5000 5550
 :

This query specifies ROWS BETWEEN 2 PRECEDING AND CURRENT ROW in the OVER clause for the RAvg column, overriding the default window size. Specifically, it frames the window within each account’s partition to a maximum of three rows: the current row, the row before it, and one more row before that one. Once the window expands to three rows, it stops growing and starts sliding down the subsequent rows until a new partition (the next account) is encountered. The BETWEEN…AND CURRENT ROW keywords that specify the upper bound of the window are assumed default, so to reduce code clutter, the other column definitions specify just the lower bound of the window with the shorter variation ROWS 2 PRECEDING.

Notice how the window “slides” within each account. For example, the sliding maximum for account 1 drops from 500 to 250 in the fourth row, because 250 is the largest value in the window of three rows that begins two rows earlier—and the 500 from the very first row is no longer visible in that window. Similarly, the sliding sum for each account is based on the defined window. Thus, the sliding sum of 375 on the last row of account 1 is the total sum of that row (175) plus the two preceding rows (75 + 125) only—not the total sum for all transactions in the entire account, as the running sum had calculated.

Using RANGE versus ROWS

Finally, RANGE can be used instead of ROWS to handle “ties” within a window. While ROWS treats each row in the window distinctly, RANGE will merge rows containing duplicate ORDER BY values, as demonstrated by the following query:

SELECT AcctId, TxnDate, Amount,
 SumByRows = SUM(Amount) OVER (ORDER BY TxnDate ROWS UNBOUNDED PRECEDING),
 SumByRange = SUM(Amount) OVER (ORDER BY TxnDate RANGE UNBOUNDED PRECEDING)
 FROM TxnData
 WHERE AcctId = 2
 ORDER BY TxnDate

AcctId TxnDate    Amount SumByRows SumByRange
------ ---------- ------ --------- ----------
2      2011-08-11 500    500       500
2      2011-08-15 50     550       550
2      2011-08-22 5000   5550      5550
2      2011-08-25 550    6100      6100
2      2011-08-27 105    6205      6300
2      2011-08-27 95     6300      6300
2      2011-08-29 100    6400      6400
2      2011-08-30 2500   8900      8900

In this result set, ROWS and RANGE both return the same values, with the exception of the fifth row. Because the fifth and sixth rows are both tied for the same date (8/27/2011), RANGE returns the combined running sum for both rows. The seventh row (for 8/29/2011) breaks the tie, and ROWS “catches up” with RANGE to return running totals for the rest of the window.

Conclusion

Windowing functions using the OVER clause have been greatly enhanced in SQL Server 2012. In addition to the 4 ranking functions, running and sliding calculations with OVER/ORDER BY is now supported for all the traditional aggregate functions as well. SQL Server 2012 also introduces eight new analytic functions that are designed specifically to work with ordered (and optionally partitioned) windows using the OVER clause. Stay tuned for Part 2, which will show you how to use these new analytic windowing functions.

Download VSLive Orlando SQL Server 2012 Workshop Materials

I just got back from another great VSLive! A special thanks to all the attendees who were at my SQL Server workshop on Monday. You guys had great questions and were a lot of fun. As promised, I’ve posted the very latest version of my slides and code for you to grab right here:

http://bit.ly/VSL2011SQLOrlando

If you missed VSLive! in Orlando, catch us in Vegas next March. In addition to the full-day SQL 2012 workshop, I’ll be giving breakout sessions on SQL Server Data Tools (SSDT) and .NET 4.0 Data Access. Enjoy!

Throwing Errors in SQL Server 2012

Error handling in T-SQL was very difficult to implement properly before SQL Server 2005 introduced the TRY/CATCH construct, a feature loosely based on .NET’s try/catch structured exception handling model. The CATCH block gives you a single place to code error handling logic in the event that a problem occurs anywhere inside the TRY block above it. Before TRY/CATCH, it was necessary to always check for error conditions after every operation by testing the built-in system function @@ERROR. Not only did code become cluttered with the many @@ERROR tests, developers (being humans) would too often forget to test @@ERROR in every needed place, causing many unhandled exceptions to go unnoticed.

In SQL Server 2005, TRY/CATCH represented a vast improvement over constantly testing @@ERROR, but RAISERROR has (until SQL Server 2012) remained as the only mechanism for generating your own errors. In SQL Server 2012, the new THROW statement (again, borrowed from throw in the .NET model) is the recommended alternative way to raise exceptions in your T-SQL code (although RAISERROR does retain several capabilities that THROW lacks, as I’ll explain shortly).

Re-Throwing Exceptions

The new THROW statement can be used in two ways. First, and as I just stated, it can serve as an alternative to RAISERROR, allowing your code to generate errors when it detects an unresolvable condition in processing. Used for this purpose, the THROW statement accepts parameters for the error code, description, and state, and works much like RAISERROR. A more specialized use of THROW takes no parameters, and can appear only inside a CATCH block. In this scenario, an unexpected error occurs in the TRY block above, triggering execution of the CATCH block. Inside the CATCH block, you can perform general error handling (for example, logging the error, or rolling back a transaction), and then issue a THROW statement with no parameters. This will re-throw the original error that occurred—with its code, message, severity, and state intact—back up to the client, so the error can be caught and handled at the application level as well. This is an easy and elegant way for you to implement a segmented exception handling strategy between the database and application layers.

In contrast, RAISERROR always raises a new error. Thus, it can only simulate re-throwing the original error by capturing the ERROR_MESSAGE, ERROR_SEVERITY, and ERROR_STATE in the CATCH block and using their values to raise a new error. Using THROW for this purpose is much more simple and direct, as demonstrated with the following code:

CREATE TABLE ErrorLog(ErrAt datetime2, Severity varchar(max), ErrMsg varchar(max))
GO

BEGIN TRY
  DECLARE @Number int = 5 / 0;
END TRY
BEGIN CATCH
  -- Log the error info, then re-throw it
  INSERT INTO ErrorLog VALUES(SYSDATETIME(), ERROR_SEVERITY(), ERROR_MESSAGE());
  THROW;
END CATCH

In this code’s CATCH block, error information is recorded to the ErrorLog table and then the original error (divide by zero) is re-thrown for the client to catch:

Msg 8134, Level 16, State 1, Line 4
Divide by zero error encountered.

To confirm that the error was logged by the CATCH block as expected before being re-thrown, query the ErrorLog table:

SELECT * FROM ErrorLog

ErrAt                     Severity ErrMsg
------------------------- -------- ------------------------------------------
2011-10-30 14:14:35.336   16       Divide by zero error encountered.

Comparing THROW and RAISERROR

The following table summarizes the notable differences between THROW and RAISERROR:

THROW RAISERROR
Can only generate user exceptions (unless re-throwing in CATCH block) Can generate user (>= 50000) and system (< 50000) exceptions
Supplies ad-hoc text; doesn’t utilize sys.messages Requires user messages defined in sys.messages (except for code 50000)
Doesn’t support token substitutions Supports token substitutions
Always uses severity level 16 (unless re-throwing in a CATCH block) Can set any severity level
Can re-throw original exception caught in the TRY block Always generates a new exception; the original exception is lost to the client
Error messages are buffered, and don’t appear in real-time Supports WITH NOWAIT to immediate flush buffered output on error

A user exception is an error with a code of 50000 or higher that you define for your application’s use. System exceptions are defined by SQL Server and have error codes lower than 50000. You can use the new THROW statement to generate and raise user exceptions, but not system exceptions. Only RAISERROR can be used to throw system exceptions. Note, however, that when THROW is used in a CATCH block to re-throw the exception from a TRY block, the actual original exception—even if it’s a system exception—will get thrown (as demonstrated in the previous “divide by zero” example).

When RAISERROR is used without an error code, SQL Server assigns an error code of 50000 and expects you to supply an ad-hoc message to associate with the error. The THROW statement always expects you to supply an ad-hoc message for the error, as well as a user error code of 50000 or higher. Thus, the following two statements are equivalent:

THROW 50000, 'An error occurred querying the table.', 1;
RAISERROR ('An error occurred querying the table.', 16, 1);

Both these statements raise an error with code 50000, severity 16, state 1, and the same message text. Compatibility between the two keywords ends there, however, as varying usages impose different rules (as summarized in Table 2-4). For example, only RAISERROR supports token substitution:

RAISERROR ('An error occurred querying the %s table.', 16, 1, 'Customer');

Msg 50000, Level 16, State 1, Line 22
An error occurred querying the Customer table.

THROW has no similar capability. Also, while RAISERROR lets you specify any severity level, THROW will always generate an error with a severity level of 16. This is significant, as level 11 and higher indicates more serious errors than level 10 and lower. For example:

RAISERROR ('An error occurred querying the table.', 10, 1);

An error occurred querying the table.

Because the severity is 10, this error does not echo the error code, level, state, and line number, and is displayed in black rather than the usual red that is used for severity levels higher than 10. In contrast, THROW cannot be used to signal a non-severe error.

The last important difference between the two keywords is the RAISERROR association with sys.messages. In particular, RAISERROR requires that you call sys.sp_addmessage to define error messages associated with user error codes higher than 50000. As explained, the RAISERROR syntax in our earlier examples uses an error code of 50000, and is the only supported syntax that lets you supply an ad-hoc message instead of utilizing sys.messages.

The following code demonstrates how to define customer user error messages for RAISERROR. First (and only once), a tokenized message for user error code 66666 is added to sys.messages. Thereafter, RAISERROR references the error by its code, and also supplies values for token replacements that are applied to the message’s text in sys.messages.

EXEC sys.sp_addmessage 66666, 16, 'There is already a %s named %s.';
RAISERROR(66666, 16, 1, 'cat', 'morris');

Msg 66666, Level 16, State 1, Line 34
There is already a cat named morris.

The THROW statement has no such requirement. You supply any ad-hoc message text with THROW. You don’t need to separately manage sys.messages, but this also means that THROW can’t (directly) leverage centrally managed error messages in sys.messages like RAISERROR does. Fortunately, the FORMATMESSAGE function provides a workaround, if you want to take advantage of the same capability with THROW. You just need to take care and make sure that the same error code is specified in the two places that you need to reference it (once for FORMATMESSAGE and once for THROW), as shown here:

DECLARE @Message varchar(max) = FORMATMESSAGE(66666, 'dog', 'snoopy');
THROW 66666, @Message, 1;

Msg 66666, Level 16, State 1, Line 40
There is already a dog named snoopy.

Summary

Starting in SQL Server 2012, the THROW keyword should be used instead of RAISERROR to raise your own errors. THROW can also be used inside CATCH blocks to raise the original error that occurred within the TRY block. However, RAISERROR is still supported, and can be used to raise system errors or errors with any lesser severity, when necessary. You can start working with THROW by downloading SQL Server 2012 “Denali” CTP3 from http://bit.ly/DenaliCTP3. Try not too have too much fun throwing errors! ;)

Book Announcement! Introducing “Programming Microsoft SQL Server 2012″

I’m very happy to announce that I’ll be authoring a new book on SQL Server 2012, code-named “Denali.” The full title, “Programming Microsoft SQL Server 2012” will technically be published by O’Reilly, yet it will be branded as a Microsoft Press book. O’Reilly recently acquired MS Press, but retained the Microsoft logo and black/red theme, which I’m really glad about. Although O’Reilly is a big name, the MS Press branding still carries more clout. And to be honest, I greatly prefer having a cool tool to a weird animal on the cover.

This book is an update to the SQL Server 2008 edition I wrote three years ago, with many similarities but also some very notable differences. I’m extremely pleased to once again team together with Andrew Brust, who will be writing the chapters on SQL CLR, column stores, BI, and SQL Azure. Andrew was my co-author for the 2008 edition and lead author of the original 2005 edition, so this is actually his third time around on this book. Both previous editions literally burst at the seams in their zeal to cover the entire SQL Server stack. This time around, in consideration of the many readily available BI-focused books, the editors decided to distill the BI footprint of the new Denali book (which accounted for roughly a whole third of the previous 1,000+ page 2008 edition) down to a single overview-style chapter. This simultaneously brings the page count down a little, while actually opening more space for expanded coverage of the relational database engine. As a result, this new book is more narrowly (and uniquely) focused on programming SQL Server for transactional line-of-business applications built with .NET and Visual Studio.

So busy days lie ahead. I’ve actually spent nearly a year already gearing up on Denali, since attending the Denali SDR in Redmond last October. Here are just some of the new topics planned for the 2012 edition:

  • SQL Server Data Tools
  • SQL Azure
  • Column store indexes
  • Sequences
  • Windowing function (OVER BY) improvements
  • Server-side paging
  • FileTable (FILESTREAM + hierarchyid = logical file system)
  • Metadata discovery
  • Spatial enhancements (curves, FULLGLOBE)
  • Contained Databases
  • Self-Service Reporting (Crescent)
  • Entity Framework & LINQ
  • WCF Data Services & WCF RIA Services

This list is by no means exhaustive, and may still change, but it does give a good idea of what to expect. Also, this is in addition to updated coverage from the previous edition that includes broader treatment of topics such as table-valued parameters, SQL Server Audit, and more.

With luck, the book should be out by Q2 2012. I’m looking forward to the work in store, and hope to produce the best piece of work I can. Along the way, I’ll be blogging more previews of what’s to come. So stay tuned, and thanks for reading.

Download VSLive Redmond SQL Server Workshop Materials

Visual Studio Live just wrapped up in Redmond, and it was a very successful conference. As promised, I’ve posted the slides and code for my SQL Server workshop for you to download. You can grab the stuff here:

http://bit.ly/VSL2011SQLRedmond

Thanks to the great crowd we had on Friday. You guys were a lot of fun, and now I’m looking forward to doing it again in Orlando in December!

Received Microsoft 2011 MVP Award for SQL Server!

I’m extremely happy to announce that I’ve just earned Microsoft’s 2011 MVP (Most Valuable Professional) Award for SQL Server! My MVP profile is online at https://mvp.support.microsoft.com/profile/Lobel.

Microsoft recognizes independent software professionals who actively share their expertise and knowledge throughout the technical community with the MVP award. There are just over 4,000 MVPs worldwide, of which there are only 279 MVPs for SQL Server (as of October 2011). So it’s a great thrill and honor for me to join this very select group of dedicated individuals.

There’s nothing I enjoy more than helping other developers learn and grow, and that’s exactly what the MVP award is all about. So thanks to all of you out there for your readership, and I look forward to many more years of speaking and writing on SQL Server, .NET, and all things Microsoft!

Creating a Helpful Extension Method for XAML Visibility

Used recklessly, extension methods are potentially evil. But judiciously applied, they can be extremely helpful. In this short post (well, short for me), I’ll show you how to create an extension method for the bool class that will simplify your .NET code in XAML-based apps (either WPF or Silverlight). In particular, this extension method addresses the fact that the Visibility property of UI controls in XAML is not a Boolean (true/false) value (as has traditionally been the case for both Windows Forms and ASP.NET). Instead, it’s one of three possible enumerated constant values: Hidden, Collapsed, and Visible.

Visible obviously means “visible,” while both Hidden and Collapsed mean “invisible.” The difference between Hidden and Collapsed is merely whether or not the invisible control occupies blank space on the screen that it would occupy if it were visible. Collapsed consumes no space, while Hidden does.

It’s nice to have the three options, but in most cases you’ll find that you just need the two options Visible and Collapsed. If you’re setting visibility to XAML controls using these two enums, you’ve probably noticed that it’s just not as clean as it is with Windows Forms or ASP.NET controls. You typically already have a Boolean value—either as a simple variable, or an expression—that determines whether the control should be visible or not. You could use that value to enable or disable the control, as the IsEnabled property is also a Boolean, but you can’t assign it to the Visibility property because that property expects one of the three Visibility enumerations—not a Boolean. That’s frustrating, because from a UX (user experience) perspective, where these concepts are used to convey “availability” to the user, hiding/showing controls versus enabling/disabling them is a subtlety of UI design. So developers should be able to think (and code) freely in terms of Booleans for both IsEnabled and Visibility.

Consider the C# code to manipulate the Visibility and IsEnabled properties of a button control:

OpenCustomerButton.IsEnabled = false;
OpenCustomerButton.IsEnabled = true;
OpenCustomerButton.Visibility = Visibility.Collapsed;
OpenCustomerButton.Visibility = Visibility.Visible;

bool isAvailable = true;
OpenCustomerButton.IsEnabled = isAvailable;
OpenCustomerButton.Visibility = (isAvailable ? Visibility.Collapsed : Visibility.Visible);

// Set the button's visibility according another button's enabled/disabled state
OpenCustomerButton.Visibility = (ManageCustomersButton.IsEnabled ? Visibility.Visible : Visibility.Collapsed);

You can see that because Visibility takes the enum rather than a bool, it’s harder to work with than IsEnabled, and just isn’t as neat. But we can do something about this. How? By writing a simple (one-line!) extension method that adds a ToVisibility method to the bool class:

public static class BoolExtensions
{
  public static Visibility ToVisibility(this bool isVisible)
  {
    return (isVisible ? Visibility.Visible : Visibility.Collapsed);
  }
}

Now the same UI code is much easier to read and write:

OpenCustomerButton.IsEnabled = false;
OpenCustomerButton.IsEnabled = true;
OpenCustomerButton.Visibility = false.ToVisibility();
OpenCustomerButton.Visibility = true.ToVisibility();

bool isAvailable = true;
OpenCustomerButton.IsEnabled = isAvailable;
OpenCustomerButton.Visibility = isAvailable.ToVisibility();

// Set the button's visibility according to another button's enabled/disabled state
OpenCustomerButton.Visibility = ManageCustomersButton.IsEnabled.ToVisibility();

Like? Me too. Neatness counts!

It would be even better if we could instead extend every UIElement with a Boolean property called IsVisible, and embed the conversion logic bi-directionally in the property’s getter and setter. But, unfortunately, you cannot create extension properties in .NET, only extension methods. So extending bool with a ToVisibility method is the next best thing.

One more tip: Put the BoolExtensions class inside a common code namespace that all your XAML classes import, so ToVisibility is always available without needing to add an extra using statement.

Remember, abusing extension methods quickly leads to messy, buggy code. Instead, find appropriate uses for extension methods that yield cleaner, tighter code. That is all. :)

Community Service Week

Next week, I’ll be presenting on SQL Server Developer Tools (SSDT) at two SQL Server User Groups. Will I see you there?

If you haven’t already heard, SSDT is the newest SQL Server development environment in Visual Studio 2010, released as part of SQL Server 2012 (codename “Denali”). You can download the current CTP3, but Juneau is not included with the Denali CTP3 bits. Instead, it’s a separate download via the Web Platform Installer at: http://www.microsoft.com/web/gallery/install.aspx?appid=JUNEAU10.

You can get a primer on SSDT by reading my earlier introductory blog post. Hopefully that inspires you to come meet me in NJ or NYC next week to get an up-close look at SSDT with lots of demos. Session abstract below. Hope to see you!

Introducing SQL Server Developer Tools (codename “Juneau”)

With the upcoming release of SQL Server 2012 (codename “Denali”), SQL Server Developer Tools (SSDT, codename “Juneau”) will serve as your primary development environment for building applications with SQL Server. While SQL Server Management Studio (SSMS) continues to serve as the primary tool for database administrators, SSDT represents a brand new experience. SSDT plugs in to Visual Studio and provides a new SQL Server node in Server Explorer for connected development of on-premises or cloud databases, and also provides a new database project type for offline development.

With SSDT, developers can finally enjoy building applications without constantly switching between Visual Studio and SSMS. In this session, Lenni will demonstrate how SSDT can be used to develop for (and deploy to) on-premise and SQL Azure databases. In addition to replicating most of the functionality found in SSMS, you will learn how to use features such as code navigation, IntelliSense, and refactoring with your database model-indispensable tools previously available only for traditional application development in Visual Studio. We’ll also cover the new declarative model that allows you to design databases offline and under source control right from within your solution in Visual Studio. Don’t miss out on this demo-centric information-packed session on the next generation of database development tools for SQL Server!

Follow

Get every new post delivered to your Inbox.