Geospatial Support for Circular Data in SQL Server 2012

SQL Server 2012 adds many significant improvements to the spatial support that was first introduced with SQL Server 2008. In this blog post, I’ll explore one of the more notable enhancements: support for curves and arcs (circular data). SQL Server 2008 only supported straight lines, or polygons composed of straight lines. The three new shapes in SQL Server 2012 are circular strings, compound curves, and curve polygons. All three are supported in Well-Known Text (WKT), Well-Known Binary (WKB), and Geometry Markup Language (GML) by both the geometry (planar, or “flat-earth” model) and geography (ellipsoidal sphere, or geodetic) data types, and all of the existing methods work on the new shapes.

Circular Strings

A circular string defines a basic curved line, similar to how a line string defines a straight line. It takes a minimum of three coordinates to define a circular string; the first and third coordinates define the end points of the line, and the second coordinate (the “anchor” point, which lies somewhere between the end points) determines the arc of the line. Here is the shape represented by CIRCULARSTRING(0 1, .25 0, 0 -1):

The following code produces four circular strings. All of them have the same start and end points, but different anchor points. The lines are buffered slightly to make them easier to see in the spatial viewer.

-- Create a "straight" circular line
SELECT geometry::Parse('CIRCULARSTRING(0 8, 4 0, 8 -8)').STBuffer(.1)
UNION ALL  -- Curve it
SELECT geometry::Parse('CIRCULARSTRING(0 8, 4 4, 8 -8)').STBuffer(.1)
UNION ALL  -- Curve it some more
SELECT geometry::Parse('CIRCULARSTRING(0 8, 4 6, 8 -8)').STBuffer(.1)
UNION ALL  -- Curve it in the other direction
SELECT geometry::Parse('CIRCULARSTRING(0 8, 4 -6, 8 -8)').STBuffer(.1)

The spatial viewer in SQL Server Management Studio shows the generated shapes:

The first shape is a “straight circular” line, because the anchor point is position directly between the start and end points. The next two shapes use the same end points with the anchor out to the right (4), one a bit further than the other (6). The last shape also uses the same end points, but specifies an anchor point that curves the line to the left rather than the right (-6).

You can extend circular strings with as many curve segments as you want. Do this by defining another two coordinates for each additional segment. The last point of the previous curve serves as the first end point of the next curve segment, so the two additional coordinates respectively specify the next segment’s anchor and second end point. Thus, valid circular strings will always have an odd number of points. You can extend a circular string indefinitely to form curves and arcs of any kind.

It’s easy to form a perfect circle by connecting two semi-circle segments. For example, the following illustration shows the circle produced by CIRCULARSTRING(0 4, 4 0, 8 4, 4 8, 0 4).

This particular example connects the end of the second segment to the beginning of the first segment to form a closed shape. Note that this is certainly not required of circular strings (or line strings), and that closing the shape by connecting the last segment to the first still does not result in a polygon, which is a two dimensional shape that has area. Despite being closed, this circle is still considered a one-dimensional shape with no area. As you’ll soon see, the curve polygon can be used to convert closed line shapes into true polygons.

Compound Curves

A compound curve is a set of circular strings, or circular strings combined with line strings, that form a desired curved shape. The end point of each element in the collection must match the starting point of the following element, so that compound curves are defined in a “connect-the-dots” fashion. The following code produces a compound curve and compares it with the equivalent geometry collection shape.

-- Compound curve
 DECLARE @CC geometry = '
  COMPOUNDCURVE(
   (4 4, 4 8),
   CIRCULARSTRING(4 8, 6 10, 8 8),
   (8 8, 8 4),
   CIRCULARSTRING(8 4, 2 3, 4 4)
  )'

-- Equivalent geometry collection
 DECLARE @GC geometry = '
  GEOMETRYCOLLECTION(
   LINESTRING(4 4, 4 8),
   CIRCULARSTRING(4 8, 6 10, 8 8),
   LINESTRING(8 8, 8 4),
   CIRCULARSTRING(8 4, 2 3, 4 4)
  )'

-- They both render the same shape in the spatial viewer
 SELECT @CC.STBuffer(.5)
 UNION ALL
 SELECT @GC.STBuffer(1.5)

This code creates a keyhole shape using a compound curve, and also creates an identical shape as a geometry collection (though notice that the LINESTRING keyword is not—and cannot—be specified when defining a compound curve). It then buffers both of them with different padding, so that the spatial viewer clearly shows the two identical shapes on top of one another, as shown:

Both the compound curve and the geometry collection yield identical shapes. In fact, the expression @CC.STEquals(@GC) which compares the two instances for equality returns 1 (for true). The STEquals method tests for “spatial equality,” meaning it returns true if two instances produce the same shape even if they are being rendered using different spatial data classes. Furthermore, recall that segments of a circular string can be made perfectly straight by positioning the anchor directly between the end points, meaning that the circular string offers yet a third option for producing the very same shape. So which one should you use? Comparing these spatial data classes will help you determine which one is best to use in different scenarios.

A geometry collection (which was already supported in SQL Server 2008) is the most accommodating, but carries the most storage overhead. Geometry collections can hold instances of any spatial data class, and the instances don’t need to be connected to (or intersected with) each other in any way. The collection simply holds a bunch of different shapes as a set, which in this example just happens to be several line strings and circular strings connected at their start and end points.

In contrast, the new compound curve class in SQL Server 2012 has the most constraints but is the most lightweight in terms of storage. It can only contain line strings or circular strings, and each segment’s start point must be connected to the previous segment’s end point (although it is most certainly not necessary to connect the first and last segments to form a closed shape as in this example). The DATALENGTH function shows the difference in storage requirements; DATALENGTH(@CC) returns 152 and DATALENGTH(@GC) returns 243. In our current example, DATALENGTH(@CC) returns 152 and DATALENGTH(@GC) returns 243. This means that the same shape requires 38% less storage space by using a compound curve instead of a geometry collection. A compound curve is also more storage-efficient than a multi-segment circular line string when straight lines are involved. This is because there is overhead for the mere potential of a curve, since the anchor point requires storage even when it’s position produces straight lines, whereas compound curves are optimized specifically to connect circular strings and (always straight) line strings.

Curve Polygons

A curve polygon is very similar to an ordinary polygon; like an ordinary polygon, a curve polygon specifies a “ring” that defines a closed shape, and can also specify additional inner rings to define “holes” inside the shape. The only fundamental difference between a polygon and a curve polygon is that the rings of a curve polygon can include circular shapes, whereas an ordinary polygon is composed exclusively with straight lines. Specifically, each ring in a curve polygon can consist of any combination of line strings, circular strings, and compound curves that collectively define the closed shape. For example, the following code produces a curve polygon with the same keyhole outline that I just demonstrated for the compound curve.

-- Curve polygon
 SELECT geometry::Parse('
  CURVEPOLYGON(
   COMPOUNDCURVE(
    (4 4, 4 8),
    CIRCULARSTRING(4 8, 6 10, 8 8),
    (8 8, 8 4),
    CIRCULARSTRING(8 4, 2 3, 4 4)
   )
  )')

This code has simply specified the same compound curve as the closed shape of a curve polygon. Although the shape is the same, the curve polygon is a two-dimensional object, whereas the compound curve version of the same shape is a one-dimensional object. This can be seen visually by the spatial viewer results, which shades the interior of the curve polygon as shown here:

Conclusion

Circular data support is an important new capability added to the spatial support in SQL Server 2012. In this blog post, I demonstrated the three new spatial data classes for curves and arcs: circular strings, compound curves, and curve polygons. Stay tuned for my next post (coming soon), for more powerful and fun spatial enhancements coming soon in SQL Server 2012!

Advertisements

SQL Server 2012 Windowing Functions Part 2 of 2: New Analytic Functions

This is the second half of my two-part article on windowing functions in SQL Server 2012. In Part 1, I explained the new running and sliding aggregation capabilities added to the OVER clause in SQL Server 2012. In this post, I’ll explain the new T-SQL analytic windowing functions. All of these functions operate using the windowing principles I explained in Part 1.

Eight New Analytic Functions

There are eight new analytic functions that have been added to T-SQL. All of them work in conjunction with an ordered window defined with an associated ORDER BY clause that can be optionally partitioned with a PARTITION BY clause and framed with a BETWEEN clause. The new functions are:

 • FIRST_VALUE
 • LAST_VALUE
 • LAG
 • LEAD
 • PERCENT_RANK
 • CUME_DIST
 • PERCENTILE_CONT
 • PERCENTILE_DISC

In the following code listing, the FIRST_VALUE, LAST_VALUE, LAG, and LEAD functions are used to analyze a set of orders at the product level.

DECLARE @Orders AS table(OrderDate date, ProductID int, Quantity int)
INSERT INTO @Orders VALUES
 ('2011-03-18', 142, 74),
 ('2011-04-11', 123, 95),
 ('2011-04-12', 101, 38),
 ('2011-05-21', 130, 12),
 ('2011-05-30', 101, 28),
 ('2011-07-25', 123, 57),
 ('2011-07-28', 101, 12)

SELECT
  OrderDate,
  ProductID,
  Quantity,
  WorstOn = FIRST_VALUE(OrderDate) OVER(PARTITION BY ProductID ORDER BY Quantity),
  BestOn = LAST_VALUE(OrderDate) OVER(PARTITION BY ProductID ORDER BY Quantity
                          ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),
  PrevOn = LAG(OrderDate, 1) OVER(PARTITION BY ProductID ORDER BY OrderDate),
  NextOn = LEAD(OrderDate, 1) OVER(PARTITION BY ProductID ORDER BY OrderDate)
 FROM @Orders
 ORDER BY OrderDate

OrderDate   ProductID  Quantity  WorstOn     BestOn      PrevOn      NextOn
----------  ---------  --------  ----------  ----------  ----------  ----------
2011-03-18  142        74        2011-03-18  2011-03-18  NULL        NULL
2011-04-11  123        95        2011-07-25  2011-04-11  NULL        2011-07-25
2011-04-12  101        38        2011-07-28  2011-04-12  NULL        2011-05-30
2011-05-21  130        12        2011-05-21  2011-05-21  NULL        NULL
2011-05-30  101        28        2011-07-28  2011-04-12  2011-04-12  2011-07-28
2011-07-25  123        57        2011-07-25  2011-04-11  2011-04-11  NULL
2011-07-28  101        12        2011-07-28  2011-04-12  2011-05-30  NULL

In this query, four analytic functions specify an OVER clause that partitions the result set by ProductID. The product partitions defined for FIRST_VALUE and LAST_VALUE are sorted by Quantity, while the product partitions for LAG and LEAD are sorted by OrderDate. The full result set is sorted by OrderDate, so you need to visualize the sorted partition for each of the four functions to understand the output—the result set sequence is not the same as the row sequence used in the windowing functions.

FIRST_VALUE and LAST_VALUE

The WorstOn and BestOn columns use FIRST_VALUE and LAST_VALUE respectively to return the “worst” and “best” dates for the product in each partition. Performance is measured by quantity, so sorting each product’s partition by quantity will position the worst order at the first row in the partition and the best order at the last row in the partition. FIRST_VALUE and LAST_VALUE can return the value of any column (OrderDate, in this case), not just the aggregate column itself. For LAST_VALUE, it is also necessary to explicitly define a window over the entire partition with ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. Otherwise, as explained in my coverage of OVER clause enhancements in Part 1, the default window is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which frames (constrains) the window, and does not consider the remaining rows in the partition that are needed to obtain the highest quantity for LAST_VALUE.

In the output, notice that OrderDate, LowestOn, and HighestOn for the first order (product 142) are all the same value (3/18). This is because product 142 was only ordered once, so FIRST_VALUE and LAST_VALUE operate over a partition that has only this one row in it, with an OrderDate value of 3/18. The second row is for product 123, quantity 95, ordered on 4/11. Four rows ahead in the result set (not the partition) there is another order for product 123, quantity 57, placed on 7/25. This means that, for this product, FIRST_VALUE and LAST_VALUE operate over a partition that has these two rows in it, sorted by quantity. This positions the 7/25 order (quantity 57) first and the 4/11 (quantity 95) last within the partition. As a result, rows for product 123 report 7/25 for WorstDate and 4/11 for BestDate. The next order (product 101) appears two more times in the result set, creating a partition of three rows. Again, based on the Quantity sort of the partition, each row in the partition reports the product’s worst and best dates (which are 7/28 and 4/12, respectively).

LAG and LEAD

The PrevOn and NextOn columns use LAG and LEAD to return the previous and next date that each product was ordered. They specify an OVER clause that partitions by ProductId as before, but the rows in these partitions are sorted by OrderDate. Thus, the LAG and LEAD functions examine each product’s orders in chronological sequence, regardless of quantity. For each row in each partition, LAG is able to access previous (lagging) rows within the same partition. Similarly, LEAD can access subsequent (leading) rows within the same partition. The first parameter to LAG and LEAD specifies the column value to be returned from a lagging or leading row, respectively. The second parameter specifies the number of rows back or forward LAG and LEAD should seek within each partition, relative to the current row. The query passes OrderDate and 1 as parameters to LAG and LEAD, using product partitions that are ordered by date. Thus, the query returns the most recent past date, and nearest future date, that each product was ordered.

Because the first order’s product (142) was only ordered once, its single-row partition has no lagging or leading rows, and so LAG and LEAD both return NULL for PrevOn and NextOn. The second order (on 4/11) is for product 123, which was ordered again on 7/25, creating a partition with two rows sorted by OrderDate, with the 4/11 order positioned first and the 7/25 order positioned last within the partition. The first row in a multi-row window has no lagging rows, but at least one leading row. Similarly, the last order in a multi-row window has at least one lagging row, but no leading rows. As a result, the first order (4/11) reports NULL and 7/25 for PrevOn and NextOn (respectively), and the second order (7/25) reports 4/11 and NULL for PrevOn and NextOn (respectively). Product 101 was ordered three times, which creates a partition of three rows. In this partition, the second row has both a lagging row and a leading row. Thus, the three orders report PrevOn and NextOn values for product 101, respectively indicating NULL-5/30 for the first (4/12) order, 4/12-7/28 for the second (5/30) order, and 5/30-NULL for the third and last order.

The last functions to examine are PERCENT_RANK (rank distribution), CUME_DIST (cumulative distribution, or percentile), PERCENTILE_CONT (continuous percentile lookup), and PERCENTILE_DISC (discreet percentile lookup). The following queries demonstrate these functions, which are all closely related, by querying sales figures across each quarter of two years.

DECLARE @Sales table(Yr int, Qtr int, Amount money)
INSERT INTO @Sales VALUES
  (2010, 1, 5000), (2010, 2, 6000), (2010, 3, 7000), (2010, 4, 2000),
  (2011, 1, 1000), (2011, 2, 2000), (2011, 3, 3000), (2011, 4, 4000)

-- Distributed across all 8 quarters
SELECT
  Yr, Qtr, Amount,
  R = RANK() OVER(ORDER BY Amount),
  PR = PERCENT_RANK() OVER(ORDER BY Amount),
  CD = CUME_DIST() OVER(ORDER BY Amount)
 FROM @Sales
 ORDER BY Amount

-- Distributed (partitioned) by year with percentile lookups
SELECT
  Yr, Qtr, Amount,
  R = RANK() OVER(PARTITION BY Yr ORDER BY Amount),
  PR = PERCENT_RANK() OVER(PARTITION BY Yr ORDER BY Amount),
  CD = CUME_DIST() OVER(PARTITION BY Yr ORDER BY Amount),
  PD5 = PERCENTILE_DISC(.5) WITHIN GROUP (ORDER BY Amount) OVER(PARTITION BY Yr),
  PD6 = PERCENTILE_DISC(.6) WITHIN GROUP (ORDER BY Amount) OVER(PARTITION BY Yr),
  PC5 = PERCENTILE_CONT(.5) WITHIN GROUP (ORDER BY Amount) OVER(PARTITION BY Yr),
  PC6 = PERCENTILE_CONT(.6) WITHIN GROUP (ORDER BY Amount) OVER(PARTITION BY Yr)
 FROM @Sales
 ORDER BY Yr, Amount

Yr    Qtr  Amount   R  PR                 CD
----  ---  -------  -  -----------------  -----
2011  1    1000.00  1  0                  0.125
2011  2    2000.00  2  0.142857142857143  0.375
2010  4    2000.00  2  0.142857142857143  0.375
2011  3    3000.00  4  0.428571428571429  0.5
2011  4    4000.00  5  0.571428571428571  0.625
2010  1    5000.00  6  0.714285714285714  0.75
2010  2    6000.00  7  0.857142857142857  0.875
2010  3    7000.00  8  1                  1

Yr    Qtr  Amount   R  PR                 CD    PD5      PD6      PC5   PC6
----  ---  -------  -  -----------------  ----  -------  -------  ----  ----
2010  4    2000.00  1  0                  0.25  5000.00  6000.00  5500  5800
2010  1    5000.00  2  0.333333333333333  0.5   5000.00  6000.00  5500  5800
2010  2    6000.00  3  0.666666666666667  0.75  5000.00  6000.00  5500  5800
2010  3    7000.00  4  1                  1     5000.00  6000.00  5500  5800
2011  1    1000.00  1  0                  0.25  2000.00  3000.00  2500  2800
2011  2    2000.00  2  0.333333333333333  0.5   2000.00  3000.00  2500  2800
2011  3    3000.00  3  0.666666666666667  0.75  2000.00  3000.00  2500  2800
2011  4    4000.00  4  1                  1     2000.00  3000.00  2500  2800

The new functions are all based on the RANK function introduced in SQL Server 2005. So both these queries also report on RANK, which will aid both in my explanation and your understanding of each of the new functions.

PERCENT_RANK and CUME_DIST

In the first query, PERCENT_RANK and CUME_DIST (aliased as PR and CD respectively) rank quarterly sales across the entire two year period. Look at the value returned by RANK (aliased as R). It ranks each row in the unpartitioned window (all eight quarters) by dollar amount. Both 2011Q2 and 2010Q4 are tied for $2,000 in sales, so RANK assigns them the same value (2). The next row break the tie, so RANK continues with 4, which accounts for the “empty slot” created by the two previous rows that were tied.

Now examine the values returned by PERCENT_RANK and CUME_DIST. Notice how they reflect the same information as RANK with decimal values ranging from 0 and 1. The only difference between the two is a slight variation in their formulaic implementation, such that PERCENT_RANK always starts with 0 while CUME_DIST always starts with a value greater than 0. Specifically, PERCENT_RANK returns (RANK – 1) / (N – 1) for each row, where N is the total number of rows in the window. This always returns 0 for the first (or only) row in the window. CUME_DIST returns RANK / N, which always returns a value greater than 0 for the first row in the window (which would be 1, if there’s only row). For windows with two or more rows, both functions return 1 for the last row in the window with decimal values distributed among all the other rows.

The second query examines the same sales figures, only this time the result set is partitioned by year. There are no ties within each year, so RANK assigns the sequential numbers 1 through 4 to each of the quarters, for 2010 and 2011, by dollar amount. You can see that PERCENT_RANK and CUME_DIST perform the same RANK calculations as explained for the first query (only, again, partitioned by year this time).

PERCENTILE_DISC and PERCENTILE_CONT

This query also demonstrates PERCENTILE_DISC and PERCENTILE_CONT. These very similar functions each accept a percentile parameter (the desired CUME_DIST value) and “reach in” to the window for the row at or near that percentile. The code demonstrates by calling both functions twice, once with a percentile parameter value of .5 and once with .6, returning columns aliased as PD5, PD6, PC5, and PC6. Both functions examine the CUME_DIST value for each row in the window to find the one nearest to .5 and .6. The subtle difference between them is that PERCENTILE_DISC will return a precise (discreet) value from the row with the matching percentile (or greater), while PERCENTILE_CONT interpolates a value based on a continuous range. Specifically, PERCENTILE_CONT returns a value ranging from the row matching the specified percentile—or a calculated value higher than that (based on the specified percentile) if there is no exact match—and the row with the next higher percentile in the window. This explains the values they return in this query.

Notice that these functions define their window ordering using ORDER BY in a WITHIN GROUP clause rather than in the OVER clause. Thus, you do not (and cannot) specify ORDER BY in the OVER clause. The OVER clause is still required, however, so OVER (with empty parentheses) must be specified even if you don’t want to partition using PARTITION BY.

For the year 2010, the .5 percentile (CUME_DIST value) is located exactly on quarter 1, which had $5,000 in sales. Thus PERCENTILE_DISC(.5) returns 5000. There is no row in the window with a percentile of .6, so PERCENTILE_DISC(.6) matches up against the first row with a percentile greater than or equal to .6, which is the row for quarter 2 with $6,000 in sales, and thus returns 6000. In both cases, PERCENTILE_DISC returns a discreet value from a row in the window at or greater than the specified percentile. The same calculations are performed for 2011, returning 2000 for PERCENTILE_DISC(.5) and 3000 for PERCENTILE_DISC(.6), corresponding to the $2,000 in sales for quarter 2 (percentile .5) and the $3,000 in sales for quarter 3 (percentile .75)

As I stated, PERCENTILE_CONT is very similar. It takes the same percentile parameter to find the row in the window matching that percentile. If there is no exact match, the function calculates a value based on the scale of percentiles distributed across the entire window, rather than looking ahead to the row having the next greater percentile value, as PERCENTILE_DISC does. Then it returns the median between that value and the value found in the row with the next greater percentile. For 2010, the .5 percentile matches up with 5000 (as before). The next percentile in the window is for .75 for 6000. The median between 5000 and 6000 is 5500 and thus, PERCENTILE_CONT(.5) returns 5500. There is no row in the window with a percentile of .6, so PERCENTILE_CONT(.6) calculates what the value for .6 would be (somewhere between 5000 and 6000, a bit closer to 5000) and then calculates the median between that value and the next percentile in the window (again, .75 for 6000). Thus, PERCENTILE_CONT(.6) returns 5800; slightly higher than the 5500 returned for PERCENTILE_CONT(.5).

Conclusion

This post explained the eight new analytic functions added to T-SQL in SQL Server 2012. These new functions, plus the running and sliding aggregation capabilities covered in Part 1, greatly expand the windowing capabilities of the OVER clause available since SQL Server 2005.