Downloading SQL Server Express Edition

I’m often asked how to install SQL Server Express Edition, and what the different download options are. So I finally decided to explain it all in this blog post.

There are several SQL Server Express Edition downloads available on the Microsoft site, and they are available in both 32-bit and 64-bit versions. You can choose to install just the SQL Server Express database engine (and nothing else), or you can choose one of two other (larger) downloads: Express With Tools (which includes SQL Server Management Studio [SSMS]) or Express With Advanced Services (which includes SSMS, Full Text Search, and Reporting Services). There are also separate downloads for SSMS and LocalDB, but these do not include the SQL Server Express database engine needed to host local databases.

To install the SQL Server Express Edition database engine, follow these steps:

  1. Open Internet Explorer, and navigate to http://www.microsoft.com/en-us/download/details.aspx?id=29062.
  2. Click the large orange Download button.
  3. Select the appropriate download for your system.
    1. For 64-bit systems, choose ENU\x64\SQLEXPR_x64_ENU.exe.
    2. For 32-bit or 64-bit WoW systems, choose ENU\x86\SQLEXPR32_x86_ENU.exe.
    3. For 32-bit systems, choose ENU\x86\SQLEXPR_x86_ENU.exe.Note If you need to download SQL Server Management Studio (SSMS) as well, choose the Express With Tools file instead, which is the one that includes WT in the filename.
      F00xx01
  4. Click Next.
  5. If you receive a pop-up warning, click Allow Once.
    F00xx02-markedup
  6. When prompted to run or save the file, choose Run. This starts and runs the download.
  7. If the User Account Control dialog appears after the download files are extracted, click Yes.
  8. In the SQL Server Installation Center, click New SQL Server Stand-Alone Installation.
    F00xx03-markedup
  9. In the SQL Server 2012 Setup wizard, do the following:
    1. On the License Terms page, select I Accept The License Terms and click Next.
    2. On the Product Updates page, allow the wizard to scan for updates, and then click Next.
    3. On the Install Setup Files page, wait for the installation to proceed.
    4. On the Feature Selection page, Click Next.
    5. Continue clicking Next through all the remaining pages until the Installation Progress page, and wait for the installation to proceed.
    6. On the Complete page indicating a successful setup, click Close.
      F00xx04

I hope these instructions help you choose the right installation option for SQL Server Express Edition!

 

 

 

 

The Same, Only Different: Comparing Windows Azure SQL Database and On-Premise SQL Server

One of the most attractive aspects of Windows Azure SQL Database is that it shares virtually the same code-base and exposes the very same tabular data stream (TDS) as on-premise SQL Server. Thus, to a great extent, the same tools and applications that work with SQL Server work just the same and just as well with SQL Database. Notice that I said to a great extent, because despite their commonality, there are quite a few SQL Server features that SQL Database does not support. In this blog post, I discuss how and why these two platforms differ from one another, and explain the SQL Database constraints that you need to be aware of if you have previous experience with SQL Server. Where possible, I also suggest workarounds.

SQL Server and SQL Database differ in several ways; most notably, in terms of size limitations, feature support, and T-SQL compatibility. In many cases, these constraints are simply the price you pay for enjoying a hassle-free, self-managing, self-healing, always-available database in the cloud. That is, Microsoft cannot responsibly support features that impair its ability to ensure high-availability, and must be able to quickly replicate and relocate a SQL Database. This is why SQL Database places limits on database size, and doesn’t support certain specialized features such as FILESTREAM, for example.

Another common reason why a particular feature or T-SQL syntax might not be supported in SQL Database is that it’s simply not applicable. With SQL Database, administrative responsibilities are split between Microsoft and yourself. Microsoft handles all of the physical administration (such as disk drives and servers), while you manage only the logical administration (such as database design and security). This is why any and all T-SQL syntax that relates to physical resources (such as pathnames) is not supported in SQL Database. For example, you don’t control the location for primary and log filegroups. This is why you can’t include an ON PRIMARY clause with a CREATE DATABASE statement, and indeed, why SQL Database does not permit a filegroup reference in any T-SQL statement. Plainly stated, everything pertaining to physical resources (that is, infrastructure) is abstracted away from you with SQL Database

Yet still, in some cases, a certain SQL Server feature or behavior may be unsupported merely because Microsoft has just not gotten around to properly testing and porting it to SQL Database. Windows Azure is constantly evolving, so you need to keep watch for updates and announcements. This blog post is a great starting point, but the best way to stay current is by reviewing the Guidelines and Limitations section of the SQL Database documentation on the MSDN web site.

Size Limitations

With the exception of the free, lightweight Express edition of SQL Server, there is no practical upper limit on database size in any edition of SQL Server. A SQL Server database can grow as large as 524,272 terabytes (for SQL Server Express edition, the limit is 10 gigabytes).

In contrast, SQL Database has very particular size limitations. You can set the maximum size by choosing between the Web and Business editions. With a Web edition database, you can set the maximum database size to either 1 or 5 gigabytes. With a Business edition database, the maximum database size can range from 10 to 150 gigabytes. Other than the available choices for maximum database size, there is no difference in functionality between the two editions. The absolute largest supported database size is 150 gigabytes, although horizontal partitioning strategies (called sharding) can be leveraged for scenarios that require databases larger than 150 gigabytes.

Connection Limitations

SQL Database is far less flexible than SQL Server when it comes to establishing and maintaining connections. Keep the following in mind when you connect to SQL Database:

  • SQL Server supports a variety of client protocols, such as TCP/IP, Shared Memory, and Named Pipes. Conversely, SQL Database allows connections only over TCP/IP.
  • SQL Database does not support Windows authentication. Every connection string sent to SQL Database must always include a login username and password.
  • SQL Database often requires that @<server> is appended to the login username in connection strings. SQL Server has no such requirement.
  • SQL Database communicates only through port 1433, and does not support static or dynamic port allocation like SQL Server does.
  • SQL Database does fully support Multiple Active Result Sets (MARS), which allows multiple pending requests on a single connection.
  • Due to the unpredictable nature of the internet, SQL Database connections can drop unexpectedly, and you need to account for this condition in your applications. Fortunately, the latest version of the Entity Framework (EF6) addresses this issue with the new Connection Resiliency feature, which automatically handles the retry logic for dropped connections.

Unsupported Features

Listed below are SQL Server capabilities that are not supported in SQL Database, and I suggest workarounds where possible. Again, because this content is subject to change, I recommend that you check the MSDN web site for the very latest information.

  • Agent Service You cannot use the SQL Server Agent service to schedule and run jobs on SQL Database.
  • Audit The SQL Server auditing feature records server and database events to either the Windows event log or the file system, and is not supported in SQL Database.
  • Backup/Restore Conventional backups with the BACKUP and RESTORE commands are not supported with SQL Database. Although the BACPAC feature can be used to import and export databases (effectively backing them up and restoring them), this approach does not provide transactional consistency for changes made during the export operation. To ensure transactional consistency, you can either set the database as read-only before exporting it to a BACPAC, or you can use the Database Copy feature to create a copy of the database with transactional consistency, and then export that copy to a BACPAC file.
  • Browser Service SQL Database listens only on port 1433. Therefore, the SQL Server Browser Service which listens on various other ports is naturally unsupported.
  • Change Data Capture (CDC) This SQL Server feature monitors changes to a database, and captures all activity to change tables. CDC relies on a SQL Server Agent job to function, and is unsupported in SQL Database.
  • Common Language Runtime (CLR) The SQL Server CLR features (often referred to simply as SQL CLR) allow you to write stored procedures, triggers, functions, and user-defined types in any .NET language (such as C# or VB), as an alternative to using traditional T-SQL. In SQL Database, only T-SQL can be used; SQL CLR is not supported.
  • Compression SQL Database does not support the data compression features found in SQL Server, which allow you to compress tables and indexes.
  • Database object naming convention In SQL Server, multipart names can be used to reference a database object in another schema (with the two-part name syntax schema.object), in another database (with the three-part name syntax database.schema.object), and (if you configure a linked server) on another server (with the four-part name syntax server.database.schema.object). In SQL Database, two-part names can also be used to reference objects in different schemas. However, three-part names are limited to reference only temporary objects in tempdb (that is, where the database name is tempdb and the object name starts with a # symbol); you cannot access other databases on the server. And you cannot reference other servers at all, so four-part names can never be used.
  • Extended Events In SQL Server, you can create extended event sessions help to troubleshooting a variety of problems, such as excessive CPU usage, memory pressure, and deadlocks. This feature is not supported in SQL Database.
  • Extended Stored Procedures You cannot execute your own extended stored procedures (these are typically custom coded procedures written in C or C++) with SQL Database. Only conventional T-SQL stored procedures are supported.
  • File Streaming SQL Server native file streaming features, including FILESTREAM and FileTable, are not supported in SQL Database. You can instead consider using Windows Azure Blob Storage containers for unstructured data files, but it will be your job at the application level to establish and maintain references between SQL Database and the files in blob storage, though note that there will be no transactional integrity between them using this approach.
  • Full-Text Searching (FTS) The FTS service in SQL Server that enables proximity searching and querying of unstructured documents is not supported in SQL Database.
  • Mirroring SQL Database does not support database mirroring, which is generally a non-issue because Microsoft is ensuring data redundancy with SQL Database so you don’t need to worry about disaster recovery. This does also mean that you cannot use SQL Database as a location for mirroring a principal SQL Server database running on-premises. However, if you wish to consider the cloud for this purpose, it is possible to host SQL Server inside a Windows Azure virtual machine (VM) against which you can mirror an on-premise principal database. This solution requires that you also implement a virtual private network (VPN) connection between your local network and the Windows Azure VM, although it will work even without the VPN if you use server certificates.
  • Partitioning SQL Server allows you to partition tables and indexes horizontally (by groups of rows) across multiple filegroups within a database, which greatly improves the performance of very large databases. SQL Database has a maximum database size of 150 GB and gives you no control over filegroups, and thus does not support table and index partitioning.
  • Replication SQL Server offers robust replication features for distributing and synchronizing data, including merge replication, snapshot replication, and transactional replication. None of these features are supported by SQL Database, however SQL Data Sync can be used to effectively implement merge replication between a SQL Database and any number of other SQL Databases on Windows Azure and on-premise SQL Server databases.
  • Resource Governor The Resource Governor feature in SQL Server lets you manage workloads and resources by specifying limits on the amount of CPU and memory that can be used to satisfy client requests. These are hardware concepts that do not apply to SQL Database, and so the Resource Governor is naturally unsupported.
  • Service Broker SQL Server Service Broker provides messaging and queuing features, and is not supported in SQL Database.
  • System stored procedures SQL Database supports only a few of the system stored procedures provided by SQL Server. The unsupported ones are typically related to those SQL Server features and behaviors not supported by SQL Database. At the same time, SQL Database provides a few new system stored procedures not found in SQL Server that are specific to SQL Database (for example, sp_set_firewall_rule).
  • Tables without a clustered index Every table in a SQL Database must define a clustered index. By default, SQL Database will create a clustered index over the table’s primary key column, but it won’t do so if you don’t define a primary key. Interestingly enough, SQL Database will actually let you create a table with no clustered index, but it will not allow any rows to be inserted until and unless a clustered index is defined for the table. This limitation does not exist in SQL Server.
  • Transparent Data Encryption (TDE) You cannot use TDE to encrypt a SQL Database like you can with SQL Server.
  • USE In SQL Database, the USE statement can only refer to the current database; it cannot be used to switch between databases as it can with SQL Server. Each SQL Database connection is tied to a single database, so to change databases, you must connect directly to the database.
  • XSD and XML indexing SQL Database fully supports the xml data type, as well as most of the rich XML support that SQL Server provides, including XML Query (XQuery), XML Path (XPath), and the FOR XML clause. However XML schema definitions (XSD) and XML indexes are not supported in SQL Database.

Calling C++ From SQL CLR C# Code

Ever since Microsoft integrated the .NET Common Language Runtime (CLR) into the relational database engine back in SQL Server 2005, the recommended technique for extending T-SQL with custom code has been to use SQL CLR with a .NET language (such as C# or VB). This is because CLR code is managed code, meaning that at runtime, the .NET framework ensures that ill-behaved code can never crash the process it’s running in. Prior to SQL CLR, the only way to extend T-SQL was with extended stored procedures written in native C++. Because native code is unmanaged, buggy C++ code can all too easily crash the process it’s running in. In the case of an extended stored procedure, this means crashing SQL Server itself, which I think we can all agree is a bad thing. It is for this very reason that extended stored procedures are deprecated in SQL Server and why SQL CLR is the way to go instead.

That said, you may have an existing C++ dynamic link library (DLL file) which exposes some public function that you need to call from your T-SQL code. Sure, the recommended approach to take is to refactor that C++ code in C# and then call it using SQL CLR. But what if that’s not a viable option? Perhaps you don’t have access to the C++ source code, or perhaps you do and it’s prohibitively expensive to port it to C#. Or maybe you just like living on the edge. In any case, you find that you absolutely need to call into C++, but you don’t want to use extended stored procedures since they are deprecated (and relatively difficult to implement). In this scenario, you can create a C# wrapper function that calls into the C++ DLL, and then implement the C# wrapper function as a SQL CLR user-defined function (UDF). This blog post shows you exactly how to do just that.

First though, to be clear, this technique carries the same risk as extended stored procedures – just one C++ memory access violation resulting from a rogue pointer can instantly crash SQL Server. For this reason, you will see that there are a few additional steps required when implementing such a solution. Fundamentally, these additional steps make it clear that you are introducing risk, and that you absolve SQL Server of any blame if your custom code crashes the SQL Server process as a result.

The next thing to be aware of is 32/64-bit compatibility. That is, you cannot load a 32-bit C++ DLL into a 64-bit SQL Server process, and vice-versa (attempting to do so will surely result in SQL Server throwing a BadImageFormatException error). This presents no problem if you have access to the source code, since you can compile it as either a 32-bit or 64-bit DLL to match the version of SQL Server that you’re running. If you don’t have access to the source code, and the DLL that you have does not match your version of SQL Server, then that presents a greater challenge, although there are several advanced “run out-of-process” solutions that are beyond the scope of this blog post.

The following sections provide step-by-step procedures that walk you through the process of calling C++ from C# in a SQL CLR user-defined function.

  1. Create the C++ library.
  2. Test the C++ library (optional).
    1. Call from a C++ native executable harness.
    2. Call from a C# executable harness.
  3. Create the C# SQL CLR UDF.
  4. Deploy to SQL Server.

For simplicity’s sake, the C++ library code that we’ll be calling from SQL Server is a simple math function named AddIntegers. This function accepts two integer parameters and returns their sum. I’ll demonstrate using Visual Studio 2013, although everything works just the same with earlier Visual Studio versions (I can confirm this for certain with VS 2012 and VS 2010, but it most likely works with even older VS versions as well).

Creating the C++ library

The instructions assume that you’re running 64-bit SQL Server, and thus they also explain how to compile the native C++ code as a 64-bit DLL.

To create the C++ library, follow these steps:

  1. Start Visual Studio 2013 (note that these instructions also work with VS 2010 and 2012).
  2. Create the new C++ library project:
    1. From the File menu, choose New Project.
    2. Under Installed Templates, choose Visual C++
    3. Select the Win32 Project template.
    4. Name the project MathLibNative.
    5. Choose any desired location to create the project; for example, C:\Demo\.
    6. Name the solution MathLibFromSQL.
      CreateSolution
    7. Click OK
  3. When the Win32 Application Wizard appears:
    1. Click Next.
    2. For Application Type, choose DLL.
    3. For Additional Options, check Empty Project.
      CreateEmptyCppDllProject
    4. Click Finish.
  4. Create the header file. This will contain the publically visible signature to the AddIntegers function.
    1. Right-click the MathLibNative project in Solution Explorer and choose Add | New Item.
    2. In the Add New Item dialog, choose Header File (.h).
    3. Name the file MathFunctions.h.
    4. Click Add.
    5. Type the following in the code editor for MathFunctions.h:
      __declspec(dllexport) int AddIntegers(int a, int b);
  5. Implement the AddIntegers function.
    1. Right-click the MathLibNative project in Solution Explorer and choose Add | New Item.
    2. In the Add New Item dialog, choose C++ File (.cpp).
    3. Name the file MathFunctions.cpp.
    4. Click Add.
    5. Type the following in the code editor for MathFunctions.cpp.
      #include "MathFunctions.h"
      
      int AddIntegers(int a, int b)
      {
        int sum = a + b;
        return sum;
      }
  6. Configure the library project for 64-bit, which is required since we intend to load this DLL into 64-bit SQL Server.
    1. From the BUILD menu, choose Configuration Manager.
    2. Click the Active Solution Platform dropdown and choose <New…> to display the New Solution Platform dialog.
    3. In the Type Or Select The New Platform combobox, type MathLibNative.
      NewSolutionPlatform
    4. Click OK to close the New Solution Platform dialog.
    5. Click the Platform dropdown for the MathLibNative project (it is currently set for Win32, meaning 32-bit).
    6. Click <New…> to display the New Project Platform dialog.
    7. Choose x64 from the New Platform dropdown.
      NewProjectPlatform
    8. Click OK to close the New Project Platform dialog.
      ConfigurationManager1
    9. Click Close to close the Configuration Manager dialog.
  7. Press CTRL+SHIFT+B to build the solution and ensure there are no compiler errors. This creates the native DLL file named MathLibNative.dll. The __declspec(dllexport) attribute in the header file (in step 4e above) also causes the compiler to generate a library file named MathLibNative.lib that you can link to from another C++ application (you will do this in the next section when you create an executable C++ test harness).

Testing the C++ library (optional)

Ultimately, we’re going to call into the native DLL file directly from a SQL CLR UDF written in C#. However, it’s often helpful to create a test harness first. This aids in debugging and helps prove that the DLL works as expected. In this section, we’ll create two test harness applications, one native C++ executable and one C# executable. The C++ executable will allow you to single-step debug into the C++ DLL, and the C# executable will help you determine the correct interface for calling into the C++ DLL. Again, this isn’t strictly necessary; you can instead jump ahead to the next section and create the SQL CLR UDF. However, the moment something doesn’t work as expected (and that moment will come), you’ll need to fall back on one or both of these test harnesses to help discover what the problem is, so it’s good to have them.

Creating a native C++ executable test harness

To create the native C++ executable test harness, follow these steps:

  1. Create a new C++ executable project:
    1. Right-click the MathLibFromSql solution in Solution Explorer and choose Add | New Project.
    2. Under Installed Templates, choose Visual C++
    3. Select the Win32 Project template.
    4. Name the project MathLibNativeHarness.
    5. Leave the default for the project location.
    6. Click OK.
  2. When the Win32 Application Wizard appears:
    1. Click Next.
    2. For Application Type, choose Console Application.
    3. For Additional Options, check Empty Project.
    4. Click Finish.
  3. Create the harness code. This is placed in a function named main, which is the entry point for the console executable:
    1. Right-click the MathLibNativeHarness project in Solution Explorer and choose Add | New Item.
    2. In the Add New Item dialog, choose C++ File (.cpp).
    3. Name the file Main.cpp.
    4. Click Add.
    5. Type the following in the code editor for MathFunctions.cpp.
      #include "..\MathLibNative\MathFunctions.h"
      #include <iostream>
      
      int main()
      {
        int sum = AddIntegers(23, 9);
        std::cout << sum << std::endl;
      }
  4. Configure the harness project for 64-bit.
    1. From the BUILD menu, choose Configuration Manager.
    2. Click the Active Solution Platform dropdown choose MathLibNative.
    3. Click the Platform dropdown for the MathLibNativeHarness project.
    4. Click <New…> to display the New Project Platform dialog.
    5. Choose x64 from the New Platform dropdown.
    6. Click OK to close the New Project Platform dialog.
    7. Check the Build checkbox for the MathLibNativeHarness project.
      ConfigurationManager2
    8. Click Close to close the Configuration Manager dialog.
  5. Link to the library.
    1. Right-click the MathLibNativeHarness project in Solution Explorer and choose Properties.
    2. Expand the Linker options and click Input.
    3. For Additional Dependencies, click the dropdown and choose <Edit…>
      LinkerAdditionalDependencies
    4. In the Additional Dependencies dialog, type the full path to the MathLibNative.lib file generated by the compiler when it built the DLL file. If you created the solution in C:\Demo, the full path is C:\Demo\MathLibFromSQL\x64\Debug\MathLibNative.lib.
      AdditionalDependencies
    5. Click OK to close the Additional Dependencies dialog.
    6. Click OK to close the project’s property pages dialog.
  6. Press CTRL+SHIFT+B to build the solution and ensure there are no compiler errors.
  7. Run the C++ test harness.
    1. Right-click the MathLibNativeHarness project in Solution Explorer and choose Set As Startup Project.
    2. Press CTRL+F5 to run the harness without the debugger.
    3. The console output should appear with the expected result (32, which is 23 plus 9), proving that the DLL is being called properly.
      ConsoleOutputNativeHarness

If you wish, you can also run the harness with the debugger by pressing F5. This will display and close the console window too quickly for you to see the output, but you can set breakpoints and debug the code. You can even single-step into the DLL file itself.

Creating a C# executable test harness

Calling the native C++ DLL from C# works differently, and since that’s what we need to do in our SQL CLR UDF with a C# class library, it’s also helpful to figure out how to “get it right” first with a C# console application test harness. One critical part in getting this to work is to identify the correct entry point to the AddIntegers function. This would be easy if the entry point was simply named after the function (i.e., AddIntegers), but unfortunately that’s not the case. The entry point is based on the function name, but is preceded by a question mark symbol and suffixed with a bit of randomly generated text. Fortunately, there is a small, simple, and free tool available on the web named Dependency Walker that can discover the entry point for you.

Before creating the C# executable test harness, use Dependency Walker to discover the entry point to the AddIntegers function. To do so, follow these steps:

  1. Download Dependency Walker from http://www.dependencywalker.com.
  2. Extract the downloaded zip file to a folder of your choice.
  3. Navigate to the folder and launch depends.exe.
  4. In the Security Warning dialog, click Run.
  5. From the File menu, choose Open.
  6. Navigate to the folder where the native C++ DLL file was built (i.e., C:\Demo\MathLibFromSQL\x64\Debug).
  7. Double-click the file MathLibNative.dll.
  8. Dependency Walker may display a message that errors were detected, but this can be safely ignored; just click OK.
  9. Locate the entry point to the AddIntegers function in the second grid on the right.
    DependsWithCallout
  10. Leave this window open so you can easily copy the entry point name and paste it into the C# code in the next procedure.

To create the C# executable test harness, follow these steps:

  1. Create a new C# executable project:
    1. Right-click the MathLibFromSql solution in Solution Explorer and choose Add | New Project.
    2. Under Installed Templates, choose Visual C#
    3. Select the Console Application template.
    4. Name the project MathLibNativeCSharpHarness.
    5. Leave the default for the project location.
    6. Click OK.
  2. Replace the starter code generated by Visual Studio in Program.cs with the following:
    using System;
    using System.Runtime.InteropServices;
    
    namespace MathLibNativeCSharpHarness
    {
      class Program
      {
        [DllImport(@"C:\Demo\MathLibFromSQL\x64\Debug\MathLibNative.dll",
          CallingConvention = CallingConvention.Cdecl,
          EntryPoint = "?AddIntegers@@YAHHH@Z")]
        private static extern int AddIntegers(int a, int b);
    
        static void Main(string[] args)
        {
          int sum = AddIntegers(33, 19);
          Console.WriteLine(sum);
        }
      }
    }
  3. Note the EntryPoint parameter of the DllImport attribute. This is the value that I pasted in from Dependency Walker as I wrote this post. To use the appropriate EntryPoint parameter value for your environment:
    1. Return to the Dependency Walker window that you opened to MathLibNative.dll in the previous section.
    2. Right-click the Function value in the second grid on the right and choose Copy Function Name.
    3. Return to the C# code and paste it in as the EntryPoint parameter value in the DllImport attribute.
  4. Also note the pathname to MathLibNative.dll; adjust it as necessary if you have created the solution in a folder other than C:\Demo.
  5. Configure the harness project for 64-bit.
    1. From the BUILD menu, choose Configuration Manager.
    2. Click the Platform dropdown for the MathLibNativeCSharpHarness project.
    3. Click <New…> to display the New Project Platform dialog.
    4. Choose x64 from the New Platform dropdown (it may already be selected by default).
    5. Click OK to close the New Project Platform dialog.
    6. Check the Build checkbox for the MathLibNativeCSharpHarness project.
      ConfigurationManager3
    7. Click Close to close the Configuration Manager dialog.
  6. Press CTRL+SHIFT+B to build the solution and ensure there are no compiler errors.
  7. Run the C# test harness.
    1. Right-click the MathLibNativeCSharpHarness project in Solution Explorer and choose Set As Startup Project.
    2. Press CTRL+F5 to run the harness without the debugger.
    3. The console output should appear with the expected result (52, which is 33 plus 19), proving that the DLL is being called properly and that we have figured out the correct DllImport attribute to use in the C# code that we’ll create for the SQL CLR UDF.
      ConsoleOutputCSharpHarness

Creating the C# SQL CLR UDF

You’re now ready to write the SQL CLR UDF, which is a C# method decorated with the SqlFunction attribute. This is a simple wrapper method; all it does is call into the C++ DLL just like the C# executable harness we created in the previous procedure does. The wrapper method itself accepts and returns special data types that correspond to SQL Server. In our scenario, the method accepts two SqlInt32 parameters and returns a SqlInt32 value, where SqlInt32 corresponds to the int data type in C++ and C#.

To create the C# SQL CLR UDF, follow these steps:

  1. Create a new C# class library project.
    1. Right-click the MathLibFromSql solution in Solution Explorer and choose Add | New Project.
    2. Under Installed Templates, choose Visual C#
    3. Select the Class Library template.
    4. Name the project MathLibNativeSQLCLR.
    5. Leave the default for the project location.
    6. Click OK.
  2. Delete the Class1.cs file (this file was generated automatically for the project by Visual Studio and is not needed).
    1. Right-click the Class1.cs file in Solution Explorer and choose Delete.
    2. Click OK to confirm that you want to delete the file Class1.cs.
  3. Add the MathFunctions class.
    1. Right-click the MathLibNativeSQLCLR project in Solution and choose Add | New Item.
    2. In the Add New Item dialog, choose Class.
    3. Name the file MathFunctions.cs.
    4. Click Add.
    5. Replace the starter code generated by Visual Studio in MathFunctions.cs with the following:
      using System.Data.SqlTypes;
      using System.Runtime.InteropServices;
      
      namespace MathLibNativeSQLCLR
      {
        public class MathFunctions
        {
          [DllImport(@"C:\Demo\MathLibFromSQL\x64\Debug\MathLibNative.dll",
            CallingConvention = CallingConvention.Cdecl,
            EntryPoint = "?AddIntegers@@YAHHH@Z")]
          private static extern int AddIntegers(int a, int b);
      
          public static SqlInt32 AddIntegersUdf(SqlInt32 a, SqlInt32 b)
          {
            var sum = AddIntegers((int)a, (int)b);
            return sum;
          }
        }
      }
  4. Note the EntryPoint parameter of the DllImport attribute. This is the same value we used for the C# harness in the previous procedure. If you skipped the previous procedure as optional, you need to at least follow the instructions found there for using Dependency Walker to discover the correct entry point name.
  5. Also note the pathname to MathLibNative.dll; adjust it as necessary if you have created the solution in a folder other than C:\Demo.
  6. Configure the SQL CLR project for 64-bit.
    1. From the BUILD menu, choose Configuration Manager.
    2. Click the Platform dropdown for the MathLibNativeSQLCLR project.
    3. Click <New…> to display the New Project Platform dialog.
    4. Choose x64 from the New Platform dropdown (it may already be selected by default).
    5. Click OK to close the New Project Platform dialog.
    6. Check the Build checkbox for the MathLibNativeSQLCLR project.
      ConfigurationManager4
    7. Click Close to close the Configuration Manager dialog.
  7. Press CTRL+SHIFT+B to build the solution and ensure there are no compiler errors.

Deploying to SQL Server

You’re now ready to deploy the SQL CLR UDF you created in the previous procedure to a SQL Server database. To do this, use your tool of choice to connect to your SQL Server instance and open a new query window that you can execute T-SQL commands in. You can of course use SQL Server Management Studio (SSMS), or alternatively, you can use SQL Server Data Tools (SSDT) inside of Visual Studio (SSDT is installed by default with VS 2012 and VS 2013, but needs to be installed separately for VS 2010).

Most of the remaining steps are standard procedure with any SQL CLR implementation. However, some of them are required specifically because we’re calling native C++ code from SQL CLR.

  1. Enable SQL CLR. This, of course, is required to support any custom SQL CLR implementation in SQL Server.
    EXEC sp_configure 'clr enabled', 1
    RECONFIGURE
    GO
  2. Create the database.
    CREATE DATABASE MathLibDb
    GO
    USE MathLibDb
    GO
  3. Set the database’s TRUSTWORTHY property. This is normally not required for SQL CLR, but is needed here so that “unsafe” assemblies (that is, those that call into native C++ code) can be created in the database.
    ALTER DATABASE MathLibDb SET TRUSTWORTHY ON
    GO
  4. Create the assembly. Because this assembly calls into native C++ code, you must also specify PERMISSION_SET = UNSAFE.
    CREATE ASSEMBLY [MathLibNativeSQLCLR]
      AUTHORIZATION dbo
      FROM 'C:\Demo\MathLibFromSQL\MathLibNativeSQLCLR\bin\x64\Debug\MathLibNativeSQLCLR.dll'
      WITH PERMISSION_SET = UNSAFE
    GO
  5. Create the T-SQL UDF.
    CREATE FUNCTION AddIntegersUdf(@A int, @B int)
      RETURNS int
      AS EXTERNAL NAME [MathLibNativeSQLCLR].[MathLibNativeSQLCLR.MathFunctions].[AddIntegersUdf]
    GO

After all this effort, it’s quite rewarding to see the result. To watch the magic happen, invoke the T-SQL UDF just like you would any other, passing in any two numbers to be added. For example:

SELECT dbo.AddIntegersUdf(47, 16)

When you see the query return the sum of the two numbers passed in to the UDF, you know that everything is working correctly:

CallSQLCLRUDF

The T-SQL code calls the SQL CLR UDF, which calls the native C++ DLL that performs the work and returns the result all the way back up to SQL Server. So yes, it takes some effort, but it does work. Remember though, make sure your C++ code is well-behaved or you risk crashing SQL Server!

New Metadata Discovery Features in SQL Server 2012

It has always been possible to interrogate SQL Server for metadata (schema) information. You can easily discover all the objects in a database (tables, views, stored procedures, and so on) and their types by directly querying system tables (not recommended, as they can change from one version of SQL Server to another) or information schema views (which are consistent in each SQL Server version). It is significantly more challenging, however, to discover the result set schema for T-SQL statements or stored procedures that contain conditional logic. Using SET FMTONLY ON/OFF has been the common technique in the past for discovering the schema of a query’s result set without actually executing the query itself. For example, consider the following code:

USE AdventureWorks2012
GO

SET FMTONLY ON
SELECT * FROM HumanResources.Employee;
SET FMTONLY OFF

This SELECT statement, which would normally return all the rows from the HumanResources.Employee table, returns no rows at all. It just reveals the columns. The SET FMTONLY ON statement prevents queries from returning rows of data so that their schemas can be discovered, and this behavior remains in effect until SET FMTONLY OFF is encountered. SQL Server 2012 introduces several new system stored procedures and table-valued functions (TVFs) that provide significantly richer metadata discovery than what can be discerned using the relatively inelegant (and now deprecated) SET FMTONLY ON/OFF approach. These new procedures and functions are:

  • sys.sp_describe_first_result_set
  • sys.dm_exec_describe_first_result_set
  • sys.dm_exec_describe_first_result_set_for_object
  • sys.sp_describe_undeclared_parameters

In this blog post, I’ll explain how to use these new objects to discover schema information in SQL Server 2012.

sys.sp_describe_first_result_set

The sys.sp_describe_first_result_set stored procedure accepts a T-SQL statement and produces a highly detailed schema description of the first possible result set returned by that statement. The following code retrieves schema information for the same SELECT statement you used earlier to get information on all the columns in the HumanResources.Employee table:

EXEC sp_describe_first_result_set
 @tsql = N'SELECT * FROM HumanResources.Employee'

The following screenshot shows the wealth of information that SQL Server returns about each column in the result set returned by the sp_describe_first_result_set call:

sys.dm_exec_describe_first_result_set

There is also a data management function named sys.dm_exec_describe_first_result_set that works very similar to sys.sp_describe_first_result_set. But because it is implemented as a table-valued function (TVF), it is easy to query against it and limit the metadata returned. For example, the following query examines the same T-SQL statement, but returns just the name and data type of nullable columns:

SELECT name, system_type_name
 FROM sys.dm_exec_describe_first_result_set(
  'SELECT * FROM HumanResources.Employee', NULL, 1)
 WHERE is_nullable = 1

Here is the output:

name               system_type_name
-----------------  ----------------
OrganizationNode   hierarchyid
OrganizationLevel  smallint

Parameterized queries are also supported, if you supply an appropriate parameter signature after the T-SQL. The T-SQL in the previous example had no parameters, so it passed NULL for the “parameters parameter.” The following example discovers the schema of a parameterized query.

SELECT name, system_type_name, is_hidden
 FROM sys.dm_exec_describe_first_result_set('
  SELECT OrderDate, TotalDue
   FROM Sales.SalesOrderHeader
   WHERE SalesOrderID = @OrderID',
  '@OrderID int', 1)

Here is the output:

name             system_type_name  is_hidden
---------------  ----------------  ---------
OrderDate        datetime          0
TotalDue         money             0
SalesOrderID     int               1

You’d be quick to question why the SalesOrderID column is returned for a SELECT statement that returns only OrderDate and TotalDue. The answer lies in the last parameter passed to the data management function. A bit value of 1 (for true) tells SQL Server to return the identifying SalesOrderID column, because it is used to “browse” the result set. Notice that it is marked true (1) for is_hidden. This informs the client that the SalesOrderID column is not actually revealed by the query, but can be used to uniquely identify each row in the query’s result set.

What if multiple result sets are possible? There’s no problem with this as long as they all have the same schema. In fact, SQL Server will even try to forgive cases where multiple possible schemas are not exactly identical. For example, if the same column is nullable in one result set and non-nullable in the other, schema discovery will succeed and indicate the column as nullable. It will even tolerate cases where the same column has a different name (but same type) between two possible result sets, and indicate NULL for the column name, rather than arbitrarily choosing one of the possible column names or failing altogether.

The following code demonstrates this with a T-SQL statement that has two possible result sets depending on the value passed in for the @SortOrder parameter. Because both result sets have compatible schemas, the data management function succeeds in returning schema information.

SELECT name, system_type_name
 FROM sys.dm_exec_describe_first_result_set('
    IF @SortOrder = 1
      SELECT OrderDate, TotalDue
       FROM Sales.SalesOrderHeader
       ORDER BY SalesOrderID ASC
    ELSE IF @SortOrder = -1
      SELECT OrderDate, TotalDue
       FROM Sales.SalesOrderHeader
       ORDER BY SalesOrderID DESC',
   '@SortOrder AS tinyint', 0) 

Here is the output:

name         system_type_name
-----------  ----------------
OrderDate    datetime
TotalDue     money

Discovery won’t succeed if SQL Server detects incompatible schemas. In this next example, the call to the system stored procedure specifies a T-SQL statement with two possible result sets, but one returns three columns while the other returns only two columns.

EXEC sys.sp_describe_first_result_set
  @tsql = N'
    IF @IncludeCurrencyRate = 1
      SELECT OrderDate, TotalDue, CurrencyRateID
       FROM Sales.SalesOrderHeader
    ELSE
      SELECT OrderDate, TotalDue
       FROM Sales.SalesOrderHeader'

In this case, the system stored procedure raises an error that clearly explains the problem:

Msg 11509, Level 16, State 1, Procedure sp_describe_first_result_set, Line 53

The metadata could not be determined because the statement 'SELECT OrderDate, TotalDue, CurrencyRateID FROM Sales.SalesOrderHeader' is not compatible with the statement 'SELECT OrderDate, TotalDue FROM Sales.SalesOrderHeader'.

It is noteworthy to mention that the data management function copes with this scenario much more passively. Given conflicting result set schemas, it simply returns NULL and does not raise an error.

sys.dm_exec_describe_first_result_set_for_object

The data management function sys.dm_exec_describe_first_result_set_for_object can be used to achieve the same discovery against any object in the database. It accepts just an object ID and the Boolean “browse” flag to specify if hidden ID columns should be returned. You can use the OBJECT_ID function to obtain the ID of the desired object. The following code demonstrates this by returning schema information for the stored procedure GetOrderInfo.

CREATE PROCEDURE GetOrderInfo(@OrderID AS int) AS
  SELECT OrderDate, TotalDue
   FROM Sales.SalesOrderHeader
   WHERE SalesOrderID = @OrderID
GO

SELECT name, system_type_name, is_hidden
 FROM sys.dm_exec_describe_first_result_set_for_object(OBJECT_ID('GetOrderInfo'), 1)

Here is the output:

name             system_type_name   is_hidden
---------------  -----------------  ---------
OrderDate        datetime           0
TotalDue         money              0
SalesOrderID     int                1

sys.sp_describe_undeclared_parameters

Finally, the sys.sp_describe_undeclared_parameters stored procedure parses a T-SQL statement to discover type information about the parameters expected by the statement, as the following code demonstrates:

EXEC sys.sp_describe_undeclared_parameters
 N'IF @IsFlag = 1 SELECT 1 ELSE SELECT 0'

Here is the output:

parameter_ordinal name    suggested_system_type_id suggested_system_type_name ...
----------------- ------- ------------------------ -------------------------- -------
1                 @IsFlag 56                       int                        ... 

In this example, SQL Server detects the @IsFlag parameter, and suggests the int data type based on the usage in the T-SQL statement it was given to parse.

Enhance Portability with Partially Contained Databases in SQL Server 2012

The dependency of database-specific users upon server-based logins poses a challenge when you need to move or restore a database to another server. Although the users move with the database, their associated logins do not, and thus the relocated database will not function properly until you also setup and map the necessary logins on the target server. To resolve these types of dependency problems and help make databases more easily portable, SQL Server 2012 introduces “partially contained” databases.

The term “partially contained” is based on the fact that SQL Server itself merely enables containment—it does not enforce it. It’s still your job to actually implement true containment. From a security perspective, this means that partially contained databases allow you to create a special type of user called a contained user. The contained user’s password is stored right inside the contained database, rather than being associated with a login defined at the server instance level and stored in the master database. Then, unlike the standard SQL Server authentication model, contained users are authenticated directly against the credentials in the contained database without ever authenticating against the server instance. Naturally, for this to work, a connection string with a contained user’s credentials must include the Initial Catalog keyword that specifies the contained database name.

Creating a Partially Contained Database

To create a partially contained database, first enable the contained database authentication setting by calling sp_configure and then issue a CREATE DATABASE statement with the new CONTAINMENT=PARTIAL clause as the following code demonstrates.

-- Enable database containment

USE master
GO

EXEC sp_configure 'contained database authentication', 1
RECONFIGURE

-- Delete database if it already exists
IF EXISTS(SELECT name FROM sys.databases WHERE name = 'MyDB')
 DROP DATABASE MyDB
GO

-- Create a partially contained database
CREATE DATABASE MyDB CONTAINMENT = PARTIAL
GO

USE MyDB
GO

To reiterate, SQL Server doesn’t enforce containment. You can still break containment by creating ordinary database users for server-based logins. For this reason, it’s easy to convert an ordinary (uncontained) database to a partially contained database; simply issue an ALTER DATABASE statement and specify SET CONTAINMENT=PARTIAL. You’ll then be able to migrate the server-based logins to contained logins and achieve server independence.

Creating a Contained User

Once you have a contained database, you can create a contained user for it by issuing a CREATE USER statement and specifying WITH PASSWORD, as shown here:

CREATE USER UserWithPw
 WITH PASSWORD = N'password$1234'

This syntax is valid only for contained databases; SQL Server will raise an error if you attempt to create a contained user in the context of an uncontained database.

That’s all there is to creating partially contained databases with contained users. The only remaining point that’s worth calling out is that an Initial Catalog clause pointing to a partially contained database must be specified explicitly in a connection string that also specifies the credentials of a contained user in that database. If just the credentials are specified without the database, SQL Server will not scan the partially contained databases hosted on the instance for one that has a user with matching credentials. Instead, it will consider the credentials to be those of an ordinary SQL Server login, and will not authenticate against the contained database.

Other Partially Contained Database Features

Aside from server-based logins, there are many other dependencies that a database might have on its hosted instance. These include linked servers, SQL CLR, database mail, service broker objects, endpoints, replication, SQL Server Agent jobs, and tempdb collation. All these objects are considered to be uncontained entities since they all exist outside the database.

Uncontained entities threaten a database’s portability. Since these objects are all defined at the server instance level, behavior can vary unpredictably when databases are shuffled around from one instance to another. Let’s examine features to help you achieve the level of containment and stability that your circumstances require.

Uncontained Entities View

SQL Server provides a new data management view (DMV) called sys.dm_db_uncontained_entities that you can query on to discover potential threats to database portability. This DMV not only highlights dependent objects, it will even report the exact location of all uncontained entity references inside of stored procedures, views, functions, and triggers.

The following code creates a few stored procedures, and then joins sys.dm_db_uncontained_entities with sys.objects to report the name of all objects having uncontained entity references in them.

-- Create a procedure that references a database-level object
CREATE PROCEDURE GetTables AS
BEGIN
  SELECT * FROM sys.tables
END
GO

-- Create a procedure that references an instance-level object
CREATE PROCEDURE GetEndpoints AS
BEGIN
  SELECT * FROM sys.endpoints
END
GO

-- Identify objects that break containment
SELECT
  UncType = ue.feature_type_name,
  UncName = ue.feature_name,
  RefType = o.type_desc,
  RefName = o.name,
  Stmt = ue.statement_type,
  Line = ue.statement_line_number,
  StartPos = ue.statement_offset_begin,
  EndPos = ue.statement_offset_end
 FROM
  sys.dm_db_uncontained_entities AS ue
  INNER JOIN sys.objects AS o ON o.object_id = ue.major_id

Here is the result of the query:

UncType      UncName    RefType               RefName       Stmt    Line  StartPos  EndPos
-----------  ---------  --------------------  ------------  ------  ----  --------  ---
System View  endpoints  SQL_STORED_PROCEDURE  GetEndpoints  SELECT  5     218       274

The DMV identifies the stored procedure GetEndpoints as an object with an uncontained entity reference. Specifically, the output reveals that a stored procedure references the sys.endpoints view in a SELECT statement on line 5 at position 218. This alerts you to a database dependency on endpoints configured at the server instance level that could potentially pose an issue for portability. The GetTables stored procedure does not have any uncontained entity references (sys.tables is contained), and is therefore not reported by the DMV.

Collations and tempdb

Ordinarily, all databases hosted on the same SQL Server instance share the same tempdb database for storing temporary tables, and all the databases (including tempdb) on the instance use the same collation setting (collation controls string data character set, case sensitivity, and accent sensitivity). When joining between regular database tables and temporary tables, both your user database and tempdb must use a compatible collation. This, again, represents an instance-level dependency with respect to the fact that the collation setting can vary from one server instance to another. Thus, problems arise when moving databases between servers that have different collation settings for tempdb. The code below demonstrates the problem, and how to avoid it by using a contained database.

-- Create an uncontained database with custom collation
USE master
GO
IF EXISTS(SELECT name FROM sys.databases WHERE name = 'MyDB')
 DROP DATABASE MyDB
GO
CREATE DATABASE MyDB COLLATE Chinese_Simplified_Pinyin_100_CI_AS
GO

USE MyDB
GO

-- Create a table in MyDB (uses Chinese_Simplified_Pinyin_100_CI_AS collation)
CREATE TABLE TestTable (TextValue nvarchar(max))

-- Create a temp table in tempdb (uses SQL_Latin1_General_CP1_CI_AS collation)
CREATE TABLE #TempTable (TextValue nvarchar(max))

-- Fails, because MyDB and tempdb uses different collation
SELECT *
 FROM TestTable INNER JOIN #TempTable ON TestTable.TextValue = #TempTable.TextValue

-- Convert to a partially contained database
DROP TABLE #TempTable
USE master

ALTER DATABASE MyDB SET CONTAINMENT=PARTIAL
GO

USE MyDB
GO

-- Create a temp table in MyDB (uses Chinese_Simplified_Pinyin_100_CI_AS collation)
CREATE TABLE #TempTable (TextValue nvarchar(max))

-- Succeeds, because the table in tempdb now uses the same collation as MyDB
SELECT *
 FROM TestTable INNER JOIN #TempTable ON TestTable.TextValue = #TempTable.TextValue

-- Cleanup
DROP TABLE #TempTable
USE master
DROP DATABASE MyDB
GO

This code first creates an uncontained database that uses Chinese_Simplified_Pinyin_100_CI_AS collation on a server instance that uses (the default) SQL_Latin1_General_CP1_CI_AS collation. The code then creates a temporary table and then attempts to join an ordinary database table against it. The attempt fails because the two tables have different collations (that is, they each reside in databases that use different collations), and SQL Server issues the following error message:

Msg 468, Level 16, State 9, Line 81
Cannot resolve the collation conflict between "SQL_Latin1_General_CP1_CI_AS" and
"Chinese_Simplified_Pinyin_100_CI_AS" in the equal to operation.

Then the code issues an ALTER DATABASE…SET CONTAINMENT=PARTIAL statement to convert the database to a partially contained database. As a result, SQL Server resolves the conflict by collating the temporary table in tempdb in the same collation as the contained database, and the second join attempt succeeds.

Summary

Partially contained databases in SQL Server 2012 go a long way helping to improve the portability of databases across servers and instances. In this blog post, I demonstrated how to create a partially contained database with contained users, how to deal with collation issues, and how to use the new data management view to discover threats to containment and identify external dependencies. These capabilities are welcome news for SQL Server DBAs everywhere. Enjoy!

New Spatial Features in SQL Server 2012

SQL Server 2012 adds many significant improvements to the spatial support that was first introduced with SQL Server 2008. Among the more notable enhancements is support for curves (arcs), where SQL Server 2008 only supported straight lines, or polygons composed of straight lines. Microsoft also provides methods that test for non-2012-compatible (curved) shapes, and convert circular data to line data for backward compatibility with SQL Server 2008 (as well as other mapping platforms that don’t support curves).

New Spatial Data Classes

The three new spatial data classes in SQL Server 2012 are:

  • Circular strings
  • Compound curves
  • Curve polygons

All three of these shapes are supported in WKT, WKB, and GML by both the geometry and geography data types, and all of the existing methods work on all of the new circular shapes. My previous post, Geospatial Support for Circular Data in SQL Server 2012 covers these new spatial classes in detail, and shows you how to use them to create circular data. This post focuses on additional spatial features that are new in SQL Server 2012.

New Spatial Methods

Let’s explore a few of the new spatial methods. Some of these new methods complement the new curved shapes, while others add new spatial features that work with all shapes.

The STNumCurves and STCurveN Methods

These two methods can be invoked on any geometry or geography instance. They can be used together to discover information about the curves contained within the spatial instance. The STNumCurves method returns the total number of curves in the instance. You can then pass any number between 1 and what STNumCurves returns to extract each individual curve, and thus iterate all the curves in the instance.

For example, the WKT string CIRCULARSTRING(0 4, 4 0, 8 4, 4 8, 0 4) defines a perfect circle composed of two connected segments; 0 4, 4 0, 8, 4 and 8 4, 4 8, 0 4 (the third coordinate 8 4 is used both as the ending point of the first arc and the starting point of the second arc. The following code demonstrates how to obtain curve information from this circular string using the STNumCurves and STCurveN methods.

-- Create a full circle shape (two connected semi-circles)
DECLARE @C geometry = 'CIRCULARSTRING(0 4, 4 0, 8 4, 4 8, 0 4)'

-- Get the curve count (2) and the 1st curve (bottom semi-circle)
SELECT
  CurveCount = @C.STNumCurves(),
  SecondCurve = @C.STCurveN(2),
  SecondCurveWKT = @C.STCurveN(2).ToString()

This query produces the following output:

CurveCount SecondCurve                                     SecondCurveWKT
---------- ----------------------------------------------- -------------------------------
2          0x000000000204030000000000000000002040000000... CIRCULARSTRING (8 4, 4 8, 0 4)

You can see that STNumCurves indicates there are two curves, and that STCurveN(2) returns the second curve. If you view the results in the spatial viewer, you’ll see just the top half of the circle. This is the semi-circle defined by the second curve, which is converted back to WKT as CIRCULARSTRING (8 4, 4 8, 0 4). Notice that this represents the second segment of the full circle.

The BufferWithCurves Method

SQL Server 2008 introduced the STBuffer method which “pads” a line, effectively converting it into a polygon. If you look closely at the resulting polygon shapes in the spatial viewer, it appears that the points of each line string (including the mid points) are transformed into rounded edges in the polygon. However, the rounded edge look is actually produced by plotting many short straight lines that are clustered very closely together, presenting the illusion of a curve. This approach is necessary since curves were not previously supported before SQL Server 2012 (but the STBuffer method was).

Clearly, using native curve definitions in a curve polygon is more efficient than clustering a multitude of straight lines in an ordinary polygon. For backward compatibility, STBuffer continues to return the (inefficient) polygon as before. So SQL Server 2012 introduces a new method, BufferWithCurves, for this purpose. The following code uses BufferWithCurves to pad lines using true curves, and compares the result with its straight-line cousin, STBuffer.

DECLARE @streets geometry = '
 GEOMETRYCOLLECTION(
  LINESTRING (100 -100, 20 -180, 180 -180),
  LINESTRING (300 -300, 300 -150, 50 -50)
 )'
SELECT @streets.BufferWithCurves(10)

SELECT
  AsWKT = @streets.ToString(),
  Bytes = DATALENGTH(@streets),
  Points = @streets.STNumPoints()
 UNION ALL
 SELECT
  @streets.STBuffer(10).ToString(),
  DATALENGTH(@streets.STBuffer(10)),
  @streets.STBuffer(10).STNumPoints()
 UNION ALL
 SELECT
  @streets.BufferWithCurves(10).ToString(),
  DATALENGTH(@streets.BufferWithCurves(10)),
  @streets.BufferWithCurves(10).STNumPoints()

Here is the resulting shape returned by the first SELECT statement (the collection of padded line shapes generated by BufferWithCurves):

As with STBuffer, the new shapes have rounded edges around the points of the original line strings. However, BufferWithCurves generates actual curves, and thus, produces a significantly smaller and simpler polygon. The second SELECT statement demonstrates by comparing the three shapes—the original line string collection, the polygon returned by STBuffer, and the curve polygon returned by BufferWithCurves. Here are the results:

AsWKT                                                                       Bytes  Points
--------------------------------------------------------------------------  -----  ------
GEOMETRYCOLLECTION (LINESTRING (100 -100, 20 -180, 180 -180), LINESTRIN...  151    6
MULTIPOLYGON (((20.000000000000796 -189.99999999999858, 179.99999999999...  5207   322
GEOMETRYCOLLECTION (CURVEPOLYGON (COMPOUNDCURVE ((20.000000000000796 -1...  693    38

The first shape is the original geometry collection of line strings used for input, which requires only 151 bytes of storage, and has only 6 points. For the second shape, STBuffer pads the line strings to produce a multi-polygon (a set of polygons) that consumes 5,207 bytes and has a total of 322 points—a whopping 3,448 percent increase from the original line strings. In the third shape, BufferWithCurves is used to produce the equivalent padding using a collection of curve polygons composed of compound curves, so it consumes only 693 bytes and has only 38 points—a (relatively) mere 458 percent increase from the original line strings.

The ShortestLineTo Method

This new method examines any two shapes and figures out the shortest line between them. The following code demonstrates:

DECLARE @Shape1 geometry = 'POLYGON ((-20 -30, -3 -26, 14 -28, 20 -40, -20 -30))'
DECLARE @Shape2 geometry = 'POLYGON ((-18 -20, 0 -10, 4 -12, 10 -20, 2 -22, -18 -20))'

SELECT @Shape1
UNION ALL
SELECT @Shape2
UNION ALL
SELECT @Shape1.ShortestLineTo(@Shape2).STBuffer(.25)

This code defines two polygons and then uses ShortestLineTo to determine, generate, and return the shortest straight line that connects them. STBuffer is also used to pad the line string so that it is more clearly visible in the spatial viewer:

The MinDbCompatibilityLevel Method

With the added support for curves in SQL Server 2012 comes support for backward compatibility with previous versions of SQL Server (2008 and 2008 R2) that don’t support curves. The new MinDbCompatibilityLevel method accepts any WKT string and returns the minimum version of SQL Server required to support the shape defined by that string. For example, consider the following code:

DECLARE @Shape1 geometry = 'CIRCULARSTRING(0 50, 90 50, 180 50)'
DECLARE @Shape2 geometry = 'LINESTRING (0 50, 90 50, 180 50)'

SELECT
 Shape1MinVersion = @Shape1.MinDbCompatibilityLevel(),
 Shape2MinVersion = @Shape2.MinDbCompatibilityLevel()

The MinDbCompatibilityLevel method returns 110 (referring to version 11.0) for the first WKT string and 100 (version 10.0) for the second one. This is because the first WKT string contains a circular string, which requires SQL Server 2012 (version 11.0), while the line string in the second WKT string is supported by SQL Server 2008 (version 10.0) and higher.

The STCurveToLine and CurveToLineWithTolerance Methods

These are two methods you can use to convert curves to roughly equivalent straight line shapes. Again, this is to provide compatibility with previous versions of SQL Server and other mapping platforms that don’t support curves.

The STCurveToLine method converts a single curve to a line string with a multitude of segments and points that best approximate the original curve. The technique is similar to what we just discussed for STBuffer, where many short straight lines are connected in a cluster of points to simulate a curve. And, as explained in that discussion, the resulting line string requires significantly more storage than the original curve. To offer a compromise between fidelity and storage, the CurveToLineWithTolerance method accepts “tolerance” parameters to produce line strings that consume less storage space than those produced by STCurveToLine. The following code demonstrates by using both methods to convert the same circle shape from the previous STNumCurves and STCurveN example into line strings.

-- Create a full circle shape (two connected semi-circles)
DECLARE @C geometry = 'CIRCULARSTRING(0 4, 4 0, 8 4, 4 8, 0 4)'

-- Render as curved shape
SELECT
  Shape = @C,
  ShapeWKT = @C.ToString(),
  ShapeLen = DATALENGTH(@C),
  Points = @C.STNumPoints()

-- Convert to lines (much larger, many more points)
SELECT
  Shape = @C.STCurveToLine(),
  ShapeWKT = @C.STCurveToLine().ToString(),
  ShapeLen = DATALENGTH(@C.STCurveToLine()),
  Points = @C.STCurveToLine().STNumPoints()

-- Convert to lines with tolerance (not as much larger, not as many more points)
SELECT
  Shape = @C.CurveToLineWithTolerance(0.1, 0),
  ShapeWKT = @C.CurveToLineWithTolerance(0.1, 0).ToString(),
  ShapeLen = DATALENGTH(@C.CurveToLineWithTolerance(0.1, 0)),
  Points = @C.CurveToLineWithTolerance(0.1, 0).STNumPoints()

The query results show that the original circle consumes only 112 bytes and has 5 points. Invoking STCurveToLine on the circle converts it into a line string that consumes 1,072 bytes and has 65 points. That’s a big increase, but the resulting line string represents the original circle in high fidelity; you will not see a perceptible difference in the two when viewing them using the spatial viewer. However, the line string produced by CurveToLineWithTolerance consumes only 304 bytes and has only 17 points; a significantly smaller footprint, paid for with a noticeable loss in fidelity. As shown by the spatial viewer results below, using CurveToLineWithTolerance produces a circle made up of visibly straight line segments:

The STIsValid, IsValidDetailed and MakeValid Methods

Spatial instance validation has improved greatly in SQL Server 2012. The STIsValid method evaluates a spatial instance and returns a 1 (for true) or 0 (for false) indicating if the instance represents a valid shape (or shapes). If the instance is invalid, the new IsValidDetailed method will return a string explaining the reason why. The following code demonstrates.

DECLARE @line geometry = 'LINESTRING(1 1, 2 2, 3 2, 2 2)'

SELECT
 IsValid = @line.STIsValid(),
 Details = @line.IsValidDetailed()

This line string is invalid because the same point (2 2) is repeated, which results in “overlapping edges,” as revealed by the output from IsValidDetailed:

IsValid  Details
-------  -------------------------------------------------------------------
0        24413: Not valid because of two overlapping edges in curve (1).

SQL Server 2012 is more tolerant of invalid spatial instances than previous versions. For example, you can now perform metric operations (such as STLength) on invalid instances, although you still won’t be able to perform other operations (such as STBuffer) on them.

The new MakeValid method can “fix” an invalid spatial instance and make it valid. Of course, the shape will shift slightly, and there are no guarantees on the accuracy or precision of the changes made. The code in Listing 10-27 uses MakeValid to remove overlapping parts (which can be caused by anomalies such as inaccurate GPS traces), effectively converting the invalid line string into a valid spatial instance.

DECLARE @line geometry = 'LINESTRING(1 1, 2 2, 3 2, 2 2)'
SELECT @line.MakeValid().ToString() AS Fixed

The WKT string returned by the SELECT statement shows the “fixed” line string:

Fixed
-------------------------------------------------------------------
LINESTRING (3 2, 2 2, 1.0000000000000071 1.0000000000000036)

Other Enhancements

The remainder of this post gives brief mention to several other noteworthy spatial enhancements added in SQL Server 2012. These include better geography support, and precision and optimization improvements.

Support for geography Instances Exceeding a Logical Hemisphere

Previous versions of SQL Server supported geography objects as large as (slightly less than) a logical hemisphere (half the globe). This limitation has been removed in SQL Server 2012, which now supports geography instances of any size (even the entire planet).

When you define a geography polygon, the order in which you specify the ring’s latitude and longitude coordinates (known as vertex order) is significant (unlike geometry, where vertex order is insignificant). The coordinate points are always defined according to the left-foot inside rule; when you “walk” the boundary of the polygon, your left foot is on the inside. Thus, vertex order determines whether you are defining a small piece of the globe, relative to the larger piece defined by the entire globe except for the small piece (that is, the rest of the globe).

Since previous versions of SQL Server were limited to half the globe, it was impossible to specify the points of a polygon in the “wrong order,” simply because doing so resulted in too large a shape (and thus, raised an error). That error potential no longer exists in SQL Server 2012, so it’s even more critical to make sure your vertex order is correct, or you’ll be unwittingly working with the exact “opposite” shape.

If you have a geography instance that is known have the wrong vertex order, you can repair it using the new ReorientObject method. This method operates only on polygons (it has no effect on points, line strings, or curves), and can be used to correct the ring orientation (vertex order) of the polygon. The following code demonstrates.

-- Small (less than a logical hemisphere) polygon
SELECT geography::Parse('POLYGON((-10 -10, 10 -10, 10 10, -10 10, -10 -10))')

-- Reorder in the opposite direction for "rest of the globe"
SELECT geography::Parse('POLYGON((-10 -10, -10 10, 10 10, 10 -10, -10 -10))')

-- Reorient back to the small polygon
SELECT geography::Parse('POLYGON((-10 -10, -10 10, 10 10, 10 -10, -10 -10))').ReorientObject()

Three geography polygon instances are defined in this code. The first geography instance defines a very small polygon. The second instance uses the exact same coordinates, but because the vertex order reversed, it defines an enormous polygon whose area represents the entire globe except for the small polygon. As explained, such a definition would cause an error in previous versions of SQL Server, but is now accommodated without a problem by SQL Server 2012. The third instance reverses the vertex order on the same shape as the second instance, thereby producing the same small polygon as the first instance.

Full Globe Support

Along with the aforementioned support for geography instances to exceed a single logical hemisphere comes a new spatial data class called FULLGLOBE. As you may have guessed, this is a shape that represents the entire planet. If you’ve ever wondered how many square meters there are in the entire world, the following query gives you the answer (which is 510,065,621,710,996 square meters, so you can stop wondering).

-- Construct a new FullGlobe object (a WGS84 ellipsoid)
DECLARE @Earth geography = 'FULLGLOBE'

-- Calculate the area of the earth
SELECT PlanetArea = @Earth.STArea()

All of the common spatial methods work as expected on a full globe object. So you could, for example, “cut away” at the globe by invoking the STDifference and STSymDifference method against it using other polygons as cookie-cutter shapes.

New “Unit Sphere” Spatial Reference ID

The default spatial reference ID (SRID) in SQL Server 2012 is 4326, which uses the metric system as its unit of measurement. This SRID also represents the true ellipsoidal sphere shape of the earth. While this representation is most accurate, it’s also more complex to calculate precise ellipsoidal mathematics. SQL Server 2012 offers a compromise in speed and accuracy, by adding a new spatial reference id (SRID), 104001, which uses a sphere of radius 1 to represent a perfectly round earth.

You can create geography instances with SRID 104001 when you don’t require the greatest accuracy. The STDistance, STLength, and ShortestLineTo methods are optimized to run faster on the unit sphere, since it takes a relatively simple formula to compute measures against a perfectly round sphere (compared to an ellipsoidal sphere).

Better Precision

Internal spatial calculations in SQL Server 2012 are now performed with 48 bits of precision, compared to 27 bits used in SQL Server 2008 and SQL Server 2008 R2. This can reduce the error caused by rounding of floating point coordinates for original vertex points by the internal computation.

Summary

This blog post introduced you to some of the powerful new spatial capabilities added to SQL Server 2012. You saw how to use STNumCurves and STCurveN to obtain curve information from circular data, the BufferWithCurves method to produce more efficient padded line shapes than STBuffer, and the ShortestLineTo method to figure out the shortest distance between two shapes. Then you saw how to use the new MinDbCompatibilityLevel, STCurveToLine, and CurveToLineWithTolerance methods for supporting backward compatibility with SQL Server 2008. You also learned how SQL Server 2012 is much better at handling invalid spatial data, using the STIsValid, IsValidDetailed, and MakeValid methods. Finally, you learned about the new full globe support, unit sphere SRID, and improved precision.

You can learn much more about spatial functionality in my new book Programming Microsoft SQL Server 2012, which has an entire chapter dedicated to the topic. I hope you get to enjoy these powerful new spatial capabilities in SQL Server 2012!

How Significant is the SQL Server 2012 Release?

SQL Server, particularly its relational database engine, matured quite some time ago. So the “significance” of every new release over recent years can be viewed—in some ways—as relatively nominal. The last watershed release of the product was actually SQL Server 2005, which was when the relational engine (that, for years, defined SQL Server) stopped occupying “center stage,” and instead took its position alongside a set of services that now, collectively, define the product. These of course include the Business Intelligence (BI) components Reporting Services, Analysis Services, and Integration Services—features that began appearing as early as 1999 but, prior to SQL Server 2005, were integrated sporadically as a patchwork of loosely coupled add-ons, wizards, and management consoles. SQL Server 2005 changed all that with a complete overhaul. For the first time, SQL Server delivered a broader, richer, and more consolidated set of features and services which are built into—rather than bolted onto—the platform. None of the product versions that have been released since that time—SQL Server 2008, 2008 R2, and now 2012—have changed underlying architecture this radically.

That said, each SQL Server release continues to advance itself in vitally significant ways. SQL Server 2008 (released August 6, 2008) added a host of new features to the relational engine—T-SQL enhancements, Change Data Capture (CDC), Transparent Data Encryption (TDE), SQL Audit, FILESTREAM—plus powerful BI capabilities with Excel PivotTables, charts, and CUBE formulas. SQL Server 2008 R2 (released April 21, 2010) was dubbed the “BI Refresh,” adding PowerPivot for Excel, Master Data Services, and StreamInsight, but offering nothing more than minor tweaks and fixes in the relational engine.

The newest release—SQL Server 2012—is set to officially launch on March 7, 2012, and like every new release, this version improves on all of the key “abilities” (availability, scalability, manageability, programmability, and so on). The chief reliability improvement is the new High Availability Disaster Recovery (HADR) alternative to database mirroring. HADR (also commonly known as “Always On”) utilizes multiple secondary servers in an “availability group” for scale-out read-only operations (rather than forcing them to sit idle, just waiting for a failover to occur). Multi-subnet failover clustering is another notable new manageability feature.

SQL Server 2012 adds many new features to the relational engine, many of which I have blogged about, and most of which I cover in my new book (soon to be published). There are powerful T-SQL extensions, most notably the windowing enhancements, 22 new T-SQL functions, improved error handling, server-side paging, sequence generators, and rich metadata discovery techniques. There are also remarkable improvements for unstructured data, such as the FileTable abstraction over FILESTREAM and the Windows file system API, full-text property searching, and Statistical Semantic Search. Spatial support gets a big boost as well, with support for circular data, full-globe support, increased performance, and greater parity between the geometry and geography data types. Contained databases simplify portability, and new “columnstore” technology drastically increases performance of huge OLAP cubes (VertiPaq for PowerPivot and Analysis Services) and data warehouses (a similar implementation in the relational engine).

The aforementioned features are impressive, but still don’t amount to much more than “additives” over an already established database platform. A new release needs more than just extra icing on the cake for customers to perceive an upgrade as compelling. To that end, Microsoft has invested heavily in BI with SQL Server 2012, and the effort shows. The BI portion of the stack has been expanded greatly, delivering key advances in “pervasive insight.” This includes major updates to the product’s analytics, data visualization (such as self-service reporting with Power View), and master data management capabilities, as well Data Quality Services (DQS), a brand new data quality engine. There is also a new Business Intelligence edition of the product that includes all of these capabilities without requiring a full Enterprise edition license. Finally, SQL Server Data Tools (SSDT) brings brand new database tooling inside Visual Studio 2010. SSDT provides a declarative, model-based design-time experience for developing databases while connected, offline, on-premise, or in the cloud.

Of course there are more new features; I’ve only mentioned the most notable ones in this post. Although the relational engine has been rock-solid for several releases, it continues to enjoy powerful enhancements, while the overall product continues to reinvent itself by expanding its stack of services, particularly in the Business Intelligence space.

Geospatial Support for Circular Data in SQL Server 2012

SQL Server 2012 adds many significant improvements to the spatial support that was first introduced with SQL Server 2008. In this blog post, I’ll explore one of the more notable enhancements: support for curves and arcs (circular data). SQL Server 2008 only supported straight lines, or polygons composed of straight lines. The three new shapes in SQL Server 2012 are circular strings, compound curves, and curve polygons. All three are supported in Well-Known Text (WKT), Well-Known Binary (WKB), and Geometry Markup Language (GML) by both the geometry (planar, or “flat-earth” model) and geography (ellipsoidal sphere, or geodetic) data types, and all of the existing methods work on the new shapes.

Circular Strings

A circular string defines a basic curved line, similar to how a line string defines a straight line. It takes a minimum of three coordinates to define a circular string; the first and third coordinates define the end points of the line, and the second coordinate (the “anchor” point, which lies somewhere between the end points) determines the arc of the line. Here is the shape represented by CIRCULARSTRING(0 1, .25 0, 0 -1):

The following code produces four circular strings. All of them have the same start and end points, but different anchor points. The lines are buffered slightly to make them easier to see in the spatial viewer.

-- Create a "straight" circular line
SELECT geometry::Parse('CIRCULARSTRING(0 8, 4 0, 8 -8)').STBuffer(.1)
UNION ALL  -- Curve it
SELECT geometry::Parse('CIRCULARSTRING(0 8, 4 4, 8 -8)').STBuffer(.1)
UNION ALL  -- Curve it some more
SELECT geometry::Parse('CIRCULARSTRING(0 8, 4 6, 8 -8)').STBuffer(.1)
UNION ALL  -- Curve it in the other direction
SELECT geometry::Parse('CIRCULARSTRING(0 8, 4 -6, 8 -8)').STBuffer(.1)

The spatial viewer in SQL Server Management Studio shows the generated shapes:

The first shape is a “straight circular” line, because the anchor point is position directly between the start and end points. The next two shapes use the same end points with the anchor out to the right (4), one a bit further than the other (6). The last shape also uses the same end points, but specifies an anchor point that curves the line to the left rather than the right (-6).

You can extend circular strings with as many curve segments as you want. Do this by defining another two coordinates for each additional segment. The last point of the previous curve serves as the first end point of the next curve segment, so the two additional coordinates respectively specify the next segment’s anchor and second end point. Thus, valid circular strings will always have an odd number of points. You can extend a circular string indefinitely to form curves and arcs of any kind.

It’s easy to form a perfect circle by connecting two semi-circle segments. For example, the following illustration shows the circle produced by CIRCULARSTRING(0 4, 4 0, 8 4, 4 8, 0 4).

This particular example connects the end of the second segment to the beginning of the first segment to form a closed shape. Note that this is certainly not required of circular strings (or line strings), and that closing the shape by connecting the last segment to the first still does not result in a polygon, which is a two dimensional shape that has area. Despite being closed, this circle is still considered a one-dimensional shape with no area. As you’ll soon see, the curve polygon can be used to convert closed line shapes into true polygons.

Compound Curves

A compound curve is a set of circular strings, or circular strings combined with line strings, that form a desired curved shape. The end point of each element in the collection must match the starting point of the following element, so that compound curves are defined in a “connect-the-dots” fashion. The following code produces a compound curve and compares it with the equivalent geometry collection shape.

-- Compound curve
 DECLARE @CC geometry = '
  COMPOUNDCURVE(
   (4 4, 4 8),
   CIRCULARSTRING(4 8, 6 10, 8 8),
   (8 8, 8 4),
   CIRCULARSTRING(8 4, 2 3, 4 4)
  )'

-- Equivalent geometry collection
 DECLARE @GC geometry = '
  GEOMETRYCOLLECTION(
   LINESTRING(4 4, 4 8),
   CIRCULARSTRING(4 8, 6 10, 8 8),
   LINESTRING(8 8, 8 4),
   CIRCULARSTRING(8 4, 2 3, 4 4)
  )'

-- They both render the same shape in the spatial viewer
 SELECT @CC.STBuffer(.5)
 UNION ALL
 SELECT @GC.STBuffer(1.5)

This code creates a keyhole shape using a compound curve, and also creates an identical shape as a geometry collection (though notice that the LINESTRING keyword is not—and cannot—be specified when defining a compound curve). It then buffers both of them with different padding, so that the spatial viewer clearly shows the two identical shapes on top of one another, as shown:

Both the compound curve and the geometry collection yield identical shapes. In fact, the expression @CC.STEquals(@GC) which compares the two instances for equality returns 1 (for true). The STEquals method tests for “spatial equality,” meaning it returns true if two instances produce the same shape even if they are being rendered using different spatial data classes. Furthermore, recall that segments of a circular string can be made perfectly straight by positioning the anchor directly between the end points, meaning that the circular string offers yet a third option for producing the very same shape. So which one should you use? Comparing these spatial data classes will help you determine which one is best to use in different scenarios.

A geometry collection (which was already supported in SQL Server 2008) is the most accommodating, but carries the most storage overhead. Geometry collections can hold instances of any spatial data class, and the instances don’t need to be connected to (or intersected with) each other in any way. The collection simply holds a bunch of different shapes as a set, which in this example just happens to be several line strings and circular strings connected at their start and end points.

In contrast, the new compound curve class in SQL Server 2012 has the most constraints but is the most lightweight in terms of storage. It can only contain line strings or circular strings, and each segment’s start point must be connected to the previous segment’s end point (although it is most certainly not necessary to connect the first and last segments to form a closed shape as in this example). The DATALENGTH function shows the difference in storage requirements; DATALENGTH(@CC) returns 152 and DATALENGTH(@GC) returns 243. In our current example, DATALENGTH(@CC) returns 152 and DATALENGTH(@GC) returns 243. This means that the same shape requires 38% less storage space by using a compound curve instead of a geometry collection. A compound curve is also more storage-efficient than a multi-segment circular line string when straight lines are involved. This is because there is overhead for the mere potential of a curve, since the anchor point requires storage even when it’s position produces straight lines, whereas compound curves are optimized specifically to connect circular strings and (always straight) line strings.

Curve Polygons

A curve polygon is very similar to an ordinary polygon; like an ordinary polygon, a curve polygon specifies a “ring” that defines a closed shape, and can also specify additional inner rings to define “holes” inside the shape. The only fundamental difference between a polygon and a curve polygon is that the rings of a curve polygon can include circular shapes, whereas an ordinary polygon is composed exclusively with straight lines. Specifically, each ring in a curve polygon can consist of any combination of line strings, circular strings, and compound curves that collectively define the closed shape. For example, the following code produces a curve polygon with the same keyhole outline that I just demonstrated for the compound curve.

-- Curve polygon
 SELECT geometry::Parse('
  CURVEPOLYGON(
   COMPOUNDCURVE(
    (4 4, 4 8),
    CIRCULARSTRING(4 8, 6 10, 8 8),
    (8 8, 8 4),
    CIRCULARSTRING(8 4, 2 3, 4 4)
   )
  )')

This code has simply specified the same compound curve as the closed shape of a curve polygon. Although the shape is the same, the curve polygon is a two-dimensional object, whereas the compound curve version of the same shape is a one-dimensional object. This can be seen visually by the spatial viewer results, which shades the interior of the curve polygon as shown here:

Conclusion

Circular data support is an important new capability added to the spatial support in SQL Server 2012. In this blog post, I demonstrated the three new spatial data classes for curves and arcs: circular strings, compound curves, and curve polygons. Stay tuned for my next post (coming soon), for more powerful and fun spatial enhancements coming soon in SQL Server 2012!

SQL Server 2012 Windowing Functions Part 2 of 2: New Analytic Functions

This is the second half of my two-part article on windowing functions in SQL Server 2012. In Part 1, I explained the new running and sliding aggregation capabilities added to the OVER clause in SQL Server 2012. In this post, I’ll explain the new T-SQL analytic windowing functions. All of these functions operate using the windowing principles I explained in Part 1.

Eight New Analytic Functions

There are eight new analytic functions that have been added to T-SQL. All of them work in conjunction with an ordered window defined with an associated ORDER BY clause that can be optionally partitioned with a PARTITION BY clause and framed with a BETWEEN clause. The new functions are:

 • FIRST_VALUE
 • LAST_VALUE
 • LAG
 • LEAD
 • PERCENT_RANK
 • CUME_DIST
 • PERCENTILE_CONT
 • PERCENTILE_DISC

In the following code listing, the FIRST_VALUE, LAST_VALUE, LAG, and LEAD functions are used to analyze a set of orders at the product level.

DECLARE @Orders AS table(OrderDate date, ProductID int, Quantity int)
INSERT INTO @Orders VALUES
 ('2011-03-18', 142, 74),
 ('2011-04-11', 123, 95),
 ('2011-04-12', 101, 38),
 ('2011-05-21', 130, 12),
 ('2011-05-30', 101, 28),
 ('2011-07-25', 123, 57),
 ('2011-07-28', 101, 12)

SELECT
  OrderDate,
  ProductID,
  Quantity,
  WorstOn = FIRST_VALUE(OrderDate) OVER(PARTITION BY ProductID ORDER BY Quantity),
  BestOn = LAST_VALUE(OrderDate) OVER(PARTITION BY ProductID ORDER BY Quantity
                          ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),
  PrevOn = LAG(OrderDate, 1) OVER(PARTITION BY ProductID ORDER BY OrderDate),
  NextOn = LEAD(OrderDate, 1) OVER(PARTITION BY ProductID ORDER BY OrderDate)
 FROM @Orders
 ORDER BY OrderDate

OrderDate   ProductID  Quantity  WorstOn     BestOn      PrevOn      NextOn
----------  ---------  --------  ----------  ----------  ----------  ----------
2011-03-18  142        74        2011-03-18  2011-03-18  NULL        NULL
2011-04-11  123        95        2011-07-25  2011-04-11  NULL        2011-07-25
2011-04-12  101        38        2011-07-28  2011-04-12  NULL        2011-05-30
2011-05-21  130        12        2011-05-21  2011-05-21  NULL        NULL
2011-05-30  101        28        2011-07-28  2011-04-12  2011-04-12  2011-07-28
2011-07-25  123        57        2011-07-25  2011-04-11  2011-04-11  NULL
2011-07-28  101        12        2011-07-28  2011-04-12  2011-05-30  NULL

In this query, four analytic functions specify an OVER clause that partitions the result set by ProductID. The product partitions defined for FIRST_VALUE and LAST_VALUE are sorted by Quantity, while the product partitions for LAG and LEAD are sorted by OrderDate. The full result set is sorted by OrderDate, so you need to visualize the sorted partition for each of the four functions to understand the output—the result set sequence is not the same as the row sequence used in the windowing functions.

FIRST_VALUE and LAST_VALUE

The WorstOn and BestOn columns use FIRST_VALUE and LAST_VALUE respectively to return the “worst” and “best” dates for the product in each partition. Performance is measured by quantity, so sorting each product’s partition by quantity will position the worst order at the first row in the partition and the best order at the last row in the partition. FIRST_VALUE and LAST_VALUE can return the value of any column (OrderDate, in this case), not just the aggregate column itself. For LAST_VALUE, it is also necessary to explicitly define a window over the entire partition with ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. Otherwise, as explained in my coverage of OVER clause enhancements in Part 1, the default window is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which frames (constrains) the window, and does not consider the remaining rows in the partition that are needed to obtain the highest quantity for LAST_VALUE.

In the output, notice that OrderDate, LowestOn, and HighestOn for the first order (product 142) are all the same value (3/18). This is because product 142 was only ordered once, so FIRST_VALUE and LAST_VALUE operate over a partition that has only this one row in it, with an OrderDate value of 3/18. The second row is for product 123, quantity 95, ordered on 4/11. Four rows ahead in the result set (not the partition) there is another order for product 123, quantity 57, placed on 7/25. This means that, for this product, FIRST_VALUE and LAST_VALUE operate over a partition that has these two rows in it, sorted by quantity. This positions the 7/25 order (quantity 57) first and the 4/11 (quantity 95) last within the partition. As a result, rows for product 123 report 7/25 for WorstDate and 4/11 for BestDate. The next order (product 101) appears two more times in the result set, creating a partition of three rows. Again, based on the Quantity sort of the partition, each row in the partition reports the product’s worst and best dates (which are 7/28 and 4/12, respectively).

LAG and LEAD

The PrevOn and NextOn columns use LAG and LEAD to return the previous and next date that each product was ordered. They specify an OVER clause that partitions by ProductId as before, but the rows in these partitions are sorted by OrderDate. Thus, the LAG and LEAD functions examine each product’s orders in chronological sequence, regardless of quantity. For each row in each partition, LAG is able to access previous (lagging) rows within the same partition. Similarly, LEAD can access subsequent (leading) rows within the same partition. The first parameter to LAG and LEAD specifies the column value to be returned from a lagging or leading row, respectively. The second parameter specifies the number of rows back or forward LAG and LEAD should seek within each partition, relative to the current row. The query passes OrderDate and 1 as parameters to LAG and LEAD, using product partitions that are ordered by date. Thus, the query returns the most recent past date, and nearest future date, that each product was ordered.

Because the first order’s product (142) was only ordered once, its single-row partition has no lagging or leading rows, and so LAG and LEAD both return NULL for PrevOn and NextOn. The second order (on 4/11) is for product 123, which was ordered again on 7/25, creating a partition with two rows sorted by OrderDate, with the 4/11 order positioned first and the 7/25 order positioned last within the partition. The first row in a multi-row window has no lagging rows, but at least one leading row. Similarly, the last order in a multi-row window has at least one lagging row, but no leading rows. As a result, the first order (4/11) reports NULL and 7/25 for PrevOn and NextOn (respectively), and the second order (7/25) reports 4/11 and NULL for PrevOn and NextOn (respectively). Product 101 was ordered three times, which creates a partition of three rows. In this partition, the second row has both a lagging row and a leading row. Thus, the three orders report PrevOn and NextOn values for product 101, respectively indicating NULL-5/30 for the first (4/12) order, 4/12-7/28 for the second (5/30) order, and 5/30-NULL for the third and last order.

The last functions to examine are PERCENT_RANK (rank distribution), CUME_DIST (cumulative distribution, or percentile), PERCENTILE_CONT (continuous percentile lookup), and PERCENTILE_DISC (discreet percentile lookup). The following queries demonstrate these functions, which are all closely related, by querying sales figures across each quarter of two years.

DECLARE @Sales table(Yr int, Qtr int, Amount money)
INSERT INTO @Sales VALUES
  (2010, 1, 5000), (2010, 2, 6000), (2010, 3, 7000), (2010, 4, 2000),
  (2011, 1, 1000), (2011, 2, 2000), (2011, 3, 3000), (2011, 4, 4000)

-- Distributed across all 8 quarters
SELECT
  Yr, Qtr, Amount,
  R = RANK() OVER(ORDER BY Amount),
  PR = PERCENT_RANK() OVER(ORDER BY Amount),
  CD = CUME_DIST() OVER(ORDER BY Amount)
 FROM @Sales
 ORDER BY Amount

-- Distributed (partitioned) by year with percentile lookups
SELECT
  Yr, Qtr, Amount,
  R = RANK() OVER(PARTITION BY Yr ORDER BY Amount),
  PR = PERCENT_RANK() OVER(PARTITION BY Yr ORDER BY Amount),
  CD = CUME_DIST() OVER(PARTITION BY Yr ORDER BY Amount),
  PD5 = PERCENTILE_DISC(.5) WITHIN GROUP (ORDER BY Amount) OVER(PARTITION BY Yr),
  PD6 = PERCENTILE_DISC(.6) WITHIN GROUP (ORDER BY Amount) OVER(PARTITION BY Yr),
  PC5 = PERCENTILE_CONT(.5) WITHIN GROUP (ORDER BY Amount) OVER(PARTITION BY Yr),
  PC6 = PERCENTILE_CONT(.6) WITHIN GROUP (ORDER BY Amount) OVER(PARTITION BY Yr)
 FROM @Sales
 ORDER BY Yr, Amount

Yr    Qtr  Amount   R  PR                 CD
----  ---  -------  -  -----------------  -----
2011  1    1000.00  1  0                  0.125
2011  2    2000.00  2  0.142857142857143  0.375
2010  4    2000.00  2  0.142857142857143  0.375
2011  3    3000.00  4  0.428571428571429  0.5
2011  4    4000.00  5  0.571428571428571  0.625
2010  1    5000.00  6  0.714285714285714  0.75
2010  2    6000.00  7  0.857142857142857  0.875
2010  3    7000.00  8  1                  1

Yr    Qtr  Amount   R  PR                 CD    PD5      PD6      PC5   PC6
----  ---  -------  -  -----------------  ----  -------  -------  ----  ----
2010  4    2000.00  1  0                  0.25  5000.00  6000.00  5500  5800
2010  1    5000.00  2  0.333333333333333  0.5   5000.00  6000.00  5500  5800
2010  2    6000.00  3  0.666666666666667  0.75  5000.00  6000.00  5500  5800
2010  3    7000.00  4  1                  1     5000.00  6000.00  5500  5800
2011  1    1000.00  1  0                  0.25  2000.00  3000.00  2500  2800
2011  2    2000.00  2  0.333333333333333  0.5   2000.00  3000.00  2500  2800
2011  3    3000.00  3  0.666666666666667  0.75  2000.00  3000.00  2500  2800
2011  4    4000.00  4  1                  1     2000.00  3000.00  2500  2800

The new functions are all based on the RANK function introduced in SQL Server 2005. So both these queries also report on RANK, which will aid both in my explanation and your understanding of each of the new functions.

PERCENT_RANK and CUME_DIST

In the first query, PERCENT_RANK and CUME_DIST (aliased as PR and CD respectively) rank quarterly sales across the entire two year period. Look at the value returned by RANK (aliased as R). It ranks each row in the unpartitioned window (all eight quarters) by dollar amount. Both 2011Q2 and 2010Q4 are tied for $2,000 in sales, so RANK assigns them the same value (2). The next row break the tie, so RANK continues with 4, which accounts for the “empty slot” created by the two previous rows that were tied.

Now examine the values returned by PERCENT_RANK and CUME_DIST. Notice how they reflect the same information as RANK with decimal values ranging from 0 and 1. The only difference between the two is a slight variation in their formulaic implementation, such that PERCENT_RANK always starts with 0 while CUME_DIST always starts with a value greater than 0. Specifically, PERCENT_RANK returns (RANK – 1) / (N – 1) for each row, where N is the total number of rows in the window. This always returns 0 for the first (or only) row in the window. CUME_DIST returns RANK / N, which always returns a value greater than 0 for the first row in the window (which would be 1, if there’s only row). For windows with two or more rows, both functions return 1 for the last row in the window with decimal values distributed among all the other rows.

The second query examines the same sales figures, only this time the result set is partitioned by year. There are no ties within each year, so RANK assigns the sequential numbers 1 through 4 to each of the quarters, for 2010 and 2011, by dollar amount. You can see that PERCENT_RANK and CUME_DIST perform the same RANK calculations as explained for the first query (only, again, partitioned by year this time).

PERCENTILE_DISC and PERCENTILE_CONT

This query also demonstrates PERCENTILE_DISC and PERCENTILE_CONT. These very similar functions each accept a percentile parameter (the desired CUME_DIST value) and “reach in” to the window for the row at or near that percentile. The code demonstrates by calling both functions twice, once with a percentile parameter value of .5 and once with .6, returning columns aliased as PD5, PD6, PC5, and PC6. Both functions examine the CUME_DIST value for each row in the window to find the one nearest to .5 and .6. The subtle difference between them is that PERCENTILE_DISC will return a precise (discreet) value from the row with the matching percentile (or greater), while PERCENTILE_CONT interpolates a value based on a continuous range. Specifically, PERCENTILE_CONT returns a value ranging from the row matching the specified percentile—or a calculated value higher than that (based on the specified percentile) if there is no exact match—and the row with the next higher percentile in the window. This explains the values they return in this query.

Notice that these functions define their window ordering using ORDER BY in a WITHIN GROUP clause rather than in the OVER clause. Thus, you do not (and cannot) specify ORDER BY in the OVER clause. The OVER clause is still required, however, so OVER (with empty parentheses) must be specified even if you don’t want to partition using PARTITION BY.

For the year 2010, the .5 percentile (CUME_DIST value) is located exactly on quarter 1, which had $5,000 in sales. Thus PERCENTILE_DISC(.5) returns 5000. There is no row in the window with a percentile of .6, so PERCENTILE_DISC(.6) matches up against the first row with a percentile greater than or equal to .6, which is the row for quarter 2 with $6,000 in sales, and thus returns 6000. In both cases, PERCENTILE_DISC returns a discreet value from a row in the window at or greater than the specified percentile. The same calculations are performed for 2011, returning 2000 for PERCENTILE_DISC(.5) and 3000 for PERCENTILE_DISC(.6), corresponding to the $2,000 in sales for quarter 2 (percentile .5) and the $3,000 in sales for quarter 3 (percentile .75)

As I stated, PERCENTILE_CONT is very similar. It takes the same percentile parameter to find the row in the window matching that percentile. If there is no exact match, the function calculates a value based on the scale of percentiles distributed across the entire window, rather than looking ahead to the row having the next greater percentile value, as PERCENTILE_DISC does. Then it returns the median between that value and the value found in the row with the next greater percentile. For 2010, the .5 percentile matches up with 5000 (as before). The next percentile in the window is for .75 for 6000. The median between 5000 and 6000 is 5500 and thus, PERCENTILE_CONT(.5) returns 5500. There is no row in the window with a percentile of .6, so PERCENTILE_CONT(.6) calculates what the value for .6 would be (somewhere between 5000 and 6000, a bit closer to 5000) and then calculates the median between that value and the next percentile in the window (again, .75 for 6000). Thus, PERCENTILE_CONT(.6) returns 5800; slightly higher than the 5500 returned for PERCENTILE_CONT(.5).

Conclusion

This post explained the eight new analytic functions added to T-SQL in SQL Server 2012. These new functions, plus the running and sliding aggregation capabilities covered in Part 1, greatly expand the windowing capabilities of the OVER clause available since SQL Server 2005.

SQL Server 2012 Windowing Functions Part 1 of 2: Running and Sliding Aggregates

The first windowing capabilities appeared in SQL Server 2005 with the introduction of the OVER clause and a set of ranking functions (ROW_NUMBER, RANK, DENSE_RANK, and NTILE). The term “window,” as it is used here, refers to the scope of visibility from one row in a result set relative to neighboring rows in the same result set. By default, OVER produces a single window over the entire result set, but its associated PARTITION BY clause lets you divide the result set up into distinct windows—one per partition. Furthermore, its associated ORDER BY clause enables cumulative calculations within each window.

In addition to the four ranking functions, the OVER clause can be used with the traditional aggregate functions (SUM, COUNT, MIN, MAX, AVG). This is extremely useful, because it allows you to calculate aggregations without being forced to summarize all the detail rows with a GROUP BY clause. However, prior to SQL Server 2012, running and sliding calculations with an associated ORDER BY clause was supported only for the ranking functions. Using ORDER BY with OVER for any of the aggregate functions was not allowed. This prevents running and sliding aggregations, severely limited the windowing capability of OVER since its introduction in SQL Server 2005.

Fortunately, SQL Server 2012 finally addresses this shortcoming. In this blog post, the first in a two-part article, I’ll show you how to use OVER/ORDER BY with all the traditional aggregate functions in SQL Server 2012 to provide running aggregates within ordered windows and partitions. I’ll also show you how to frame windows using the ROWS or RANGE clause, which adjusts the size and scope of the window and enables sliding aggregations. SQL Server 2012 also introduces eight new analytic functions that are designed specifically to work with ordered (and optionally partitioned) windows using the OVER clause. I will cover those new analytic functions in Part 2.

To demonstrate running and sliding aggregates, create a table and populate it with sample financial transactions for several different accounts, as shown below. (Note the use of the DATEFROMPARTS function, also new in SQL Server 2012, which is used to construct a date value from year, month, and day parameters.)

CREATE TABLE TxnData (AcctId int, TxnDate date, Amount decimal)
GO

INSERT INTO TxnData (AcctId, TxnDate, Amount) VALUES
  (1, DATEFROMPARTS(2011, 8, 10), 500),  -- 5 transactions for acct 1
  (1, DATEFROMPARTS(2011, 8, 22), 250),
  (1, DATEFROMPARTS(2011, 8, 24), 75),
  (1, DATEFROMPARTS(2011, 8, 26), 125),
  (1, DATEFROMPARTS(2011, 8, 28), 175),
  (2, DATEFROMPARTS(2011, 8, 11), 500),  -- 8 transactions for acct 2
  (2, DATEFROMPARTS(2011, 8, 15), 50),
  (2, DATEFROMPARTS(2011, 8, 22), 5000),
  (2, DATEFROMPARTS(2011, 8, 25), 550),
  (2, DATEFROMPARTS(2011, 8, 27), 105),
  (2, DATEFROMPARTS(2011, 8, 27), 95),
  (2, DATEFROMPARTS(2011, 8, 29), 100),
  (2, DATEFROMPARTS(2011, 8, 30), 2500),
  (3, DATEFROMPARTS(2011, 8, 14), 500),  -- 4 transactions for acct 3
  (3, DATEFROMPARTS(2011, 8, 15), 600),
  (3, DATEFROMPARTS(2011, 8, 22), 25),
  (3, DATEFROMPARTS(2011, 8, 23), 125)

Running Aggregations

Used by itself, the OVER clause operates over a window that encompasses the entire result set of a query. Windows can be partitioned in your queries using OVER with PARTITION BY, enabling partition-level aggregations to be calculated for each window. And with SQL Server 2012, an ORDER BY clause can also be specified with OVER to achieve row-level running aggregations within each window. The following code demonstrates the use of OVER with ORDER BY to produce running aggregations:

SELECT AcctId, TxnDate, Amount,
  RAvg = AVG(Amount) OVER (PARTITION BY AcctId ORDER BY TxnDate),
  RCnt = COUNT(*)    OVER (PARTITION BY AcctId ORDER BY TxnDate),
  RMin = MIN(Amount) OVER (PARTITION BY AcctId ORDER BY TxnDate),
  RMax = MAX(Amount) OVER (PARTITION BY AcctId ORDER BY TxnDate),
  RSum = SUM(Amount) OVER (PARTITION BY AcctId ORDER BY TxnDate)
 FROM TxnData
 ORDER BY AcctId, TxnDate

AcctId TxnDate    Amount RAvg        RCnt RMin RMax RSum
------ ---------- ------ ----------- ---- ---- ---- ----
1      2011-08-10 500    500.000000  1    500  500  500
1      2011-08-22 250    375.000000  2    250  500  750
1      2011-08-24 75     275.000000  3    75   500  825
1      2011-08-26 125    237.500000  4    75   500  950
1      2011-08-28 175    225.000000  5    75   500  1125
2      2011-08-11 500    500.000000  1    500  500  500
2      2011-08-15 50     275.000000  2    50   500  550
2      2011-08-22 5000   1850.000000 3    50   5000 5550
 :

The results of this query are partitioned (windowed) by account. Within each window, the account’s running averages, counts, minimum/maximum values, and sums are ordered by transaction date, showing the chronologically accumulated values for each account. No ROWS clause or RANGE clause is specified, so ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is assumed by default. This yields a window frame size that spans from the beginning of the partition (the first row of each account) through the current row. When the account number changes, the previous window is “closed” and new calculations start running for a new window over the next account number.

Sliding Aggregations

You can narrow each account’s window by framing it with a ROWS BETWEEN n PRECEDING AND CURRENT ROW clause within the OVER clause. This enables sliding calculations, as demonstrated by this slightly modified version of the previous query:

SELECT AcctId, TxnDate, Amount,
  SAvg = AVG(Amount) OVER (PARTITION BY AcctId ORDER BY TxnDate ROWS BETWEEN 2 PRECEDING AND CURRENT ROW),
  SCnt = COUNT(*)    OVER (PARTITION BY AcctId ORDER BY TxnDate ROWS 2 PRECEDING),
  SMin = MIN(Amount) OVER (PARTITION BY AcctId ORDER BY TxnDate ROWS 2 PRECEDING),
  SMax = MAX(Amount) OVER (PARTITION BY AcctId ORDER BY TxnDate ROWS 2 PRECEDING),
  SSum = SUM(Amount) OVER (PARTITION BY AcctId ORDER BY TxnDate ROWS 2 PRECEDING)
 FROM TxnData
 ORDER BY AcctId, TxnDate

AcctId TxnDate    Amount SAvg        SCnt SMin SMax SSum
------ ---------- ------ ----------- ---- ---- ---- ----
1      2011-08-10 500    500.000000  1    500  500  500
1      2011-08-22 250    375.000000  2    250  500  750
1      2011-08-24 75     275.000000  3    75   500  825
1      2011-08-26 125    150.000000  3    75   250  450
1      2011-08-28 175    125.000000  3    75   175  375
2      2011-08-11 500    500.000000  1    500  500  500
2      2011-08-15 50     275.000000  2    50   500  550
2      2011-08-22 5000   1850.000000 3    50   5000 5550
 :

This query specifies ROWS BETWEEN 2 PRECEDING AND CURRENT ROW in the OVER clause for the RAvg column, overriding the default window size. Specifically, it frames the window within each account’s partition to a maximum of three rows: the current row, the row before it, and one more row before that one. Once the window expands to three rows, it stops growing and starts sliding down the subsequent rows until a new partition (the next account) is encountered. The BETWEEN…AND CURRENT ROW keywords that specify the upper bound of the window are assumed default, so to reduce code clutter, the other column definitions specify just the lower bound of the window with the shorter variation ROWS 2 PRECEDING.

Notice how the window “slides” within each account. For example, the sliding maximum for account 1 drops from 500 to 250 in the fourth row, because 250 is the largest value in the window of three rows that begins two rows earlier—and the 500 from the very first row is no longer visible in that window. Similarly, the sliding sum for each account is based on the defined window. Thus, the sliding sum of 375 on the last row of account 1 is the total sum of that row (175) plus the two preceding rows (75 + 125) only—not the total sum for all transactions in the entire account, as the running sum had calculated.

Using RANGE versus ROWS

Finally, RANGE can be used instead of ROWS to handle “ties” within a window. While ROWS treats each row in the window distinctly, RANGE will merge rows containing duplicate ORDER BY values, as demonstrated by the following query:

SELECT AcctId, TxnDate, Amount,
 SumByRows = SUM(Amount) OVER (ORDER BY TxnDate ROWS UNBOUNDED PRECEDING),
 SumByRange = SUM(Amount) OVER (ORDER BY TxnDate RANGE UNBOUNDED PRECEDING)
 FROM TxnData
 WHERE AcctId = 2
 ORDER BY TxnDate

AcctId TxnDate    Amount SumByRows SumByRange
------ ---------- ------ --------- ----------
2      2011-08-11 500    500       500
2      2011-08-15 50     550       550
2      2011-08-22 5000   5550      5550
2      2011-08-25 550    6100      6100
2      2011-08-27 105    6205      6300
2      2011-08-27 95     6300      6300
2      2011-08-29 100    6400      6400
2      2011-08-30 2500   8900      8900

In this result set, ROWS and RANGE both return the same values, with the exception of the fifth row. Because the fifth and sixth rows are both tied for the same date (8/27/2011), RANGE returns the combined running sum for both rows. The seventh row (for 8/29/2011) breaks the tie, and ROWS “catches up” with RANGE to return running totals for the rest of the window.

Conclusion

Windowing functions using the OVER clause have been greatly enhanced in SQL Server 2012. In addition to the 4 ranking functions, running and sliding calculations with OVER/ORDER BY is now supported for all the traditional aggregate functions as well. SQL Server 2012 also introduces eight new analytic functions that are designed specifically to work with ordered (and optionally partitioned) windows using the OVER clause. Stay tuned for Part 2, which will show you how to use these new analytic windowing functions.

Follow

Get every new post delivered to your Inbox.

Join 50 other followers