Using SQL Server 2008 Change Data Capture

Change Data Capture (CDC) is a powerful new feature in SQL Server 2008 (Enterprise edition only) that lets you easily and quickly capture all data changes made to tables in your database (it’s great for incrementally updating large data warehouses). Without resorting to triggers or specialized stored procedures, and without requiring any programming or code changes to your application, CDC will create change tables, monitor your database for all DML operations (inserts, updates, and deletes), and record the modified data into the change tables. CDC also provides special table-valued functions (TVFs) that enable easy querying of the change tables.

In this post, I’ll explain everything you need to know to use CDC, including a complete walkthough.

CDC Architecture

CDC is driven by a SQL Server Agent job that recognizes changes by monitoring the SQL Server transaction log. This provides much better performance than using triggers—and there’s no code to write or maintain with CDC (the tradeoff is somewhat more latency, which is often perfectly acceptable).

The following illustration taken from SQL Server Books Online depicts a high-level view of CDC architecture:

The first time sp_cdc_enable_table is executed on any table in a database, SQL Server also creates two SQL Server Agent jobs. The first is a change-capture job, which performs the actual transaction log monitoring to apply changes on the tracked table to the corresponding change table. The second is a cleanup job, which deletes rows from change tables after a configurable interval (three days, by default) and removes all CDC artifacts if the tracked table is dropped. Therefore, SQL Server Agent must be running the first time this procedure is run to CDC-enable a table on any database in the server instance. Subsequently, if SQL Server Agent stops running, changes to tracked tables will accumulate in the transaction log but not be applied to the change tables until SQL Server Agent is restarted.

Enabling CDC

Several new system stored procedures and TVFs are provided to enable, monitor, and consume database changes. To begin, you execute the sp_cdc_enable_db procedure to enable CDC on the current database. (You must be in the sysadmin role to do this.) When you run this procedure, a new cdc user, cdc schema, and CDC_admin role are created. These names are hard-coded, so in the event that you already have a user or schema named cdc, you will need to rename it before using CDC.

Once the database is CDC-enabled, you enable CDC on a given table by executing sp_cdc_enable_table. (You must be in the db_owner role to do this.) When you do that, several objects are created in the cdc schema: a change table and at least one (but possibly two) TVFs. Let’s look at each of these objects in turn.

When CDC is enabled on a table, SQL Server creates a change table in the cdc schema corresponding to the table on which CDC is being enabled. The change table will be populated with change data automatically by CDC and is assigned a name based on both the schema and the table being tracked. For example, when you enable CDC on the Employee table in the dbo schema (as we’ll do shortly), SQL Server creates a corresponding change table named cdc.dbo_Employee_CT that will record all changes to the dbo.Employee table. The schema of the tracked table (dbo in this case) is part of the change table name so that same-named tables from different schemas can all be unambiguously tracked in the cdc schema.

The sp_cdc_enable_table stored procedure has several optional parameters that give you a lot of flexibility. You can, among other options, specify your own name for the change table (as long as it’s unique in the database), a role that a user must belong to in order to query changes (if not in sysadmin or db_owner), which columns of the table should be tracked (you don’t need to track all of them), the filegroup on which to create the change table, and whether the SWITCH_PARTITION option of ALTER TABLE can be executed against the tracked table (which has very important implications). Consult SQL Server Books Online for more details on sp_cdc_enable_table parameters.

When you no longer require CDC on a particular table, you can call the sp_cdc_disable_table stored procedure on the table. This procedure drops the change table and the TVFs and updates the system metadata to reflect that the table is no longer tracked. When you no longer require CDC on the database, call the sp_cdc_disable_db stored procedure to completely disable CDC for the entire database.

Querying Change Data

While it is certainly possible to query the change table directly for change data, you will not normally do so. Instead, you will call a special TVF to query the change table for you. This TVF is also created for you by SQL Server automatically when the change table is created, and—like the change table—the TVF is also created in the cdc schema with a name based on the schema and table name of the tracked table. So again, if we’re tracking the dbo.Employee table, SQL Server creates a TVF named cdc.fn_cdc_get_all_changes_dbo_Employee that accepts parameters to select all changes that occur to dbo.Employee between any two desired points in time.

If you specify @supports_net_changes=1 when calling sp_cdc_enable_table, a second TVF is created for the change table as well. Like the first TVF, this one allows you to select changes between any two points in time, except that this TVF returns just the net (final) changes that occurred during that time frame. This means, for example, that if a row was added and then deleted within the time frame being queried using this second TVF, data for that row would not be returned—whereas the first TVF would return data that reflects both the insert and the delete. Similarly, if a row was updated several times within the same time frame, the first TVF will return each version while the second one will only return the last (final) version. This second TVF is named in a similar fashion as the first, except using the word net instead of all. So for dbo.Employee, this TVF is named cdc.fn_cdc_get_net_changes_dbo_Employee.

Neither of these TVFs accepts start and end times directly but instead require the range to be expressed as log sequence numbers (LSNs) by first calling sys.fn_cdc_map_time_to_lsn. So to query between two points in time, you call sys.fn_cdc_map_time_to_lsn twice—once for the start time and once for the end time—and then use the LSN values returned by this function as input values to the TVFs for querying change data.

CDC Walkthrough

Enough talking the talk. Now let’s walk the walk through the code for a complete example of using CDC.

CREATE DATABASE CDCDemo
GO

USE CDCDemo
GO

EXEC sp_cdc_enable_db

SELECT name, is_cdc_enabled FROM sys.databases
SELECT * FROM sys.schemas WHERE name = 'cdc'
SELECT * FROM sys.database_principals WHERE name = 'cdc'

After creating our sample database CDCDemo, we enable CDC on that database by calling EXEC sp_cdc_enable_db. The SELECT queries that follow demonstrate how to retrieve various kinds of CDC-related information. The first SELECT query shows how the is_cdc_enabled column in sys.databases returns true (1) or false (0), making it easy to find out which databases are CDC-enabled and which aren’t. The next two SELECT queries show how the new cdc schema and user can be found in sys.schemas and sys.database_principals.

CREATE TABLE Employee(
 EmployeeId    int NOT NULL PRIMARY KEY,
 EmployeeName  varchar(100) NOT NULL,
 EmailAddress  varchar(200) NOT NULL)

-- Enable CDC on the table (SQL Server Agent *should* be running when you run this)
EXEC sp_cdc_enable_table
 @source_schema = N'dbo',
 @source_name = N'Employee',
 @role_name = N'CDC_admin',
 @capture_instance = N'dbo_Employee',
 @supports_net_changes = 1

-- Show CDC-enabled tables
SELECT name, is_tracked_by_cdc FROM sys.tables

This code creates the Employee table (which has only three columns to keep our example simple). CDC is then enabled on the Employee table by calling EXEC sp_cdc_enable_table and passing parameters that identify the Employee table in the dbo schema for change capture. (Remember that SQL Server Agent must be running at this point.) The next SELECT statement shows how to query the is_tracked_by_cdc column in sys.tables to find out which tables are CDC-enabled and which aren’t.

Now let’s insert, update, and delete some data, and watch CDC in action:

-- Insert John, Dan, Jay, and Jeff
INSERT INTO Employee VALUES(1, 'John Smith', 'john.smith@ourcorp.com')
INSERT INTO Employee VALUES(2, 'Dan Park', 'dan.park@ourcorp.com')
INSERT INTO Employee VALUES(3, 'Jay Hamlin', 'jay.hamlin@ourcorp.com')
INSERT INTO Employee VALUES(4, 'Jeff Hay', 'jeff.hay@ourcorp.com')

-- Results from insert
WAITFOR DELAY '00:00:05' -- Let CDC catch up
SELECT * FROM Employee
SELECT * FROM cdc.dbo_employee_ct

-- Delete Jeff
DELETE Employee WHERE EmployeeId = 4

-- Results from delete
WAITFOR DELAY '00:00:05' -- Let CDC catch up
SELECT * FROM Employee
SELECT * FROM cdc.dbo_employee_ct

-- Update Dan and Jay
UPDATE Employee SET EmployeeName = 'Dan P. Park' WHERE EmployeeId = 2
UPDATE Employee SET EmployeeName = 'Jay K. Hamlin' WHERE EmployeeId = 3

-- Results from update
WAITFOR DELAY '00:00:05' -- Let CDC catch up
SELECT * FROM Employee
SELECT * FROM cdc.dbo_employee_ct

The preceding code performs a mix of INSERT, UPDATE, and DELETE operations against the Employee table to simulate database activity and engage the capture process. These operations are accompanied by SELECT statements that query the change table cdc.dbo_employee_ct (deliberate WAITFOR delays are strategically placed to give the CDC job a chance to pick up the changes from the transaction log and dump them into the change table). This is done purely to demonstrate that change data for the Employee table is being captured to the change table. However (as mentioned), you should normally not query the change tables directly in this manner and should instead use the generated TVF(s) to extract change information about the Employee table, as demonstrated by the following code:

-- To access change data, use the CDC TVFs, not the change tables directly
DECLARE @begin_time datetime
DECLARE @end_time datetime
DECLARE @from_lsn binary(10)
DECLARE @to_lsn binary(10)
SET @begin_time = GETDATE() - 1
SET @end_time = GETDATE()

-- Map the time interval to a CDC LSN range
SELECT @from_lsn =
 sys.fn_cdc_map_time_to_lsn('smallest greater than or equal', @begin_time)

SELECT @to_lsn =
 sys.fn_cdc_map_time_to_lsn('largest less than or equal', @end_time)

SELECT @begin_time AS BeginTime, @end_time AS EndTime
SELECT @from_lsn AS FromLSN, @to_lsn AS ToLSN

-- Return the changes occurring within the query window.

-- First, all changes that occurred...
SELECT *
 FROM cdc.fn_cdc_get_all_changes_dbo_employee(@from_lsn, @to_lsn, N'all')

-- Then, net changes, that is, final state...
SELECT *
 FROM cdc.fn_cdc_get_net_changes_dbo_employee(@from_lsn, @to_lsn, N'all')

Recall that enabling CDC on the Employee table creates a TVF for retrieving all changes made to the table between any two points in time. Recall too that by specifying @supports_net_changes = 1, this also creates a second TVF for retrieving only the net changes made between any two points in time. The difference between all changes and net changes is clearly illustrated when we call both of these TVFs and compare their results.

To call either of the generated TVFs, you need to provide a value range that defines the window of time during which the change data you want returned was captured. As already explained, this range is expressed using LSN values, which you can obtain by calling sys.fn_cdc_map_time_to_lsn and passing in the desired start and end points in time. So first we establish a time range for the past 24 hours, which we obtain by assigning GETDATE() – 1 and GETDATE() to the start and end time variables. Then we call sys.fn_cdc_map_time_to_lsn on the start and end time variables to obtain the LSN values corresponding to the last 24 hours. (Note that the starting LSN gets adjusted automatically to compensate for the fact there are no LSNs from 24 hours ago, as does the ending LSN, since there might not be any from a moment ago either.) We then issue two SELECT statements so that we can view the time and LSN range values. Equipped with the LSN range values, we issue two more SELECT statements. The first statement queries the range against the all changes TVF, and the second statement queries the range against the net changes TVF. Comparing the results of these two queries clearly illustrates the difference between the TVFs, as shown here:

BeginTime               EndTime
----------------------- -----------------------
2010-01-08 23:42:55.567 2010-01-09 23:42:55.567

(1 row(s) affected)

FromLSN                ToLSN
---------------------- ----------------------
0x0000001A0000001E0039 0x00000020000000A50001

(1 row(s) affected)

__$start_lsn           __$seqval              __$operation __$update_mask EmployeeId EmployeeName    EmailAddress
---------------------- ---------------------- ------------ -------------- ---------- --------------- -----------------------
0x0000001E0000007C0013 0x0000001E0000007C0012 2            0x07           1          John Smith      john.smith@ourcorp.com
0x0000001E000000800003 0x0000001E000000800002 2            0x07           2          Dan Park        dan.park@ourcorp.com
0x0000001E000000810003 0x0000001E000000810002 2            0x07           3          Jay Hamlin      jay.hamlin@ourcorp.com
0x0000001E000000820003 0x0000001E000000820002 2            0x07           4          Jeff Hay        jeff.hay@ourcorp.com
0x0000001E000000850004 0x0000001E000000850002 1            0x07           4          Jeff Hay        jeff.hay@ourcorp.com
0x0000001E000001AC0004 0x0000001E000001AC0002 4            0x02           2          Dan P. Park     dan.park@ourcorp.com
0x0000001E000001AE0004 0x0000001E000001AE0002 4            0x02           3          Jay K. Hamlin   jay.hamlin@ourcorp.com

(7 row(s) affected)

__$start_lsn           __$operation __$update_mask EmployeeId EmployeeName      EmailAddress
---------------------- ------------ -------------- ---------- ----------------- --------------------------
0x0000001E0000007C0013 2            NULL           1          John Smith        john.smith@ourcorp.com
0x0000001E000001AC0004 2            NULL           2          Dan P. Park       steven.jones@ourcorp.com
0x0000001E000001AE0004 2            NULL           3          Jay K. Hamlin     jay.hamlin@ourcorp.com

(3 row(s) affected)

The first result set includes all the information about all changes made during the specified LSN range, including all interim changes. Thus, the information returned from the first all changes TVF shows every stage of change, or seven changes in total. In our scenario, John was inserted once and then never changed. So only his insert (__$operation value 2) is shown. Dan and Jay were inserted (__$operation value 2) and updated (__$operation value 4), so both changes (insert and update) are returned for each of them. Jeff, on the other hand, was deleted (__$operation value 1) after being inserted, and so both changes (insert and delete) are returned for James. Note also that this TVF also returns a bitmask field in column __$update_mask that you can parse to determine, on a row-by-row basis, which fields were changed and which fields weren’t.

The second result set includes only the final changes made during the specified LSN range. So for the same LSN range, we receive only three change records from the second net changes TVF, each of which provides the final column values in the specified LSN range. John appears only once as in the previous query, since he was inserted only once and never modified or deleted within the LSN range. However, although Dan and Jay were inserted and updated, they each appear only once (with their final values for the LSN range), and not twice as in the previous query. And since Jeff was inserted and deleted within the window of time specified by the LSN range, no change data for Jeff is returned at all by the net changes TVF.

Summary

CDC can be used to address many common scenarios for capturing change data; auditing, history, data warehousing, and so on. The appeal to using CDC is its automatic and transparent operation over any existing database without requiring any changes to database schema or application code. It’s easy and straightforward, and definitely worth investigating!

Windows Virtual PC Quirks and Workarounds

If you’re moving from Microsoft Virtual PC 2007 to Windows Virtual PC on Windows 7, get ready for a few changes, plus some things that just don’t work quite right, especially with undo disks. After experiencing the pain of sorting it all out, I thought I’d write this post in the hopes that others can be spared the same grief.

Don’t get me wrong. Overall, I find Windows Virtual PC has some very nice enhancements over Microsoft Virtual PC 2007—particularly the new Integration Features, which supports guest machine access to host machine resources via a network shares (replacing Virtual Machine Additions that supported drag-and-drop between host and guest machines). But here are a few things that just might trip you up at first…

Shutdown options have been reorganized

I find the new arrangement annoying, despite claims that it’s now “less confusing” (http://blogs.msdn.com/virtual_pc_guy/archive/2009/08/18/windows-virtual-pc-and-undo-disks.aspx). In any case, you won’t find the option to commit changes to the undo disk when shutting down the virtual machine. Instead, you can choose only to hibernate, shut down (properly, keep undo disk), turn off (pull the plug on the virtual machine, keep undo disk), or turn off and discard changes (pull the plug and delete undo disk), and to make that choice your default for the future:

To commit the undo disk changes, you need to revisit the Settings for the virtual machine, open the Undo Disk settings page, and finally click Apply Changes:

Pain in the butt, if you ask me, but such is life. According to Microsoft, this is “easier,” so who am I to argue?

Finding the Undo disk location

According to the docs, the .vud undo disk file is created and stored in the same folder as the virtual machine configuration file. That’s a bit confusing, because there are actually two virtual machine configuration files: the .vmcx file that is visible in x:\Users\YourName\Virtual Machines, and the .vmc file that is stored in the hidden folder x:\Users\YourName\AppData\Local\Microsoft\Windows Virtual PC\Virtual Machines, and the undo disk is stored in the latter (hidden) location. If you open the .vmcx file in Notepad, you’ll see that it refers to the .vmc file in the hidden folder:

For example, after starting virtual machine SQL08DEV with undo disks enabled, you won’t find the .vud (undo disk) file alongside the .vmcx file in x:\Users\YourName\Virtual Machines:

Before banging your head against the wall or jumping to the conclusion that Microsoft’s documentation is wrong (and I did both), just navigate to the real (hidden) directory and you’ll find it there alongside the .vmc file (and the .vmc.vpcbackup [backup] and .vsv [saved state] files):

Undo disks won’t merge if vmcx/vhd files are on separate drives

In Microsoft Virtual PC 2007, you gain superb performance by running your virtual hard disk (.vhd file) on an external hard drive (or even a flash drive), while the VM configuration (.vmc file) remains on the system disk. An undo disk automatically gets created in the same folder as the .vmc file (the system disk). The VM operates in this mode by exclusively reading from the external disk’s .vhd file, while simultaneously writing to the undo disk’s .vud file on the system disk—ipso facto, super-fast.

While this mode is still supported in Windows Virtual PC on Windows 7, there seems to be a bug that prevents you from eventually committing the undo disk changes back to the .vhd file on the external drive. No matter what—even after rebooting—Windows Virtual PC insists that the parent disk is inaccessible, and refuses to perform the merge. Of course if you don’t mind discarding the undo disk, this isn’t a problem (discarding the undo disk is a wonderful strategy for rolling back to a consistent state that sets the stage for a predictable and repeatable environment for presentations, development, or whatever). But when it comes time to merge, you’re hosed! Until they fix this, I’m forced to keep my .vhd files on my system disk (performance isn’t too bad though, on my screaming new Dell M6400 Intel Core Quad 12GB RAID 1 monster). Of course, if you’re not uncomfortable directly editing the XML in the vmc file, you may be able to shuffle the vhd back and forth between external and system disks. Hardly an elegant solution, and certainly not one that I’d recommend.

Integration Features won’t work if the virtual machine user account has no password

Integration Features in Windows Virtual PC has some nice advantages over the VM Additions in Microsoft Virtual PC 2007, but you will run into a problem enabling Integration Features if your virtual machine logs in with an account that has no password. That shouldn’t be a problem in most cases, since user accounts usually have assigned passwords, even on virtual machines. But some virtual machines, such as those used for demos, may have empty passwords. Unfortunately, the error message that Windows Virtual PC throws at you when you try to enable Integration Features on such virtual machines is anything but helpful. It would be nice to get a simple message telling you to just assign a password to your virtual machine user account. But instead, you simply get a failed authentication message with no indication that all will be fine if you just assign the missing password. You can try till you’re blue in the face (like I did), or just assign the password and move on.

Summary

Virtualization is awesome, and overall, Microsoft has done a good job of it with their Virtual PC products. Still, nothing in this life perfect—including software—so you need to watch out for these pitfalls when moving to Windows Virtual PC on Windows 7.