Change Data Capture (CDC) is a powerful new feature in SQL Server 2008 (Enterprise edition only) that lets you easily and quickly capture all data changes made to tables in your database (it’s great for incrementally updating large data warehouses). Without resorting to triggers or specialized stored procedures, and without requiring any programming or code changes to your application, CDC will create change tables, monitor your database for all DML operations (inserts, updates, and deletes), and record the modified data into the change tables. CDC also provides special table-valued functions (TVFs) that enable easy querying of the change tables.
In this post, I’ll explain everything you need to know to use CDC, including a complete walkthough.
CDC is driven by a SQL Server Agent job that recognizes changes by monitoring the SQL Server transaction log. This provides much better performance than using triggers—and there’s no code to write or maintain with CDC (the tradeoff is somewhat more latency, which is often perfectly acceptable).
The following illustration taken from SQL Server Books Online depicts a high-level view of CDC architecture:
The first time sp_cdc_enable_table is executed on any table in a database, SQL Server also creates two SQL Server Agent jobs. The first is a change-capture job, which performs the actual transaction log monitoring to apply changes on the tracked table to the corresponding change table. The second is a cleanup job, which deletes rows from change tables after a configurable interval (three days, by default) and removes all CDC artifacts if the tracked table is dropped. Therefore, SQL Server Agent must be running the first time this procedure is run to CDC-enable a table on any database in the server instance. Subsequently, if SQL Server Agent stops running, changes to tracked tables will accumulate in the transaction log but not be applied to the change tables until SQL Server Agent is restarted.
Several new system stored procedures and TVFs are provided to enable, monitor, and consume database changes. To begin, you execute the sp_cdc_enable_db procedure to enable CDC on the current database. (You must be in the sysadmin role to do this.) When you run this procedure, a new cdc user, cdc schema, and CDC_admin role are created. These names are hard-coded, so in the event that you already have a user or schema named cdc, you will need to rename it before using CDC.
Once the database is CDC-enabled, you enable CDC on a given table by executing sp_cdc_enable_table. (You must be in the db_owner role to do this.) When you do that, several objects are created in the cdc schema: a change table and at least one (but possibly two) TVFs. Let’s look at each of these objects in turn.
When CDC is enabled on a table, SQL Server creates a change table in the cdc schema corresponding to the table on which CDC is being enabled. The change table will be populated with change data automatically by CDC and is assigned a name based on both the schema and the table being tracked. For example, when you enable CDC on the Employee table in the dbo schema (as we’ll do shortly), SQL Server creates a corresponding change table named cdc.dbo_Employee_CT that will record all changes to the dbo.Employee table. The schema of the tracked table (dbo in this case) is part of the change table name so that same-named tables from different schemas can all be unambiguously tracked in the cdc schema.
The sp_cdc_enable_table stored procedure has several optional parameters that give you a lot of flexibility. You can, among other options, specify your own name for the change table (as long as it’s unique in the database), a role that a user must belong to in order to query changes (if not in sysadmin or db_owner), which columns of the table should be tracked (you don’t need to track all of them), the filegroup on which to create the change table, and whether the SWITCH_PARTITION option of ALTER TABLE can be executed against the tracked table (which has very important implications). Consult SQL Server Books Online for more details on sp_cdc_enable_table parameters.
When you no longer require CDC on a particular table, you can call the sp_cdc_disable_table stored procedure on the table. This procedure drops the change table and the TVFs and updates the system metadata to reflect that the table is no longer tracked. When you no longer require CDC on the database, call the sp_cdc_disable_db stored procedure to completely disable CDC for the entire database.
Querying Change Data
While it is certainly possible to query the change table directly for change data, you will not normally do so. Instead, you will call a special TVF to query the change table for you. This TVF is also created for you by SQL Server automatically when the change table is created, and—like the change table—the TVF is also created in the cdc schema with a name based on the schema and table name of the tracked table. So again, if we’re tracking the dbo.Employee table, SQL Server creates a TVF named cdc.fn_cdc_get_all_changes_dbo_Employee that accepts parameters to select all changes that occur to dbo.Employee between any two desired points in time.
If you specify @supports_net_changes=1 when calling sp_cdc_enable_table, a second TVF is created for the change table as well. Like the first TVF, this one allows you to select changes between any two points in time, except that this TVF returns just the net (final) changes that occurred during that time frame. This means, for example, that if a row was added and then deleted within the time frame being queried using this second TVF, data for that row would not be returned—whereas the first TVF would return data that reflects both the insert and the delete. Similarly, if a row was updated several times within the same time frame, the first TVF will return each version while the second one will only return the last (final) version. This second TVF is named in a similar fashion as the first, except using the word net instead of all. So for dbo.Employee, this TVF is named cdc.fn_cdc_get_net_changes_dbo_Employee.
Neither of these TVFs accepts start and end times directly but instead require the range to be expressed as log sequence numbers (LSNs) by first calling sys.fn_cdc_map_time_to_lsn. So to query between two points in time, you call sys.fn_cdc_map_time_to_lsn twice—once for the start time and once for the end time—and then use the LSN values returned by this function as input values to the TVFs for querying change data.
Enough talking the talk. Now let’s walk the walk through the code for a complete example of using CDC.
CREATE DATABASE CDCDemo GO USE CDCDemo GO EXEC sp_cdc_enable_db SELECT name, is_cdc_enabled FROM sys.databases SELECT * FROM sys.schemas WHERE name = 'cdc' SELECT * FROM sys.database_principals WHERE name = 'cdc'
After creating our sample database CDCDemo, we enable CDC on that database by calling EXEC sp_cdc_enable_db. The SELECT queries that follow demonstrate how to retrieve various kinds of CDC-related information. The first SELECT query shows how the is_cdc_enabled column in sys.databases returns true (1) or false (0), making it easy to find out which databases are CDC-enabled and which aren’t. The next two SELECT queries show how the new cdc schema and user can be found in sys.schemas and sys.database_principals.
CREATE TABLE Employee( EmployeeId int NOT NULL PRIMARY KEY, EmployeeName varchar(100) NOT NULL, EmailAddress varchar(200) NOT NULL) -- Enable CDC on the table (SQL Server Agent *should* be running when you run this) EXEC sp_cdc_enable_table @source_schema = N'dbo', @source_name = N'Employee', @role_name = N'CDC_admin', @capture_instance = N'dbo_Employee', @supports_net_changes = 1 -- Show CDC-enabled tables SELECT name, is_tracked_by_cdc FROM sys.tables
This code creates the Employee table (which has only three columns to keep our example simple). CDC is then enabled on the Employee table by calling EXEC sp_cdc_enable_table and passing parameters that identify the Employee table in the dbo schema for change capture. (Remember that SQL Server Agent must be running at this point.) The next SELECT statement shows how to query the is_tracked_by_cdc column in sys.tables to find out which tables are CDC-enabled and which aren’t.
Now let’s insert, update, and delete some data, and watch CDC in action:
-- Insert John, Dan, Jay, and Jeff INSERT INTO Employee VALUES(1, 'John Smith', 'firstname.lastname@example.org') INSERT INTO Employee VALUES(2, 'Dan Park', 'email@example.com') INSERT INTO Employee VALUES(3, 'Jay Hamlin', 'firstname.lastname@example.org') INSERT INTO Employee VALUES(4, 'Jeff Hay', 'email@example.com') -- Results from insert WAITFOR DELAY '00:00:05' -- Let CDC catch up SELECT * FROM Employee SELECT * FROM cdc.dbo_employee_ct -- Delete Jeff DELETE Employee WHERE EmployeeId = 4 -- Results from delete WAITFOR DELAY '00:00:05' -- Let CDC catch up SELECT * FROM Employee SELECT * FROM cdc.dbo_employee_ct -- Update Dan and Jay UPDATE Employee SET EmployeeName = 'Dan P. Park' WHERE EmployeeId = 2 UPDATE Employee SET EmployeeName = 'Jay K. Hamlin' WHERE EmployeeId = 3 -- Results from update WAITFOR DELAY '00:00:05' -- Let CDC catch up SELECT * FROM Employee SELECT * FROM cdc.dbo_employee_ct
The preceding code performs a mix of INSERT, UPDATE, and DELETE operations against the Employee table to simulate database activity and engage the capture process. These operations are accompanied by SELECT statements that query the change table cdc.dbo_employee_ct (deliberate WAITFOR delays are strategically placed to give the CDC job a chance to pick up the changes from the transaction log and dump them into the change table). This is done purely to demonstrate that change data for the Employee table is being captured to the change table. However (as mentioned), you should normally not query the change tables directly in this manner and should instead use the generated TVF(s) to extract change information about the Employee table, as demonstrated by the following code:
-- To access change data, use the CDC TVFs, not the change tables directly DECLARE @begin_time datetime DECLARE @end_time datetime DECLARE @from_lsn binary(10) DECLARE @to_lsn binary(10) SET @begin_time = GETDATE() - 1 SET @end_time = GETDATE() -- Map the time interval to a CDC LSN range SELECT @from_lsn = sys.fn_cdc_map_time_to_lsn('smallest greater than or equal', @begin_time) SELECT @to_lsn = sys.fn_cdc_map_time_to_lsn('largest less than or equal', @end_time) SELECT @begin_time AS BeginTime, @end_time AS EndTime SELECT @from_lsn AS FromLSN, @to_lsn AS ToLSN -- Return the changes occurring within the query window. -- First, all changes that occurred... SELECT * FROM cdc.fn_cdc_get_all_changes_dbo_employee(@from_lsn, @to_lsn, N'all') -- Then, net changes, that is, final state... SELECT * FROM cdc.fn_cdc_get_net_changes_dbo_employee(@from_lsn, @to_lsn, N'all')
Recall that enabling CDC on the Employee table creates a TVF for retrieving all changes made to the table between any two points in time. Recall too that by specifying @supports_net_changes = 1, this also creates a second TVF for retrieving only the net changes made between any two points in time. The difference between all changes and net changes is clearly illustrated when we call both of these TVFs and compare their results.
To call either of the generated TVFs, you need to provide a value range that defines the window of time during which the change data you want returned was captured. As already explained, this range is expressed using LSN values, which you can obtain by calling sys.fn_cdc_map_time_to_lsn and passing in the desired start and end points in time. So first we establish a time range for the past 24 hours, which we obtain by assigning GETDATE() – 1 and GETDATE() to the start and end time variables. Then we call sys.fn_cdc_map_time_to_lsn on the start and end time variables to obtain the LSN values corresponding to the last 24 hours. (Note that the starting LSN gets adjusted automatically to compensate for the fact there are no LSNs from 24 hours ago, as does the ending LSN, since there might not be any from a moment ago either.) We then issue two SELECT statements so that we can view the time and LSN range values. Equipped with the LSN range values, we issue two more SELECT statements. The first statement queries the range against the all changes TVF, and the second statement queries the range against the net changes TVF. Comparing the results of these two queries clearly illustrates the difference between the TVFs, as shown here:
BeginTime EndTime ----------------------- ----------------------- 2010-01-08 23:42:55.567 2010-01-09 23:42:55.567 (1 row(s) affected) FromLSN ToLSN ---------------------- ---------------------- 0x0000001A0000001E0039 0x00000020000000A50001 (1 row(s) affected) __$start_lsn __$seqval __$operation __$update_mask EmployeeId EmployeeName EmailAddress ---------------------- ---------------------- ------------ -------------- ---------- --------------- ----------------------- 0x0000001E0000007C0013 0x0000001E0000007C0012 2 0x07 1 John Smith firstname.lastname@example.org 0x0000001E000000800003 0x0000001E000000800002 2 0x07 2 Dan Park email@example.com 0x0000001E000000810003 0x0000001E000000810002 2 0x07 3 Jay Hamlin firstname.lastname@example.org 0x0000001E000000820003 0x0000001E000000820002 2 0x07 4 Jeff Hay email@example.com 0x0000001E000000850004 0x0000001E000000850002 1 0x07 4 Jeff Hay firstname.lastname@example.org 0x0000001E000001AC0004 0x0000001E000001AC0002 4 0x02 2 Dan P. Park email@example.com 0x0000001E000001AE0004 0x0000001E000001AE0002 4 0x02 3 Jay K. Hamlin firstname.lastname@example.org (7 row(s) affected) __$start_lsn __$operation __$update_mask EmployeeId EmployeeName EmailAddress ---------------------- ------------ -------------- ---------- ----------------- -------------------------- 0x0000001E0000007C0013 2 NULL 1 John Smith email@example.com 0x0000001E000001AC0004 2 NULL 2 Dan P. Park firstname.lastname@example.org 0x0000001E000001AE0004 2 NULL 3 Jay K. Hamlin email@example.com (3 row(s) affected)
The first result set includes all the information about all changes made during the specified LSN range, including all interim changes. Thus, the information returned from the first all changes TVF shows every stage of change, or seven changes in total. In our scenario, John was inserted once and then never changed. So only his insert (__$operation value 2) is shown. Dan and Jay were inserted (__$operation value 2) and updated (__$operation value 4), so both changes (insert and update) are returned for each of them. Jeff, on the other hand, was deleted (__$operation value 1) after being inserted, and so both changes (insert and delete) are returned for James. Note also that this TVF also returns a bitmask field in column __$update_mask that you can parse to determine, on a row-by-row basis, which fields were changed and which fields weren’t.
The second result set includes only the final changes made during the specified LSN range. So for the same LSN range, we receive only three change records from the second net changes TVF, each of which provides the final column values in the specified LSN range. John appears only once as in the previous query, since he was inserted only once and never modified or deleted within the LSN range. However, although Dan and Jay were inserted and updated, they each appear only once (with their final values for the LSN range), and not twice as in the previous query. And since Jeff was inserted and deleted within the window of time specified by the LSN range, no change data for Jeff is returned at all by the net changes TVF.
CDC can be used to address many common scenarios for capturing change data; auditing, history, data warehousing, and so on. The appeal to using CDC is its automatic and transparent operation over any existing database without requiring any changes to database schema or application code. It’s easy and straightforward, and definitely worth investigating!