SQL Server 2016 Dynamic Data Masking (DDM)

Introducing Dynamic Data Masking (DDM)

In this blog post, I’ll show you how to shield sensitive data from unauthorized users using Dynamic Data Masking, or DDM.

DDM lets you hide data, not by encrypting it, but by masking it. So there are no data changes in your tables. Rather, SQL Server automatically hides the actual data from all query results for users that don’t have permission to see it.

For example, take these query results:

MemberID    FirstName    LastName      Phone        Email
----------- ------------ ------------- ------------ --------------------------
1           Roberto      Tamburello    555.123.4567 RTamburello@contoso.com
2           Janice       Galvin        555.123.4568 JGalvin@contoso.com.co
3           Dan          Mu            555.123.4569 ZMu@contoso.net
4           Jane         Smith         454.222.5920 Jane.Smith@hotmail.com
5           Danny        Jones         674.295.7950 Danny.Jones@hotmail.com

With DDM, you can serve up the same results with the FirstName, Phone, and Email columns masked as follows:

MemberID    FirstName    LastName      Phone        Email
----------- ------------ ------------- ------------ --------------------------
1           Ro...to      Tamburello    xxxx         RXXX@XXXX.com
2           Ja...ce      Galvin        xxxx         JXXX@XXXX.com
3           ...          Mu            xxxx         ZXXX@XXXX.com
4           ...          Smith         xxxx         JXXX@XXXX.com
5           Da...ny      Jones         xxxx         DXXX@XXXX.com

DDM has four pre-defined masking functions:

default – You can completely hide data with the default function; that is, the function is named default. The default function masks the entire column value returned from the database, so that its completely hidden in the results, and works with virtually any data type.

partial – The partial function lets you be reveal some, but not all of the underlying data, and it works only with string types. With partial, you can show any number of characters at the beginning of a string, at the end of a string, or at both the beginning and the end of a string. The entire middle portion of the string is hidden, and gets replaced by a custom mask that you supply.

email – The email function is a bit strange, because it doesn’t really offer anything that you can’t achieve with the partial function. It’s actually just a convenient shorthand for the partial function that exposes only the first character of a string, and masks the rest with XXX@XXXX.com. In no way does the email function examine the string that its masking to see if it’s actually formatted as an email address; so any column you use this function with is going to look like an email address in your query results, regardless.

random – Finally, the random function is available for numeric columns. Like the default function, it completely hides the underlying value, but unlike default – which hides numeric columns by always masking them with a zero – the random function lets you supply a range of numbers from which a value is randomly chosen every time the data is queried.

As I said, DDM does not physically modify any data in the database. Instead, it masks the data on the fly, as it is queried by users that lack the permission to see the real data. This is a huge win for many common scenarios involving sensitive data; for example, in the healthcare industry, there are strict regulations around the sharing of so-called PHI, or personal health information. These regulations often make it hard to give a developer access to a decent sampling of live production data. DDM helps solve this problem, because administrators can now give developers access to production data, with all the sensitive personal data masked from view – and this is a process that’s often referred to as “anonymizing” the data.

At the same time, because everything is handled internally by SQL Server, there is no additional development effort needed at the application level; there’s no extra code to write, you just define your masks, and you’re done.

Masking Table Columns

DDM is very easy to use. When you create a table with columns that you’d like to mask, you simply include some additional MASKED WITH syntax, to tell SQL Server how to apply the masking:

CREATE TABLE Customer(
  FirstName varchar(20)
    MASKED WITH (FUNCTION='partial(1, "...", 0)'),
  LastName varchar(20),
  Phone varchar(12)
    MASKED WITH (FUNCTION='default()'),
  Email varchar(200)
    MASKED WITH (FUNCTION='email()'),
  Balance money
    MASKED WITH (FUNCTION='random(1000, 5000)'))

In this example, we’re using the partial function to partially mask the first name column. Specifically, the first parameter reveals just the first character of the first name, the second parameter is the custom mask to follow the first character with three dots, and the last parameter tells SQL Server to reveal none of the end characters of the first name. Using the default function for the phone column completely hides the phone number, the email function reveals the first character of the email column, followed by the mask XXX@XXXX.com, and the random function is being used here to randomly mask the Balance column with numbers between one-and-five-thousand.
If you already have a table with columns that you’d like to mask, it’s just as easy. Simply use the ADD MASKED WITH syntax with an ALTER TABLE, ALTER COLUMN statement, like so:

ALTER TABLE Customer
  ALTER COLUMN LastName
    ADD MASKED WITH (FUNCTION='default()')

Masking Different Data Types

The way a column gets masked by DDM depends on two things:

  • the masking function that you use
  • the data type of the column that you’re masking

DDM table

The default function is the only function that works with virtually all data types. In the case of a string column, it uses a hardcoded mask of four lower-case x’s, which is effectively the same as supplying a mask of four lower-case x’s to the partial function, without revealing any starting or ending characters. In the case of the other data types, DDM masks the column using an appropriate replacement value for that type; for example, using a zero for numeric data types, or January first 1900 for a date type. The default function can also be used to mask many of the more specialized data types, such as XML, binary and spatial columns, for example.

The partial function works only with string columns; meaning varchar, char, and text columns, as well as their Unicode version counterparts. This function accepts the three parameters I described on the previous slide, giving you control over how much or little gets exposed from the start and end of the string, and the custom mask to embed in the middle.

The email function also works only with string columns, and simply reveals just the first character of the string, followed by the mask XXX@XXXX.com, using upper-case X’s.

And finally, the random function works only with numeric columns, meaning for example int, bigint, short, money, decimal, and even bit. Use the random function instead of the default function to mask numeric columns, when you’d like to manufacture values that are semi-realistic, and not just zeros.

Discovering Masked Columns

To find out which columns in which tables are being masked, you can query sys.columns which now includes an is_masked and masking_function column to tell you if a column is being masked, and if so, the function being used to mask that column.

SELECT
  t.name AS TableName,
  mc.name AS ColumnName,
  mc.masking_function AS MaskingFunction
FROM
  sys.masked_columns AS mc
  INNER JOIN sys.tables AS t ON mc.[object_id] = t.[object_id]

Or, it’s even easier to query the new sys.masked_columns view, which internally, queries from sys.columns and filters to return only the masked columns; that is, where is_masked is set to 1, for true.

Mask Permissions

Dynamic data masking is based purely on the permissions that are either granted to a given user, or not.

So first, no special permission is actually required to create a new table, and define it with masked columns.  As for existing tables, the ALTER ANY MASK permission is required for a user to add a mask to an unmasked column, or to change or remove the mask of an already masked column.

The UNMASK permission is the big one, because it effectively ignores any masking defined for any columns. This is the permission that you want to be certain not to grant to users that should only view masked data; for example, you would be sure not to grant developers the UNMASK permission when supplying production data for them to use as sample data.

No special permission is needed to insert or date data in a masked column. So DDM effectively behaves like a write-only feature in the sense that a user has the ability to write data that they themselves will not be able to read back unless they also possess the UNMASK permission.

DDM Limitations and Considerations

There are a few things to keep in mind when you’re working with DDM. Although DDM does support most data types – even some of the highly specialized data types that are very often not supported by other SQL Server features – some columns cannot be masked. So while DDM can mask BLOB data stored in varbinary(max) columns, it cannot mask those columns if they are also decorated with the FILESTREAM attribute, which enables highly scalable BLOB storage in SQL Server.

Also, you can also not mask sparse columns that are part of a COLUMN_SET, or computed columns, although you can still create computed columns that are based on masked columns, in which case the computed column value will get masked as a result.

Keys for FULTEXT indexes can’t be masked, and finally columns that have been encrypted using the new Always Encrypted feature in SQL Server 2016 (which I’ll cover in a future blog post) cannot be masked.

It’s also important to remember that there is no way to ever derive the unmasked data once it has been masked. So even though SQL Server doesn’t actually modify the underlying data for masked columns, and ETL process – for example – that queries SQL Server and receives masked data, will wind up loading the target system with that masked data, and the target system will have no means of ever knowing what the unmasked data is.

 

Leave a comment