I’ve just returned from VSLive! Orlando and gave a great full day workshop on SQL Server 2008 R2 with Andrew Brust. Once again, I find that FILESTREAM stands out as one of my favorite new features in the relational database engine. So, I’ve decided to post a 3-part series on this awesome new feature. In this first post, I’ll explain FILESTREAM and how it works. In part two, I’ll walk you through the steps to enable and use FILESTREAM. In my third and last post of this series, I’ll show you how to maximize FILESTREAM performance using the OpenSqlFilestream feature exposed by the SQL Server Native Client API.
The Need To Stream
Storing unstructured BLOB (large binary object) data that is somehow associated with structured data in a relational database is certainly not a new problem. But these days, the proliferation of unstructured content is accelerating rapidly, and increasingly, we find ourselves dealing with more and more Word documents, Excel spreadsheets, PDF files, emails with attachments, audio/video files, and the list goes on and on. There has been no native solution in SQL Server for handling BLOB storage, until now.
Before FILESTREAM in SQL Server 2008, we’ve had basically two choices for handling BLOBs: Store them in the database as varbinary(max) columns (or, in older versions of SQL Server, using the image data type), or store them outside the database (typically, in the file system or dedicated BLOB store). Both of these approaches have advantages and disadvantages, making neither of them ideal.
Pros and Cons of Traditional BLOB Storage Techniques
Storing BLOBs in the database is fine if your BLOB requirements are modest. Just by using the varbinary(max) data type, you’ll be able to easily embed and extract byte arrays representing BLOBs directly into and out of your database tables. You’ll also enjoy transactional integrity over BLOB updates, as well as integrated management benefits (for example, a single database backup includes the BLOB data). However, if you have many BLOBs, and/or they are very large, then a disproportionate amount of disk space allocated to your structured relational tables will be consumed by BLOB data. Query performance will suffer as a result, and in most cases the performance hit will be so bad that you’ll need to consider a different option.
To maximize database performance, you can instead store the BLOBs outside the database. Files belong in the file system, right? Structured relational data belongs in tables, right? Right! And if you’ve ever had to build a system that associates files in the file system with rows in database tables, you know the pain involved in implementing this solution. Because SQL Server has no awareness of the coupling between the table data and the BLOB data in the file system, it’s your job to establish and maintain those references. When a row is deleted from the table, your application needs to determine which files in the file system store data for that deleted row, and then delete those files as well. You also lose transactional scope, since rolling back a database transaction will in no way undo changes made to the file system. It’s now also another administrative burden to back up your data, since a database backup is no longer a “complete” backup of your database, without you also performing a separate file system backup for the BLOB data.
In SQL Server 2008, the new FILESTREAM feature gives us the best of both worlds. The beauty of FILESTREAM is its transparency. Contrary to what you may have heard or read, FILESTREAM is not a new data type in SQL Server 2008. Rather, it is an attribute that you apply to the varbinary(max) data type–which (as mentioned) is the same data type traditionally used to store BLOBs directly in your tables. From your perspective, a varbinary(max) FILESTREAM column is no different than a conventional varbinary(max) column. So you can embed/extract byte arrays to/from the varbinary(max) column, or even inline small binary streams in T-SQL, just as you did before. Updates made to BLOB data in a varbinary(max) FILESTREAM column while inside a database transaction will roll back when the database transaction rolls back. And a single database backup protects the entire database, including its BLOB data.
The magic? Behind the scenes, simply applying the FILESTREAM attribute to the varbinary(max) data type for defining a BLOB columns enables a highly efficient storage abstraction layer over that varbinary(max) column.
Specifically, SQL Server will store and retrieve BLOB data assigned to varbinary(max) FILESTREAM columns to and from the file system behind the scenes automatically.
How Does FILESTREAM Work?
FILESTREAM works by integrating with the NTFS file system. As you’ll see in part 2 of this blog post series, FILESTREAM storage for a database is defined in the CREATE DATABASE statement simply by declaring another file group that points to a physical location in the file system. Because it’s just another file group, SQL Server now considers the data in the file system to be an integral part of the overall database. So, unless you explicitly exclude the file group when backing up the database, the BLOB data in the file system will be included in your normal full and incremental database backups.
Because SQL Server is now intimately involved with the file system behind the scenes, you enjoy the performance benefit of keeping BLOB storage outside your structured table file groups, without worrying about programmatically maintaining the link references between rows in tables and corresponding files in the file system. And because the NTFS file system is a transactional file system, SQL Server is able to transactionalize BLOB updates made to the file system. When you initiate a database transaction and then proceed to update BLOB data stored in varbinary(max) FILESTREAM columns, SQL Server will initiate an NTFS transaction over the file system. When you commit or roll back your database transaction, SQL Server will then commit or roll back the NTFS file system transaction for you automatically.
If all of this is starting to get you excited, then that’s the point! Be sure to stay tuned for part 2, which will show you how to enable and use FILESTREAM, and part 3, which explores the OpenSqlFilestream feature that delivers ultra-high streaming performance against varbinary(max) FILESTREAM columns.