SQL Server 2012 simplifies the processing of unstructured data found in files. The FileTable enhancements to FILESTREAM allow documents to be known to SQL Server full-text search by simply copying them to a FileTable network share. This is much more convenient than it was in SQL Server 2008 R2 where the files had to be explicitly loaded into SQL Server as blobs before they could be processed by full-text search.
Although I’m a full-time IT consultant, I spend most of my spare time pursuing graduate studies in nursing. I’m always looking for ways to use technology to improve my efficiency as a student and also improve my effectiveness as a consultant. Some of the technical tips and tricks I’ve figured out for school have been useful to my clients in corporate environments. In this blog post, I’ve taken PowerPoint slides from my pharmacology course to evaluate the usefulness of using FileTable to process unstructured data found in the same kinds of files used in corporate environments.
Step 1. Access your SQL Server instance’s properties to enable FILESTREAM.
The first checkbox is to enable FILESTREAM in SQL Server. The second checkbox is to allow Windows to read and write FILESTREAM data – in other words, make your files accessible to Windows. The third checkbox is to allow remote users to access your FILESTREAM data. This is the option for real world usage. Typically a SQL Server is on a remote machine, not locally installed.
Figure 2. Check everything if you want maximum functionality.
Changing FILESTREAM settings requires a restart of the SQL Server service.
You must create a FILESTREAM enabled database or add FILESTREAM to an existing database. For this blog post, I chose to create a new database.
CREATE DATABASE School
PRIMARY (NAME = School, FILENAME = 'd:\school\dbFiles\school.mdf')
CONTAINS FILESTREAM(NAME = schoolFilestream, FILENAME = 'd:\school\schoolFilestream')
LOG ON (NAME = SchoolLog, FILENAME = 'd:\school\dbFiles\school.ldf');
The folder for the data and log files must exist. The subfolder for the FILESTREAM must not exist and will be created when the database is created.
Figure 3. Folder structure showing that D:\school\dbFiles exists and D:\school\schoolFilestream does not exist.
After creating the database with the above T-SQL statement, the D:\school\schoolFilestream folder is created as shown below. As the dialog box shows, it is considered a system folder by Windows.
Figure 4. Click Continue to access the new folder and its contents.
Figure 5. The new D:\school\schoolFilestream folder and its contents.
Figure 6. FILESTREAM Directory Name (which must NOT be a full path) and Non-Transacted Access options.
The options specified created a folder fileContainer under a new mssqlserver share as shown below.
Figure 7. Network showing the fileContainer FILESTREAM directory name under the default share name of mssqlserver.
At this point the fileContainer folder is empty. Time to create a FileTable. Notice that the CREATE TABLE statement does not specify any columns. The default is to create a folder with the same name as the FileTable, but the example shows how to use FILETABLE_DIRECTORY to override the default and use pharmFiles for the folder name. Notice that the WITH clause is also used to create a unique constraint.
CREATE TABLE pharm AS FileTable
WITH (FILETABLE_DIRECTORY = N'pharmFiles'
,FILETABLE_STREAMID_UNIQUE_CONSTRAINT_NAME = ui_file_stream);
Figure 8. The CREATE TABLE statement created a new folder pharmFiles.
Within SSMS, you can see that a FileTable has predefined columns.
Figure 9. FileTable columns are predefined.
Now it is time to copy all of the pharmacology PowerPoint files to the pharmFiles folder on the network share.
Figure 10. PowerPoint files added to the previously empty pharmFiles folder.
Let’s examine the contents of the pharm FileTable after the file copy.
Figure 11. Notice that SQL Server recognizes all of the files that were copied to the pharmFiles folder on the mssqlserver share.
Notice that the files are of type pptx. SQL Server 2012 by default does not index pptx files. Use the following query to find out what file types are indexed.
SELECT * FROM sys.fulltext_document_types
ORDER BY document_type;
It is necessary to add an IFilter, which you will probably find is more than a one step process. As of the date of publication, the latest and greatest IFilter pack is the Microsoft Office 2010 Filter Pack (a.k.a. Filter Pack 2.0). It’s possible that when you read this, a newer version may be available, so do your due diligence and check. After installing the Filter Pack, you may need to install a service pack. There’s more than one service pack that has been released for Filter Pack 2.0. I will not provide any links because I don’t want to provide you with outdated information. Microsoft recommends that you obtain service packs by running Windows Update, which is what I did. As a best practice, after installing something from Windows Update, you should check for additional updates that could be required after what you just finished installing. I had three consecutive updates to do. You could have more or less than that.
The Office Filter Pack does not have a dependency on Microsoft Office. You don’t need to have Office installed on your SQL Server machine for the Filter Pack to work. It’s not a good idea to have Office installed on a server.
After installing and updating the Filter Pack, run the following command to load the new filters.
exec sp_fulltext_service 'load_os_resources', 1;
Update the system metadata about what filters are installed.
exec sp_fulltext_service 'update_languages';
Restart the filter daemon.
exec sp_fulltext_service 'restart_all_fdhosts’;
Confirm that the new filters were installed. SELECT
* FROM sys.fulltext_document_types ORDER BY document_type
Create a fulltext catalog and a fulltext index.
CREATE FULLTEXT CATALOG pharmCat AS DEFAULT;
CREATE FULLTEXT INDEX ON pharm (file_stream TYPE COLUMN file_type) KEY INDEX ui_file_stream;
Here’s a query to find out what files in the pharm folder contain the drug name Vancomycin.
WHERE CONTAINS (file_stream, 'Vancomycin');
A single row is returned because only one file, Anti-infective.pptx mentions that drug. A future post will address using FileTable to make sense of unstructured data.