Skip to main content

Full-Text Search with PDF in Microsoft SQL Server

Last week I get interesting task to develop. The task was to search input text in PDF file stored in database as FileStream. The task implementation took me some time so I decided to share it with other developers.

Here we are going to use SQL Server 2008 R2 (x64 Developers Edition), external driver from Adobe, Full-Text Search technology and FileStream technology.Because this sems a little bit comlicated let`s make this topic clear and do it step by step.

1) Enable FileStream - this part is pretty easy, just check wheter You already have enabled filestream on Your SQL Server instance - if no simply enable it as in the picture below.

Picture 1. Enable filestream in SQL Server instance.

2) Create SQL table to store files - mainly ther will be PDF file stored but some others is also be allright. Out table DocumentFile will be created in dbo schema and contain one column primary key with default value as sequential GUID. Important this is out table contains FileStream_Id and FileSource columns which are required do FileStream. Additionaly don`t miss the Extension column because we going need it for Full-Text Search.

Code Snippet
  1. CREATE TABLE dbo.DocumentFiles
  2.     (
  3.     DocumentId uniqueidentifier Primary KEY DEFAULT newsequentialid(),
  4.     AddDate datetime NOT NULL,
  5.     Name nvarchar(50) NOT NULL,
  6.     Extension nvarchar(10) NOT NULL,
  7.     Description nvarchar(1000) NULL,
  8.     FileStream_Id uniqueidentifier NOT NULL,
  9.     FileSource varbinary(MAX) NOT NULL DEFAULT (0x)
  10.     )  ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
  11. --Add default add date for document    
  12. ALTER TABLE dbo.DocumentFiles ADD CONSTRAINT
  13.     DF_DocumentFiles_AddDate DEFAULT sysdatetime() FOR AddDate

3) Installing additional component for PDF file support - by default PDF  files is not supported in SQL Server. To check PDF support installed status just execute following T-SQL command:

Code Snippet
  1. SELECT document_type, path FROM sys.fulltext_document_types WHERE document_type = '.pdf'

If after query execution You have no rows returned, this means You have to install PDF support for SQL Server from here (version for x64). When installation complete, to see PDF file support  is int the sys.fulltext_document_types You must restart You SQL Server instance and then validate extension is in the supported type list.

Picture 2. Properly installed PDF extension.
4) Creating Full-Text Search (FTS) index on DocumentFiles table.

T-SQL query below, enable FTS on databse and then create full-text catalog named Document_Catalog which is required for creating any FTS index on any table in database.

Code Snippet
  1. EXEC sp_fulltext_database 'enable'
  2. GO
  3. IF NOT EXISTS (SELECT TOP 1 1 FROM sys.fulltext_catalogs WHERE name = 'Ducuments_Catalog')
  4. BEGIN
  5.     EXEC sp_fulltext_catalog 'Ducuments_Catalog', 'create';
  6. END

Now it`s time for creating full-text index. But before this, its time for a small inclusion because when You are using Entity Framework Code-First name of You primary key in any table vary between each time table is created. The problem is when creating FTS index on table we have to specified primary key index name. Query presented below retrieves primary key name from  system tables and pass it further queries. Other important thing, as I write above, is Extension column. Here stored file extension have to be stored in the following format '.pdf'. This is required because SQL Server uses it to determine which Full-Text Search driver should be use. Out newly created index has change tracking set to auto so each time new document is added to index it`s automatically added to it. If you want to decide by You own when documents are updated in index set change tracking mode to manually.


Code Snippet
  1. DECLARE @indexName nvarchar(255) = (SELECT Top 1 i.Name from sys.indexes i
  2.                                     Join sys.tables t on  i.object_id = t.object_id
  3.                                     WHERE t.Name = 'DocumentFiles' AND i.type_desc = 'CLUSTERED')
  4.                                     
  5.                                     PRINT @indexName
  6.  
  7. EXEC sp_fulltext_table 'DocumentFiles', 'create', 'Ducuments_Catalog',  @indexName
  8. EXEC sp_fulltext_column 'DocumentFiles', 'FileSource', 'add', 0, 'Extension'
  9. EXEC sp_fulltext_table 'DocumentFiles', 'activate'
  10. EXEC sp_fulltext_catalog 'Ducuments_Catalog', 'start_full'
  11.  
  12. ALTER FULLTEXT INDEX ON [dbo].[DocumentFiles] ENABLE
  13. ALTER FULLTEXT INDEX ON [dbo].[DocumentFiles] SET CHANGE_TRACKING = AUTO


After section four completed its time for our solution testing. For insert file to DocumentFiles table first insert simple data  (Insert Into....) except FileStream_Id and  FileSource. After it next upload file to FileStream directly from SQL Server as presented here or from C# code.

When You have data inserted You are able to query it as simple Full-Text data by using query as in example below.


Code Snippet
  1. SELECT d.* FROM dbo.DocumentFiles d
  2. WHERE Contains(d.FileSource, '%Word%')


Thank you.

Read more...




Popular posts from this blog

Persisting Enum in database with Entity Framework

Problem statement We all want to write clean code and follow best coding practices. This all engineers 'North Star' goal which in many cases can not be easily achievable because of many potential difficulties with converting our ideas/good practices into working solutions.  One of an example I recently came across was about using ASP.NET Core and Entity Framework 5 to store Enum values in a relational database (like Azure SQL). Why is this a problem you might ask... and my answer here is that you want to work with Enum types in your code but persist an integer in your databases. You can think about in that way. Why we use data types at all when everything could be just a string which is getting converted into a desirable type when needed. This 'all-string' approach is of course a huge anti-pattern and a bad practice for many reasons with few being: degraded performance, increased storage space, increased code duplication.  Pre-requirements 1. Status enum type definition...

Using Newtonsoft serializer in CosmosDB client

Problem In some scenarios engineers might want to use a custom JSON serializer for documents stored in CosmosDB.  Solution In CosmosDBV3 .NET Core API, when creating an instance of  CosmosClient one of optional setting in  CosmosClientOptions is to specify an instance of a Serializer . This serializer must be JSON based and be of  CosmosSerializer type. This means that if a custom serializer is needed this should inherit from CosmosSerializer abstract class and override its two methods for serializing and deserializing of an object. The challenge is that both methods from  CosmosSerializer are stream based and therefore might be not as easy to implement as engineers used to assume - still not super complex.  For demonstration purpose as or my custom serializer I'm going to use Netwonsoft.JSON library. Firstly a new type is needed and this must inherit from  CosmosSerializer.  using  Microsoft.Azure.Cosmos; using  Newtonsoft.Json; usin...

Multithread processing of the SqlDataReader - Producer/Consumer design pattern

In today post I want to describe how to optimize usage of a ADO.NET SqlDataReader class by using multi-threading. To present that lets me introduce a problem that I will try to solve.  Scenario : In a project we decided to move all data from a multiple databases to one data warehouse. It will be a good few terabytes of data or even more. Data transfer will be done by using a custom importer program. Problem : After implementing a database agnostic logic of generating and executing a query I realized that I can retrieve data from source databases faster that I can upload them to big data store through HTTP client -importer program. In other words, data reader is capable of reading data faster then I can process it an upload to my big data lake. Solution : As a solution for solving this problem I would like to propose one of a multi-thread design pattern called Producer/Consumer . In general this pattern consists of a two main classes where: Producer class is res...