Skip to main content

Replacing/removing HTML Entities in database

Sometimes solutiona that we create stores data from external suppliers. These data is stored in database and than presents to end-user. The problem occured when our suppliers user XML basen technology to send data package to us.  The  XML standard do not allow using some special characters in text (node attribute or value) so each occurance of special characters encoded. Each of us should know that and try do decode those entities before wtiring it to database and/or presents to user. But in real many thing may go wrong and in some rows of our product database these signs may appear.

When You recognize the problem You can do three things but only two of them are correct. The first idea (wrong) is the attempt to create a CLR stored procedure or function with the System.Web.HttpUtility.HtmlDecode function. The problem is that You can`t add reference to System.Web in CLR projects! So this idea can not be executed.

Second idea is to create a stand alone console application and implement whole algorithm inside. Those algorithm have to select all records from database(for example 10GB of data) and in each single row do replacement and update. This is very quick to implement but it`s not efficient and hurt the server performance event You scheduled task in night.

The third idea is to create a T-SQL query (stored procedure) that replace each occurance of special characters in each record. The problem is that You need to implement whole dictionary od translation for each  special character (key) and its normal counterpart (value).

Example of such dictionary may looks like:


DECLARE @Dictionary TABLE --creating a table variable
(
[key] nvarchar(10), --special character pattern
[value] nvarchar(50) --normal sign
)
--two of examples
INSERT INTO @Dictionary([key], [value]) VALUES ('&','&'),('#039;','''')

After creating a dictionary table variable You can create a cursor (ie.'HtmlReplace') for the table You want to update. Inside of these cursor You need to fetch all dictionary so You have to create a nested cursor (ie. 'TempCursor'). Now, as You can see, we process each single row from our huge table and for each of this row we walkthrough each pair key-value from dictionary and that find every occurance of key which is replaced by value from dictionary. After whole row were processed we can do a simple upadte.

For example. We have single table named dbo.Objects with only two column 'Code' and 'Desc'. In the 'Desc' column we have many rows with special XML  charancters. By using T-SQL Query  presented below we can replace it using their counterparts.


DECLARE @Code nvarchar(50)
DECLARE @Text nvarchar(max)
DECLARE  @CurrentPattern nvarchar(10);
--temporary dictiorany values
DECLARE @key nvarchar(10)
DECLARE @value nvarchar(50)


DECLARE HtmlReplace Cursor FOR
SELECT [Code],[Desc] FROM dbo.[Object]


OPEN HtmlReplace


FETCH NEXT FROM HtmlReplace 
INTO @Code, @Text


WHILE @@FETCH_STATUS = 0 --this cursor runs over dbo.Object table
BEGIN
PRINT 'Before:' + @Text
DECLARE TempCursor Cursor FOR
SELECT [key],[value] FROM @Dictionary

OPEN TempCursor
FETCH NEXT FROM TempCursor 
INTO @key, @value

WHILE @@FETCH_STATUS = 0 --This cursor tetching ower dictionary table replacing each occurance of current key
BEGIN
SET @CurrentPattern = '%'+@key+'%';
WHILE((SELECT PATINDEX (@CurrentPattern, @Text) )>0)
BEGIN
SET @Text = REPLACE(@Text,@key,@value)
END
FETCH NEXT FROM TempCursor 
INTO @key, @value
END
CLOSE TempCursor
DEALLOCATE TempCursor
PRINT 'After:' + @Text
--final update
UPDATE dbo.[Object] SET [Desc] = @Text WHERE [Code] = @Code

FETCH NEXT FROM HtmlReplace 
INTO @Code, @Text
END
CLOSE HtmlReplace
DEALLOCATE HtmlReplace


This is only idea...so if You want You can extand this query as a User Defined Function.

Thank You

Popular posts from this blog

Full-Text Search with PDF in Microsoft SQL Server

Last week I get interesting task to develop. The task was to search input text in PDF file stored in database as FileStream. The task implementation took me some time so I decided to share it with other developers. Here we are going to use SQL Server 2008 R2 (x64 Developers Edition), external driver from Adobe, Full-Text Search technology and FileStream technology.Because this sems a little bit comlicated let`s make this topic clear and do it step by step. 1) Enable FileStream - this part is pretty easy, just check wheter You already have enabled filestream on Your SQL Server instance - if no simply enable it as in the picture below. Picture 1. Enable filestream in SQL Server instance. 2) Create SQL table to store files  - mainly ther will be PDF file stored but some others is also be allright. Out table DocumentFile will be created in dbo schema and contain one column primary key with default value as sequential GUID. Important this is out table contains FileStream

Playing with a .NET types definition

In the last few days I spent some time trying to unify structure of one of the project I`m currently working on. Most of the changes were about changing variable types because it`s were not used right way. That is why in this post I want to share my observations and practices with you. First of all we need to understand what ' variable definition ' is and how it`s different from ' variable initialization '. This part should be pretty straightforward:   variable definition  consist of data type and variable name only <data_type> <variable_name> ; for example int i ; . It`s important to understand how variable definition affects your code because it behaves differently depends weather you work with value or reference types. In the case of value types after defining variable it always has default value and it`s never null value. However after defined reference type variable without initializing it has null value by default. variable initialization  is

Persisting Enum in database with Entity Framework

Problem statement We all want to write clean code and follow best coding practices. This all engineers 'North Star' goal which in many cases can not be easily achievable because of many potential difficulties with converting our ideas/good practices into working solutions.  One of an example I recently came across was about using ASP.NET Core and Entity Framework 5 to store Enum values in a relational database (like Azure SQL). Why is this a problem you might ask... and my answer here is that you want to work with Enum types in your code but persist an integer in your databases. You can think about in that way. Why we use data types at all when everything could be just a string which is getting converted into a desirable type when needed. This 'all-string' approach is of course a huge anti-pattern and a bad practice for many reasons with few being: degraded performance, increased storage space, increased code duplication.  Pre-requirements 1. Status enum type definition