Skip to main content

Replacing/removing HTML Entities in database

Sometimes solutiona that we create stores data from external suppliers. These data is stored in database and than presents to end-user. The problem occured when our suppliers user XML basen technology to send data package to us.  The  XML standard do not allow using some special characters in text (node attribute or value) so each occurance of special characters encoded. Each of us should know that and try do decode those entities before wtiring it to database and/or presents to user. But in real many thing may go wrong and in some rows of our product database these signs may appear.

When You recognize the problem You can do three things but only two of them are correct. The first idea (wrong) is the attempt to create a CLR stored procedure or function with the System.Web.HttpUtility.HtmlDecode function. The problem is that You can`t add reference to System.Web in CLR projects! So this idea can not be executed.

Second idea is to create a stand alone console application and implement whole algorithm inside. Those algorithm have to select all records from database(for example 10GB of data) and in each single row do replacement and update. This is very quick to implement but it`s not efficient and hurt the server performance event You scheduled task in night.

The third idea is to create a T-SQL query (stored procedure) that replace each occurance of special characters in each record. The problem is that You need to implement whole dictionary od translation for each  special character (key) and its normal counterpart (value).

Example of such dictionary may looks like:


DECLARE @Dictionary TABLE --creating a table variable
(
[key] nvarchar(10), --special character pattern
[value] nvarchar(50) --normal sign
)
--two of examples
INSERT INTO @Dictionary([key], [value]) VALUES ('&','&'),('#039;','''')

After creating a dictionary table variable You can create a cursor (ie.'HtmlReplace') for the table You want to update. Inside of these cursor You need to fetch all dictionary so You have to create a nested cursor (ie. 'TempCursor'). Now, as You can see, we process each single row from our huge table and for each of this row we walkthrough each pair key-value from dictionary and that find every occurance of key which is replaced by value from dictionary. After whole row were processed we can do a simple upadte.

For example. We have single table named dbo.Objects with only two column 'Code' and 'Desc'. In the 'Desc' column we have many rows with special XML  charancters. By using T-SQL Query  presented below we can replace it using their counterparts.


DECLARE @Code nvarchar(50)
DECLARE @Text nvarchar(max)
DECLARE  @CurrentPattern nvarchar(10);
--temporary dictiorany values
DECLARE @key nvarchar(10)
DECLARE @value nvarchar(50)


DECLARE HtmlReplace Cursor FOR
SELECT [Code],[Desc] FROM dbo.[Object]


OPEN HtmlReplace


FETCH NEXT FROM HtmlReplace 
INTO @Code, @Text


WHILE @@FETCH_STATUS = 0 --this cursor runs over dbo.Object table
BEGIN
PRINT 'Before:' + @Text
DECLARE TempCursor Cursor FOR
SELECT [key],[value] FROM @Dictionary

OPEN TempCursor
FETCH NEXT FROM TempCursor 
INTO @key, @value

WHILE @@FETCH_STATUS = 0 --This cursor tetching ower dictionary table replacing each occurance of current key
BEGIN
SET @CurrentPattern = '%'+@key+'%';
WHILE((SELECT PATINDEX (@CurrentPattern, @Text) )>0)
BEGIN
SET @Text = REPLACE(@Text,@key,@value)
END
FETCH NEXT FROM TempCursor 
INTO @key, @value
END
CLOSE TempCursor
DEALLOCATE TempCursor
PRINT 'After:' + @Text
--final update
UPDATE dbo.[Object] SET [Desc] = @Text WHERE [Code] = @Code

FETCH NEXT FROM HtmlReplace 
INTO @Code, @Text
END
CLOSE HtmlReplace
DEALLOCATE HtmlReplace


This is only idea...so if You want You can extand this query as a User Defined Function.

Thank You

Popular posts from this blog

Persisting Enum in database with Entity Framework

Problem statement We all want to write clean code and follow best coding practices. This all engineers 'North Star' goal which in many cases can not be easily achievable because of many potential difficulties with converting our ideas/good practices into working solutions.  One of an example I recently came across was about using ASP.NET Core and Entity Framework 5 to store Enum values in a relational database (like Azure SQL). Why is this a problem you might ask... and my answer here is that you want to work with Enum types in your code but persist an integer in your databases. You can think about in that way. Why we use data types at all when everything could be just a string which is getting converted into a desirable type when needed. This 'all-string' approach is of course a huge anti-pattern and a bad practice for many reasons with few being: degraded performance, increased storage space, increased code duplication.  Pre-requirements 1. Status enum type definition...

Creating common partial class with Entity Framework

When we use the Entity Framework (EF) in multilayer information systems sometimes we want to extend classes generated by EF by adding some common properties or functions. Such operation can`t be conduct on *.edmx data model so we need to make some improvement in our solution. Let`s begin... Lets assumed that in our soulution we have only three layer (three project): Client console application which has reference to the second layer  - ' ConsoleApplication ' project name Class library project with class interfaces only - ' Interfaces ' project name Class library class implementation and data model referenced to 'Interfaces' project - ' Classes ' project name. Picture 1. Solution structure. Now when we have all solution structure we can focus on data model. In the ' Classes ' project we create a new folder named ' Model ' and inside add new item of ADO.NET Entity Data Model named ' Learning.edmx ' - it may be empty ...

Using Hortonworks Hive in .NET

A few months ago I decided to learn a big data. This sounds very complex and of course it is. All these strange names which actually tells nothing to person who is new in these area combined with different way of looking at data storage makes entire topic even more complex. However after reading N blogs and watching many, many tutorials today I finally had a chance to try to write some code. As in last week I managed to setup a Hortonworks distribution of Hadoop today I decided to connect to it from my .NET based application and this is what I will describe in this post. First things first I didn`t setup entire Hortonworks ecosystem from scratch - I`d love to but for now it`s far beyond my knowledge thus I decided to use a sandbox environment provided by Hortonworks. There are multiple different VMs available to download but in my case I`ve choose a Hyper-V. More about setting this environment up you can read here . Picture 1. Up and running sandbox environment. Now whe...