Skip to main content

Replacing/removing HTML Entities in database

Sometimes solutiona that we create stores data from external suppliers. These data is stored in database and than presents to end-user. The problem occured when our suppliers user XML basen technology to send data package to us.  The  XML standard do not allow using some special characters in text (node attribute or value) so each occurance of special characters encoded. Each of us should know that and try do decode those entities before wtiring it to database and/or presents to user. But in real many thing may go wrong and in some rows of our product database these signs may appear.

When You recognize the problem You can do three things but only two of them are correct. The first idea (wrong) is the attempt to create a CLR stored procedure or function with the System.Web.HttpUtility.HtmlDecode function. The problem is that You can`t add reference to System.Web in CLR projects! So this idea can not be executed.

Second idea is to create a stand alone console application and implement whole algorithm inside. Those algorithm have to select all records from database(for example 10GB of data) and in each single row do replacement and update. This is very quick to implement but it`s not efficient and hurt the server performance event You scheduled task in night.

The third idea is to create a T-SQL query (stored procedure) that replace each occurance of special characters in each record. The problem is that You need to implement whole dictionary od translation for each  special character (key) and its normal counterpart (value).

Example of such dictionary may looks like:


DECLARE @Dictionary TABLE --creating a table variable
(
[key] nvarchar(10), --special character pattern
[value] nvarchar(50) --normal sign
)
--two of examples
INSERT INTO @Dictionary([key], [value]) VALUES ('&','&'),('#039;','''')

After creating a dictionary table variable You can create a cursor (ie.'HtmlReplace') for the table You want to update. Inside of these cursor You need to fetch all dictionary so You have to create a nested cursor (ie. 'TempCursor'). Now, as You can see, we process each single row from our huge table and for each of this row we walkthrough each pair key-value from dictionary and that find every occurance of key which is replaced by value from dictionary. After whole row were processed we can do a simple upadte.

For example. We have single table named dbo.Objects with only two column 'Code' and 'Desc'. In the 'Desc' column we have many rows with special XML  charancters. By using T-SQL Query  presented below we can replace it using their counterparts.


DECLARE @Code nvarchar(50)
DECLARE @Text nvarchar(max)
DECLARE  @CurrentPattern nvarchar(10);
--temporary dictiorany values
DECLARE @key nvarchar(10)
DECLARE @value nvarchar(50)


DECLARE HtmlReplace Cursor FOR
SELECT [Code],[Desc] FROM dbo.[Object]


OPEN HtmlReplace


FETCH NEXT FROM HtmlReplace 
INTO @Code, @Text


WHILE @@FETCH_STATUS = 0 --this cursor runs over dbo.Object table
BEGIN
PRINT 'Before:' + @Text
DECLARE TempCursor Cursor FOR
SELECT [key],[value] FROM @Dictionary

OPEN TempCursor
FETCH NEXT FROM TempCursor 
INTO @key, @value

WHILE @@FETCH_STATUS = 0 --This cursor tetching ower dictionary table replacing each occurance of current key
BEGIN
SET @CurrentPattern = '%'+@key+'%';
WHILE((SELECT PATINDEX (@CurrentPattern, @Text) )>0)
BEGIN
SET @Text = REPLACE(@Text,@key,@value)
END
FETCH NEXT FROM TempCursor 
INTO @key, @value
END
CLOSE TempCursor
DEALLOCATE TempCursor
PRINT 'After:' + @Text
--final update
UPDATE dbo.[Object] SET [Desc] = @Text WHERE [Code] = @Code

FETCH NEXT FROM HtmlReplace 
INTO @Code, @Text
END
CLOSE HtmlReplace
DEALLOCATE HtmlReplace


This is only idea...so if You want You can extand this query as a User Defined Function.

Thank You

Popular posts from this blog

Persisting Enum in database with Entity Framework

Problem statement We all want to write clean code and follow best coding practices. This all engineers 'North Star' goal which in many cases can not be easily achievable because of many potential difficulties with converting our ideas/good practices into working solutions.  One of an example I recently came across was about using ASP.NET Core and Entity Framework 5 to store Enum values in a relational database (like Azure SQL). Why is this a problem you might ask... and my answer here is that you want to work with Enum types in your code but persist an integer in your databases. You can think about in that way. Why we use data types at all when everything could be just a string which is getting converted into a desirable type when needed. This 'all-string' approach is of course a huge anti-pattern and a bad practice for many reasons with few being: degraded performance, increased storage space, increased code duplication.  Pre-requirements 1. Status enum type definition...

Multithread processing of the SqlDataReader - Producer/Consumer design pattern

In today post I want to describe how to optimize usage of a ADO.NET SqlDataReader class by using multi-threading. To present that lets me introduce a problem that I will try to solve.  Scenario : In a project we decided to move all data from a multiple databases to one data warehouse. It will be a good few terabytes of data or even more. Data transfer will be done by using a custom importer program. Problem : After implementing a database agnostic logic of generating and executing a query I realized that I can retrieve data from source databases faster that I can upload them to big data store through HTTP client -importer program. In other words, data reader is capable of reading data faster then I can process it an upload to my big data lake. Solution : As a solution for solving this problem I would like to propose one of a multi-thread design pattern called Producer/Consumer . In general this pattern consists of a two main classes where: Producer class is res...

Creating common partial class with Entity Framework

When we use the Entity Framework (EF) in multilayer information systems sometimes we want to extend classes generated by EF by adding some common properties or functions. Such operation can`t be conduct on *.edmx data model so we need to make some improvement in our solution. Let`s begin... Lets assumed that in our soulution we have only three layer (three project): Client console application which has reference to the second layer  - ' ConsoleApplication ' project name Class library project with class interfaces only - ' Interfaces ' project name Class library class implementation and data model referenced to 'Interfaces' project - ' Classes ' project name. Picture 1. Solution structure. Now when we have all solution structure we can focus on data model. In the ' Classes ' project we create a new folder named ' Model ' and inside add new item of ADO.NET Entity Data Model named ' Learning.edmx ' - it may be empty ...