Skip to main content

Replacing/removing HTML Entities in database

Sometimes solutiona that we create stores data from external suppliers. These data is stored in database and than presents to end-user. The problem occured when our suppliers user XML basen technology to send data package to us.  The  XML standard do not allow using some special characters in text (node attribute or value) so each occurance of special characters encoded. Each of us should know that and try do decode those entities before wtiring it to database and/or presents to user. But in real many thing may go wrong and in some rows of our product database these signs may appear.

When You recognize the problem You can do three things but only two of them are correct. The first idea (wrong) is the attempt to create a CLR stored procedure or function with the System.Web.HttpUtility.HtmlDecode function. The problem is that You can`t add reference to System.Web in CLR projects! So this idea can not be executed.

Second idea is to create a stand alone console application and implement whole algorithm inside. Those algorithm have to select all records from database(for example 10GB of data) and in each single row do replacement and update. This is very quick to implement but it`s not efficient and hurt the server performance event You scheduled task in night.

The third idea is to create a T-SQL query (stored procedure) that replace each occurance of special characters in each record. The problem is that You need to implement whole dictionary od translation for each  special character (key) and its normal counterpart (value).

Example of such dictionary may looks like:


DECLARE @Dictionary TABLE --creating a table variable
(
[key] nvarchar(10), --special character pattern
[value] nvarchar(50) --normal sign
)
--two of examples
INSERT INTO @Dictionary([key], [value]) VALUES ('&','&'),('#039;','''')

After creating a dictionary table variable You can create a cursor (ie.'HtmlReplace') for the table You want to update. Inside of these cursor You need to fetch all dictionary so You have to create a nested cursor (ie. 'TempCursor'). Now, as You can see, we process each single row from our huge table and for each of this row we walkthrough each pair key-value from dictionary and that find every occurance of key which is replaced by value from dictionary. After whole row were processed we can do a simple upadte.

For example. We have single table named dbo.Objects with only two column 'Code' and 'Desc'. In the 'Desc' column we have many rows with special XML  charancters. By using T-SQL Query  presented below we can replace it using their counterparts.


DECLARE @Code nvarchar(50)
DECLARE @Text nvarchar(max)
DECLARE  @CurrentPattern nvarchar(10);
--temporary dictiorany values
DECLARE @key nvarchar(10)
DECLARE @value nvarchar(50)


DECLARE HtmlReplace Cursor FOR
SELECT [Code],[Desc] FROM dbo.[Object]


OPEN HtmlReplace


FETCH NEXT FROM HtmlReplace 
INTO @Code, @Text


WHILE @@FETCH_STATUS = 0 --this cursor runs over dbo.Object table
BEGIN
PRINT 'Before:' + @Text
DECLARE TempCursor Cursor FOR
SELECT [key],[value] FROM @Dictionary

OPEN TempCursor
FETCH NEXT FROM TempCursor 
INTO @key, @value

WHILE @@FETCH_STATUS = 0 --This cursor tetching ower dictionary table replacing each occurance of current key
BEGIN
SET @CurrentPattern = '%'+@key+'%';
WHILE((SELECT PATINDEX (@CurrentPattern, @Text) )>0)
BEGIN
SET @Text = REPLACE(@Text,@key,@value)
END
FETCH NEXT FROM TempCursor 
INTO @key, @value
END
CLOSE TempCursor
DEALLOCATE TempCursor
PRINT 'After:' + @Text
--final update
UPDATE dbo.[Object] SET [Desc] = @Text WHERE [Code] = @Code

FETCH NEXT FROM HtmlReplace 
INTO @Code, @Text
END
CLOSE HtmlReplace
DEALLOCATE HtmlReplace


This is only idea...so if You want You can extand this query as a User Defined Function.

Thank You

Popular posts from this blog

Persisting Enum in database with Entity Framework

Problem statement We all want to write clean code and follow best coding practices. This all engineers 'North Star' goal which in many cases can not be easily achievable because of many potential difficulties with converting our ideas/good practices into working solutions.  One of an example I recently came across was about using ASP.NET Core and Entity Framework 5 to store Enum values in a relational database (like Azure SQL). Why is this a problem you might ask... and my answer here is that you want to work with Enum types in your code but persist an integer in your databases. You can think about in that way. Why we use data types at all when everything could be just a string which is getting converted into a desirable type when needed. This 'all-string' approach is of course a huge anti-pattern and a bad practice for many reasons with few being: degraded performance, increased storage space, increased code duplication.  Pre-requirements 1. Status enum type definition...

Using Newtonsoft serializer in CosmosDB client

Problem In some scenarios engineers might want to use a custom JSON serializer for documents stored in CosmosDB.  Solution In CosmosDBV3 .NET Core API, when creating an instance of  CosmosClient one of optional setting in  CosmosClientOptions is to specify an instance of a Serializer . This serializer must be JSON based and be of  CosmosSerializer type. This means that if a custom serializer is needed this should inherit from CosmosSerializer abstract class and override its two methods for serializing and deserializing of an object. The challenge is that both methods from  CosmosSerializer are stream based and therefore might be not as easy to implement as engineers used to assume - still not super complex.  For demonstration purpose as or my custom serializer I'm going to use Netwonsoft.JSON library. Firstly a new type is needed and this must inherit from  CosmosSerializer.  using  Microsoft.Azure.Cosmos; using  Newtonsoft.Json; usin...

Multithread processing of the SqlDataReader - Producer/Consumer design pattern

In today post I want to describe how to optimize usage of a ADO.NET SqlDataReader class by using multi-threading. To present that lets me introduce a problem that I will try to solve.  Scenario : In a project we decided to move all data from a multiple databases to one data warehouse. It will be a good few terabytes of data or even more. Data transfer will be done by using a custom importer program. Problem : After implementing a database agnostic logic of generating and executing a query I realized that I can retrieve data from source databases faster that I can upload them to big data store through HTTP client -importer program. In other words, data reader is capable of reading data faster then I can process it an upload to my big data lake. Solution : As a solution for solving this problem I would like to propose one of a multi-thread design pattern called Producer/Consumer . In general this pattern consists of a two main classes where: Producer class is res...