One of my team mates was recently faced with an interesting problem – a few records were inserted/updated into the database via some backend data append and bulk insert tasks. These records apparently had some special characters which was causing the application to crash. While a application code fix was being developed, tested and deployed it was essential to be able to identify the bad data and clean it up as a stop-gap arrangement. The problem was that the offending characters were not available on the keyboard! (i.e. they were “special” characters). I was approached to help with this task and this is an account of a solution that I came up with.
Demo
To begin with the demo, let us assume that we have a table with a text column that has both – “bad” and “good” data. To simulate the bad data, I am using special characters from the ASCII table which are not generally visible on the console (e.g. a vertical tab and others). Characters like Space and Horizontal tabs are quite normal in almost all documents and are therefore considered to be valid characters for the purposes of this demo.
Once the table has been created and test data created, we go about the task of identifying the offending records. To do so, we simply build a lookup table which would help us create a string of “bad” characters which can be used as part of a regular expression.
That’s it! Once the regular expression is ready, it’s a matter of writing a simple SELECT statement to identify the records matching the expression.
The complete script, with the output is shown below:
USE tempdb;
GO
–Create the test table
IF OBJECT_ID(‘tempdb..#hiddenCharacterValues’,’U’) IS NOT NULL
DROP TABLE #hiddenCharacterValues;
GOCREATE TABLE #hiddenCharacterValues
(RecordId INT IDENTITY(1,1),
RecordValue VARCHAR(25)
);
GO–Insert some test data into the table
INSERT INTO #hiddenCharacterValues
(RecordValue)
VALUES (‘StandardSpace’ + CHAR(32)), –Good
(‘GroupSeparator’ + CHAR(29)), –Bad
(‘HorizontalTab’ + CHAR(9)), –Good
(‘VerticalTab’ + CHAR(11)), –Bad
(‘ExtendedASCII’ + CHAR(176)); –Bad
GO–Build a comma separated list of known “bad characters”
DECLARE @regExString VARCHAR(MAX);
DECLARE @invalidCharLookUp TABLE (InvalidChars VARCHAR(2));
INSERT INTO @invalidCharLookUp (InvalidChars)
VALUES (CHAR(11)),
(CHAR(29)),
(CHAR(176));SELECT @regExString = COALESCE(@regExString,”)
+ iclu.InvalidChars
FROM @invalidCharLookUp AS iclu;–Finally select records which contain these characters
SELECT hcv.RecordId,
hcv.RecordValue
FROM #hiddenCharacterValues AS hcv
WHERE hcv.RecordValue LIKE ‘%[‘ + @regExString + ‘]%’;
GO
Now that the records are identified, they can either be manually updated or deleted (as required by the business).
Related Posts
- Using Regular Expressions with CHECK constraints [Link]
Are there any other methods that you would use (or are using) to address such an issue? Please do let me know.
Until we meet next time,
