#0314-SQL Server-Identify records with special characters using regular expressions

One of my team mates was recently faced with an interesting problem – a few records were inserted/updated into the database via some backend data append and bulk insert tasks. These records apparently had some special characters which was causing the application to crash. While a application code fix was being developed, tested and deployed it was essential to be able to identify the bad data and clean it up as a stop-gap arrangement. The problem was that the offending characters were not available on the keyboard! (i.e. they were “special” characters). I was approached to help with this task and this is an account of a solution that I came up with.

Demo

To begin with the demo, let us assume that we have a table with a text column that has both – “bad” and “good” data. To simulate the bad data, I am using special characters from the ASCII table which are not generally visible on the console (e.g. a vertical tab and others). Characters like Space and Horizontal tabs are quite normal in almost all documents and are therefore considered to be valid characters for the purposes of this demo.

Once the table has been created and test data created, we go about the task of identifying the offending records. To do so, we simply build a lookup table which would help us create a string of “bad” characters which can be used as part of a regular expression.

That’s it! Once the regular expression is ready, it’s a matter of writing a simple SELECT statement to identify the records matching the expression.

The complete script, with the output is shown below:

USE tempdb;

GO

–Create the test table

IF OBJECT_ID(‘tempdb..#hiddenCharacterValues’,’U’) IS NOT NULL

    DROP TABLE #hiddenCharacterValues;

GO
CREATE TABLE #hiddenCharacterValues

    (RecordId INT IDENTITY(1,1),

     RecordValue VARCHAR(25)

    );

GO
–Insert some test data into the table

INSERT INTO #hiddenCharacterValues

       (RecordValue)

VALUES (‘StandardSpace’  + CHAR(32)),   –Good

       (‘GroupSeparator’ + CHAR(29)),   –Bad

       (‘HorizontalTab’  + CHAR(9)),    –Good

       (‘VerticalTab’    + CHAR(11)),   –Bad

       (‘ExtendedASCII’  + CHAR(176));  –Bad

GO
–Build a comma separated list of known “bad characters”

DECLARE @regExString VARCHAR(MAX);

DECLARE @invalidCharLookUp TABLE (InvalidChars VARCHAR(2));

INSERT INTO @invalidCharLookUp (InvalidChars)

VALUES (CHAR(11)),

       (CHAR(29)),

       (CHAR(176));
SELECT @regExString = COALESCE(@regExString,”)

                      + iclu.InvalidChars

FROM @invalidCharLookUp AS iclu;
–Finally select records which contain these characters

SELECT hcv.RecordId,

       hcv.RecordValue

FROM #hiddenCharacterValues AS hcv

WHERE hcv.RecordValue LIKE ‘%[‘ + @regExString + ‘]%’;

GO

Now that the records are identified, they can either be manually updated or deleted (as required by the business).

Using Regular Expressions with CHECK constraints [Link]

Are there any other methods that you would use (or are using) to address such an issue? Please do let me know.

Until we meet next time,

Be courteous. Drive responsibly.

3 thoughts on “#0314-SQL Server-Identify records with special characters using regular expressions”

RDW2 January 12, 2014 at 4:20 AM

You don’t really need the table of invalid characters. I’ve created a CLR UDF for validating data that uses a RegEx to simply omit any unacceptable characters and then compare that result to the string that was passed in. If the RegEx result is different from the original string, then the data is invalid. I also created another CLR UDF that uses the same RegEx technique to simply scrub and then return the string that is passed in.

In both cases, I actually created 2 CLR UDFs, one that accepted a RegEx string as well as the string to be validated/scrubbed and one that accepted a”human consumable” name for a predefined RegEx (e.g. “NUMERIC_ONLY” to omit any characters not in the set of 0-9 and “,” or to flag as invalid any strings not consisting of only those characters). The set of predefined RegEx strings makes it easier for developers or users to create SQL statements to check for funky data.

LikeLike

Reply ↓

Nakul Vachhrajani January 12, 2014 at 9:25 AM

@RDW2: If I had to implement this on a permanent basis, I would definitely go down the path of creating a CLR function where all inputs can be validated from a business perspective done within a single assembly.

In this particular case, the error was causing the system to error out and a one-off custom solution was to be developed (while the application was being patched to actually work with these characters) which is why a pure T-SQL solution made more sense from a manageability perspective.

LikeLike

Reply ↓

RDW2 January 12, 2014 at 8:40 PM

Nakul,

You sort of missed my point. I implemented the CLR UDF in just such a scenario as you decribed but as a permanent means of implementing temporary. emergency fixes using T-SQL. I added the CLR UDFs that have predefined RegEx strings as a part of implementing that actual solution to the problem. By having the temporary UDF’s available, it is very easy to whip up the T-SQL statements to identify the issues . . . without creating any addition temporary (or permanent) look up tables.

LikeLike

Reply ↓

Let me know what you think about this post by leaving your feedback here! Cancel reply

RDW2 January 12, 2014 at 4:20 AM

You don’t really need the table of invalid characters. I’ve created a CLR UDF for validating data that uses a RegEx to simply omit any unacceptable characters and then compare that result to the string that was passed in. If the RegEx result is different from the original string, then the data is invalid. I also created another CLR UDF that uses the same RegEx technique to simply scrub and then return the string that is passed in.

In both cases, I actually created 2 CLR UDFs, one that accepted a RegEx string as well as the string to be validated/scrubbed and one that accepted a”human consumable” name for a predefined RegEx (e.g. “NUMERIC_ONLY” to omit any characters not in the set of 0-9 and “,” or to flag as invalid any strings not consisting of only those characters). The set of predefined RegEx strings makes it easier for developers or users to create SQL statements to check for funky data.

LikeLike

Reply ↓
Nakul Vachhrajani January 12, 2014 at 9:25 AM

@RDW2: If I had to implement this on a permanent basis, I would definitely go down the path of creating a CLR function where all inputs can be validated from a business perspective done within a single assembly.

In this particular case, the error was causing the system to error out and a one-off custom solution was to be developed (while the application was being patched to actually work with these characters) which is why a pure T-SQL solution made more sense from a manageability perspective.

LikeLike

Reply ↓
RDW2 January 12, 2014 at 8:40 PM

Nakul,

You sort of missed my point. I implemented the CLR UDF in just such a scenario as you decribed but as a permanent means of implementing temporary. emergency fixes using T-SQL. I added the CLR UDFs that have predefined RegEx strings as a part of implementing that actual solution to the problem. By having the temporary UDF’s available, it is very easy to whip up the T-SQL statements to identify the issues . . . without creating any addition temporary (or permanent) look up tables.

LikeLike

Reply ↓

This site uses Akismet to reduce spam. Learn how your comment data is processed.

SQLTwins by Nakul Vachhrajani

SQL Server tips and experiences dedicated to my twin daughters.

#0314-SQL Server-Identify records with special characters using regular expressions

Demo

Related Posts

3 thoughts on “#0314-SQL Server-Identify records with special characters using regular expressions”

Let me know what you think about this post by leaving your feedback here! Cancel reply

Demo

Related Posts

Spread the word around! Share via:

Related

3 thoughts on “#0314-SQL Server-Identify records with special characters using regular expressions”

Let me know what you think about this post by leaving your feedback here! Cancel reply