Most enterprise products require that they be able to audit data modifications being made within the application. An important aspect of auditing is to be able to identify changes made outside of the application too, i.e. updates directly made on the data using a direct connection to the SQL Server instance. I recently authored a series of articles on SQLServerCentral.com around data change and tamper detection mechanisms available in Microsoft SQL Server. These are:
- An in-depth look at change detection in SQL Server – Part 01 [Link]
- An in-depth look at change detection in SQL Server – Part 02 [Link]
- HASHBYTES: Is CHECKSUM really required? [Link]
Soon after the publication of these articles, a coincidence at work prompted me to write this post.
One of our application modules used HASHBYTES for detecting data change. We were receiving random reports that the quality assurance engineers were receiving auditing errors. As we tracked the issue down through the SQL Server Profiler, we noticed that the underlying error was a “simple” string truncation error:
Msg 8152, Level 16, State 10, Line 5
String or binary data would be truncated.
Root Cause
The following line from the Books-On-Line help for HASHBYTES [Link] caught our attention:
Allowed input values are limited to 8000 bytes.
As soon as we read it, things became very clear to us. Taking a quick look at the schema showed us that the underlying field being checked for changes was an NVARCHAR(MAX) field and the users were attempting to enter data greater than 4000 characters (i.e. 8000 bytes) causing the issue. As long as the data remained less than 4000 characters, no error was encountered.
A demo
The script below demonstrates the expected behaviour of HASHBYTES given that the input string is less than or equal to the 8000 byte limit. For the sake of brevity, I am using the REPLICATE function to create the string.
USE tempdb; GO --Declare an NVARCHAR((MAX) variable DECLARE @tHashInputs NVARCHAR(MAX); --Create a 4000 character string SELECT @tHashInputs = CAST(REPLICATE('a',4000) AS NVARCHAR(MAX)); --Attempt to generate the HASHBYTEs value for this string SELECT HASHBYTES('MD5',ISNULL(@tHashInputs,'')); GO
Now, let me alter the script such that the string being evaluated using HASHBYTES is greater than the 8000 byte limit, i.e. is 4001 characters for the NVARCHAR data-type:
USE tempdb; GO --Declare an NVARCHAR((MAX) variable DECLARE @tHashInputs NVARCHAR(MAX); --Create a 4000 character string SELECT @tHashInputs = CAST(REPLICATE('a',4001) AS NVARCHAR(MAX)); --Attempt to generate the HASHBYTEs value for this string SELECT HASHBYTES('MD5',ISNULL(@tHashInputs,'')); GO
The following error is encountered upon executing the batch:
Msg 8152, Level 16, State 10, Line 5
String or binary data would be truncated.
As can be seen from the example above, HASHBYTES will fail for data beyond the documented 8000 character limit. Hence, areas like free-form comments may not be a suitable candidate for usage of HASHBYTES.
Solution
The only known workarounds that we have currently are:
- Split the strings into multiple parts before generating the HASHBYTES value
- Opt for any other change detection algorithm
Have you ever encountered such a situation before? If yes, what solution did you apply to overcome it?
References:
- HASHBYTES – Books On Line page [Link]
- An in-depth look at change detection in SQL Server – Part 01 [Link]
- An in-depth look at change detection in SQL Server – Part 02 [Link]
- HASHBYTES: Is CHECKSUM really required? [Link]
Until we meet next time,
Hi , I do not have idea about HASHBYTES but if we use VARCHAR datatype instead of NVARCHAR it would allow
USE tempdb;
GO
–Declare an VARCHAR((MAX) variable
DECLARE @tHashInputs VARCHAR(MAX);
–Create a 4000 character string
SELECT @tHashInputs = CAST(REPLICATE(‘a’,4001) AS VARCHAR(MAX));
–Attempt to generate the HASHBYTEs value for this string
SELECT HASHBYTES(‘MD5’,ISNULL(@tHashInputs,”));
GO
but for HASHBYTES if it is eecessary NVARCHAR than split string into multiple parts is better solution.
LikeLike
@riks: Thank-you for your comments. The prime difference between VARCHAR and NVARCHAR is that NVARCHAR allows us to store Unicode data. Since our application needs to store Unicode data, NVARCHAR was the data-type of choice.
In case the application can use VARCHAR, then you are very correct. HASHBYTES will be able to go upto 8000 characters without encountering an error. The reason is that NVARCHAR occupies 2 bytes as opposed to VARCHAR which occupies only 1 byte of storage space.
LikeLike
@Nakul : Thanks for reply. Is application data type specific? can you give some example that this particular app based on unicode data type and other is based on non unicode data type.
LikeLike
Pingback: #0362 – SQL Server – Change Detection in Microsoft SQL Server – Limitations of T-SQL: BINARY_CHECKSUM and CHECKSUM | SQLTwins by Nakul Vachhrajani
Pingback: #0362 – SQL Server – Change Detection in Microsoft SQL Server – Limitations of T-SQL: BINARY_CHECKSUM and CHECKSUM - SQL Server - SQL Server - Toad World