Tag Archives: Data Quality

When we multiply a Decimal (20.16) with another number (100) using the calculator, the result is as expected (2016)

#0380 – SQL Server – Basics – Specify Scale and Precision when defining Decimal and Numeric datatypes


I had some interesting conversation during a code review that I was asked to conduct for a simple query that a team had written to support their project monitoring. The team was specializing in quality assurance and had minimal development experience. The team had used variables of decimal data types in their script, but they were declared without any precision or scale. When I gave a review comment on the declaration of variables, I was asked the question:

Does it make a difference if we do not specify scale and precision when defining variables of decimal or numeric datatypes?

I often look forward to such encounters for two reasons:

  1. When I answer their questions, the process reinforces my concepts and learnings
  2. It helps me contribute to the overall community by writing a blog about my experience

When the question was asked, I honestly admitted that I did not have a specific answer other than it was the best practice to do so from a long-term maintainability standpoint. Post lunch, I did a small test which I showed the team and will be presenting today.

The Problem

In the script below, I take a decimal variable (declared without a fixed scale or precision) with value (20.16) and multiply it by a constant number (100) and then by another constant decimal (100.0). If one uses a calculator, the expected result is:

  • 20.16 * 100 = 2016
  • 20.16 * 100.0 = 2016
When we multiply a Decimal (20.16) with another number (100) using the calculator, the result is as expected (2016)

Expected results when we multiply a Decimal with another number using the calculator

However, when we perform the same test via SQL Server, we are in for a surprise:

DECLARE @dVal1 DECIMAL = 20.16;

SELECT (@dVal1 * 100)   AS DecimalMultipliedByAnInteger, 
       (@dVal1 * 100.0) AS DecimalMultipliedByADecimal;
GO

As can be seen from the seen from the results below, we do not get the expected results, but we find that the decimal value was rounded off before the multiplication took place.

Although the test input value is declared as a decimal, the result appears to be based only on the significand, not the mantissa part of the input.

Although the test input value is declared as a decimal, the result appears to be based only on the significand, not the mantissa part of the input.

Root Cause

The reason behind this behaviour is hidden in the following lines of the SQL Server online documentation on MSDN (formerly known as “Books-On-Line”) for decimal and numeric data-types available here: https://msdn.microsoft.com/en-us/library/ms187746.aspx.

…s (scale)
The number of decimal digits that will be stored to the right of the decimal point….Scale can be specified only if precision is specified. The default scale is 0…

The real reason however is a few lines below – rounding.

Converting decimal and numeric Data

…By default, SQL Server uses rounding when converting a number to a decimal or numeric value with a lower precision and scale….

What SQL Server appears to be doing here is that when a variable of DECIMAL datatype is declared without a precision and scale value, the scale is taken to be zero (0). Hence, the test value of 20.16 is rounded to the nearest integer, 20.

To confirm that rounding is indeed taking place, I swapped the digits in the input value from 20.16 to 20.61 and re-ran the same test.

DECLARE @dVal1 DECIMAL = 20.61;

SELECT (@dVal1 * 100)   AS DecimalMultipliedByAnInteger, 
       (@dVal1 * 100.0) AS DecimalMultipliedByADecimal;
GO

Now, the result was 2100 instead of 2000 because the input test value of 20.61 was rounded to 21 before the multiplication took place.

Because the test input value was declared as a decimal without precision and scale, rounding took place, resulting in a different result.

Because the test input value was declared as a decimal without precision and scale, rounding took place, resulting in a different result.

By this time, my audience was struck in awe as they realized the impact this behaviour would have had on their project monitoring numbers.

The Summary – A Best Practice

We can summarize the learning into a single sentence:

It is a best practice for ensuring data quality to always specify a precision and scale when working with variables of the numeric or decimal data types.

To confirm, here’s a version of the same test as we saw earlier. The only difference is that this time, we have explicitly specified the precision and scale on our input values.

DECLARE @dVal1 DECIMAL(19,4) = 20.16;

SELECT (@dVal1 * 100)   AS DecimalMultipliedByAnInteger, 
       (@dVal1 * 100.0) AS DecimalMultipliedByADecimal;
GO

When we look at the results, we see that the output is exactly what we wanted to see, i.e. 2016.

Because the test input value was declared as a decimal with precision and scale, no rounding took place and we got the expected result, i.e. 2016.

Because the test input value was declared as a decimal with precision and scale, no rounding took place and we got the expected result.

Further Reading

Until we meet next time,
Be courteous. Drive responsibly.

Advertisement

#0374 – SQL Server – Removing string terminators (“\0”) using REPLACE during string concatenation


Recently, I was called upon to troubleshoot a strange behaviour demonstrated by a data conditioning script that involved string concatenations. The script ran fine without any errors, but the script did not appear to be concatenating string.

The entire script for this post is shared towards the end of the post. Because the script involves creating specific data which would give away the root cause, I will not be presenting the snippets beforehand.

Assume that we have a simple table with two columns, “FirstName” and “LastName”:

Simple table with test data for string concatenation demo

Simple table with test data for string concatenation demo

The data conditioning script involved populating the “FullName” column in the table with a simple combination of the First and the Last names.

Output of String Concatenation Script demonstrating the problem

Output of String Concatenation Script demonstrating the problem

The Problem: If we look at the output carefully, there is a problem with the FullNames for rows # 2 and 4. Although the LastName is present, only the FirstName is seen in the concatenation result.

The Theory:

After about an hour of troubleshooting, we decided to check out the length of the strings in the table, and that’s when we hit gold. Although we could “see” only a couple of characters, the length was turning out to be a higher than what we expected.

As can be seen from the screenshot below, although the First Name “John” has a length of 4, we get 5 in the length. Similarly, although the FullName shows up as “John”, we get a length of 9.

Screenshot showing the length of the strings in the table

Screenshot showing the length of the strings in the table

This is the moment when the light bulb went off and we realized what was going on. The strings were inserted by a legacy application based on C/C++ code. In such legacy applications, we need to explicitly handle termination of strings by adding the string termination character. That would very well account for the presence of an additional character in the FirstName.

In order to explain the mystery behind the length of the FullName, let’s walk through the concatenation of a record.

  • FirstName = John, 4 characters
  • LastName = Doe, 3 characters
  • Expected length of FullName = First Name + a space + Last Name = 4 + 1 + 3 = 8 characters
  • Actual length of FullName = 9 characters

The difference can be accurately explained when we believe that the concatenation did actually happen – only thing is that we are unable to see the LastName part of the string because the system encounters the string termination character, causing it to stop displaying more characters from the string.

In order to confirm our theory, we replaced the string termination character with a hyphen (-), using the REPLACE function.

UPDATE pt
SET pt.FullName = REPLACE(pt.FullName COLLATE Latin1_General_BIN, CHAR(0) COLLATE Latin1_General_BIN, '-')
FROM @personTable AS pt;

In order for this to work in all environments, we changed the collation to binary when performing the replace (because ultimately, string terminators are just a set of bits when performing binary manipulation).

Finding and Replacing the String Terminator in a string.

Finding and Replacing the String Terminator in a string.

As can be seen from the screenshot above, the REPLACE was successful, and we were able to see the entire string.

In Conclusion

  • When working with data created by legacy code, it is useful to understand how the code works. In this case, we realized that the string terminator was causing a problem and were able to overcome it – but it could have led to hours of troubleshooting (an option of re-creating data manually was also put on the table)
  • SQL Server, and T-SQL can be trusted when it comes to data manipulation. Almost always it’s the system or the human element that is missing something critical

Until we meet next time,
Be courteous. Drive responsibly.

Script for this post:

USE tempdb;
GO
--Creating the sample table
DECLARE @personTable TABLE (FirstName VARCHAR(50),
                            LastName  VARCHAR(50),
                            FullName  VARCHAR(100)
                           );

--Insert some test data
INSERT INTO @personTable (FirstName, LastName)
VALUES ('Nakul','Vachhrajani'),
       ('John' + CHAR(0),'Doe'),
       ('Jack','Smith'),
       ('FirstName' + CHAR(0),'LastName');

--Check out the data
SELECT pt.FirstName,
       pt.LastName
FROM @personTable AS pt;

--Perform the string concatenation
UPDATE pt
SET pt.FullName = pt.FirstName + ' ' + pt.LastName
FROM @personTable AS pt;


--Check out the data
SELECT pt.FirstName,
       pt.LastName,
       pt.FullName
FROM @personTable AS pt;

--Checking the length of the data
SELECT pt.FirstName,
       LEN(pt.FirstName) AS FirstNameLength,
       pt.LastName,
       LEN(pt.LastName) AS LastNameLength,
       pt.FullName,
       LEN(pt.FullName) AS FullNameLength
FROM @personTable AS pt;

--Confirming presence of string termination characters
SELECT pt.FirstName,
       LEN(pt.FirstName) AS FirstNameLength,
       CHARINDEX(CHAR(0),pt.FirstName,1) AS LocationOfStringTerminator,
       pt.LastName,
       LEN(pt.LastName) AS LastNameLength,
       pt.FullName,
       LEN(pt.FullName) AS FullNameLength,
       CHARINDEX(CHAR(0),pt.FullName,1) AS LocationOfStringTerminatorInFullName
FROM @personTable AS pt;

--Replace the string termination character with a hyphen
UPDATE pt
SET pt.FullName = REPLACE(pt.FullName COLLATE Latin1_General_BIN, CHAR(0) COLLATE Latin1_General_BIN, '-')
FROM @personTable AS pt;

--Confirming that string termination characters are no longer present
SELECT pt.FirstName,
       LEN(pt.FirstName) AS FirstNameLength,
       CHARINDEX(CHAR(0),pt.FirstName,1) AS LocationOfStringTerminator,
       pt.LastName,
       LEN(pt.LastName) AS LastNameLength,
       pt.FullName,
       LEN(pt.FullName) AS FullNameLength,
       CHARINDEX(CHAR(0),pt.FullName,1) AS LocationOfStringTerminatorInFullName
FROM @personTable AS pt;
GO

#0371 – SQL Server – Best Practices – Always use precise decimal data-types


I recently came across a concrete example of something that has always been maintained as a best practice – when working on items that are sensitive to decimal precision, always use precise data-types like DECIMAL instead of FLOAT or REAL.

The following example clearly demonstrates that using a FLOAT in mathematical calculations may change the outcome of the computation. For financial calculations, this would mean incorrect reporting of revenues.

USE tempdb;
GO
DECLARE @floatValue FLOAT = 1.5,
        @decimalValue DECIMAL(5,2) = 1.5,
        @multiplicationFactor INT = 159;

SELECT @floatValue AS FloatValue,
       @decimalValue AS DecimalValue,
       (@multiplicationFactor*(@floatValue/100)) AS FloatBasedValue,
       (@multiplicationFactor*(@decimalValue/100)) AS DecimalBasedValue,
       ROUND((@multiplicationFactor*(@floatValue/100)),2) AS RoundedFloatBasedValue,
       ROUND((@multiplicationFactor*(@decimalValue/100)),2) AS RoundedDecimalBasedValue;

The output of this simple computation is provided in the screenshot below:

Result shows that we have a rounding error when a value based off a FLOAT is rounded to 2 decimals v/s when a fixed precision DECIMAL is rounded

Impact of using imprecise datatypes in mathematical computations

As can clearly be seen, the computations performed using an imprecise datatype (FLOAT) result in a rounding error. When compounded over multiple derived calculations, this can have a serious impact on the end result.

The conclusion therefore is simply that it is always a best practice to use precise datatypes when working with financial calculations.

Further Reading

Until we meet next time,
Be courteous. Drive responsibly.

Constraint Violation Errors on UDTT

#0364 – SQL Server – Using Primary Key, Check and other Constraints with User Defined Table Types (UDTT)


As database systems and their interactions and interfaces with systems become more complex, a large amount of complex data is exchanged through the system boundaries. Table Valued parameters are frequently used to exchange data. If the structure of the data being exchanged is known, often systems (which work with  SQL 2008 and above) will use User defined table types (UDTTs).

Depending upon the domain, data being exchanged across system boundaries may need to be constrained to a number of constraints, especially check and unique constraints. My encounter with UDTTs was informal – I learnt about them along the way and never explored them fully. Recently, I was wondering if I could use constraints with UDTTs.

And so, I decided to conduct a small experiment. I created the following UDTT. As can be seen from the definition, I have created the UDTT with the following constraints.

  • Primary Key constraint
  • Unique constraint
  • Default constraint
  • Check constraint
USE tempdb;
GO
CREATE TYPE dbo.UDTTDDLTest AS TABLE
    ( RecordId INT NOT NULL,
      RecordValue VARCHAR(50) NOT NULL UNIQUE,
      RecordStatus TINYINT NOT NULL DEFAULT(2)
      PRIMARY KEY CLUSTERED (RecordId),
      CHECK (RecordStatus>= 1 AND RecordStatus <= 3)
    );
GO

SQL Server allows creation of the constraint without any issues. I then tried to run the following code that attempts to insert various combinations of data into the user defined table type.

USE tempdb;
GO

SET NOCOUNT ON;

DECLARE @myVariable AS dbo.UDTTDDLTest;

--Normal insertion
INSERT INTO @myVariable (RecordId, RecordValue, RecordStatus)
VALUES (1,
        'SQL Twins',
        1
       );


SELECT * FROM @myVariable;

--Checking effect of default constraints
INSERT INTO @myVariable (RecordId, RecordValue)
VALUES (2,
        'SQL Twins with Default'
       );

SELECT * FROM @myVariable;

--Checking effect of primary key constraint
INSERT INTO @myVariable (RecordId, RecordValue, RecordStatus)
VALUES (2,
        'Oops...duplicate key',
        1
       );

SELECT * FROM @myVariable;

--Checking effect of check constraint
INSERT INTO @myVariable (RecordId, RecordValue, RecordStatus)
VALUES (3,
        'Bad Status!',
        4
       );

SELECT * FROM @myVariable;

--Checking effect of unique constraint. The value "SQL Twins" is already in the UDTT,
--so if the constraint is in effect, we will get an error
INSERT INTO @myVariable (RecordId, RecordValue, RecordStatus)
VALUES (4,
        'SQL Twins',
        1
       );


SELECT * FROM @myVariable;
GO

As can be seen from the output below, SQL Server successfully prevented me from entering data that violated any one of the constraints.

Msg 2627, Level 14, State 1, Line 38
Violation of PRIMARY KEY constraint ‘PK__#B40743E__FBDF78E9F9E4365B’. Cannot insert duplicate key in object ‘dbo.@myVariable’. The duplicate key value is (2).
The statement has been terminated.
Msg 547, Level 16, State 0, Line 47
The INSERT statement conflicted with the CHECK constraint “CK__#B40743ED__Recor__B5EF8C5F”. The conflict occurred in database “tempdb”, table “@myVariable”.
The statement has been terminated.
Msg 2627, Level 14, State 1, Line 57
Violation of UNIQUE KEY constraint ‘UQ__#B40743E__4D9E41B64A9AE451’. Cannot insert duplicate key in object ‘dbo.@myVariable’. The duplicate key value is (SQL Twins).
The statement has been terminated.

Records That Inserted Successfully

Constraint Violation Errors on UDTT

Summary

Constraints (Primary Key, Default, Unique and Check constraints) are supported by User Defined Table Types (UDTTs) in Microsoft SQL Server and can be used to ensure the quality of data being exchanged via UDTTs.

Further Reading

  • Passing Table valued parameters to a table valued function – Underappreciated features of Microsoft SQL Server [Link]
  • SQL Server 2000-Alternate to Table Valued Parameters-Share Temporary Tables with a stored procedure [Link]
  • User Defined Table Types [Link]

Until we meet next time,
Be courteous. Drive responsibly.

SQL Server – Experiments with BIT data type – Is it really Boolean? Can values other than 0 and 1 be assigned?


Recently at the office, we ended up discussing the BIT data type (http://msdn.microsoft.com/en-us/library/ms177603.aspx). While we all understand that BIT is supposed to represents a Boolean value, we were not quite convinced. Boolean values are two-state – they can only be TRUE or FALSE (or 1 and 0). However, we have seen BIT to be NULL a couple of times. Therefore, the questions that we had in our minds were:

  1. Is BIT really a two-state data type? Or is it tri-state (for those not familiar with the world of electronics, the tri-state is a state that’s undefined – it’s neither ON nor OFF)
  2. Do we need to use only a 0 or a 1 to initialize the BIT data type?
  3. Can we use ‘TRUE’ and ‘FALSE’ with the BIT datatype?

Finally, we decided to perform a little experiment. We prepared a small script covering the possible scenarios and checked their outputs. The script is available for your reference below:

--Declare the variable
DECLARE @bit BIT

--Check for default value
SELECT @bit AS DefaultValue

--Set to a positive value, other than 0 or 1
SET @bit = 99;
SELECT @bit AS [PositiveValue];

--Set to a negative value
SET @bit = -99;
SELECT @bit AS [NegativeValue];

--Set to a decimal value > 0 and < 1
SET @bit = 0.25;
SELECT @bit AS [DecimalValue025];

--Set to a decimal value > 0 and < 1
SET @bit = 0.50;
SELECT @bit AS [DecimalValue050];

--Set to a decimal value > 0 and < 1
SET @bit = 0.75;
SELECT @bit AS [DecimalValue075];

--Set to a string value - TRUE
SET @bit = 'TRUE';
SELECT @bit AS [StringTRUE],
       CASE
           WHEN @bit = 'TRUE'
           THEN 'Yes'
           ELSE 'No'
       END AS  [StringComparison];

--Set to a string value - FALSE
SET @bit = 'FALSE';
SELECT @bit AS [StringFALSE],
       CASE
           WHEN @bit = 'TRUE'
           THEN 'Yes'
           ELSE 'No'
       END AS  [StringComparison];

The results that came up resolved all our questions.

Results from my experiments with the BIT datatype. (Image (c)SQLTwins, nakulvachhrajani.com)
Experimenting with the BIT datatype in SQL Server

Conclusion

The above experiment helped us conclude that:

  1. BIT values are not completely Boolean, they are tri-state with NULL, 0 and 1 being possible values
  2. If left unassigned, the value is NULL
  3. If assigned with a value anything other than 0 or NULL, the value is taken as 1
  4. You can use a ‘TRUE’ and ‘FALSE’ string values with the BIT data type
  5. Because the default value of BIT is NULL, always assign your BIT variables! (for that matter, always assign a default value to any variable!)

I trust you found the above experiment interesting. I would welcome your ideas for future experiments.

Until we meet next time,

Be courteous. Drive responsibly.