Tag Archives: Best Practices

Articles related to performance and usability best practices in Microsoft SQL Server.

#0385 – SQL Server – Query Development and NULL values


Software Design, Assumptions and their side-effects

A few days ago, I tweeted a news piece that came to my attention on the name “NULL” causing computing systems to malfunction. When I shared it with my colleagues, a healthy discussion on how should software architectures handle NULL took place.

The key focus of discussion was of course that we need to be more globally aware and avoid generalization when defining business requirements (e.g. if one country does not commonly have people with surnames greater than 10 characters in length, doesn’t mean there aren’t people with longer surnames as highlighted here).

We also talked about detection of NULL values and manipulating them. In the world of databases, joins between tables are a common scenario and it is interesting to study the impact that NULL has on the JOIN operator in a T-SQL query.

NULL values and JOINs

I had encountered a scenario involving joining on NULL-able columns which I had encountered during a unit test I was executing. What happened was that rows were disappearing and initially I was unable to figure out why. Allow me to explain with a demo.

I have a simple table with software metrics like lines of code (LoC), story points, etc. for various modules. I also have a look-up table that gives me details of the measurement units, including provision to handle unknown sizing units (which was possible in our case for modules developed in fairly uncommon technologies).

When I joined between the metrics and the sizing units table, what happened was that metrics for the project in uncommon technologies did not appear!

DECLARE @someMetrics TABLE
         (ModuleCode INT         NOT NULL,
          SizeValue  INT             NULL,
          SizeUnit   VARCHAR(10)     NULL
         );

DECLARE @unitMaster TABLE
        (SizeUnit        VARCHAR(10) NULL,
         UnitDescription VARCHAR(50) NULL
        );

INSERT INTO @unitMaster (SizeUnit,
                         UnitDescription
                        )
VALUES (NULL , 'Unit Unknown'),
       ('LOC', 'Lines of Code'),
       ( 'SP', 'Story Points');

INSERT INTO @someMetrics (ModuleCode,
                          SizeValue,
                          SizeUnit
                         )
VALUES (1, 1200, 'LOC'),
       (3, 3, 'SP' ),
       (6, 32, NULL ),
       (7, 2500, 'LOC');

--Problem
SELECT sm.ModuleCode,
       sm.SizeValue,
       um.UnitDescription
FROM @someMetrics AS sm
INNER JOIN @unitMaster AS um ON sm.SizeUnit = um.SizeUnit;
The join operator will eliminate rows with null values in the join predicates

Metrics for NULL sizing units were filtered in the join process

The Root Cause

What is happening is that by default NULL compared with any value (NULL or otherwise) yields NULL. This is expected because when one of two values in a comparison is an unknown, there is no way to ascertain the result of the comparison and hence it has to be unknown, i.e. NULL.

In case of a join, the way SQL Server behaves is that all rows that match a given condition are fetched. Rows with conditions evaluating to “anything other than true” are filtered out. This is exactly what was happening in my case – since the columns used in the joined predicate had NULL values, the comparison yielded NULL (a non-true result) and hence these rows were filtered out.

The Fix

Option 01: The ideal fix is to ensure that the data never contains NULLs. Instead of NULL, a default value (anything suitable to the domain) can be used to indicate that the source system did not supply any value.

Option 02: In case changing the source data is not feasible, the query needs to be updated to substitute the NULL values for a chosen default as shown below.

DECLARE @someMetrics TABLE
         (ModuleCode INT         NOT NULL,
          SizeValue  INT             NULL,
          SizeUnit   VARCHAR(10)     NULL
         );

DECLARE @unitMaster TABLE
        (SizeUnit        VARCHAR(10) NULL,
         UnitDescription VARCHAR(50) NULL
        );

INSERT INTO @unitMaster (SizeUnit,
                         UnitDescription
                        )
VALUES (NULL , 'Unit Unknown'),
       ('LOC', 'Lines of Code'),
       ( 'SP', 'Story Points');

INSERT INTO @someMetrics (ModuleCode,
                          SizeValue,
                          SizeUnit
                         )
VALUES (1, 1200, 'LOC'),
       (3, 3, 'SP' ),
       (6, 32, NULL ),
       (7, 2500, 'LOC');

--One way to resolve the problem
SELECT sm.ModuleCode,
       sm.SizeValue,
       um.UnitDescription
FROM @someMetrics AS sm
INNER JOIN @unitMaster AS um
      ON ISNULL(sm.SizeUnit,'NULL') = ISNULL(um.SizeUnit,'NULL');
Using a chosen default instead of NULL in the join predicate

Using a chosen default instead of NULL in the join predicate

Conclusion

Irrespective of the business design, it is extremely important to have a strategy on handling of NULL values during system design. Most systems lack consistency in the way they allow for and handle NULL values.

Bringing in a set of standardized design practices on handling of NULL values makes it easier to predict system behaviour and also helps in re-engineering efforts whenever required to do so in the future.

Further Reading

  • Why is it not a good idea to implement NOT NULL check as a CHECK constraint? [Blog Link here]
  • SQL Server – CONCAT_NULL_YIELDS_NULL property [Blog Link here]
  • Fun with temporary tables – Impact of ANSI_NULL_DFLT_ON [Blog Link here]

Until we meet next time,

Be courteous. Drive responsibly.

A Blank string is treated as a zero (0) when used in a query filter on an integer value due to implicit conversion.

#0384 – SQL Server – Basics – Implicit Conversion – Blank strings (”) are treated as integers


On one of the forums, I encountered an interesting question the other day. The person asking the query had a table which had integers, null values and blank strings. In order to clean up this data, it was required to distinctly identify each of these combinations, but the simple queries below seemed to return the same results.

USE tempdb;
GO
SELECT sv.RowValue,
       sv.RowValueInWords
FROM dbo.SomeValues AS sv
WHERE sv.RowValue >= 0;
GO

USE tempdb;
GO
SELECT sv.RowValue,
       sv.RowValueInWords
FROM dbo.SomeValues AS sv
WHERE sv.RowValue >= '';
GO

Behind the scenes

This behaviour is not a bug, but is by design of SQL Server. What is happening is that when faced with conflicting data types, SQL Server tries to perform implicit conversions to arrive at a common data type which can be used to compare both the values.

Implicit conversion is done by following the rules of data type precedence, which dictates the sequence of implicit conversion, i.e. a data type with lower precedence is converted to a data type that has higher precedence. If such implicit conversion is not possible, and the user has not specified an explicit conversion, an error is returned.

Per the rules outlined by data type precedence, character data types (VARCHAR/NVARCHAR/TEXT/NTEXT) have a precedence that is lower than integers (INT). Hence, when faced with comparing characters and numbers, SQL Server will always implicitly convert the string to a number. A blank string (”) is therefore treated as a zero (0) when implicit conversion takes place.

Allow me to present this with a demo.

A demo

In the example below, I recreate the scenario referenced in the forum post – I have a table with numbers, zero and NULLs.

USE tempdb;
GO
SET NOCOUNT ON;
GO
--Safety Check
IF OBJECT_ID('dbo.SomeValues','U') IS NOT NULL
BEGIN
    DROP TABLE dbo.SomeValues;
END
GO

--Creating the test table
CREATE TABLE dbo.SomeValues 
        (RowId           INT         NOT NULL IDENTITY(1,1),
         RowValue        INT             NULL,
         RowValueInWords VARCHAR(50) NOT NULL
        );
GO

--Inserting values in the test table
INSERT INTO dbo.SomeValues(RowValue, RowValueInWords)
VALUES (0,    'Zero'),
       (NULL, 'NULL'),
       (1,    'One'),
       (2,    'Two'),
       (3,    'Three'),
       (4,    'Four'),
       (5,    'Five');
GO

Now let me run the query presented in the forum post.

USE tempdb;
GO
SELECT 'Query with Zero' AS QueryCondition,
       sv.RowValue,
       sv.RowValueInWords
FROM dbo.SomeValues AS sv
WHERE sv.RowValue >= 0;
GO

USE tempdb;
GO
SELECT 'Query with Blank' AS QueryCondition,
       sv.RowValue,
       sv.RowValueInWords
FROM dbo.SomeValues AS sv
WHERE sv.RowValue >= '';
GO

SQL Server returned me the same result because the blank string was implicitly converted to a zero (0).

A Blank string is treated as a zero (0) when used in a query filter on an integer value due to implicit conversion.

Implicit Conversion may cause query filters to behave unexpectedly

Now, I will try and insert a blank string into the test table. We can see that the value will be inserted successfully.

--Let's try to insert a blank value
USE tempdb;
GO
INSERT INTO dbo.SomeValues (RowValue, RowValueInWords)
VALUES ('','Blank');
GO
/***********
RESULTS
***********/
--Command(s) completed successfully.

The question now is – what did SQL Server insert? If a blank value was indeed stored, it would violate the rules of the data type enforced by the column definition. Hence, I will now select the values from the table to see what was inserted.

USE tempdb;
GO
--See what was stored
SELECT sv.RowValue,
       sv.RowValueInWords
FROM dbo.SomeValues AS sv;
GO
A blank string is inserted into the table as a zero due to implicit conversion

A blank string is inserted into the table as a zero due to implicit conversion

We can see from the results above that SQL Server performed an implicit conversion during the insert and stored a zero into the table. If the calling application tries to validate the data stored, it can continue to detect a mismatch between the expected and the actual data stored.

Summary

Implicit conversion is a boon if used wisely, but in most cases it can (and will) catch poorly written code and unsuspecting developers by surprise. Almost all operations in SQL Server are affected by implicit conversion as I have explored in the past with the following posts:

  • Best Practices – Avoid Implicit conversions as they may cause unexpected issues (Part 01) [Blog Link]
  • Best Practices – Avoid implicit conversions (Part 02) – Implicit conversions, COALESCE and ISNULL [Blog Link]
  • Msg 206; Operand Type Clash; Return type of a CASE expression follows datatype precedence [Blog Link]
  • Data Type Precedence in SQL Server [MSDN Link]

Until we meet next time,
Be courteous. Drive responsibly.

Msg 1013 indicating duplicate exposed names when the same object is referenced again in the query without an alias.

#0383 – SQL Server – Basics – Msg 1013 – The objects in the FROM clause have the same exposed names. Use correlation names to distinguish them.


Working with complex queries (those that have more than a handful of joins), one of the things that is difficult for novice developers is to keep track of where they are and which objects have already been referenced on the query.

One of the developers just walked up to me the other day with the following error (formatted for the sake of brevity):

Msg 1013, Level 16, State 1, Line 3
The objects "Person.Person" and "Person.Person" in the 
FROM clause have the same exposed names. 
Use correlation names to distinguish them.

Now, I did not recollect if I had ever seen the error before, so my first response was to take a look at the query. As soon as I looked at the query, I immediately realized the problem – the query had a self join and no object aliases were used to distinguish between the two instances of the same object!

USE AdventureWorks2014;
GO
SELECT *
FROM Person.Person 
INNER JOIN Person.Person ON BusinesEntityID = BusinessEntityID;
GO
Msg 1013 indicating duplicate exposed names when the same object is referenced again in the query without an alias.

Msg 1013 indicating duplicate exposed names when the same object is referenced again in the query without an alias.

Lesson Learned

Once the objects in the query were given aliases, the error was resolved.

However, this incident enforces what I have always practiced and believed to be a best practice – always ensure that object names, queries and other literals used in a query have proper aliases.

Using proper aliases ensures that as a developer, reviewer or as a support engineer, we always know exactly which instance of the object is being referred.

Until we meet next time,
Be courteous. Drive responsibly.

Comparison between the index physical stats when the table uses NEWSEQUENTIALID() v/s NEWID(). We can see that NEWID() results in higher fragmentation and consumes higher number of pages on disk when compared to NEWSEQUENTIALID().

#0381 – SQL Server – Table design – Is it better to use NEWID or NEWSEQUENTIALID when defining the key as a UNIQUEIDENTIFIER?


Database schema design involves defining the clustered keys (generally the primary key) on a table, and one of the main decisions to be taken is whether to use a clustered key based on a UNIQUEIDENTIFIER/ROWGUID or an INTEGER?

Generally, an INTEGER is a default choice for the clustered key column. Compared to an INTEGER, a GUID takes up a lot of space (36 characters!). Hence the decision to use UNIQUEIDENTIFIER/ROWGUID depends a lot upon the desired application:

  • Whether the amount of data to be stored in the table is ever going to exceed the limits of  an integer key value?
  • The code which is going to consume the data (it helps if the underlying storage uses keys in the same format as the code, as in the case of the .NET Enterprise Framework)
  • Nature of integration/DR implemented (e.g. External Id keys used in the integrated system, replication, etc)

If the requirements do require that UNIQUEIDENTIFIER is to be used, the next question is:

What is better to use as the default value for a GUID? NEWID or NEWSEQUENTIALID?

NEWSEQUENTIALID() and NEWID() both generate GUIDs, however NEWSEQUENTIALID() has certain advantages:

  • NEWID() generates a lot of random activity, whereas NEWSEQUENTIALID() does not. Hence, NEWSEQUENTIALID() is faster
  • Because NEWSEQUENTIALID() is sequential, it helps to fill the data pages faster

In order words,

NEWID() generates more fragmentation when compared to NEWSEQUENTIALID()

NEWID() generates more fragments

To test this, I ran the following test:

  1. Create two tables – one with NEWID() as the default value for the key and one with NEWSEQUENTIALID()
  2. Ensure that MAXDOP is set to 0 so that I can insert data into the tables in parallel (to simulate inputs into a table from multiple threads)
  3. Repeat this process multiple times
  4. Once the insert is complete, check the average fragmentation on the tables

Allow me to run through the test step-by-step.

The script below creates two tables:

USE tempdb;
GO
SET NOCOUNT ON;
GO
--Safety Check
IF OBJECT_ID('dbo.SequentialIdCheck','U') IS NOT NULL
BEGIN
   DROP TABLE dbo.SequentialIdCheck;
END
GO

IF OBJECT_ID('dbo.NonSequentialIdCheck','U') IS NOT NULL
BEGIN
   DROP TABLE dbo.NonSequentialIdCheck;
END
GO

--Create the tables (SequentialId Check)
CREATE TABLE dbo.SequentialIdCheck (RowId UNIQUEIDENTIFIER NOT NULL 
                                          CONSTRAINT df_SequentialRowId 
                                          DEFAULT NEWSEQUENTIALID(),
                                    ObjectId INT NOT NULL,
                                    RowValue NVARCHAR(200) NULL,
                                    CONSTRAINT pk_SequentialIdCheck
                                    PRIMARY KEY CLUSTERED (RowId)
                                   );
GO

--Create the tables (Non SequentialId Check)
CREATE TABLE dbo.NonSequentialIdCheck (RowId UNIQUEIDENTIFIER NOT NULL
                                             CONSTRAINT df_NonSequentialRowId 
                                             DEFAULT NEWID(),
                                       ObjectId INT NOT NULL,
                                       RowValue NVARCHAR(200) NULL,
                                       CONSTRAINT pk_NonSequentialIdCheck
                                       PRIMARY KEY CLUSTERED (RowId)
                                      );
GO

Now, I will ensure that max degree of parallelism is turned ON.

sp_configure 'show advanced options',1
RECONFIGURE
GO
sp_configure 'max degree of parallelism',0
RECONFIGURE
GO

Next, I will insert some test data. I will repeat the insert multiple times to simulate an extremely large data-set over multiple inserts.

USE tempdb;
GO
--Insert some test data
--Run the insert 5 times
INSERT INTO dbo.SequentialIdCheck (ObjectId, RowValue)
SELECT sao1.[object_id] AS [ObjectId],
       sao1.[name] AS [RowValue]
FROM sys.all_objects AS sao1
CROSS JOIN sys.all_objects AS sao2;

INSERT INTO dbo.NonSequentialIdCheck (ObjectId, RowValue)
SELECT sao1.[object_id] AS [ObjectId],
       sao1.[name] AS [RowValue]
FROM sys.all_objects AS sao1
CROSS JOIN sys.all_objects AS sao2;
GO 5
Beginning execution loop
Batch execution completed 5 times.

Finally, I check the average fragmentation on the tables by checking the fragmentation of the clustered index.

USE tempdb;
GO
--Check the fragmentation
SELECT OBJECT_NAME(ps.[object_id]) AS ObjectName,
       ps.[index_type_desc],
       ps.[avg_fragmentation_in_percent],
       ps.[fragment_count],
       ps.[page_count],
       ps.[database_id],
       ps.[object_id]
FROM sys.dm_db_index_physical_stats(DB_ID('tempdb'),
                                    OBJECT_ID('dbo.SequentialIdCheck'),
                                    DEFAULT,
                                    DEFAULT,
                                    DEFAULT
                                   ) AS ps;
GO

SELECT OBJECT_NAME(ps.[object_id]) AS ObjectName,
       ps.[index_type_desc],
       ps.[avg_fragmentation_in_percent],
       ps.[fragment_count],
       ps.[page_count],
       ps.[database_id],
       ps.[object_id]
FROM sys.dm_db_index_physical_stats(DB_ID('tempdb'),
                                    OBJECT_ID('dbo.NonSequentialIdCheck'),
                                    DEFAULT,
                                    DEFAULT,
                                    DEFAULT
                                   ) AS ps;
GO
Comparison between the index physical stats when the table uses NEWSEQUENTIALID() v/s NEWID(). We can see that NEWID() results in higher fragmentation and consumes higher number of pages on disk when compared to NEWSEQUENTIALID().

NEWID() results in higher fragmentation and higher pages consumed on disk.

As can be seen clearly from the screenshot above, we see that the table with the NEWID() default key value has a higher fragmentation value when compared to the table with NEWSEQUENTIALID().

We also see that the table with NEWID() default key value has taken a lot of pages – resulting in more space being occupied on disk.

The underlying reason for this is  that NEWSEQUENTIALID() generates GUIDs that are in sequence for the given batch, whereas the GUIDs generated by NEWID() are random. When used as a clustered key, having values in sequence helps fill the pages faster and reduce fragmentation.

Conclusion

In most cases, an INTEGER based key on a table is sufficient. However, when a GUID is required by design, it is important to keep in mind that using NEWID() causes more fragmentation in the underlying data resulting in poor system performance. Because non-sequential GUIDs cause fragmentation, they are (generally) not a good choice for using as a clustered index key unless it is required to do so by the business/application design.

If NEWSEQUENTIALID() is to be used, please do keep in mind that the keys need to be generated by the database engine making it tricky when using with Entity Frameworks where the key value is required by the code in order to instantiate an entity.

Further Reading

Until we meet next time,
Be courteous. Drive responsibly.

When we multiply a Decimal (20.16) with another number (100) using the calculator, the result is as expected (2016)

#0380 – SQL Server – Basics – Specify Scale and Precision when defining Decimal and Numeric datatypes


I had some interesting conversation during a code review that I was asked to conduct for a simple query that a team had written to support their project monitoring. The team was specializing in quality assurance and had minimal development experience. The team had used variables of decimal data types in their script, but they were declared without any precision or scale. When I gave a review comment on the declaration of variables, I was asked the question:

Does it make a difference if we do not specify scale and precision when defining variables of decimal or numeric datatypes?

I often look forward to such encounters for two reasons:

  1. When I answer their questions, the process reinforces my concepts and learnings
  2. It helps me contribute to the overall community by writing a blog about my experience

When the question was asked, I honestly admitted that I did not have a specific answer other than it was the best practice to do so from a long-term maintainability standpoint. Post lunch, I did a small test which I showed the team and will be presenting today.

The Problem

In the script below, I take a decimal variable (declared without a fixed scale or precision) with value (20.16) and multiply it by a constant number (100) and then by another constant decimal (100.0). If one uses a calculator, the expected result is:

  • 20.16 * 100 = 2016
  • 20.16 * 100.0 = 2016
When we multiply a Decimal (20.16) with another number (100) using the calculator, the result is as expected (2016)

Expected results when we multiply a Decimal with another number using the calculator

However, when we perform the same test via SQL Server, we are in for a surprise:

DECLARE @dVal1 DECIMAL = 20.16;

SELECT (@dVal1 * 100)   AS DecimalMultipliedByAnInteger, 
       (@dVal1 * 100.0) AS DecimalMultipliedByADecimal;
GO

As can be seen from the seen from the results below, we do not get the expected results, but we find that the decimal value was rounded off before the multiplication took place.

Although the test input value is declared as a decimal, the result appears to be based only on the significand, not the mantissa part of the input.

Although the test input value is declared as a decimal, the result appears to be based only on the significand, not the mantissa part of the input.

Root Cause

The reason behind this behaviour is hidden in the following lines of the SQL Server online documentation on MSDN (formerly known as “Books-On-Line”) for decimal and numeric data-types available here: https://msdn.microsoft.com/en-us/library/ms187746.aspx.

…s (scale)
The number of decimal digits that will be stored to the right of the decimal point….Scale can be specified only if precision is specified. The default scale is 0…

The real reason however is a few lines below – rounding.

Converting decimal and numeric Data

…By default, SQL Server uses rounding when converting a number to a decimal or numeric value with a lower precision and scale….

What SQL Server appears to be doing here is that when a variable of DECIMAL datatype is declared without a precision and scale value, the scale is taken to be zero (0). Hence, the test value of 20.16 is rounded to the nearest integer, 20.

To confirm that rounding is indeed taking place, I swapped the digits in the input value from 20.16 to 20.61 and re-ran the same test.

DECLARE @dVal1 DECIMAL = 20.61;

SELECT (@dVal1 * 100)   AS DecimalMultipliedByAnInteger, 
       (@dVal1 * 100.0) AS DecimalMultipliedByADecimal;
GO

Now, the result was 2100 instead of 2000 because the input test value of 20.61 was rounded to 21 before the multiplication took place.

Because the test input value was declared as a decimal without precision and scale, rounding took place, resulting in a different result.

Because the test input value was declared as a decimal without precision and scale, rounding took place, resulting in a different result.

By this time, my audience was struck in awe as they realized the impact this behaviour would have had on their project monitoring numbers.

The Summary – A Best Practice

We can summarize the learning into a single sentence:

It is a best practice for ensuring data quality to always specify a precision and scale when working with variables of the numeric or decimal data types.

To confirm, here’s a version of the same test as we saw earlier. The only difference is that this time, we have explicitly specified the precision and scale on our input values.

DECLARE @dVal1 DECIMAL(19,4) = 20.16;

SELECT (@dVal1 * 100)   AS DecimalMultipliedByAnInteger, 
       (@dVal1 * 100.0) AS DecimalMultipliedByADecimal;
GO

When we look at the results, we see that the output is exactly what we wanted to see, i.e. 2016.

Because the test input value was declared as a decimal with precision and scale, no rounding took place and we got the expected result, i.e. 2016.

Because the test input value was declared as a decimal with precision and scale, no rounding took place and we got the expected result.

Further Reading

Until we meet next time,
Be courteous. Drive responsibly.