Tag Archives: Data Quality

#0406 – SQL Server – Remember that spaces and blank strings are the same


It was recently brought to my attention that a particular script was passing spaces when it should not. Here’s an example:

DECLARE @spaceCharacter NVARCHAR(1) = N' ';
DECLARE @blankCharacter NVARCHAR(1) = N'';

--Confirm that we are looking at different values
--The ASCII codes are different!
SELECT ASCII(@spaceCharacter) AS ASCIICodeForSpace,
       ASCII(@blankCharacter) AS ASCIICodeForBlankString;

--Compare a blank string with spaces
IF (@spaceCharacter = @blankCharacter)
    SELECT 'Yes' AS IsSpaceSameAsBlankString;
ELSE 
    SELECT 'No' AS IsSpaceSameAsBlankString;
GO

/* RESULTS
ASCIICodeForSpace ASCIICodeForBlankString
----------------- -----------------------
32                NULL

IsSpaceSameAsBlankString
------------------------
Yes
*/

01_Symptom

We then checked the LENGTH and DATALENGTH of both strings and noticed something interesting – the check on the LENGTH was trimming out trailing spaces whereas the check on the DATALENGTH was not.

DECLARE @spaceCharacter NVARCHAR(1) = N' ';
DECLARE @blankCharacter NVARCHAR(1) = N'';

--Check the length
SELECT LEN(@spaceCharacter) AS LengthOfSpace, 
       LEN(@blankCharacter) AS LengthOfBlankCharacter,
       DATALENGTH(@spaceCharacter) AS DataLengthOfSpace, 
       DATALENGTH(@blankCharacter) AS DataLengthOfBlankCharacter;
GO

/* RESULTS
LengthOfSpace LengthOfBlankCharacter DataLengthOfSpace DataLengthOfBlankCharacter
------------- ---------------------- ----------------- --------------------------
0             0                      2                 0
*/

02_LengthAndDataLength

Often, we loose sight of the most basic concepts – they hide in our subconscious. This behaviour of SQL Server is enforced by the SQL Standard (specifically SQL ’92) based on which most RDBMS systems are made of.

The ideal solution for an accurate string comparison was therefore to also compare the data length in addition to a normal string comparison.

DECLARE @spaceCharacter NVARCHAR(1) = N' ';
DECLARE @blankCharacter NVARCHAR(1) = N'';

--The Solution
IF (@spaceCharacter = @blankCharacter) 
   AND (DATALENGTH(@spaceCharacter) = DATALENGTH(@blankCharacter))
    SELECT 'Yes' AS IsSpaceSameAsBlankString;
ELSE 
    SELECT 'No' AS IsSpaceSameAsBlankString;
GO

/* RESULTS
IsSpaceSameAsBlankString
------------------------
No
*/

03_Solution

Further Reading

  • How SQL Server Compares Strings with Trailing Spaces [KB316626]

Until we meet next time,

Be courteous. Drive responsibly.

Advertisements
Checking the "Keep Identity" checkbox

#0390 – SQL Server – SSIS – OLE DB Destination – Table Fast Load – Keep Identity option


Recently, I started writing about the nuances of SSIS which most accidental SSIS developers may frequently get stumped by due to the differences in behaviour over conventional T-SQL.

The OLE DB Destination

In my previous post, I started to explore the OLE DB destination. In order to load data as quickly into the destination as possible, the  OLE DB destination allows us to use a “Fast Load” mode. The “Fast Load” option allows the data team to configure various options that affect the speed of the data load:

  1. Keep Identity
  2. Keep NULLs
  3. Table Lock
  4. Check Constraints
  5. Rows per Batch
  6. Maximum Insert Commit Size

I looked at the “Keep NULLs” option earlier, and today I will go over the “Keep Identity” option.

The “Keep Identity” option, as the name suggests controls the ability of the OLE DB destination task to insert data into an IDENTITY column. IDENTITY columns are quite a commonly used functionality to provide an auto increment value in a table. When migrating data from one system to another, it may be required to preserve the identity values during the migration.

Allow me to demonstrate the switch in action. In order to setup the demo, I created a table in the [tempdb] database on my SQL Server instance with an IDENTITY column, and inserted some test data into it.

USE tempdb;
GO
--Safety Check
IF OBJECT_ID('dbo.KeepIdentityInOLEDB','U') IS NOT NULL
BEGIN
    DROP TABLE dbo.KeepIdentityInOLEDB;
END
GO

--Create table
CREATE TABLE dbo.KeepIdentityInOLEDB 
    ([IdentityId]       INT          NOT NULL IDENTITY(1,1),
     [ProductName]      VARCHAR(255)     NULL,
     [ManufacturerName] VARCHAR(255)     NULL
    );
GO

--Initial sample test Data
INSERT INTO dbo.[KeepIdentityInOLEDB] ([ProductName], [ManufacturerName])
SELECT [ProductList].[ProductName],
       [ProductList].[ManufacturerName] 
FROM (VALUES ('Windows'     , 'Microsoft'),
             ('SQL Server'  , NULL       ),
             ('VisualStudio','Microsoft'),
             ('MySQL'       , 'Oracle'   ),
             ('PeopleSoft'  , 'Oracle'   )
     ) AS [ProductList] ([ProductName], [ManufacturerName]);
GO

Default behaviour (Keep Identity unchecked)

When an OLE DB destination is configured to load a table with identity columns, the OLE DB destination fails validation if the identity column is mapped to the input.

Mapping identity column in the target table with a source value

Mapping identity column in the target table with a source value

Validation errors if identity columns are mapped and “Keep Identity” is unchecked (default)

Even if I forcibly try to run the package, the package fails.

0390-execution-failure-with-keep-identity-unchecked

Execution Failure when trying to execute the package with “Keep Identity” unchecked

Inserting data into Identity tables (Keep Identity checked)

If I check the “Keep Identity” checkbox, the validation error disappears and the package runs successfully.

Checking the “Keep Identity” checkbox

Successful package execution with “Keep Identity” checked

When moving data from one system into another where the identity values can uniquely refer to the records being inserted, this option works perfectly.

However, keep in mind a side-effect of this checkbox. Let me take a look at the data that was inserted into the test table.

Inserting data with “Keep Identity” checked may cause duplicates in the identity column

The “Keep Identity” checkbox checked is like setting the IDENTITY_INSERT option to ON when inserting data into a table using T-SQL. Only difference is that the option is set to OFF once the package execution completes, and hence no explicit handling is required.

Further Reading

I have written a series of posts on T-SQL behaviour around identity values, and I trust you would find them interesting.

  • An introduction to IDENTITY columns [Link]
  • What are @@IDENTITY, SCOPE_IDENTITY, IDENT_CURRENT and $IDENTITY? [Link]
  • Myths – IDENTITY columns cannot have holes or “gaps” [Link]
  • Myths – Values cannot explicitly inserted into IDENTITY columns [Link]
  • Myths – Duplicate Values cannot exist IDENTITY columns [Link]
  • Myths – The value for SEED and INCREMENT must be 1 [Link]
  • Myths – IDENTITY values (including SEED & INCREMENT) cannot be negative [Link]
  • Myths – IDENTITY columns cannot be added to existing tables – Part 01 [Link]
  • Myths – IDENTITY columns cannot be added to existing tables – Part 02 [Link]
  • Myths – IDENTITY columns cannot have constraints defined on them [Link]
  • Myths – IDENTITY columns do not propagate via SELECT…INTO statements [Link]
  • Use IDENTITY() Function to change the Identity specification in a SELECT…INTO statement [Link]

Until we meet next time,

Be courteous. Drive responsibly.

OLE DB Destination Configuration. Notice the "Keep nulls" switch is unchecked.

#0389 – SQL Server – SSIS – OLE DB Destination – Table Fast Load – Keep NULLs option


SQL Server Integration Services (SSIS) are typically called upon when integrating systems exchange data from one source to a given destination. The reason I use the term “source” and “destination” instead of a “database” because either of the two can be something other than a database (a flat file, some web-service, a script task, etc). This is possible because SSIS is more like any other .net framework based programming language (C# or VB.net).

OLE DB destination

Because one would commonly have either a Microsoft Access or a Microsoft SQL Server on at least one side of the integration, the most common source & destinations used in a SSIS-based data solution are the OLE DB Source and the OLE DB Destination. The OLE DB destination allows you to load data to a table, a view or even a SQL command (e.g. the results of a statement execution).

In order to load data as quickly into the destination as possible, the  OLE DB destination allows us to use a “Fast Load” mode. The “Fast Load” option allows the data team to configure various options that affect the speed of the data load:

  1. Keep Identity
  2. Keep NULLs
  3. Table Lock
  4. Check Constraints
  5. Rows per Batch
  6. Maximum Insert Commit Size

We will look at each option in detail over the next couple of weeks.

Keep NULLs option

The Keep NULLs option is normally something that most accidental SSIS developers do not pay much attention to. It comes unchecked by default and it left unchecked. However, the state of this checkbox can have a significant impact on the completeness and quality of data being inserted into the destination database.

To clarify, allow me to explain the functionality of this checkbox:

  1. Checked – If a column in the source data has NULL values, keep them as-is
  2. Unchecked – If a column in the source data has NULL values, try to replace them with the default values as defined by the destination DB

The state of this checkbox typically does not make much of a difference because in most cases, the domain and business rules in both the systems involved would be similar. Thus, the if a column in one system allows a NULL value, other systems in the same domain would also allow a NULL (e.g. in most enrollment forms, the last name would generally be mandatory but the first name is not). However, legacy systems (which have been around since decades) would have accumulated a lot data that does not conform to newer domain practices, causing issues during migration. This is when the “Keep Nulls” checkbox comes into action.

In the case I am going to present today, I have a set of Product Names and their corresponding Manufacturers. In a few of these cases, I don’t know the manufacturer and have therefore kept it blank.

USE tempdb;
GO
--Test Data
SELECT [ProductList].[ProductName],
       [ProductList].[ManufacturerName] 
FROM (VALUES ('Windows'     , 'Microsoft'),
             ('SQL Server'  , NULL       ),
             ('VisualStudio','Microsoft'),
             ('MySQL'       , 'Oracle'   ),
             ('PeopleSoft'  , 'Oracle'   )
     ) AS [ProductList] ([ProductName], [ManufacturerName]);
GO
Sample data with some NULL values

Sample data with some NULL values

For the sake of this demo, I have used this query as my source in the test SSIS package. Below is a screenshot of my data flow task.

Using a test data query in the OLE DB source command

Using a test data query in the OLE DB source command

I directly take this dataset as input to the OLE DB destination. The OLE DB destination is configured to a test table ([dbo].[KeepNullsInOLEDB]) with the following table definition.

USE [tempdb];
GO
--Safety Check
IF OBJECT_ID('dbo.KeepNullsInOLEDB','U') IS NOT NULL
BEGIN
    DROP TABLE dbo.KeepNullsInOLEDB;
END
GO

--Create table
CREATE TABLE dbo.KeepNullsInOLEDB 
        ([ProductName]      VARCHAR(255) NULL,
         [ManufacturerName] VARCHAR(255) NULL     
                            CONSTRAINT df_KeepNullsInOLEDB_ManufacturerName 
                            DEFAULT ('Microsoft')
        );
GO

OLE DB Destination Configuration. Notice the “Keep nulls” switch is unchecked.

After executing the package, I query the [dbo].[KeepNullsInOLEDB] table in the destination database, and compare with the source data.

Values inserted into the destination table. Notice the default value from table definition is used.

Values inserted into the destination table. Notice the default value from table definition is used.

As can be seen from the screenshot, the [ManufacturerName] for “SQL Server” is not NULL. It is instead set to “Microsoft” which is the default value as set in the default constraint on the destination table.

The data inserted in the destination table changes if the switch is kept checked in the OLE DB destination.

Notice how the value from the default constraint is not used when “Keep Nulls” is checked.

If the “Keep nulls” checkbox is checked, the default constraint on the target table is not used – thereby maintaining the same data as the source.

Summary

Depending upon the business requirements, it may be critical to migrate data from a source to a destination “as-is”, without the application of default constraints. In such situations, the “Keep nulls” switch on the OLE DB destination (“Fast Load” mode) needs to be checked.

If the “Keep nulls” switch is unchecked, the default constraints from the target table definition come into effect.

In my future posts, I will take a look at the other switches on the OLE DB Fast Load mode.

Until we meet next time,

Be courteous. Drive responsibly.

Data Flow Task used to demonstrate case-sensitivity of Lookup transformation

#0388 – SQL Server – SSIS – Lookup transformations are case-sensitive


I have been working with SQL Server Integration Services (SSIS) recently. In many ways, SSIS is counter-intuitive if you have been working with the database engine for long (more than a decade in my case). Working with SSIS is more like working with any other .net framework based programming language (C# or VB.net). Over the next couple of days, I will be writing about some of the salient aspects of SSIS which should be kept in mind in case you are working on multiple SQL Server technologies.

Lookup Transformations – A key to successful system integrations

Cross-referencing of Enumerations

One of the key challenges for any system integration is to ensure that the enumerations  and “default” values used in the source system (e.g. sales statuses, product categories, etc) align between the “source” & “target” systems.

Once the values aligned during business, high-level and low-level designs, implementation of this cross-referencing in SQL Server Integration Services (SSIS) is done by a data flow component called the “Lookup Transformation“. The Lookup transformation effectively performs a join between the input data with a reference data set. If values match, they are available in what is called the “Match Output” whereas values that do not match can be made available as a “No Match Output”. However, this comes with a tiny counter-intuitive behaviour that I learnt about the hard way.

The lookup performed by the lookup transformation is case-sensitive.

Demo

In order to demonstrate the case-sensitivity of lookup transformations, I have developed a SSIS package that does the following in a simple data-flow task:

  1. Get some static data from an OLEDB data source, basically some rows with text representation of numbers (One, Two, Three, and so on)
  2. The Lookup transform has a static mapping between the numeric and text values of various numbers – 1 through 10
  3. As the input data passes through the lookup transformation, we try to map the text values in the source data with the values available in the lookup transformation so that we can get the appropriate numeric representation
    • In my demo, records that find a valid lookup are written to a recordset destination (it could be any valid destination), whereas records that do not match are written to another destination
    • I have placed data viewers on the output pipelines to visually see the data being moved, which is what I will show below

The query used to generate the static data in the OLE DB source is provided below.

SELECT srcValues.RowName,
       srcValues.RowValue
FROM (VALUES ('Row1','One'),
             ('Row2','Two'),
             ('Row3','three'),
             ('Row4','Four'),
             ('Row5','Five'),
             ('Row6','Six'),
             ('Row7','seven'),
             ('Row8','eight'),
             ('Row9','Nine'),
             ('Row10','Ten')
     ) AS srcValues (RowName, RowValue);

The query used to generate the lookup values for the lookup transform is provided below:

SELECT lookUpValues.Id,
       lookUpValues.RowValue
FROM (VALUES (1, 'One'),
             (2, 'Two'),
             (3, 'Three'),
             (4, 'Four'),
             (5, 'Five'),
             (6, 'Six'),
             (7, 'Seven'),
             (8, 'Eight'),
             (9, 'Nine'),
             (10, 'Ten')
     ) AS lookUpValues (Id, RowValue);

Observe that in the static source data, not all values have a consistent case – some are in sentence case, whereas some are in small case.

The screenshots below show the overall setup of the SSIS package.

Data Flow Task used to demonstrate case-sensitivity of Lookup transformation

Data Flow Task used to demonstrate case-sensitivity of Lookup transformation

LookUp Transformation - General configuration (Notice redirection to no match output)

LookUp Transformation – General configuration (Notice redirection to no match output)

Lookup Transformation - Connection tab showing reference values

Lookup Transformation – Connection tab showing reference values

Lookup Transformation - Columns configuration showing "RowValue" used for matching whereas the reference Id is fetched to include in output

Lookup Transformation – Columns configuration

Notice here that we have used the text value from the source data (“RowValue” column) for matching/lookup to the reference data set. The reference Id is fetched to include in output.

If a match is found the “Match Output” should contain the matching row from the source combined with the Id from the reference/lookup data. This is seen in the data viewer output below.

Lookup Transformation - Match Output (source rows with the Id from the reference data)

Lookup Transformation – Match Output (source rows with the Id from the reference data)

If a match is not found (which would be the case for the values with lower case in the source data), the “No Match Output” will contain the row from the source data that failed the lookup (since failures were redirected to the “No Match” output in the general configuration). Notice we do not get the Id from the reference because no match to the reference was found.

Lookup Transformation - No Match Output

Lookup Transformation – No Match Output

Summary

When working with a case insensitive database, we often tend to take data quality with respect to case of the data for granted. Having data with inconsistent case has multiple repercussions (especially with data grouping in front end applications), but the biggest negative impact due to inconsistent case of text data is the inaccurate cross-referencing during a master data cleanup, system integration or data migration exercise.

Call to action

Do take a few cycles in your development to take a look at your data quality, and if necessary, implement required data cleansing to ensure that your lookup data, enumerations and master data are using a case that is correct and consistent with the domain and business requirements.

Until we meet next time,

Be courteous. Drive responsibly.

#0385 – SQL Server – Query Development and NULL values


Software Design, Assumptions and their side-effects

A few days ago, I tweeted a news piece that came to my attention on the name “NULL” causing computing systems to malfunction. When I shared it with my colleagues, a healthy discussion on how should software architectures handle NULL took place.

The key focus of discussion was of course that we need to be more globally aware and avoid generalization when defining business requirements (e.g. if one country does not commonly have people with surnames greater than 10 characters in length, doesn’t mean there aren’t people with longer surnames as highlighted here).

We also talked about detection of NULL values and manipulating them. In the world of databases, joins between tables are a common scenario and it is interesting to study the impact that NULL has on the JOIN operator in a T-SQL query.

NULL values and JOINs

I had encountered a scenario involving joining on NULL-able columns which I had encountered during a unit test I was executing. What happened was that rows were disappearing and initially I was unable to figure out why. Allow me to explain with a demo.

I have a simple table with software metrics like lines of code (LoC), story points, etc. for various modules. I also have a look-up table that gives me details of the measurement units, including provision to handle unknown sizing units (which was possible in our case for modules developed in fairly uncommon technologies).

When I joined between the metrics and the sizing units table, what happened was that metrics for the project in uncommon technologies did not appear!

DECLARE @someMetrics TABLE
         (ModuleCode INT         NOT NULL,
          SizeValue  INT             NULL,
          SizeUnit   VARCHAR(10)     NULL
         );

DECLARE @unitMaster TABLE
        (SizeUnit        VARCHAR(10) NULL,
         UnitDescription VARCHAR(50) NULL
        );

INSERT INTO @unitMaster (SizeUnit,
                         UnitDescription
                        )
VALUES (NULL , 'Unit Unknown'),
       ('LOC', 'Lines of Code'),
       ( 'SP', 'Story Points');

INSERT INTO @someMetrics (ModuleCode,
                          SizeValue,
                          SizeUnit
                         )
VALUES (1, 1200, 'LOC'),
       (3, 3, 'SP' ),
       (6, 32, NULL ),
       (7, 2500, 'LOC');

--Problem
SELECT sm.ModuleCode,
       sm.SizeValue,
       um.UnitDescription
FROM @someMetrics AS sm
INNER JOIN @unitMaster AS um ON sm.SizeUnit = um.SizeUnit;
The join operator will eliminate rows with null values in the join predicates

Metrics for NULL sizing units were filtered in the join process

The Root Cause

What is happening is that by default NULL compared with any value (NULL or otherwise) yields NULL. This is expected because when one of two values in a comparison is an unknown, there is no way to ascertain the result of the comparison and hence it has to be unknown, i.e. NULL.

In case of a join, the way SQL Server behaves is that all rows that match a given condition are fetched. Rows with conditions evaluating to “anything other than true” are filtered out. This is exactly what was happening in my case – since the columns used in the joined predicate had NULL values, the comparison yielded NULL (a non-true result) and hence these rows were filtered out.

The Fix

Option 01: The ideal fix is to ensure that the data never contains NULLs. Instead of NULL, a default value (anything suitable to the domain) can be used to indicate that the source system did not supply any value.

Option 02: In case changing the source data is not feasible, the query needs to be updated to substitute the NULL values for a chosen default as shown below.

DECLARE @someMetrics TABLE
         (ModuleCode INT         NOT NULL,
          SizeValue  INT             NULL,
          SizeUnit   VARCHAR(10)     NULL
         );

DECLARE @unitMaster TABLE
        (SizeUnit        VARCHAR(10) NULL,
         UnitDescription VARCHAR(50) NULL
        );

INSERT INTO @unitMaster (SizeUnit,
                         UnitDescription
                        )
VALUES (NULL , 'Unit Unknown'),
       ('LOC', 'Lines of Code'),
       ( 'SP', 'Story Points');

INSERT INTO @someMetrics (ModuleCode,
                          SizeValue,
                          SizeUnit
                         )
VALUES (1, 1200, 'LOC'),
       (3, 3, 'SP' ),
       (6, 32, NULL ),
       (7, 2500, 'LOC');

--One way to resolve the problem
SELECT sm.ModuleCode,
       sm.SizeValue,
       um.UnitDescription
FROM @someMetrics AS sm
INNER JOIN @unitMaster AS um
      ON ISNULL(sm.SizeUnit,'NULL') = ISNULL(um.SizeUnit,'NULL');
Using a chosen default instead of NULL in the join predicate

Using a chosen default instead of NULL in the join predicate

Conclusion

Irrespective of the business design, it is extremely important to have a strategy on handling of NULL values during system design. Most systems lack consistency in the way they allow for and handle NULL values.

Bringing in a set of standardized design practices on handling of NULL values makes it easier to predict system behaviour and also helps in re-engineering efforts whenever required to do so in the future.

Further Reading

  • Why is it not a good idea to implement NOT NULL check as a CHECK constraint? [Blog Link here]
  • SQL Server – CONCAT_NULL_YIELDS_NULL property [Blog Link here]
  • Fun with temporary tables – Impact of ANSI_NULL_DFLT_ON [Blog Link here]

Until we meet next time,

Be courteous. Drive responsibly.