Tag Archives: Data Quality

#0406 – SQL Server – Remember that spaces and blank strings are the same

#0390 – SQL Server – SSIS – OLE DB Destination – Table Fast Load – Keep Identity option

Recently, I started writing about the nuances of SSIS which most accidental SSIS developers may frequently get stumped by due to the differences in behaviour over conventional T-SQL.

The OLE DB Destination

In my previous post, I started to explore the OLE DB destination. In order to load data as quickly into the destination as possible, the OLE DB destination allows us to use a “Fast Load” mode. The “Fast Load” option allows the data team to configure various options that affect the speed of the data load:

Keep Identity
Keep NULLs
Table Lock
Check Constraints
Rows per Batch
Maximum Insert Commit Size

I looked at the “Keep NULLs” option earlier, and today I will go over the “Keep Identity” option.

The “Keep Identity” option, as the name suggests controls the ability of the OLE DB destination task to insert data into an IDENTITY column. IDENTITY columns are quite a commonly used functionality to provide an auto increment value in a table. When migrating data from one system to another, it may be required to preserve the identity values during the migration.

Allow me to demonstrate the switch in action. In order to setup the demo, I created a table in the [tempdb] database on my SQL Server instance with an IDENTITY column, and inserted some test data into it.

USE tempdb;
GO
--Safety Check
IF OBJECT_ID('dbo.KeepIdentityInOLEDB','U') IS NOT NULL
BEGIN
    DROP TABLE dbo.KeepIdentityInOLEDB;
END
GO

--Create table
CREATE TABLE dbo.KeepIdentityInOLEDB 
    ([IdentityId]       INT          NOT NULL IDENTITY(1,1),
     [ProductName]      VARCHAR(255)     NULL,
     [ManufacturerName] VARCHAR(255)     NULL
    );
GO

--Initial sample test Data
INSERT INTO dbo.[KeepIdentityInOLEDB] ([ProductName], [ManufacturerName])
SELECT [ProductList].[ProductName],
       [ProductList].[ManufacturerName] 
FROM (VALUES ('Windows'     , 'Microsoft'),
             ('SQL Server'  , NULL       ),
             ('VisualStudio','Microsoft'),
             ('MySQL'       , 'Oracle'   ),
             ('PeopleSoft'  , 'Oracle'   )
     ) AS [ProductList] ([ProductName], [ManufacturerName]);
GO

Default behaviour (Keep Identity unchecked)

When an OLE DB destination is configured to load a table with identity columns, the OLE DB destination fails validation if the identity column is mapped to the input.

Mapping identity column in the target table with a source value

Validation errors if identity columns are mapped and “Keep Identity” is unchecked (default)

Even if I forcibly try to run the package, the package fails.

0390-execution-failure-with-keep-identity-unchecked

Execution Failure when trying to execute the package with “Keep Identity” unchecked

Inserting data into Identity tables (Keep Identity checked)

If I check the “Keep Identity” checkbox, the validation error disappears and the package runs successfully.

Checking the “Keep Identity” checkbox

Successful package execution with “Keep Identity” checked

When moving data from one system into another where the identity values can uniquely refer to the records being inserted, this option works perfectly.

However, keep in mind a side-effect of this checkbox. Let me take a look at the data that was inserted into the test table.

Inserting data with “Keep Identity” checked may cause duplicates in the identity column

The “Keep Identity” checkbox checked is like setting the IDENTITY_INSERT option to ON when inserting data into a table using T-SQL. Only difference is that the option is set to OFF once the package execution completes, and hence no explicit handling is required.

#0389 – SQL Server – SSIS – OLE DB Destination – Table Fast Load – Keep NULLs option

6 Replies

SQL Server Integration Services (SSIS) are typically called upon when integrating systems exchange data from one source to a given destination. The reason I use the term “source” and “destination” instead of a “database” because either of the two can be something other than a database (a flat file, some web-service, a script task, etc). This is possible because SSIS is more like any other .net framework based programming language (C# or VB.net).

OLE DB destination

Because one would commonly have either a Microsoft Access or a Microsoft SQL Server on at least one side of the integration, the most common source & destinations used in a SSIS-based data solution are the OLE DB Source and the OLE DB Destination. The OLE DB destination allows you to load data to a table, a view or even a SQL command (e.g. the results of a statement execution).

In order to load data as quickly into the destination as possible, the OLE DB destination allows us to use a “Fast Load” mode. The “Fast Load” option allows the data team to configure various options that affect the speed of the data load:

Keep Identity
Keep NULLs
Table Lock
Check Constraints
Rows per Batch
Maximum Insert Commit Size

We will look at each option in detail over the next couple of weeks.

Keep NULLs option

The Keep NULLs option is normally something that most accidental SSIS developers do not pay much attention to. It comes unchecked by default and it left unchecked. However, the state of this checkbox can have a significant impact on the completeness and quality of data being inserted into the destination database.

To clarify, allow me to explain the functionality of this checkbox:

Checked – If a column in the source data has NULL values, keep them as-is
Unchecked – If a column in the source data has NULL values, try to replace them with the default values as defined by the destination DB

The state of this checkbox typically does not make much of a difference because in most cases, the domain and business rules in both the systems involved would be similar. Thus, the if a column in one system allows a NULL value, other systems in the same domain would also allow a NULL (e.g. in most enrollment forms, the last name would generally be mandatory but the first name is not). However, legacy systems (which have been around since decades) would have accumulated a lot data that does not conform to newer domain practices, causing issues during migration. This is when the “Keep Nulls” checkbox comes into action.

In the case I am going to present today, I have a set of Product Names and their corresponding Manufacturers. In a few of these cases, I don’t know the manufacturer and have therefore kept it blank.

USE tempdb;
GO
--Test Data
SELECT [ProductList].[ProductName],
       [ProductList].[ManufacturerName] 
FROM (VALUES ('Windows'     , 'Microsoft'),
             ('SQL Server'  , NULL       ),
             ('VisualStudio','Microsoft'),
             ('MySQL'       , 'Oracle'   ),
             ('PeopleSoft'  , 'Oracle'   )
     ) AS [ProductList] ([ProductName], [ManufacturerName]);
GO

Sample data with some NULL values

For the sake of this demo, I have used this query as my source in the test SSIS package. Below is a screenshot of my data flow task.

Using a test data query in the OLE DB source command

I directly take this dataset as input to the OLE DB destination. The OLE DB destination is configured to a test table ([dbo].[KeepNullsInOLEDB]) with the following table definition.

USE [tempdb];
GO
--Safety Check
IF OBJECT_ID('dbo.KeepNullsInOLEDB','U') IS NOT NULL
BEGIN
    DROP TABLE dbo.KeepNullsInOLEDB;
END
GO

--Create table
CREATE TABLE dbo.KeepNullsInOLEDB 
        ([ProductName]      VARCHAR(255) NULL,
         [ManufacturerName] VARCHAR(255) NULL     
                            CONSTRAINT df_KeepNullsInOLEDB_ManufacturerName 
                            DEFAULT ('Microsoft')
        );
GO

OLE DB Destination Configuration. Notice the “Keep nulls” switch is unchecked.

After executing the package, I query the [dbo].[KeepNullsInOLEDB] table in the destination database, and compare with the source data.

Values inserted into the destination table. Notice the default value from table definition is used.

As can be seen from the screenshot, the [ManufacturerName] for “SQL Server” is not NULL. It is instead set to “Microsoft” which is the default value as set in the default constraint on the destination table.

The data inserted in the destination table changes if the switch is kept checked in the OLE DB destination.

Notice how the value from the default constraint is not used when “Keep Nulls” is checked.

If the “Keep nulls” checkbox is checked, the default constraint on the target table is not used – thereby maintaining the same data as the source.

Summary

Depending upon the business requirements, it may be critical to migrate data from a source to a destination “as-is”, without the application of default constraints. In such situations, the “Keep nulls” switch on the OLE DB destination (“Fast Load” mode) needs to be checked.

If the “Keep nulls” switch is unchecked, the default constraints from the target table definition come into effect.

In my future posts, I will take a look at the other switches on the OLE DB Fast Load mode.

Until we meet next time,

Be courteous. Drive responsibly.

#0388 – SQL Server – SSIS – Lookup transformations are case-sensitive

Lookup Transformations – A key to successful system integrations

Cross-referencing of Enumerations

One of the key challenges for any system integration is to ensure that the enumerations and “default” values used in the source system (e.g. sales statuses, product categories, etc) align between the “source” & “target” systems.

Once the values aligned during business, high-level and low-level designs, implementation of this cross-referencing in SQL Server Integration Services (SSIS) is done by a data flow component called the “Lookup Transformation“. The Lookup transformation effectively performs a join between the input data with a reference data set. If values match, they are available in what is called the “Match Output” whereas values that do not match can be made available as a “No Match Output”. However, this comes with a tiny counter-intuitive behaviour that I learnt about the hard way.

The lookup performed by the lookup transformation is case-sensitive.

Demo

In order to demonstrate the case-sensitivity of lookup transformations, I have developed a SSIS package that does the following in a simple data-flow task:

Get some static data from an OLEDB data source, basically some rows with text representation of numbers (One, Two, Three, and so on)
The Lookup transform has a static mapping between the numeric and text values of various numbers – 1 through 10
As the input data passes through the lookup transformation, we try to map the text values in the source data with the values available in the lookup transformation so that we can get the appropriate numeric representation
- In my demo, records that find a valid lookup are written to a recordset destination (it could be any valid destination), whereas records that do not match are written to another destination
- I have placed data viewers on the output pipelines to visually see the data being moved, which is what I will show below

The query used to generate the static data in the OLE DB source is provided below.

SELECT srcValues.RowName,
       srcValues.RowValue
FROM (VALUES ('Row1','One'),
             ('Row2','Two'),
             ('Row3','three'),
             ('Row4','Four'),
             ('Row5','Five'),
             ('Row6','Six'),
             ('Row7','seven'),
             ('Row8','eight'),
             ('Row9','Nine'),
             ('Row10','Ten')
     ) AS srcValues (RowName, RowValue);

The query used to generate the lookup values for the lookup transform is provided below:

SELECT lookUpValues.Id,
       lookUpValues.RowValue
FROM (VALUES (1, 'One'),
             (2, 'Two'),
             (3, 'Three'),
             (4, 'Four'),
             (5, 'Five'),
             (6, 'Six'),
             (7, 'Seven'),
             (8, 'Eight'),
             (9, 'Nine'),
             (10, 'Ten')
     ) AS lookUpValues (Id, RowValue);

Observe that in the static source data, not all values have a consistent case – some are in sentence case, whereas some are in small case.

The screenshots below show the overall setup of the SSIS package.

Data Flow Task used to demonstrate case-sensitivity of Lookup transformation

LookUp Transformation – General configuration (Notice redirection to no match output)

Lookup Transformation – Connection tab showing reference values

Lookup Transformation – Columns configuration

Notice here that we have used the text value from the source data (“RowValue” column) for matching/lookup to the reference data set. The reference Id is fetched to include in output.

If a match is found the “Match Output” should contain the matching row from the source combined with the Id from the reference/lookup data. This is seen in the data viewer output below.

Lookup Transformation – Match Output (source rows with the Id from the reference data)

If a match is not found (which would be the case for the values with lower case in the source data), the “No Match Output” will contain the row from the source data that failed the lookup (since failures were redirected to the “No Match” output in the general configuration). Notice we do not get the Id from the reference because no match to the reference was found.

Lookup Transformation – No Match Output

Summary

When working with a case insensitive database, we often tend to take data quality with respect to case of the data for granted. Having data with inconsistent case has multiple repercussions (especially with data grouping in front end applications), but the biggest negative impact due to inconsistent case of text data is the inaccurate cross-referencing during a master data cleanup, system integration or data migration exercise.

Call to action

Do take a few cycles in your development to take a look at your data quality, and if necessary, implement required data cleansing to ensure that your lookup data, enumerations and master data are using a case that is correct and consistent with the domain and business requirements.

Until we meet next time,

Be courteous. Drive responsibly.

#0385 – SQL Server – Query Development and NULL values

Software Design, Assumptions and their side-effects

A few days ago, I tweeted a news piece that came to my attention on the name “NULL” causing computing systems to malfunction. When I shared it with my colleagues, a healthy discussion on how should software architectures handle NULL took place.

The key focus of discussion was of course that we need to be more globally aware and avoid generalization when defining business requirements (e.g. if one country does not commonly have people with surnames greater than 10 characters in length, doesn’t mean there aren’t people with longer surnames as highlighted here).

We also talked about detection of NULL values and manipulating them. In the world of databases, joins between tables are a common scenario and it is interesting to study the impact that NULL has on the JOIN operator in a T-SQL query.

NULL values and JOINs

I had encountered a scenario involving joining on NULL-able columns which I had encountered during a unit test I was executing. What happened was that rows were disappearing and initially I was unable to figure out why. Allow me to explain with a demo.

I have a simple table with software metrics like lines of code (LoC), story points, etc. for various modules. I also have a look-up table that gives me details of the measurement units, including provision to handle unknown sizing units (which was possible in our case for modules developed in fairly uncommon technologies).

When I joined between the metrics and the sizing units table, what happened was that metrics for the project in uncommon technologies did not appear!

DECLARE @someMetrics TABLE
         (ModuleCode INT         NOT NULL,
          SizeValue  INT             NULL,
          SizeUnit   VARCHAR(10)     NULL
         );

DECLARE @unitMaster TABLE
        (SizeUnit        VARCHAR(10) NULL,
         UnitDescription VARCHAR(50) NULL
        );

INSERT INTO @unitMaster (SizeUnit,
                         UnitDescription
                        )
VALUES (NULL , 'Unit Unknown'),
       ('LOC', 'Lines of Code'),
       ( 'SP', 'Story Points');

INSERT INTO @someMetrics (ModuleCode,
                          SizeValue,
                          SizeUnit
                         )
VALUES (1, 1200, 'LOC'),
       (3, 3, 'SP' ),
       (6, 32, NULL ),
       (7, 2500, 'LOC');

--Problem
SELECT sm.ModuleCode,
       sm.SizeValue,
       um.UnitDescription
FROM @someMetrics AS sm
INNER JOIN @unitMaster AS um ON sm.SizeUnit = um.SizeUnit;

The join operator will eliminate rows with null values in the join predicates

Metrics for NULL sizing units were filtered in the join process

The Root Cause

What is happening is that by default NULL compared with any value (NULL or otherwise) yields NULL. This is expected because when one of two values in a comparison is an unknown, there is no way to ascertain the result of the comparison and hence it has to be unknown, i.e. NULL.

In case of a join, the way SQL Server behaves is that all rows that match a given condition are fetched. Rows with conditions evaluating to “anything other than true” are filtered out. This is exactly what was happening in my case – since the columns used in the joined predicate had NULL values, the comparison yielded NULL (a non-true result) and hence these rows were filtered out.

The Fix

Option 01: The ideal fix is to ensure that the data never contains NULLs. Instead of NULL, a default value (anything suitable to the domain) can be used to indicate that the source system did not supply any value.

Option 02: In case changing the source data is not feasible, the query needs to be updated to substitute the NULL values for a chosen default as shown below.

DECLARE @someMetrics TABLE
         (ModuleCode INT         NOT NULL,
          SizeValue  INT             NULL,
          SizeUnit   VARCHAR(10)     NULL
         );

DECLARE @unitMaster TABLE
        (SizeUnit        VARCHAR(10) NULL,
         UnitDescription VARCHAR(50) NULL
        );

INSERT INTO @unitMaster (SizeUnit,
                         UnitDescription
                        )
VALUES (NULL , 'Unit Unknown'),
       ('LOC', 'Lines of Code'),
       ( 'SP', 'Story Points');

INSERT INTO @someMetrics (ModuleCode,
                          SizeValue,
                          SizeUnit
                         )
VALUES (1, 1200, 'LOC'),
       (3, 3, 'SP' ),
       (6, 32, NULL ),
       (7, 2500, 'LOC');

--One way to resolve the problem
SELECT sm.ModuleCode,
       sm.SizeValue,
       um.UnitDescription
FROM @someMetrics AS sm
INNER JOIN @unitMaster AS um
      ON ISNULL(sm.SizeUnit,'NULL') = ISNULL(um.SizeUnit,'NULL');

Using a chosen default instead of NULL in the join predicate

Conclusion

Irrespective of the business design, it is extremely important to have a strategy on handling of NULL values during system design. Most systems lack consistency in the way they allow for and handle NULL values.

Bringing in a set of standardized design practices on handling of NULL values makes it easier to predict system behaviour and also helps in re-engineering efforts whenever required to do so in the future.

SQLTwins by Nakul Vachhrajani

SQL Server tips and experiences dedicated to my twin daughters.

Tag Archives: Data Quality

#0406 – SQL Server – Remember that spaces and blank strings are the same

Further Reading

#0390 – SQL Server – SSIS – OLE DB Destination – Table Fast Load – Keep Identity option

The OLE DB Destination

Default behaviour (Keep Identity unchecked)

Inserting data into Identity tables (Keep Identity checked)

Further Reading

#0389 – SQL Server – SSIS – OLE DB Destination – Table Fast Load – Keep NULLs option

OLE DB destination

Keep NULLs option

Summary

#0388 – SQL Server – SSIS – Lookup transformations are case-sensitive

Lookup Transformations – A key to successful system integrations

Cross-referencing of Enumerations

Demo

Summary

Call to action

#0385 – SQL Server – Query Development and NULL values

Software Design, Assumptions and their side-effects

NULL values and JOINs

The Root Cause

The Fix

Conclusion

Further Reading