Tag Archives: Tips

General Microsoft SQL Server tips

OLE DB Destination - Rows/Batch and Max. Insert Commit Size

#0393 – SQL Server – SSIS – OLE DB Destination – Table Fast Load – Rows per batch and Max. Insert Commit Size options


Developing SSIS packages is quite easy – it’s mostly drag and drop and some minor configuration. However, when it comes to tuning the package, one needs to understand the finer points of each task on the control flow. On each task, there are some options that help improving the performance of the data flow, whereas some others help regulate the quality of data being migrated.

The OLE DB Destination

I have been writing about the OLE DB destination in the last couple of posts. In order to load data as quickly into the destination as possible, the  OLE DB destination allows us to use a “Fast Load” mode. The “Fast Load” option allows the data team to configure various options that affect the speed of the data load:

  1. Keep Identity
  2. Keep NULLs
  3. Table Lock
  4. Check Constraints
  5. Rows per Batch
  6. Maximum Insert Commit Size

When we use the fast load options of the OLE DB destination, we are essentially using the BULK INSERT T-SQL command. This is the reason we get almost all the options of BULK INSERT in the OLE DB transformation. Today, I will take a look at the last two options which are the secret behind significantly improving the data load performance on a system with a slow I/O subsystem. These are “Rows per Batch” and “Maximum Insert Commit Size”.

My test instance is on my prime development environment and hence even the reasonably large load completes in a couple of minutes. However, on a slow I/O sub-system, the impact of these options will be much higher.

Creating the package with logging for comparing execution time

I have created a simple package that creates a table and inserts data into it. The table is identical to the [Sales].[SalesOrderDetail] table in the [AdventureWorks2014] sample database. The table creation script used in the Execute SQL task on the package is provided below:

USE [tempdb];
GO

IF OBJECT_ID('[dbo].[SalesOrderDetail]','U') IS NOT NULL
BEGIN
    DROP TABLE [dbo].[SalesOrderDetail];
END
GO

CREATE TABLE [dbo].[SalesOrderDetail]
   (
    [SalesOrderID]           [INT]              NOT NULL,
    [SalesOrderDetailID]     [INT]              NOT NULL,
    [CarrierTrackingNumber]  [NVARCHAR](25)         NULL,
    [OrderQty]               [SMALLINT]         NOT NULL,
    [ProductID]              [INT]              NOT NULL,
    [SpecialOfferID]         [INT]              NOT NULL,
    [UnitPrice]              [MONEY]            NOT NULL,
    [UnitPriceDiscount]      [MONEY]            NOT NULL 
                             CONSTRAINT [DF_sodUnitPriceDiscount]  DEFAULT ((0.0)),
    [LineTotal]              DECIMAL(38, 6),
    [rowguid]                [UNIQUEIDENTIFIER] NOT NULL,
    [ModifiedDate]           [DATETIME]         NOT NULL 
                             CONSTRAINT [DF_sodModifiedDate]  DEFAULT (GETDATE())
   ) ON [PRIMARY]
GO

Once the table is created, the package “flows” to the Data Flow Task. Inside the data flow, essentially I simply select about 35,181,930 records by using a CROSS JOIN between the [AdventureWorks2014].[Sales].[SalesOrderDetail] and [AdventureWorks2014].[HumanResources].[Employee] tables using an OLE DB source and pump it to the newly created target table via an OLE DB destination with the “Table Lock” option checked (default).

I then configure logging on the package to log the package activity for the Data Flow Task for the OnError, OnPreExecute and OnPostExecute events (Configuring package logging is out of scope for this blog post).

The script used in the OLEDB source is presented here:

SELECT sod.SalesOrderID,
       sod.SalesOrderDetailID,
       sod.CarrierTrackingNumber,
       sod.OrderQty,
       sod.ProductID,
       sod.SpecialOfferID,
       sod.UnitPrice,
       sod.UnitPriceDiscount,
       sod.LineTotal,
       sod.rowguid,
       sod.ModifiedDate
FROM Sales.SalesOrderDetail AS sod
CROSS JOIN HumanResources.Employee AS hre;
GO

Screenshots showing the package configuration described above are shown below.

OLEDB Destination - Package Overview to test Rows/Batch and Max. Insert Commit Size

OLEDB Destination – Package Overview to test Rows/Batch and Max. Insert Commit Size

Please note that between each run, the data buffers were cleaned and procedure cache was cleared out to get a “cold” state performance of the database engine.

CHECKPOINT;
DBCC DROPCLEANBUFFERS;
DBCC FREEPROCCACHE;
GO

Once the package is executed, I will compare the difference between the OnPreExecute and OnPostExecute times for various configurations of the “Row per batch” and “Max. insert commit size” option to get an idea of the performance difference between them.

NOTE: The packages were executed after it was confirmed that the databases involved (in this case tempdb) had grown sufficiently to accommodate the inserted data.

Comparing package execution performance

Because I had turned on logging on the SSIS package, I ran the following query against the [dbo].[sysssislog] table which gives me the time difference (in seconds) between the “OnPreExecute” and “OnPostExecute” events for both the packages. The query and the results are available below:

USE [tempdb];
GO
SELECT [PivotedData].,
       [PivotedData].[OnPreExecute],
       [PivotedData].[OnPostExecute],
       DATEDIFF(SECOND,[PivotedData].[OnPreExecute],[PivotedData].[OnPostExecute]) AS [ExecutionTime]
FROM (SELECT [sl].,
             [sl].[event],
             [sl].[starttime]
      FROM [dbo].[sysssislog] AS [sl] WITH (NOLOCK)
      WHERE ([sl].[event] = 'OnPreExecute' 
             OR 
             [sl].[event] = 'OnPostExecute'
            )
        AND [sl]. LIKE 'OLEDB%'
     ) AS SourceData
PIVOT (MAX([SourceData].[starttime])
       FOR [SourceData].[event] IN ([OnPreExecute], [OnPostExecute])
      ) AS PivotedData;
GO
OLEDB Destination - Performance impact of adjusting Rows/Batch and Max. Insert Commit Size

OLEDB Destination – Performance impact of adjusting Rows/Batch and Max. Insert Commit Size

As can be seen from the screenshots off the results, the tasks with controlled batch sizes and rows/batch had better performance even on my development environment. I have seen a considerable improvement on systems with poor I/O performance.

Conclusion

The OLE DB destination is, therefore, a very powerful way of tuning the data inserts into a destination SQL Server database.

There is no magic bullet to ensuring SSIS performance, but a thorough evaluation of and appropriate adjustments to the OLEDB destination options based on business rules, overall system configuration and the nature of the workload is sure to get the optimal performance from your packages.

Further Reading

Until we meet next time,

Be courteous. Drive responsibly.

#0392 – SQL Server – SSIS – OLE DB Destination – Table Fast Load – Check Constraints option


Developing SSIS packages is quite easy – it’s mostly drag and drop and some minor configuration, which is a really boon for someone who is new to SSIS. However, when it comes to tuning the package, one needs to understand the finer points of each task on the control flow. On each task, there are some options that help improving the performance of the data flow, whereas some others help regulate the quality of data being migrated.

The OLE DB Destination

I have written about the OLE DB destination. In order to load data as quickly into the destination as possible, the  OLE DB destination allows us to use a “Fast Load” mode. The “Fast Load” option allows the data team to configure various options that affect the speed of the data load:

  1. Keep Identity
  2. Keep NULLs
  3. Table Lock
  4. Check Constraints
  5. Rows per Batch
  6. Maximum Insert Commit Size

Today, I will take a look at the “Check Constraints” option which controls the quality of the data “flowing” through the data flow task.

Leveraging Constraints in SQL Server

Before I go ahead and demonstrate the effect of the “Check Constraints” checkbox on the OLE DB Destination, I will reiterate the usage of constraints in Microsoft SQL Server table design. SQL Server supports multiple types of constraints, helping in maintaining data quality – both referential and domain:

  • NOT NULL constraint – Prevents NULL values in a column
  • UNIQUE constraint – Ensures values in a column are unique
  • PRIMARY KEY – Uniquely identifies a row in a table
  • FOREIGN KEY – They identify and enforce relationships between tables
  • CHECK constraint – Checks the values being inserted against a defined set of business rules for valid data range values in the column
  • DEFAULT constraint – Ensures that when an explicit value is not specified by the client, a default value is used so as not to break logical integrity of the data

Constraints that enforce referential and domain integrity are a physical implementation of the entity relationship and logical database design. As businesses grow and system architectures evolve, valid values for various enumerations also change and evolve. Values valid for a domain are enforced in the physical design via CHECK constraints.

Historical data may or may not confirm to the values enforced by current CHECK constraints today. For example, an expense workflow may have multiple stages which may not have existed in the past. While current data in the transaction systems would have been updated to confirm to the new enumerations, the historical data may still be as-is, i.e. valid in the past, but invalid today.

Check constraints and OLE DB Destination in SSIS

When moving such data over to a warehouse, the data transfer is a bulk data movement. By default, Microsoft SQL Server does not check constraints when loading data in bulk. SSIS allows you to control this behaviour when using the OLE DB transformation.

Demo

For this demo, I have a fairly simple scenario – a user registration table that enforces the following simple rules:

  • A basic check for the validity of E-mail address
  • The user registering must be 18 years in age or older

These checks are enforced by using CHECK constraints, and the DDL is provided below for your kind reference.

USE tempdb;
GO

--Safety Check
IF OBJECT_ID('dbo.UserRegistration','U') IS NOT NULL
BEGIN
    DROP TABLE dbo.UserRegistration;
END
GO

CREATE TABLE dbo.UserRegistration 
            (UserId         INT         NOT NULL IDENTITY(1,1),
             UserName       VARCHAR(20) NOT NULL,
             UserPassword   VARCHAR(20) NOT NULL,
             UserEmail      VARCHAR(50)     NULL,
             UserBirthDate  DATE        NOT NULL
            );
GO

--Primary Key
ALTER TABLE dbo.UserRegistration
    ADD CONSTRAINT pk_UserRegistrationUserId PRIMARY KEY CLUSTERED (UserId);
GO

--User must have a valid E-mail 
--(basic checking done here)
ALTER TABLE dbo.UserRegistration
    ADD CONSTRAINT chk_UserRegistrationUserEmail 
        CHECK (UserEmail LIKE '[a-z,0-9,_,-]%@[a-z,0-9,_,-]%.[a-z][a-z]%');
GO

--User must be at least 18 years in age
ALTER TABLE dbo.UserRegistration
    ADD CONSTRAINT chk_UserRegistrationBirthDate 
        CHECK (DATEDIFF(YEAR,UserBirthDate,GETDATE()) >= 18);
GO

In my SSIS package, I have used a standard data flow task. The OLE DB source uses a query that creates some valid and invalid data which I would like to insert into the destination [dbo].[UserRegistration] table. Below is the T-SQL query used for the source and screenshots of my data flow task.

SELECT TestData.UserName, 
       TestData.UserPassword, 
       TestData.UserEmail, 
       TestData.UserBirthDate
FROM 
(VALUES --Valid Data
    ('SQLTwins1','SQLTwins@Pwd1','validEmail1@somedomain.com','1960-01-01'),
    ('SQLTwins2','SQLTwins@Pwd2','validEmail2@somedomain.com','1970-01-01'),
    --Invalid Email
    ('SQLTwins3','SQLTwins@Pwd3','invalidEmail@somedomain'   ,'1970-01-01'),
    --Invalid BirthDate
    ('SQLTwins4','SQLTwins@Pwd4','validEmail4@somedomain'    ,'2016-01-01'),
    --Invalid Email & BirthDate
    ('SQLTwins5','SQLTwins@Pwd5','invalidEmail'              ,'2016-01-01')
) AS TestData (UserName, UserPassword, UserEmail, UserBirthDate);
OLE DB Source for checking bypass of the check constraints during bulk inserts

OLE DB Source for checking bypass of the check constraints during bulk inserts

0392-sqltwins-ssis-checkconstraints-oledbdestination

OLE DB Destination with Fast Load options, notice the “Check constraints” is checked by default

Notice that by default, the CHECK Constraints checkbox is checked. When I run the package by keeping CHECK constraints enabled, the package fails during execution.

SSIS Package failure when one or more input values violate CHECK constraints

SSIS Package failure when one or more input values violate CHECK constraints

Looking at the package progress log, I can confirm that the package failed because the data violated check constraints when inserting into the table. The error has been formatted below to enhance readability.

[Insert into UserRegistration table [2]] Error: SSIS Error Code DTS_E_OLEDBERROR.  
An OLE DB error has occurred. Error code: 0x80004005.

An OLE DB record is available.  Source: "Microsoft SQL Server Native Client 11.0"  
Hresult: 0x80004005  Description: "The statement has been terminated.".

An OLE DB record is available.  Source: "Microsoft SQL Server Native Client 11.0"  
Hresult: 0x80004005  
Description: "The INSERT statement conflicted with the CHECK constraint "chk_UserRegistrationUserEmail". 
             The conflict occurred in database "tempdb", table "dbo.UserRegistration", column 'UserEmail'.".

I now edit the package to uncheck the “Check Constraints” checkbox on the OLE DB Destination.

0392-SQLTwins-SSIS-CheckConstraints-OLEDBDestination_Unchecked.JPG

Unchecking the “Check constraints” checkbox on the OLE DB Destination

Running the package again results in success.

Successful Package Execution with “Check constraints” unchecked, i.e. data violating check constraints was successfully inserted

I now validate the data in the [dbo].[UserRegistration] table. I can see that data violating check constraints was successfully inserted.

USE [tempdb];
GO
SELECT [ur].[UserId],
       [ur].[UserName],
       [ur].[UserPassword],
       [ur].[UserEmail],
       [ur].[UserBirthDate]
FROM [dbo].[UserRegistration] AS [ur];
GO
Confirmation that constraints were not checked during insertion of data by the SSIS package

Confirmation that constraints were not checked during insertion of data by the SSIS package

Important Notes

  • Unchecking “Check Constraints” only turns off the checks for check constraints
  • Referential and other integrity constraints (NOT NULL, Foreign Key, Unique Key, etc) continue to be enforced

Conclusion

The OLE DB destination task is a very powerful way to load data into SQL Server table in a short duration of time. At the same time, it can also cause bad data to be inserted into the destination if not used wisely.

Turn the Check Constraints option off to optimize the data load when you are sure that:

  • Source data has expected discrepancies which are acceptable to the business/domain OR source data is correct
  • Re-validation of the check constraints will be done as a post migration process to make all constraints trusted (disabling check constraints would have marked them as non-trusted)

Further Reading:

Until we meet next time,

Be courteous. Drive responsibly.

The OLE DB destination in "Fast Load" configuration with "Table Lock" checked (default)

#0391 – SQL Server – SSIS – OLE DB Destination – Table Fast Load – Performance – Table Lock option


Developing SSIS packages is quite easy – it’s mostly drag and drop and some minor configuration, which is a really boon for someone who is new to SSIS. However, when it comes to tuning the package, one needs to understand the finer points of each task on the control flow.

The OLE DB Destination

In one of my previous posts, I started to explore the OLE DB destination. In order to load data as quickly into the destination as possible, the  OLE DB destination allows us to use a “Fast Load” mode. The “Fast Load” option allows the data team to configure various options that affect the speed of the data load:

  1. Keep Identity
  2. Keep NULLs
  3. Table Lock
  4. Check Constraints
  5. Rows per Batch
  6. Maximum Insert Commit Size

I looked at the “Keep NULLs” and the “Keep Identity” options earlier, and today I will go over the “Table Lock” option.

Because the option is part of the OLE DB destination task, the “Table Lock” option indicates whether the database engine should request a more wider lock on the entire table (i.e. use TABLOCK) rather than trying to get granular locks one each row/page and then follow lock escalation to block the table.

Theoretically, when moving extremely large amounts of data compared to the number of records already available in the destination table, the database engine would deem the granular locks (at the row/page level) too cost prohibitive and escalate to the table anyway. In this case, it would be better to specify the wider TABLOCK.

Allow me to present a brief demo.

Creating the package with logging for comparing execution time

As always, I have created a simple package that creates a table and inserts data into it. The table is identical to the [Sales].[SalesOrderDetail] table in the [AdventureWorks2014] sample database. The table creation script used in the Execute SQL task on the package is provided below:

USE [tempdb];
GO

IF OBJECT_ID('[dbo].[SalesOrderDetail]','U') IS NOT NULL
BEGIN
    DROP TABLE [dbo].[SalesOrderDetail];
END
GO

CREATE TABLE [dbo].[SalesOrderDetail]
   (
    [SalesOrderID]           [INT]              NOT NULL,
    [SalesOrderDetailID]     [INT]              NOT NULL,
    [CarrierTrackingNumber]  [NVARCHAR](25)         NULL,
    [OrderQty]               [SMALLINT]         NOT NULL,
    [ProductID]              [INT]              NOT NULL,
    [SpecialOfferID]         [INT]              NOT NULL,
    [UnitPrice]              [MONEY]            NOT NULL,
    [UnitPriceDiscount]      [MONEY]            NOT NULL 
                             CONSTRAINT [DF_sodUnitPriceDiscount]  DEFAULT ((0.0)),
    [LineTotal]              DECIMAL(38, 6),
    [rowguid]                [UNIQUEIDENTIFIER] NOT NULL,
    [ModifiedDate]           [DATETIME]         NOT NULL 
                             CONSTRAINT [DF_sodModifiedDate]  DEFAULT (GETDATE())
   ) ON [PRIMARY]
GO

Once the table is created, the package “flows” to the Data Flow Task. Inside the data flow, essentially I simply select about 4 times the data from the [AdventureWorks2014].[Sales].[SalesOrderDetail] table (approximately 485268 rows) using an OLE DB source and pump it to the newly created target table via an OLE DB destination with the “Table Lock” option checked (default).

I then configure logging on the package to log the package activity for the Data Flow Task for the OnError, OnPreExecute and OnPostExecute events (Configuring package logging is out of scope for this blog post).

The script used in the OLEDB source is presented here:

USE AdventureWorks2014;
GO
SELECT sod.SalesOrderID,
       sod.SalesOrderDetailID,
       sod.CarrierTrackingNumber,
       sod.OrderQty,
       sod.ProductID,
       sod.SpecialOfferID,
       sod.UnitPrice,
       sod.UnitPriceDiscount,
       sod.LineTotal,
       sod.rowguid,
       sod.ModifiedDate
FROM Sales.SalesOrderDetail AS sod
CROSS JOIN (VALUES (1), (2), (3), (4)) AS Combinations (ComboValue)
GO

A set of screenshots showing the package configuration described above are shown below.

An

An “Execute SQL Task” creates the test table

0391-sqltwins-ssis-tablelock-dataflow-oledbdestination

The OLE DB destination in “Fast Load” configuration with “Table Lock” checked (default)

Package Log configuration enabled for the OLEDB destination

Package Log configuration enabled for the OLEDB destination

Event configuration done to log task activities as part of package logging

Event configuration done to log task activities as part of package logging

Once the package is executed, I will compare the difference between the OnPreExecute and OnPostExecute times for both configurations of the “Table Lock” option to get an idea of the performance difference between them.

NOTE: The packages were executed after it was confirmed that the databases involved (in this case tempdb) had grown sufficiently to accommodate the inserted data.

Confirming that we are indeed taking a Table Lock

While the package is executing in SQL Server Data Tools (SSDT, erstwhile BIDS), I run the following query on the SQL Server to check the locks occupied on our test table.

USE [tempdb]
GO
SELECT tl.resource_associated_entity_id AS ObjectId,
       OBJECT_NAME(tl.resource_associated_entity_id) AS ObjectName,
       tl.request_mode AS LockRequestMode
FROM [sys].[dm_tran_locks] AS tl
WHERE tl.resource_database_id = DB_ID()
  AND tl.resource_associated_entity_id = OBJECT_ID('[dbo].[SalesOrderDetail]','U');
GO

Based on the results, we can confirm that an exclusive Bulk Update (BU) lock has indeed been requested and granted on the table – indicating that the TABLOCK option was used as part of the insert.

Bulk Update (BU) lock used on the table, indicating use of TABLOCK (Table Lock)

Bulk Update (BU) lock used on the table, indicating use of TABLOCK (Table Lock)

Running the package with “Table Lock” turned Off

If I check the locks on the table while running the package with the “Table Lock” option unchecked, I see that either an “Intent Exclusive” (IX) or an “Exclusive” (X) lock have been used. This indicates that SQL Server is actually using Exclusive locks on lower level allocation units (page/row).

0391-sqltwins-ssis-tablelock-intentexclusivetablelockrequested

When “Table Lock” is unchecked, an Intent Exclusive Lock is acquired on the table, indicating granular exclusive locks in use

Please note that between each run, the data buffers were cleaned and procedure cache was cleared out to get a “cold” state performance of the database engine.

CHECKPOINT;
DBCC DROPCLEANBUFFERS;
DBCC FREEPROCCACHE;
GO

Comparing package execution performance

Because I had turned on logging on the SSIS package, I ran the following query against the [dbo].[sysssislog] table which gives me the time difference (in seconds) between the “OnPreExecute” and “OnPostExecute” events for both the packages. The query and the results are available below:

USE [tempdb];
GO
SELECT [PivotedTbl].,
       DATEDIFF(SECOND, 
                [PivotedTbl].[OnPreExecute], 
                [PivotedTbl].[OnPostExecute]
               ) AS PackageRunTimeInSeconds
FROM (SELECT sl.,
             sl.[event],
             sl.[starttime]
      FROM [dbo].[sysssislog] AS sl
     ) AS [SourceTbl]
PIVOT (MAX([SourceTbl].[starttime]) 
       FOR [SourceTbl].[event] IN ([OnPreExecute],[OnPostExecute])
      ) AS [PivotedTbl];
GO
0391-sqltwins-ssis-tablelock-executiontimes

Execution time summary showing data flow with OLE DB destination using Table locks completes faster than one using granular locks

We can clearly see that the table load with “Table lock” checkbox turned on is comparatively faster.

Summary

The OLE DB destination task is a very powerful way to load data into SQL Server table. It also provides the flexibility to boost the rate of data insertion when used wisely.

  • When loading data into an empty table or when system is under an outage window, evaluate keeping the “Table Lock” checkbox checked
  • When it is important to keep the tables accessible during the data load, evaluate keeping the “Table Lock” checkbox unchecked
    • This will ensure that if possible, the SELECT queries are not blocked from being executed as long as they do not refer to the same page(s) being inserted/updated
  • Using a higher-level lock (in case of requesting a Table Lock/TABLOCK) does provide a reduced data “load” time due to reduced overhead of maintaining granular locks and can be used as a means to speed up the data inserts for large data sets

References:

  • Lock Modes in SQL Server [MSDN Link]
  • Lock Compatibility [MSDN Link]
  • Performance – Best Practice – Create Index Before or After a data insert? [Blog Link]
  • Performance – Best Practice – Create Index Before or After a data insert – I/O, Fragmentation, CPU and Elapsed Time [Blog Link]

Until we meet next time,

Be courteous. Drive responsibly.

Checking the "Keep Identity" checkbox

#0390 – SQL Server – SSIS – OLE DB Destination – Table Fast Load – Keep Identity option


Recently, I started writing about the nuances of SSIS which most accidental SSIS developers may frequently get stumped by due to the differences in behaviour over conventional T-SQL.

The OLE DB Destination

In my previous post, I started to explore the OLE DB destination. In order to load data as quickly into the destination as possible, the  OLE DB destination allows us to use a “Fast Load” mode. The “Fast Load” option allows the data team to configure various options that affect the speed of the data load:

  1. Keep Identity
  2. Keep NULLs
  3. Table Lock
  4. Check Constraints
  5. Rows per Batch
  6. Maximum Insert Commit Size

I looked at the “Keep NULLs” option earlier, and today I will go over the “Keep Identity” option.

The “Keep Identity” option, as the name suggests controls the ability of the OLE DB destination task to insert data into an IDENTITY column. IDENTITY columns are quite a commonly used functionality to provide an auto increment value in a table. When migrating data from one system to another, it may be required to preserve the identity values during the migration.

Allow me to demonstrate the switch in action. In order to setup the demo, I created a table in the [tempdb] database on my SQL Server instance with an IDENTITY column, and inserted some test data into it.

USE tempdb;
GO
--Safety Check
IF OBJECT_ID('dbo.KeepIdentityInOLEDB','U') IS NOT NULL
BEGIN
    DROP TABLE dbo.KeepIdentityInOLEDB;
END
GO

--Create table
CREATE TABLE dbo.KeepIdentityInOLEDB 
    ([IdentityId]       INT          NOT NULL IDENTITY(1,1),
     [ProductName]      VARCHAR(255)     NULL,
     [ManufacturerName] VARCHAR(255)     NULL
    );
GO

--Initial sample test Data
INSERT INTO dbo.[KeepIdentityInOLEDB] ([ProductName], [ManufacturerName])
SELECT [ProductList].[ProductName],
       [ProductList].[ManufacturerName] 
FROM (VALUES ('Windows'     , 'Microsoft'),
             ('SQL Server'  , NULL       ),
             ('VisualStudio','Microsoft'),
             ('MySQL'       , 'Oracle'   ),
             ('PeopleSoft'  , 'Oracle'   )
     ) AS [ProductList] ([ProductName], [ManufacturerName]);
GO

Default behaviour (Keep Identity unchecked)

When an OLE DB destination is configured to load a table with identity columns, the OLE DB destination fails validation if the identity column is mapped to the input.

Mapping identity column in the target table with a source value

Mapping identity column in the target table with a source value

Validation errors if identity columns are mapped and “Keep Identity” is unchecked (default)

Even if I forcibly try to run the package, the package fails.

0390-execution-failure-with-keep-identity-unchecked

Execution Failure when trying to execute the package with “Keep Identity” unchecked

Inserting data into Identity tables (Keep Identity checked)

If I check the “Keep Identity” checkbox, the validation error disappears and the package runs successfully.

Checking the “Keep Identity” checkbox

Successful package execution with “Keep Identity” checked

When moving data from one system into another where the identity values can uniquely refer to the records being inserted, this option works perfectly.

However, keep in mind a side-effect of this checkbox. Let me take a look at the data that was inserted into the test table.

Inserting data with “Keep Identity” checked may cause duplicates in the identity column

The “Keep Identity” checkbox checked is like setting the IDENTITY_INSERT option to ON when inserting data into a table using T-SQL. Only difference is that the option is set to OFF once the package execution completes, and hence no explicit handling is required.

Further Reading

I have written a series of posts on T-SQL behaviour around identity values, and I trust you would find them interesting.

  • An introduction to IDENTITY columns [Link]
  • What are @@IDENTITY, SCOPE_IDENTITY, IDENT_CURRENT and $IDENTITY? [Link]
  • Myths – IDENTITY columns cannot have holes or “gaps” [Link]
  • Myths – Values cannot explicitly inserted into IDENTITY columns [Link]
  • Myths – Duplicate Values cannot exist IDENTITY columns [Link]
  • Myths – The value for SEED and INCREMENT must be 1 [Link]
  • Myths – IDENTITY values (including SEED & INCREMENT) cannot be negative [Link]
  • Myths – IDENTITY columns cannot be added to existing tables – Part 01 [Link]
  • Myths – IDENTITY columns cannot be added to existing tables – Part 02 [Link]
  • Myths – IDENTITY columns cannot have constraints defined on them [Link]
  • Myths – IDENTITY columns do not propagate via SELECT…INTO statements [Link]
  • Use IDENTITY() Function to change the Identity specification in a SELECT…INTO statement [Link]

Until we meet next time,

Be courteous. Drive responsibly.

OLE DB Destination Configuration. Notice the "Keep nulls" switch is unchecked.

#0389 – SQL Server – SSIS – OLE DB Destination – Table Fast Load – Keep NULLs option


SQL Server Integration Services (SSIS) are typically called upon when integrating systems exchange data from one source to a given destination. The reason I use the term “source” and “destination” instead of a “database” because either of the two can be something other than a database (a flat file, some web-service, a script task, etc). This is possible because SSIS is more like any other .net framework based programming language (C# or VB.net).

OLE DB destination

Because one would commonly have either a Microsoft Access or a Microsoft SQL Server on at least one side of the integration, the most common source & destinations used in a SSIS-based data solution are the OLE DB Source and the OLE DB Destination. The OLE DB destination allows you to load data to a table, a view or even a SQL command (e.g. the results of a statement execution).

In order to load data as quickly into the destination as possible, the  OLE DB destination allows us to use a “Fast Load” mode. The “Fast Load” option allows the data team to configure various options that affect the speed of the data load:

  1. Keep Identity
  2. Keep NULLs
  3. Table Lock
  4. Check Constraints
  5. Rows per Batch
  6. Maximum Insert Commit Size

We will look at each option in detail over the next couple of weeks.

Keep NULLs option

The Keep NULLs option is normally something that most accidental SSIS developers do not pay much attention to. It comes unchecked by default and it left unchecked. However, the state of this checkbox can have a significant impact on the completeness and quality of data being inserted into the destination database.

To clarify, allow me to explain the functionality of this checkbox:

  1. Checked – If a column in the source data has NULL values, keep them as-is
  2. Unchecked – If a column in the source data has NULL values, try to replace them with the default values as defined by the destination DB

The state of this checkbox typically does not make much of a difference because in most cases, the domain and business rules in both the systems involved would be similar. Thus, the if a column in one system allows a NULL value, other systems in the same domain would also allow a NULL (e.g. in most enrollment forms, the last name would generally be mandatory but the first name is not). However, legacy systems (which have been around since decades) would have accumulated a lot data that does not conform to newer domain practices, causing issues during migration. This is when the “Keep Nulls” checkbox comes into action.

In the case I am going to present today, I have a set of Product Names and their corresponding Manufacturers. In a few of these cases, I don’t know the manufacturer and have therefore kept it blank.

USE tempdb;
GO
--Test Data
SELECT [ProductList].[ProductName],
       [ProductList].[ManufacturerName] 
FROM (VALUES ('Windows'     , 'Microsoft'),
             ('SQL Server'  , NULL       ),
             ('VisualStudio','Microsoft'),
             ('MySQL'       , 'Oracle'   ),
             ('PeopleSoft'  , 'Oracle'   )
     ) AS [ProductList] ([ProductName], [ManufacturerName]);
GO
Sample data with some NULL values

Sample data with some NULL values

For the sake of this demo, I have used this query as my source in the test SSIS package. Below is a screenshot of my data flow task.

Using a test data query in the OLE DB source command

Using a test data query in the OLE DB source command

I directly take this dataset as input to the OLE DB destination. The OLE DB destination is configured to a test table ([dbo].[KeepNullsInOLEDB]) with the following table definition.

USE [tempdb];
GO
--Safety Check
IF OBJECT_ID('dbo.KeepNullsInOLEDB','U') IS NOT NULL
BEGIN
    DROP TABLE dbo.KeepNullsInOLEDB;
END
GO

--Create table
CREATE TABLE dbo.KeepNullsInOLEDB 
        ([ProductName]      VARCHAR(255) NULL,
         [ManufacturerName] VARCHAR(255) NULL     
                            CONSTRAINT df_KeepNullsInOLEDB_ManufacturerName 
                            DEFAULT ('Microsoft')
        );
GO

OLE DB Destination Configuration. Notice the “Keep nulls” switch is unchecked.

After executing the package, I query the [dbo].[KeepNullsInOLEDB] table in the destination database, and compare with the source data.

Values inserted into the destination table. Notice the default value from table definition is used.

Values inserted into the destination table. Notice the default value from table definition is used.

As can be seen from the screenshot, the [ManufacturerName] for “SQL Server” is not NULL. It is instead set to “Microsoft” which is the default value as set in the default constraint on the destination table.

The data inserted in the destination table changes if the switch is kept checked in the OLE DB destination.

Notice how the value from the default constraint is not used when “Keep Nulls” is checked.

If the “Keep nulls” checkbox is checked, the default constraint on the target table is not used – thereby maintaining the same data as the source.

Summary

Depending upon the business requirements, it may be critical to migrate data from a source to a destination “as-is”, without the application of default constraints. In such situations, the “Keep nulls” switch on the OLE DB destination (“Fast Load” mode) needs to be checked.

If the “Keep nulls” switch is unchecked, the default constraints from the target table definition come into effect.

In my future posts, I will take a look at the other switches on the OLE DB Fast Load mode.

Until we meet next time,

Be courteous. Drive responsibly.