Tag Archives: Series

Articles that are too big to fit into one part.

Data Flow Task used to demonstrate case-sensitivity of Lookup transformation

#0388 – SQL Server – SSIS – Lookup transformations are case-sensitive


I have been working with SQL Server Integration Services (SSIS) recently. In many ways, SSIS is counter-intuitive if you have been working with the database engine for long (more than a decade in my case). Working with SSIS is more like working with any other .net framework based programming language (C# or VB.net). Over the next couple of days, I will be writing about some of the salient aspects of SSIS which should be kept in mind in case you are working on multiple SQL Server technologies.

Lookup Transformations – A key to successful system integrations

Cross-referencing of Enumerations

One of the key challenges for any system integration is to ensure that the enumerations  and “default” values used in the source system (e.g. sales statuses, product categories, etc) align between the “source” & “target” systems.

Once the values aligned during business, high-level and low-level designs, implementation of this cross-referencing in SQL Server Integration Services (SSIS) is done by a data flow component called the “Lookup Transformation“. The Lookup transformation effectively performs a join between the input data with a reference data set. If values match, they are available in what is called the “Match Output” whereas values that do not match can be made available as a “No Match Output”. However, this comes with a tiny counter-intuitive behaviour that I learnt about the hard way.

The lookup performed by the lookup transformation is case-sensitive.

Demo

In order to demonstrate the case-sensitivity of lookup transformations, I have developed a SSIS package that does the following in a simple data-flow task:

  1. Get some static data from an OLEDB data source, basically some rows with text representation of numbers (One, Two, Three, and so on)
  2. The Lookup transform has a static mapping between the numeric and text values of various numbers – 1 through 10
  3. As the input data passes through the lookup transformation, we try to map the text values in the source data with the values available in the lookup transformation so that we can get the appropriate numeric representation
    • In my demo, records that find a valid lookup are written to a recordset destination (it could be any valid destination), whereas records that do not match are written to another destination
    • I have placed data viewers on the output pipelines to visually see the data being moved, which is what I will show below

The query used to generate the static data in the OLE DB source is provided below.

SELECT srcValues.RowName,
       srcValues.RowValue
FROM (VALUES ('Row1','One'),
             ('Row2','Two'),
             ('Row3','three'),
             ('Row4','Four'),
             ('Row5','Five'),
             ('Row6','Six'),
             ('Row7','seven'),
             ('Row8','eight'),
             ('Row9','Nine'),
             ('Row10','Ten')
     ) AS srcValues (RowName, RowValue);

The query used to generate the lookup values for the lookup transform is provided below:

SELECT lookUpValues.Id,
       lookUpValues.RowValue
FROM (VALUES (1, 'One'),
             (2, 'Two'),
             (3, 'Three'),
             (4, 'Four'),
             (5, 'Five'),
             (6, 'Six'),
             (7, 'Seven'),
             (8, 'Eight'),
             (9, 'Nine'),
             (10, 'Ten')
     ) AS lookUpValues (Id, RowValue);

Observe that in the static source data, not all values have a consistent case – some are in sentence case, whereas some are in small case.

The screenshots below show the overall setup of the SSIS package.

Data Flow Task used to demonstrate case-sensitivity of Lookup transformation

Data Flow Task used to demonstrate case-sensitivity of Lookup transformation

LookUp Transformation - General configuration (Notice redirection to no match output)

LookUp Transformation – General configuration (Notice redirection to no match output)

Lookup Transformation - Connection tab showing reference values

Lookup Transformation – Connection tab showing reference values

Lookup Transformation - Columns configuration showing "RowValue" used for matching whereas the reference Id is fetched to include in output

Lookup Transformation – Columns configuration

Notice here that we have used the text value from the source data (“RowValue” column) for matching/lookup to the reference data set. The reference Id is fetched to include in output.

If a match is found the “Match Output” should contain the matching row from the source combined with the Id from the reference/lookup data. This is seen in the data viewer output below.

Lookup Transformation - Match Output (source rows with the Id from the reference data)

Lookup Transformation – Match Output (source rows with the Id from the reference data)

If a match is not found (which would be the case for the values with lower case in the source data), the “No Match Output” will contain the row from the source data that failed the lookup (since failures were redirected to the “No Match” output in the general configuration). Notice we do not get the Id from the reference because no match to the reference was found.

Lookup Transformation - No Match Output

Lookup Transformation – No Match Output

Summary

When working with a case insensitive database, we often tend to take data quality with respect to case of the data for granted. Having data with inconsistent case has multiple repercussions (especially with data grouping in front end applications), but the biggest negative impact due to inconsistent case of text data is the inaccurate cross-referencing during a master data cleanup, system integration or data migration exercise.

Call to action

Do take a few cycles in your development to take a look at your data quality, and if necessary, implement required data cleansing to ensure that your lookup data, enumerations and master data are using a case that is correct and consistent with the domain and business requirements.

Until we meet next time,

Be courteous. Drive responsibly.

Advertisements

#0276 – SQL Server – IDENTITY columns – Use IDENTITY() Function to change the Identity specification in a SELECT…INTO statement


We know that copying a table into another using the SELECT…INTO statement copies over the IDENTITY property also. However, it is possible that one might want to use an IDENTITY specification different from the source table in this process.


This can be achieved using the IDENTITY() function, which allows us to create a new column for the destination table when used with the SELECT…INTO clause. The function accepts upto 3 parameters – the data-type, the Identity seed and the Identity increment values for the new column.


The process described in this post can also be used to define an IDENTITY column in the destination table when the source table does not have one.


IDENTITY() – Demo


The script below demonstrates the usage of the IDENTITY column. I do not have an IDENTITY column on the source table. During the SELECT…INTO process, I wanted to create the new table with an IDENTITY specification in-place. To do so, I used the IDENTITY() function.

USE tempdb;
GO

SET NOCOUNT ON;

--1. Prepare the environment
--   Create a table, and add some test data into it

--Safety Check
IF OBJECT_ID('tIdentity','U') IS NOT NULL
    DROP TABLE dbo.tIdentity;
GO

--Create a table, notice that there are NO Identity columns on this table
CREATE TABLE dbo.tIdentity ( IdentityValue VARCHAR(10) );
GO

--Add some test data
INSERT INTO dbo.tIdentity (IdentityValue)
VALUES ('One'),
       ('Two'),
       ('Three'),
       ('Four'),
       ('Five');
GO

--2. Use SELECT..INTO to bulk insert data to a new table
SELECT IDENTITY(INT,100,1) AS DestinationId,
       SourceTable.IdentityValue AS DestinationValue
INTO dbo.DestinationTable
FROM dbo.tIdentity AS SourceTable ;

--2a. Fetch the value inserted into the destination table
SELECT DestinationTable.DestinationId,
       DestinationTable.DestinationValue
FROM dbo.DestinationTable;

--3. Check the properties of the new column - is it an IDENTITY column?
SELECT sc.name AS ColumnName,
       OBJECT_NAME(sc.object_id) AS TableName,
       sic.seed_value AS IdentitySeed,
       sic.increment_value AS IdentityIncrement,
       sic.is_nullable AS IsNullable,
       sic.last_value AS LastIdentityValueUsed
FROM sys.columns AS sc
INNER JOIN sys.identity_columns AS sic ON sc.object_id = sic.object_id
                                      AND sc.column_id = sic.column_id
WHERE ( sc.object_id = OBJECT_ID(N'dbo.DestinationTable',N'U') OR
        sc.object_id = OBJECT_ID(N'dbo.tIdentity',N'U')
      )
  AND sc.is_identity = 1;

--4. Cleanup
IF OBJECT_ID('dbo.tIdentity','U') IS NOT NULL
    DROP TABLE dbo.tIdentity;
GO

IF OBJECT_ID('dbo.DestinationTable','U') IS NOT NULL
    DROP TABLE dbo.DestinationTable;
GO

/**********************************************
               RESULTS
**********************************************/
/*
DestinationId DestinationValue
------------- ----------------
100           One
101           Two
102           Three
103           Four
104           Five

ColumnName     TableName         IdentitySeed  IdentityIncrement  IsNullable LastIdentityValueUsed
-------------- ----------------- ------------- ------------------ ---------- ----------------------
DestinationId  DestinationTable  100           1                  0          104
*/

Using IDENTITY() without SELECT…INTO – Msg 177


IDENTITY() does appear to be a very useful function, but please be aware that it cannot be used outside the SELECT…INTO clause. This can be confirmed by a simple modification to the SELECT query used in the example above:

SELECT IDENTITY(INT,100,1) AS DestinationId,
       SourceTable.IdentityValue AS DestinationValue
FROM dbo.tIdentity AS SourceTable ;

Executing this statement (assuming that the dependent tables and data is in-place) returns the following error:


Msg 177, Level 15, State 1, Line 16
The IDENTITY function can only be used when the SELECT statement has an INTO clause.


Conclusion


As can be seen from the scripts above, the IDENTITY() function is very useful to define domain/product-specific IDENTITY specification when importing data from one table to another.


Until we meet next time,


Be courteous. Drive responsibly.

#0275 – SQL Server – IDENTITY columns – Myths – IDENTITY columns do not propagate via SELECT…INTO statements


As I was writing the series on myths around IDENTITY columns, I started to wonder whether copying a table into another using the SELECT…INTO statement copies over the IDENTITY property also.


The simplest way to answer this question was to run a simple test, which is shown below:



  1. Create a table with an IDENTITY column defined, and insert some test data
  2. Use SELECT…INTO to select data from the source table and pump it into another table

    • For clarity, I will be using different column names between the source and the destination columns

  3. Use the catalog views – sys.columns and sys.identity_columns to confirm whether the new table was created with the IDENTITY column in place or not

USE tempdb;
GO

SET NOCOUNT ON;

--1. Prepare the environment
--   Create a table, and add some test data into it

--Safety Check
IF OBJECT_ID('tIdentity','U') IS NOT NULL
    DROP TABLE dbo.tIdentity;
GO

--Create a table
CREATE TABLE dbo.tIdentity (IdentityId INT IDENTITY(1,1),
                            IdentityValue VARCHAR(10)
                           );
GO

--Add some test data
INSERT INTO dbo.tIdentity (IdentityValue)
VALUES ('One'),
       ('Two'),
       ('Three');
GO

--2. Use SELECT..INTO to bulk insert data to a new table
SELECT SourceTable.IdentityId AS DestinationId,
       SourceTable.IdentityValue AS DestinationValue
INTO dbo.DestinationTable
FROM dbo.tIdentity AS SourceTable

--3. Check the properties of the new column - is it an IDENTITY column?
SELECT sc.name AS ColumnName,
       OBJECT_NAME(sc.object_id) AS TableName,
       sic.seed_value AS IdentitySeed,
       sic.increment_value AS IdentityIncrement,
       sic.is_nullable AS IsNullable,
       sic.last_value AS LastIdentityValueUsed
FROM sys.columns AS sc
INNER JOIN sys.identity_columns AS sic ON sc.object_id = sic.object_id
                                      AND sc.column_id = sic.column_id
WHERE ( sc.object_id = OBJECT_ID(N'dbo.DestinationTable',N'U') OR
        sc.object_id = OBJECT_ID(N'dbo.tIdentity',N'U')
      )
  AND sc.is_identity = 1;

--4. Cleanup
IF OBJECT_ID('dbo.tIdentity','U') IS NOT NULL
    DROP TABLE dbo.tIdentity;
GO

IF OBJECT_ID('dbo.DestinationTable','U') IS NOT NULL
    DROP TABLE dbo.DestinationTable;
GO

/**********************************************
               RESULTS
**********************************************/
/*
ColumnName     TableName         IdentitySeed IdentityIncrement IsNullable LastIdentityValueUsed
-------------- ----------------- ------------ ----------------- ---------- ---------------------
IdentityId     tIdentity              1              1               0             3
DestinationId  DestinationTable       1              1               0             3
*/

As can be seen from the experiment, the IDENTITY property propagates from one table to another via the SELECT…INTO clause.


Until we meet next time,


Be courteous. Drive responsibly.

#0274 – SQL Server – IDENTITY columns – Myths – IDENTITY columns cannot have constraints defined on them


In the first part of this series on IDENTITY columns, I had mentioned that IDENTITY columns must be NOT NULL and must not have a DEFAULT constraint defined on them. Often, this is interpreted in haste as IDENTITY columns cannot have any constraints defined on them.


This is a myth. Today’s post will attempt to clear out this confusion.


Defining a DEFAULT constraint on an IDENTITY column – Msg 1754


The script below is quite simple – it attempts to create a table with an IDENTITY column and then goes on to use the ALTER TABLE statement to add a default constraint to the IDENTITY column.

USE tempdb;
GO

SET NOCOUNT ON;

--1. Prepare the environment
--   Create a table, and add some test data into it

--Safety Check
IF OBJECT_ID('tIdentity','U') IS NOT NULL
    DROP TABLE dbo.tIdentity;
GO

--Create a table, this time WITHOUT a Clustered Key
CREATE TABLE dbo.tIdentity (RecordId INT IDENTITY (1,1),
                            IdentityValue VARCHAR(20)
                           );
GO

--2. Add a default constraint
ALTER TABLE dbo.tIdentity
    ADD CONSTRAINT df_DefaultRecordId DEFAULT(0) FOR RecordId;
GO

--3. Cleanup
IF OBJECT_ID('tIdentity','U') IS NOT NULL
    DROP TABLE dbo.tIdentity;
GO

Executing this script returns an error as shown below:


Msg 1754, Level 16, State 0, Line 3
Defaults cannot be created on columns with an IDENTITY attribute. Table ‘tIdentity’, column ‘RecordId’.
Msg 1750, Level 16, State 0, Line 3
Could not create constraint. See previous errors.


This example settles one part of the puzzle – IDENTITY columns cannot have DEFAULT constraints on them.


Defining other constraints (CHECK/PRIMARY KEY) on an IDENTITY column


The script below creates a table with an IDENTITY column and defines a primary key constraint on the IDENTITY column. Later, the script uses the ALTER TABLE statement to add a CHECK constraint on the IDENTITY value.


Because the DDLs execute successfully, we will insert some test data to test that the constraints are indeed in effect.

USE tempdb ;
GO

SET NOCOUNT ON ;

--1. Prepare the environment
--   Create a table, and add some test data into it

--Safety Check
IF OBJECT_ID('tIdentity', 'U') IS NOT NULL 
    DROP TABLE dbo.tIdentity ;
GO

--Create a table, with a primary key constraint on the IDENTITY column
CREATE TABLE dbo.tIdentity
    (
      RecordId INT IDENTITY(1, 1),
      IdentityValue VARCHAR(20),
      CONSTRAINT pk_tIdentityRecordId PRIMARY KEY CLUSTERED ( RecordId )
    ) ;
GO

--2. Alter the table to add a CHECK constraint
--   Now, we have two constraints on the IDENTITY column:
--   a. A Primary Key constraint
--   b. A Check constraint
ALTER TABLE dbo.tIdentity
ADD CONSTRAINT chk_MaxRecordId CHECK ( RecordId <= 5 ) ;
GO

--2. Add some test data
--   Since we are adding 5 records, the statement will succeeed
INSERT  INTO dbo.tIdentity ( IdentityValue )
VALUES  ( 'One' ),
        ( 'Two' ),
        ( 'Three' ),
        ( 'Four' ),
        ( 'Five' ) ;
GO

--2b.One may also want to check the value of IDENT_CURRENT
PRINT 'CurrentIdentityValue (Before Constraint violation): '
    + CAST(IDENT_CURRENT('dbo.tIdentity') AS VARCHAR(10)) ;
GO

--3. Add some more test data
--   This will fail, because the identity value will now exceed the values 
--   allowed by the CHECK constraint
INSERT  INTO dbo.tIdentity ( IdentityValue )
VALUES  ( 'Six' ) ;
GO

--3b.One may also want to check the value of IDENT_CURRENT
PRINT 'CurrentIdentityValue (After Constraint violation): '
    + CAST(IDENT_CURRENT('dbo.tIdentity') AS VARCHAR(10)) ;
GO

--4. Cleanup
IF OBJECT_ID('tIdentity', 'U') IS NOT NULL 
    DROP TABLE dbo.tIdentity ;
GO

The resulting messages are shown below:


CurrentIdentityValue (Before Constraint violation): 5
Msg 547, Level 16, State 0, Line 5
The INSERT statement conflicted with the CHECK constraint “chk_MaxRecordId”. The conflict occurred in database “tempdb”, table “dbo.tIdentity”, column ‘RecordId’.
The statement has been terminated.

CurrentIdentityValue (After Constraint violation): 6


As can be seen from the resulting messages shown above, the CHECK constraint on the IDENTITY column was in effect resulting in the constraint violation error.


Observation!


Observe that the current value of the IDENTITY column (obtained via IDENT_CURRENT) has incremented even though the value was not inserted. This is because of the failed insert, which can leave holes in the IDENTITY sequence.


Conclusion


As we can see from the examples above, IDENTITY columns cannot have a DEFAULT constraint defined on them – other constraints can still be defined on an IDENTITY column.


Until we meet next time,


Be courteous. Drive responsibly.

#0273 – SQL Server – IDENTITY columns – Myths – IDENTITY columns cannot be added to existing tables – Part 02


As seen in the first part of this post, we can add IDENTITY columns to existing tables without any issue. IDENTITY columns cannot be NULL. Hence, when we add a new column as an IDENTITY column, every record existing in the table has to be assigned an IDENTITY value. So, the question arises that which record should be the first one to be assigned with the Identity seed, which record will be the second record and so on. In short:



How existing records are assigned the IDENTITY value?


We saw in the first part of the post that because a table was ordered via a clustered primary key, the IDENTITY specification was directly applied in that order. However, when the table is a heap (i.e. does not have a clustered index defined on it and is therefore unordered), the IDENTITY values are assigned based on the order in which records were inserted into the table.


The script and it’s results below prove this point.

USE tempdb;
GO

SET NOCOUNT ON;

--1. Prepare the environment
--   Create a table, and add some test data into it

--Safety Check
IF OBJECT_ID('tIdentity','U') IS NOT NULL
    DROP TABLE dbo.tIdentity;
GO

--Create a table, this time WITHOUT a Clustered Key
CREATE TABLE dbo.tIdentity (RecordId INT,
                            IdentityValue VARCHAR(20),
                            CONSTRAINT pk_tIdentityRecordId PRIMARY KEY NONCLUSTERED (RecordId)
                           );
GO

--2. Add some test data
INSERT INTO dbo.tIdentity (RecordId, IdentityValue)
VALUES (3, 'Three'),
       (5, 'Five'),
       (1, 'One'),
       (0, 'Zero'),
       (4, 'Four');
GO

--2b. As a test, inserting some interleaving data separately
INSERT INTO dbo.tIdentity (RecordId, IdentityValue)
VALUES (2, 'Two'),
       (6, 'Six');
GO

--3. Check the values inserted into the table
--   Observe the order of the records that come up in the results
SELECT RecordId, IdentityValue
FROM dbo.tIdentity;
GO

--4. Alter the table to add an IDENTITY Column
ALTER TABLE dbo.tIdentity
    ADD IdentityId INT IDENTITY(1,1);
GO    

--4. Check the values inserted into the table
--   Observe the order of the records that come up in the results
--   v/s the value of the IdentityId
SELECT RecordId, IdentityId, IdentityValue
FROM dbo.tIdentity;
GO

----4b.One may also want to check the value of IDENT_CURRENT
--SELECT IDENT_CURRENT('dbo.tIdentity') AS IdentityValueAfterDBCC;
--GO

/**********************************************
               RESULTS
**********************************************/
/*
RecordId    IdentityValue
----------- --------------
3           Three
5           Five
1           One
0           Zero
4           Four
2           Two
6           Six

RecordId    IdentityId  IdentityValue
----------- ----------- --------------
3           4           Three
5           6           Five
1           2           One
0           1           Zero
4           5           Four
2           3           Two
6           7           Six
*/

Now, let us define a CLUSTERED INDEX on this table such that the records are re-arranged. Once the clustered index is defined, we will insert a couple of records and then look at the assignment of the IDENTITY column. We can see that IDENTITY values continue to be assigned in the order in which records are inserted, however, the records are now ordered based on the clustered index specification.

USE tempdb;
GO

-- 5. Define a clustered index for the Primary Key
ALTER TABLE dbo.tIdentity
    DROP CONSTRAINT pk_tIdentityRecordId;
GO

ALTER TABLE dbo.tIdentity
    ADD CONSTRAINT pk_tIdentityRecordId PRIMARY KEY CLUSTERED (RecordId);
GO

--6. As a test, inserting some interleaving data separately
INSERT INTO dbo.tIdentity (RecordId, IdentityValue)
VALUES (-1, 'Minus One'),
       (7, 'Seven'),
       (8, 'Eight');
GO

--7. Check the values inserted into the table
--   Observe the order of the records that come up in the results
--   v/s the value of the IdentityId
SELECT RecordId, IdentityId, IdentityValue
FROM dbo.tIdentity;
GO

----7b.One may also want to check the value of IDENT_CURRENT
--SELECT IDENT_CURRENT('dbo.tIdentity') AS IdentityValueAfterDBCC;
--GO  

--4. Cleanup
IF OBJECT_ID('tIdentity','U') IS NOT NULL
    DROP TABLE dbo.tIdentity;
GO

/**********************************************
               RESULTS
**********************************************/
/*
RecordId    IdentityId  IdentityValue
----------- ----------- --------------
-1          8           Minus One
0           1           Zero
1           2           One
2           3           Two
3           4           Three
4           5           Four
5           6           Five
6           7           Six
7           9           Seven
8           10          Eight
*/

Conclusion (Part 02)


The behavior of IDENTITY columns and of SQL Server seen in this post is quite interesting – IDENTITY values are always generated in the order in which the records are inserted. When adding IDENTITY column to an existing table, this is true only if the table is not ordered, i.e. does not have a clustered index. Once the IDENTITY values have been assigned to all existing records in the table, all future values will be generated in the order of insertion.


Until we meet next time,


Be courteous. Drive responsibly.