Tag Archives: Series

Articles that are too big to fit into one part.

#0421 – SQL Server – SSIS – Wild card search to find if a file exists in a folder or directory


I recently answered a question on a forum which I believe will be useful to many of the readers in the audience.

SSIS packages are widely used for data import from and export to files. One of the main tasks in this situation would be to check if files with certain kinds of names exist in a particular folder or directory, i.e. basically perform a wild-card search in a directory.

The solution

This can be achieved by using the EnumerateFiles() method of the System.IO.Directory class in the SSIS Script task. Here’s the sample package:

In a folder, I have a set of files, some with similar names (which we will search from the SSIS package).

Files existing in the directory to be searched

The SSIS package has two (2) variables:

Variable NameConfiguration on the Script TaskDescription
SearchPathReadOnly, InputThis is the path to be searched
SearchPatternReadOnly, InputPattern to be searched
FileExistsReadWrite, OutputA boolean indicating downstream processes whether files were found or not
Table 1: Variables on the SSIS package
Screenshot showing the variables and their configuration on the script task

The script is a quite simple implementation as below:

public void Main()
{
    // SQLTwins: SSIS: Blog #0421

    string searchPath = Dts.Variables["User::SearchPath"].Value.ToString();
    string searchPattern = Dts.Variables["User::SearchPattern"].Value.ToString();

    System.Collections.Generic.List<string> searchResults = System.IO.Directory.EnumerateFileSystemEntries(searchPath, searchPattern).ToList();

    if (searchResults.Any())
    {
        Dts.Variables["User::FileExists"].Value = true;
    }

    Dts.TaskResult = (int)ScriptResults.Success;
}

Ensure that you have the following directive in the “namespaces” section of your script:

using System.Linq;

Here’s the script in action:

Screenshot showing the script in debug mode indicating that files were found matching the pattern.

As you can see, the script can help perform a wild-card search in a given folder or directory.

Further Reading:

  • File System errors when trying to move and rename a file [Blog Link]
  • Adding Date & Time to file names after processing [Blog Link]
  • VSTA Errors when working with SSIS packages [Blog Link]
  • System.IO.Directory.EnumerateFiles [MSDN Link]

Until we meet next time,

Be courteous. Drive responsibly.

Advertisement

#0416 – SQL Server – Msg 8101 – Use column lists when working with IDENTITY columns


I have often written about IDENTITY columns on my blog. Identity columns, most commonly used to implement auto-increment keys, have been around for more than a decade now. Yet, I often see teams run into interesting use cases especially in cases where data is being migrated from one system to another.

Today’s post is based on one such incident that came to my attention.

The team was trying to migrate data from one table to another as part of an exercise to change the database structure for more efficiency. When moving the data from one table to another, they were using the option (SET IDENTITY_INSERT ON) in order to explicitly insert values into the Identity column. However, they were running into an error.

Msg 8101, Level 16, State 1, Line 24
An explicit value for the identity column in table 'dbo.tIdentity' can only be specified when a column list is used and IDENTITY_INSERT is ON.

Here is a simulation of what they were doing:

USE tempdb;
GO
SET NOCOUNT ON;
--Prepare the environment
--   Create a table, and add some test data into it
--Safety Check
IF OBJECT_ID('tIdentity','U') IS NOT NULL
    DROP TABLE dbo.tIdentity;
GO
--Create a table
CREATE TABLE dbo.tIdentity (IdentityId INT IDENTITY(1,1),
                            IdentityValue VARCHAR(10)
                           );
GO
--Turn Explicit IDENTITY_INSERT ON and insert duplicate IDENTITY values
SET IDENTITY_INSERT dbo.tIdentity ON;
GO
--NOTICE: No column list has been supplied in the INSERT
INSERT INTO dbo.tIdentity
VALUES (1, 'One'),
       (2, 'Two');
GO

--RESULTS
--Msg 8101, Level 16, State 1, Line 24
--An explicit value for the identity column in table 'dbo.tIdentity' can only be pecified when a column list is used and IDENTITY_INSERT is ON.

The Solution

Let’s re-read the error. It clearly gives an indication of what the issue is – if we need to insert an explicit value into Identity columns, we need to explicitly use column lists in our insert statements, as shown below.

USE tempdb;
GO
SET NOCOUNT ON;
--Prepare the environment
--Create a table, and add some test data into it
--Safety Check
IF OBJECT_ID('tIdentity','U') IS NOT NULL
    DROP TABLE dbo.tIdentity;
GO

--Create a table
CREATE TABLE dbo.tIdentity (IdentityId INT IDENTITY(1,1),
                            IdentityValue VARCHAR(10)
                           );
GO

--Turn Explicit IDENTITY_INSERT ON and insert duplicate IDENTITY values
SET IDENTITY_INSERT dbo.tIdentity ON;
GO

--NOTE: Column list has been supplied in the INSERT,
--      so, no errors will be encountered    
INSERT INTO dbo.tIdentity ([IdentityId], [IdentityValue])
VALUES (1, 'One'),
       (2, 'Two');
GO

--Confirm that data has been inserted
SELECT IdentityId,
       IdentityValue
FROM dbo.tIdentity;
GO

--Now that data has been inserted, turn OFF IDENTITY_INSERT
SET IDENTITY_INSERT dbo.tIdentity OFF;
GO

-----------------------------------------------------------------
--RESULTS
----------
--IdentityId  IdentityValue
--1           One
--2           Two
 -----------------------------------------------------------------

Hope you will find this helpful.

Untill we meet next time,

Be courteous. Drive responsibly.

#0400 – SQL Server – SSIS – Using the SQL Server Destination


SSIS packages are quite easy to get started with – it’s mostly drag and drop of various containers, tasks and setting of connections. Ensuring that the components work optimally requires using the right mix of tasks based on the scenario at hand.

Often SSIS packages connect to remote data sources & destinations. However, there are cases where the destination is a Microsoft SQL Server and it is required to run the package on the same server where the instance is hosted and we do not need granular grouping. Such situations may include data import into a staging area during migrations or as part of an ETL.

In such situations, the SQL Server destination may prove to be a better option as compared to the OLE DB destination.

Generally,  we would have a data pipeline with an OLE DB destination on the receiving end. The setup for using SQL Server destination is extremely simple – the only change is replacing OLE DB destination with the SQL Server destination. The SQL Server destination performs Bulk Inserts into the destination SQL Server while leveraging shared memory connections to SQL Server over the existing OLE DB connection manager.

The  screenshots below indicate the simplicity of using the SQL Server destination.

01_sqldestinationtask

Adding the SQL Server destination to a data flow

02_sqldestination_connectionmanager

Selecting a connection manager

03_sqldestination_advancedpane

The “Advanced” tab of the SQL Server destination

The Advanced tab (see above) has a host of options to improve the performance and control the behaviour of the bulk inserts made by the SQL Server destination.

  • Keep Identity – controls whether to insert values into an identity column
  • Keep Nulls – controls whether NULLs should be inserted instead of using the default values defined on the column
  • Table Lock – allows to take a higher-level table lock during the bulk insert
  • Check Constraints – controls whether constraints should be checked during the insert or not
  • Fire Triggers – controls whether or not to fire DML triggers defined on the table
  • First Row – specifies the first row to insert. By default all rows are inserted
  • Last Row – specifies the last row to insert. By default all rows are inserted
  • Maximum number of errors – controls the number of errors before the bulk insert operation stops
  • Timeout – controls the bulk insert operation timeout
  • Order Columns – Allows a user to specify the sort order on one or more columns

Summary

The SQL Server Destination is recommended instead of the OLE DB destination if the SSIS package is to be executed on the same machine/server where the target Microsoft SQL Server instance is located. Below are the finer points about the SQL Server destination:

  1. The SSIS package must be executed on the same server where the Microsoft SQL Server instance is located
  2. The Shared Memory protocol for data exchange is enabled for the instance from the SQL Server Configuration Manager
    • Warning: This may need local security policy updates if User Access Control (UAC) is configured
  3. SQL Server destination
    • Only works with OLE DB connection managers (ODBC is not supported)
    • Supports only one input
    • Does not support an error output
    • Performs bulk insert of data
    • Allows leveraging of fast load options of the OLE DB connection

Further Reading

Until we meet next time,

Be courteous.  Drive responsibly.

 

#0399 – SQL Server – SSIS – Debugging – Exploring the Data Viewer


I often get questions regarding debugging of SSIS packages with the most common scenario being a need to “see” the data flowing  down the data pipeline. SQL Server Integration Services (SSIS) packages are explored and edited visually within the SQL Server Data Tools (SSDT) and hence, the mechanism to troubleshoot an SSIS package when developing it also has to be visual.

The Data Viewer allows a developer to pause the data flow in a data flow task and look at the data as it is on that point in the pipeline. Today, I will introduce you to the Data Viewer.

Sample SSIS package to demonstrate Data Viewer on the Data Flow Pipeline

Sample SSIS package to demonstrate Data Viewer on the Data Flow Pipeline

The data flow is quite simply taking all the records from the [HumanResources].[Employee] table of the sample [AdventureWorks2014] database and putting into an object variable via the Recordset destination (I am using the Recordset destination for the demo here for the sake of simplicity).

Using the Data Viewer

If I want to study the data flowing through the data pipeline, all I need to do is right-click on the data flow path and choose “Enable Data Viewer” (a data flow path being the connection between a source and a transformation, between two transformations or a transformation and a destination).

0399_image2

Enabling the Data Viewer on a Data Flow Pipeline

0399_image3

Magnifying glass indicates Data Viewer is enabled

As can be seen from the screen grab above, a magnifying glass icon on the data flow now appears indicating that the data flow is configured for viewing.

If I execute the SSIS package at this point, I see that the data flow pauses just before it starts writing to the destination and opens a new grid window. The window resembles a normal dataset viewer (if you are familiar with developing C# or ASP.NET applications in Visual Studio, you would feel right at home!) which contains the data flowing through the data pipeline. You can even copy this grid to Excel or any other file for further research (Tip!).

Data Viewer showing data flowing through the data path

Data Viewer showing data flowing through the data path

Once you have studied the data flowing through, you can choose to either stop the transaction (by stopping execution of the package) or allowing the package to execute through by clicking on the “green” arrow on the data viewer.

Allowing the data flow to continue down the pipeline by allowing the package to resume execution

Allowing the data flow to continue down the pipeline by allowing the package to resume execution

Filtering columns displayed on the Data Viewer

Sometimes, we may not want to sift through all the columns in the data pipeline. If we know that a particular column is causing some problems, we may just want to monitor that column. One can explicitly choose which columns should be displayed on the data viewer by going into the data flow properties.

  1. Right-click on the data flow path and choose “Edit”
  2. Go to “Data Viewer”
  3. Use the arrow buttons to selectively choose which columns to display in the data viewer

Selectively choosing columns visible on the Data Viewer

Selectively choosing columns visible on the Data Viewer

Summary

The Data Viewer can be used for design time troubleshooting of an SSIS package, allowing developers to pause the data flow and monitor the data flowing down the data pipeline.

  • You can have multiple data viewers in your data flow task so that you can monitor each part of the data flow
  • You can control the columns seen in the data viewer so that you can focus on the fields of interest
  • You can copy this data for further research/maintaining a record
  • If you have transformation components that use 64-bit components, you may need to turn off the “Run64BitRuntime” under the Debug options (Debug -> Solution Properties -> Configuration Properties -> Debugging)

Further Reading

Until we meet next time,

Be courteous. Drive responsibly.

Overview of the sample SSIS package using the SORT operator

#0398 – SQL Server – SSIS – Performing a DISTINCT using the SORT transformation


Developing SSIS packages is quite easy – it’s mostly drag and drop and some minor configuration. Choosing the right transformations and components and making them work optimally is a task that is heavily dependent upon the use case at hand.

Normally, when working with relational data sources, I am averse to using the SORT transformation within a data flow, because it is a blocking transformation. Blocking transformations not only consume client memory, they also act as roadblocks in what should be an uninterrupted data pipeline. The performance tip (!) therefore is to prefer that sorting, if required, is done as part of the data source query itself so that the data pipeline gets presorted data.

However, presorting of data may not be possible in all cases. For example, if you are receiving data from a web-service, XML data source or any other 3rd party custom component. A SORT transformation becomes essential in such scenarios, but one of the areas where I have seen most people stumble upon is that a custom query is later used to remove unnecessary records from the set. The custom query is not required in most cases, and I will be demonstrating how to use configure the SORT transformation to also perform a DISTINCT.

Demo

Assume that:

  • You just need a list of unique patrons who visited a local library this year in order of their registration Id for sending out some promotional mails
  • The promotions need to be sent irrespective of the visit date and number of visits within the year
  • The input (log of library visits) is coming from a web-service of some sort over which you as an integration team member have no control over
  • Following is a sample of the test data that comes over from the web-service

NOTE: For the purposes of this demo, I will use a T-SQL SELECT to generate my test data, but I would request the reader to assume that the data is coming from a non-relational source.

SELECT LibraryVisitLog.PatronId,
       LibraryVisitLog.PatronName,
       LibraryVisitLog.VisitDate
FROM (VALUES (1201,'John Doe',  '2017-01-05'),
             (1180,'John Smith','2017-01-05'),
             (1201,'John Doe',  '2017-01-06'),
             (1201,'John Doe',  '2017-01-07'),
             (1201,'John Doe',  '2017-01-08'),
             (1201,'John Doe',  '2017-01-09'),
             (1201,'John Doe',  '2017-01-10'),
             (1280,'Jane Doe',  '2017-01-10'),
             (1201,'John Doe',  '2017-01-11'),
             (1201,'John Doe',  '2017-01-12'),
             (1210,'Jack Smith','2017-01-12'),
             (1201,'John Doe',  '2017-01-13'),
             (1201,'John Doe',  '2017-01-14'),
             (1180,'John Smith','2017-01-14')
     ) AS LibraryVisitLog (PatronId, PatronName, VisitDate)

Sample Test Data to be used for the demo

Sample Test Data

We only want the information about the following 4 (four) unique patrons ordered by their Id down the pipeline.

Patron Id Patron Name
1180 John Smith
1201 John Doe
1210 Jack Smith
1280 Jane Doe

Configuring the SSIS package to achieve this is quite simple. As shown in the image below, I have a SORT transformation between my data source and target.

Overview of the sample SSIS package using the SORT operator

Overview of the demo SSIS package

The SORT transformation is configured to:

  • Ignore the VisitDate (as established in the requirements above)
  • SORT based on the PatronId
  • Pass the PatronName through the pipleline

Configuring the SORT operator to perform a DISTINCT on the output rows

Configuring the SORT operator to perform a DISTINCT on the output rows

In addition, notice that I have checked the checkbox “Remove rows with duplicate sort values” in the bottom left corner. This switch controls whether the transformation copies duplicate rows on the transformation output or creates a single, unique entry for all duplicates.

To actually see the data flowing through the pipeline, I have enabled DataViewers on each stage of the data flow. You can see the actual data flowing through the pipeline in the screenshots provided below:

Demonstrating the ability of a SORT transformation to perform a DISTINCT on the data

Demonstrating the ability of a SORT transformation to perform a DISTINCT on the data

0398-image05

Data Viewer showing input data

0398-image06

Data Viewer showing sorted and Distinct data

Summary

With respect to the SORT operator and achieving a balance between the performance and limitations of the data sources used by an SSIS package:

  • It is a best practice to pre-sort the data at the data source before sending it across a data flow pipeline
  • In case pre-sorting is not possible, but sorting is required, one can use the SORT transformation
  • Leverage the SORT transform to filter out unique records from the sorted output – custom logic is not required to do so

Further Reading

Until we meet next time,

Be courteous. Drive responsibly.