Tag Archives: Performance Tuning

Experiences, ideas and tips around Microsoft SQL Server performance tuning.

#0415 – SQL Server – Performance Tuning – Use STRING_AGG to generate comma separated strings


With more and more data being exchanged over APIs, generating comma-separated strings are becoming a much more common requirement.

A few years ago, I wrote about two different ways to generate comma-separated strings. The most common one I find to be in use when generating comma-separated values from a table is the intermediate conversion of XML. This however, is a very costly mechanism and can potentially take minutes for the query to run depending upon the amount of data involved.

SQL Server 2017 brings a new aggregate function that can be used to generate comma-separated values extremely fast. The function is STRING_AGG().

Here’s a sample of it’s usage:


 --WARNING: THIS SCRIPT IS PROVIDED AS-IS AND WITHOUT
-- WARRANTY.
-- FOR DEMONSTRATION PURPOSES ONLY
--Step 01: Generate Temp table to store source data
DECLARE @NamesTable TABLE ([Id] INT,
[Name] NVARCHAR(50)
);
--Step 02: Generate test data
INSERT INTO @NamesTable
VALUES (1, 'A'),
(2, 'D'),
(2, 'C'),
(3, 'E'),
(3, 'H'),
(3, 'G');
--Step 03: Using STRING_AGG to generate comma-separated strings
SELECT STRING_AGG(tbl.Name, ',') AS [CommaSeparatedString]
FROM @NamesTable AS tbl;
GO
/RESULTS**
CommaSeparatedString
A,D,C,E,H,G
*/

Advantages of STRING_AGG:

  • Can be used just like any other aggregate function in a query
  • Can work with any user supplied separator – doesn’t necessarily have to be a comma
  • No manual step required – Separators are not added at the end of the concatenated string
  • STRING_AGG() is significantly faster than using XML based methods
  • Can be used with any compatibility level as long as the version is SQL Server 2017 (or higher) and Azure SQL database

Here’s an example of how STRING_AGG can be used with any separator:

 --WARNING: THIS SCRIPT IS PROVIDED AS-IS AND WITHOUT
-- WARRANTY.
-- FOR DEMONSTRATION PURPOSES ONLY
--Step 01: Generate Temp table to store source data
DECLARE @NamesTable TABLE ([Id] INT,
[Name] NVARCHAR(50)
);
--Step 02: Generate test data
INSERT INTO @NamesTable
VALUES (1, 'A'),
(2, 'D'),
(2, 'C'),
(3, 'E'),
(3, 'H'),
(3, 'G');
--Step 03: Using STRING_AGG to generate comma-separated strings
SELECT STRING_AGG(tbl.Name, '-*-') AS [CustomSeparatorString]
FROM @NamesTable AS tbl;
GO
/RESULTS**
CustomSeparatorString
A--D--C--E--H--G /

A minor challenge

As with every new feature, there may be a small usability challenge with STRING_AGG. One cannot use keywords like DISTINCT to ensure that only distinct values are used for generating the comma-separated string. There is however a Azure feedback item open where you can exercise your vote if you feel this feature is useful.

Further Reading

  • Different ways to generate a comma-separated string from a table [Blog Link]
  • STRING_AGG() Aggregate Function [MSDN BOL]

Until we meet next time,

Be courteous. Drive responsibly.

Advertisement

#0400 – SQL Server – SSIS – Using the SQL Server Destination


SSIS packages are quite easy to get started with – it’s mostly drag and drop of various containers, tasks and setting of connections. Ensuring that the components work optimally requires using the right mix of tasks based on the scenario at hand.

Often SSIS packages connect to remote data sources & destinations. However, there are cases where the destination is a Microsoft SQL Server and it is required to run the package on the same server where the instance is hosted and we do not need granular grouping. Such situations may include data import into a staging area during migrations or as part of an ETL.

In such situations, the SQL Server destination may prove to be a better option as compared to the OLE DB destination.

Generally,  we would have a data pipeline with an OLE DB destination on the receiving end. The setup for using SQL Server destination is extremely simple – the only change is replacing OLE DB destination with the SQL Server destination. The SQL Server destination performs Bulk Inserts into the destination SQL Server while leveraging shared memory connections to SQL Server over the existing OLE DB connection manager.

The  screenshots below indicate the simplicity of using the SQL Server destination.

01_sqldestinationtask

Adding the SQL Server destination to a data flow

02_sqldestination_connectionmanager

Selecting a connection manager

03_sqldestination_advancedpane

The “Advanced” tab of the SQL Server destination

The Advanced tab (see above) has a host of options to improve the performance and control the behaviour of the bulk inserts made by the SQL Server destination.

  • Keep Identity – controls whether to insert values into an identity column
  • Keep Nulls – controls whether NULLs should be inserted instead of using the default values defined on the column
  • Table Lock – allows to take a higher-level table lock during the bulk insert
  • Check Constraints – controls whether constraints should be checked during the insert or not
  • Fire Triggers – controls whether or not to fire DML triggers defined on the table
  • First Row – specifies the first row to insert. By default all rows are inserted
  • Last Row – specifies the last row to insert. By default all rows are inserted
  • Maximum number of errors – controls the number of errors before the bulk insert operation stops
  • Timeout – controls the bulk insert operation timeout
  • Order Columns – Allows a user to specify the sort order on one or more columns

Summary

The SQL Server Destination is recommended instead of the OLE DB destination if the SSIS package is to be executed on the same machine/server where the target Microsoft SQL Server instance is located. Below are the finer points about the SQL Server destination:

  1. The SSIS package must be executed on the same server where the Microsoft SQL Server instance is located
  2. The Shared Memory protocol for data exchange is enabled for the instance from the SQL Server Configuration Manager
    • Warning: This may need local security policy updates if User Access Control (UAC) is configured
  3. SQL Server destination
    • Only works with OLE DB connection managers (ODBC is not supported)
    • Supports only one input
    • Does not support an error output
    • Performs bulk insert of data
    • Allows leveraging of fast load options of the OLE DB connection

Further Reading

Until we meet next time,

Be courteous.  Drive responsibly.

 

OLE DB Destination - Rows/Batch and Max. Insert Commit Size

#0393 – SQL Server – SSIS – OLE DB Destination – Table Fast Load – Rows per batch and Max. Insert Commit Size options


Developing SSIS packages is quite easy – it’s mostly drag and drop and some minor configuration. However, when it comes to tuning the package, one needs to understand the finer points of each task on the control flow. On each task, there are some options that help improving the performance of the data flow, whereas some others help regulate the quality of data being migrated.

The OLE DB Destination

I have been writing about the OLE DB destination in the last couple of posts. In order to load data as quickly into the destination as possible, the  OLE DB destination allows us to use a “Fast Load” mode. The “Fast Load” option allows the data team to configure various options that affect the speed of the data load:

  1. Keep Identity
  2. Keep NULLs
  3. Table Lock
  4. Check Constraints
  5. Rows per Batch
  6. Maximum Insert Commit Size

When we use the fast load options of the OLE DB destination, we are essentially using the BULK INSERT T-SQL command. This is the reason we get almost all the options of BULK INSERT in the OLE DB transformation. Today, I will take a look at the last two options which are the secret behind significantly improving the data load performance on a system with a slow I/O subsystem. These are “Rows per Batch” and “Maximum Insert Commit Size”.

My test instance is on my prime development environment and hence even the reasonably large load completes in a couple of minutes. However, on a slow I/O sub-system, the impact of these options will be much higher.

Creating the package with logging for comparing execution time

I have created a simple package that creates a table and inserts data into it. The table is identical to the [Sales].[SalesOrderDetail] table in the [AdventureWorks2014] sample database. The table creation script used in the Execute SQL task on the package is provided below:

USE [tempdb];
GO

IF OBJECT_ID('[dbo].[SalesOrderDetail]','U') IS NOT NULL
BEGIN
    DROP TABLE [dbo].[SalesOrderDetail];
END
GO

CREATE TABLE [dbo].[SalesOrderDetail]
   (
    [SalesOrderID]           [INT]              NOT NULL,
    [SalesOrderDetailID]     [INT]              NOT NULL,
    [CarrierTrackingNumber]  [NVARCHAR](25)         NULL,
    [OrderQty]               [SMALLINT]         NOT NULL,
    [ProductID]              [INT]              NOT NULL,
    [SpecialOfferID]         [INT]              NOT NULL,
    [UnitPrice]              [MONEY]            NOT NULL,
    [UnitPriceDiscount]      [MONEY]            NOT NULL 
                             CONSTRAINT [DF_sodUnitPriceDiscount]  DEFAULT ((0.0)),
    [LineTotal]              DECIMAL(38, 6),
    [rowguid]                [UNIQUEIDENTIFIER] NOT NULL,
    [ModifiedDate]           [DATETIME]         NOT NULL 
                             CONSTRAINT [DF_sodModifiedDate]  DEFAULT (GETDATE())
   ) ON [PRIMARY]
GO

Once the table is created, the package “flows” to the Data Flow Task. Inside the data flow, essentially I simply select about 35,181,930 records by using a CROSS JOIN between the [AdventureWorks2014].[Sales].[SalesOrderDetail] and [AdventureWorks2014].[HumanResources].[Employee] tables using an OLE DB source and pump it to the newly created target table via an OLE DB destination with the “Table Lock” option checked (default).

I then configure logging on the package to log the package activity for the Data Flow Task for the OnError, OnPreExecute and OnPostExecute events (Configuring package logging is out of scope for this blog post).

The script used in the OLEDB source is presented here:

SELECT sod.SalesOrderID,
       sod.SalesOrderDetailID,
       sod.CarrierTrackingNumber,
       sod.OrderQty,
       sod.ProductID,
       sod.SpecialOfferID,
       sod.UnitPrice,
       sod.UnitPriceDiscount,
       sod.LineTotal,
       sod.rowguid,
       sod.ModifiedDate
FROM Sales.SalesOrderDetail AS sod
CROSS JOIN HumanResources.Employee AS hre;
GO

Screenshots showing the package configuration described above are shown below.

OLEDB Destination - Package Overview to test Rows/Batch and Max. Insert Commit Size

OLEDB Destination – Package Overview to test Rows/Batch and Max. Insert Commit Size

Please note that between each run, the data buffers were cleaned and procedure cache was cleared out to get a “cold” state performance of the database engine.

CHECKPOINT;
DBCC DROPCLEANBUFFERS;
DBCC FREEPROCCACHE;
GO

Once the package is executed, I will compare the difference between the OnPreExecute and OnPostExecute times for various configurations of the “Row per batch” and “Max. insert commit size” option to get an idea of the performance difference between them.

NOTE: The packages were executed after it was confirmed that the databases involved (in this case tempdb) had grown sufficiently to accommodate the inserted data.

Comparing package execution performance

Because I had turned on logging on the SSIS package, I ran the following query against the [dbo].[sysssislog] table which gives me the time difference (in seconds) between the “OnPreExecute” and “OnPostExecute” events for both the packages. The query and the results are available below:

USE [tempdb];
GO
SELECT [PivotedData].,
       [PivotedData].[OnPreExecute],
       [PivotedData].[OnPostExecute],
       DATEDIFF(SECOND,[PivotedData].[OnPreExecute],[PivotedData].[OnPostExecute]) AS [ExecutionTime]
FROM (SELECT [sl].,
             [sl].[event],
             [sl].[starttime]
      FROM [dbo].[sysssislog] AS [sl] WITH (NOLOCK)
      WHERE ([sl].[event] = 'OnPreExecute' 
             OR 
             [sl].[event] = 'OnPostExecute'
            )
        AND [sl]. LIKE 'OLEDB%'
     ) AS SourceData
PIVOT (MAX([SourceData].[starttime])
       FOR [SourceData].[event] IN ([OnPreExecute], [OnPostExecute])
      ) AS PivotedData;
GO

OLEDB Destination - Performance impact of adjusting Rows/Batch and Max. Insert Commit Size

OLEDB Destination – Performance impact of adjusting Rows/Batch and Max. Insert Commit Size

As can be seen from the screenshots off the results, the tasks with controlled batch sizes and rows/batch had better performance even on my development environment. I have seen a considerable improvement on systems with poor I/O performance.

Conclusion

The OLE DB destination is, therefore, a very powerful way of tuning the data inserts into a destination SQL Server database.

There is no magic bullet to ensuring SSIS performance, but a thorough evaluation of and appropriate adjustments to the OLEDB destination options based on business rules, overall system configuration and the nature of the workload is sure to get the optimal performance from your packages.

Further Reading

Until we meet next time,

Be courteous. Drive responsibly.

The OLE DB destination in "Fast Load" configuration with "Table Lock" checked (default)

#0391 – SQL Server – SSIS – OLE DB Destination – Table Fast Load – Performance – Table Lock option


Developing SSIS packages is quite easy – it’s mostly drag and drop and some minor configuration, which is a really boon for someone who is new to SSIS. However, when it comes to tuning the package, one needs to understand the finer points of each task on the control flow.

The OLE DB Destination

In one of my previous posts, I started to explore the OLE DB destination. In order to load data as quickly into the destination as possible, the  OLE DB destination allows us to use a “Fast Load” mode. The “Fast Load” option allows the data team to configure various options that affect the speed of the data load:

  1. Keep Identity
  2. Keep NULLs
  3. Table Lock
  4. Check Constraints
  5. Rows per Batch
  6. Maximum Insert Commit Size

I looked at the “Keep NULLs” and the “Keep Identity” options earlier, and today I will go over the “Table Lock” option.

Because the option is part of the OLE DB destination task, the “Table Lock” option indicates whether the database engine should request a more wider lock on the entire table (i.e. use TABLOCK) rather than trying to get granular locks one each row/page and then follow lock escalation to block the table.

Theoretically, when moving extremely large amounts of data compared to the number of records already available in the destination table, the database engine would deem the granular locks (at the row/page level) too cost prohibitive and escalate to the table anyway. In this case, it would be better to specify the wider TABLOCK.

Allow me to present a brief demo.

Creating the package with logging for comparing execution time

As always, I have created a simple package that creates a table and inserts data into it. The table is identical to the [Sales].[SalesOrderDetail] table in the [AdventureWorks2014] sample database. The table creation script used in the Execute SQL task on the package is provided below:

USE [tempdb];
GO

IF OBJECT_ID('[dbo].[SalesOrderDetail]','U') IS NOT NULL
BEGIN
    DROP TABLE [dbo].[SalesOrderDetail];
END
GO

CREATE TABLE [dbo].[SalesOrderDetail]
   (
    [SalesOrderID]           [INT]              NOT NULL,
    [SalesOrderDetailID]     [INT]              NOT NULL,
    [CarrierTrackingNumber]  [NVARCHAR](25)         NULL,
    [OrderQty]               [SMALLINT]         NOT NULL,
    [ProductID]              [INT]              NOT NULL,
    [SpecialOfferID]         [INT]              NOT NULL,
    [UnitPrice]              [MONEY]            NOT NULL,
    [UnitPriceDiscount]      [MONEY]            NOT NULL 
                             CONSTRAINT [DF_sodUnitPriceDiscount]  DEFAULT ((0.0)),
    [LineTotal]              DECIMAL(38, 6),
    [rowguid]                [UNIQUEIDENTIFIER] NOT NULL,
    [ModifiedDate]           [DATETIME]         NOT NULL 
                             CONSTRAINT [DF_sodModifiedDate]  DEFAULT (GETDATE())
   ) ON [PRIMARY]
GO

Once the table is created, the package “flows” to the Data Flow Task. Inside the data flow, essentially I simply select about 4 times the data from the [AdventureWorks2014].[Sales].[SalesOrderDetail] table (approximately 485268 rows) using an OLE DB source and pump it to the newly created target table via an OLE DB destination with the “Table Lock” option checked (default).

I then configure logging on the package to log the package activity for the Data Flow Task for the OnError, OnPreExecute and OnPostExecute events (Configuring package logging is out of scope for this blog post).

The script used in the OLEDB source is presented here:

USE AdventureWorks2014;
GO
SELECT sod.SalesOrderID,
       sod.SalesOrderDetailID,
       sod.CarrierTrackingNumber,
       sod.OrderQty,
       sod.ProductID,
       sod.SpecialOfferID,
       sod.UnitPrice,
       sod.UnitPriceDiscount,
       sod.LineTotal,
       sod.rowguid,
       sod.ModifiedDate
FROM Sales.SalesOrderDetail AS sod
CROSS JOIN (VALUES (1), (2), (3), (4)) AS Combinations (ComboValue)
GO

A set of screenshots showing the package configuration described above are shown below.

An

An “Execute SQL Task” creates the test table

0391-sqltwins-ssis-tablelock-dataflow-oledbdestination

The OLE DB destination in “Fast Load” configuration with “Table Lock” checked (default)

Package Log configuration enabled for the OLEDB destination

Package Log configuration enabled for the OLEDB destination

Event configuration done to log task activities as part of package logging

Event configuration done to log task activities as part of package logging

Once the package is executed, I will compare the difference between the OnPreExecute and OnPostExecute times for both configurations of the “Table Lock” option to get an idea of the performance difference between them.

NOTE: The packages were executed after it was confirmed that the databases involved (in this case tempdb) had grown sufficiently to accommodate the inserted data.

Confirming that we are indeed taking a Table Lock

While the package is executing in SQL Server Data Tools (SSDT, erstwhile BIDS), I run the following query on the SQL Server to check the locks occupied on our test table.

USE [tempdb]
GO
SELECT tl.resource_associated_entity_id AS ObjectId,
       OBJECT_NAME(tl.resource_associated_entity_id) AS ObjectName,
       tl.request_mode AS LockRequestMode
FROM [sys].[dm_tran_locks] AS tl
WHERE tl.resource_database_id = DB_ID()
  AND tl.resource_associated_entity_id = OBJECT_ID('[dbo].[SalesOrderDetail]','U');
GO

Based on the results, we can confirm that an exclusive Bulk Update (BU) lock has indeed been requested and granted on the table – indicating that the TABLOCK option was used as part of the insert.

Bulk Update (BU) lock used on the table, indicating use of TABLOCK (Table Lock)

Bulk Update (BU) lock used on the table, indicating use of TABLOCK (Table Lock)

Running the package with “Table Lock” turned Off

If I check the locks on the table while running the package with the “Table Lock” option unchecked, I see that either an “Intent Exclusive” (IX) or an “Exclusive” (X) lock have been used. This indicates that SQL Server is actually using Exclusive locks on lower level allocation units (page/row).

0391-sqltwins-ssis-tablelock-intentexclusivetablelockrequested

When “Table Lock” is unchecked, an Intent Exclusive Lock is acquired on the table, indicating granular exclusive locks in use

Please note that between each run, the data buffers were cleaned and procedure cache was cleared out to get a “cold” state performance of the database engine.

CHECKPOINT;
DBCC DROPCLEANBUFFERS;
DBCC FREEPROCCACHE;
GO

Comparing package execution performance

Because I had turned on logging on the SSIS package, I ran the following query against the [dbo].[sysssislog] table which gives me the time difference (in seconds) between the “OnPreExecute” and “OnPostExecute” events for both the packages. The query and the results are available below:

USE [tempdb];
GO
SELECT [PivotedTbl].,
       DATEDIFF(SECOND, 
                [PivotedTbl].[OnPreExecute], 
                [PivotedTbl].[OnPostExecute]
               ) AS PackageRunTimeInSeconds
FROM (SELECT sl.,
             sl.[event],
             sl.[starttime]
      FROM [dbo].[sysssislog] AS sl
     ) AS [SourceTbl]
PIVOT (MAX([SourceTbl].[starttime]) 
       FOR [SourceTbl].[event] IN ([OnPreExecute],[OnPostExecute])
      ) AS [PivotedTbl];
GO

0391-sqltwins-ssis-tablelock-executiontimes

Execution time summary showing data flow with OLE DB destination using Table locks completes faster than one using granular locks

We can clearly see that the table load with “Table lock” checkbox turned on is comparatively faster.

Summary

The OLE DB destination task is a very powerful way to load data into SQL Server table. It also provides the flexibility to boost the rate of data insertion when used wisely.

  • When loading data into an empty table or when system is under an outage window, evaluate keeping the “Table Lock” checkbox checked
  • When it is important to keep the tables accessible during the data load, evaluate keeping the “Table Lock” checkbox unchecked
    • This will ensure that if possible, the SELECT queries are not blocked from being executed as long as they do not refer to the same page(s) being inserted/updated
  • Using a higher-level lock (in case of requesting a Table Lock/TABLOCK) does provide a reduced data “load” time due to reduced overhead of maintaining granular locks and can be used as a means to speed up the data inserts for large data sets

References:

  • Lock Modes in SQL Server [MSDN Link]
  • Lock Compatibility [MSDN Link]
  • Performance – Best Practice – Create Index Before or After a data insert? [Blog Link]
  • Performance – Best Practice – Create Index Before or After a data insert – I/O, Fragmentation, CPU and Elapsed Time [Blog Link]

Until we meet next time,

Be courteous. Drive responsibly.

#0378 – SQL Server – Performance – CASE evaluates all the input result expressions


Recently, I asked to troubleshoot a performance issue with a stored procedure that was being used for reporting purposes. Looking at the execution plan, I realized that while the joins and the filters were as expected, the core bottleneck were sub-queries in the CASE expression. In order to execute the query, SQL Server needs to evaluate all the input result expressions and then return the value in the output result set based on the switch (when expression).

In case one of these input result expressions refer a large table or a table that’s locked, it could compromise the performance of the entire statement – even though the conditions are such that the table is not directly accessed (which is what was happening in our case).

The script below demonstrates the behaviour with an example. In the script, the CASE expression returns the values from one of 3 tables in the AdventureWorks database – Production.Product, Person.Person and Sales.SalesOrderHeader.

USE AdventureWorks2012;
GO

DECLARE @caseSwitch INT = 1;

SELECT CASE @caseSwitch 
            WHEN 1 THEN (SELECT TOP 1 
                                pp.Name
                            FROM Production.Product AS pp
                        )
            WHEN 2 THEN (SELECT TOP 1 
                                per.LastName + ', ' + per.FirstName
                            FROM Person.Person AS per
                        )
            WHEN 3 THEN (SELECT TOP 1 
                                soh.Comment
                            FROM Sales.SalesOrderHeader AS soh
                        )
            ELSE 'Default Value'
       END;
GO

When we execute the script with the “Show Actual Execution Plan” (Ctrl + M) turned on, we can see that all three tables were accessed.

A CASE expressions evaluates all the input result expressions

A CASE expressions evaluates all the input result expressions

If this behaviour presents a lot of performance issues in the system, the solution is to re-engineer the way the system is queried such that the required set of data is staged into temporary tables to avoid loading the underlying tables.

Until we meet next time,
Be courteous. Drive responsibly.