Tips for SSIS SQL Server Integration Services

by Joakim Dalby

 

1. Introduction

Best Practices of making SSIS Integration Services package in form of some tips shown by examples. In my article about SQL Server 2014 I describe deployment methods of SSIS project and package on different servere using parameter for con­nec­tion string and Environment variable inside the SSISDB catalog:

SQL Server 2014, read chapter 3 and 8.

 

Download SSDT data tools for Visual Studio

 

Download the latest OLE DB driver called Microsoft OLE DB Driver for SQL Server (MSOLEDBSQL) to Native OLE DB connection in Data Flow Task. Don’t use SQL Server Native Client 11.0 anymore:

https://aka.ms/downloadmsoledbsql

 

Read introductory text about:

Microsoft OLE DB Driver for SQL Server

 

Download the latest ODBC driver 18 for SQL Server:

https://aka.ms/downloadmsodbcsql

 

Download the latest Microsoft OLE DB Provider for Analysis Services MSOLAP to be able to process a cube from a SSIS package:

https://learn.microsoft.com/en-us/analysis-services

Analysis Service Execute DDL Task under DLL Connection with a package variable:

RIGHT(@[System::PackageName], FINDSTRING(REVERSE(@[System::PackageName]), "_", 1) - 1)

in the Expression as @[User::CubeName] to do a full cube processing:

"{\"refresh\": {\"type\": \"full\",\"objects\":[{\"database\":\""+@[User::CubeName]+"\"}]}}"

 

Avoid using SqlPackage.exe from folder C:\Program Files\Microsoft Visual Studio\2022 \Community\Common7\IDE\Extensions\Microsoft\SQLDB\DAC. Instead install SqlPackage from here using Windows (.NET Framework) DacFramework.msi to folder

C:\Program Files\Microsoft SQL Server\160\DAC\bin.

Read more about SqlPackage.

 

Read about Deploy Integration Services  

 

SSIS Lookup performance tuning techniques REPLACENULL(Customer_Key,-1)

 

In case you see one of these error message:

·  Failed to save modifications to the server. Error returned: 'OLE DB or ODBC error: A network-related or instance-specific error has occurred while establishing a connection to SQL Server. Server is not found or not accessible. Check if instance name is correct and if SQL Server is configured to allow remote connections. For more information see SQL Server Books Online.; 08001; SQL Server Network Interfaces: The handle specified is invalid; 08001; Client unable to establish connection; 08001; SQL Server Network Interfaces: The handle specified is invalid; 08001. A connection could not be made to the data source with the Name of '<database>'.

·  Object reference not set to an instance of an object when execute a SSIS-package.

I recommend to update to the latest OLE DB driver and to the latest ODBC driver.

 

A solution in SQL Server Data Tools for Visual Studio can contain severals projects for different data layer in a data warehouse for SSIS package, SSRS report, SSAS olap cube, SSAS tabular cube and export to Excel file and so on. It is almost like a .NET solution with projects that become different dll files, but normally the dll files belong together and is called from the same exe file. If you like to show solution node in Solution Explorer: Tools, Options, Projects and Solutions and check­mark in 'Always show solution'. I recommand a SSIS project per data layer in a data warehouse like for doing extraction from source systems, for doing archive or ods, for dsa, for edw and for each data mart.

 

A SSIS project properties is shown from Solution Explorer pane by rightclick the project name and select {Properties}. I always set ProtectionLevel to Dont­Save­Sensitive so other deve­lopers can access the SSIS package therefore no depen­ding of the creator user and a password. When a package is saved, sensitive pass­word values will be removed. This will result in passwords needing to be supplied to the package through a configuration file or by the user. All new IS package inherits the protection level. More reading with different approaches.

 

An example of a SSIS design pattern parent-child.

 

SSIS vs Azure Data Factory vs Azure Databricks

 

Please remember to consider MaxConcurrentExcecutables property in the SSIS package, more at this links Data Flow Performance Features and Performance Tuning Techniques.

When SSIS package runs on a dedicated server and there are a lot of operations that run in parallel, it can be beneficial to increase this setting if some of the operations (threads) do a lot of waiting for external systems to reply. A Data Flow Task has property for EngineThreads, DefaultMaxBufferRows and DefaultMaxBufferSize and by using the property AutoAdjustBufferSize set to True it will improve SSIS data flow buffer performance.

 

When a OLE DB Source wants to fetch data from a view in a database, use SQL command with Select * From <view> that will give better performance.

 

A SSIS project has a Connection Managers where all connectionsstring to files and database should be, so they can be reused in all the SSIS packages in the project.

Rightclick at Connection Managers and select {New Connection Manager} and for connection to a SQL Server database select OLEDB and click [Add] and [New], typein SQL Server name, normally use Windows Authentication and select the database in the dropdown.

 

OLEDB data provider was Native OLE DB\SQL Server Native Client 11.0. This con­nection is used in Control Flow at Execute SQL Task and in Data Flow Task at OLE DB Source, OLE DB Destination and Lookup. The SQL Server Native Client has been removed from SQL Server 2022 (160) and SQL Server Management Studio 19. The SQL Server Native Client (SQLNCLI or SQLNCLI11) and the legacy Microsoft OLE DB Provider for SQL Server (SQLOLEDB) are not recommended for new development.

OLEDB data provider must be Microsoft OLE DB Driver for SQL Server (MSOLEDBSQL) or the latest Microsoft ODBC Driver for SQL Server.

 

ADO.NET connection manager type .Net Providers\SqlClient Data Provider.

This connection is used in Execute T-SQL Statement Task, Execute SQL Task and when Script Component wants to read directly from a sql server via SqlCommand.

 

A connection manager used inside a SSIS package has a property Expression that for a flat file which name include a date yyyy-mm-dd can have an expression that assign the right filename at runtime mode for a dynamic connection like this with a double backslash in the file path:

"C:\\File\\Customer"+LEFT((DT_WSTR,30)GETDATE(),10)+".csv"

 

Want a " inside a "…string…" use a backslash before " e.g. in a variable assigment:

"<root xmlns:joda=\"www.joda.xml\"/>"

 

To a connection manager e.g. a SQL Server database with SQL Server Authenti­ca­tion you can assign server name, database name, username and password to pac­kage variables (user::) and add the variables to the connection manager property Expression so you can insert OLE DB Source or OLE DB Destination in same way as the Windows Authentication and the best part it is still working together with DontSaveSensitive, only the button [Test connection] is not working, but no problem with that. Config file can assign the variables for running mode at the produc­tion server etc.

 

Show SSIS Toolbox

In menubar select View and Other Windows and SSIS Toolbox and you have it in the left side of your screeen showing the tools or components depends of you are in a control flow or a data flow canvas.

 

To delete a row in a component

Like 'Derived Column' or 'Union All', first left click at another row and then set mouse over the want-to-delete row and right click and select {Delete} in the menu.

 

ID of package

Make sure the package ID guid number is unique if you using a template or have copied a package, go to property ID and click dropdown and select Generate New ID. Get the ID by parameter System::PackageID. Also a System::PackageName.

More about System Variables in SSIS package.

 

Execute and format a SSIS package

Normally I run a SSIS packages from the design mode by clicking at Start button, but sometimes it will evaluate and syntax check all files in the project and in the solution. To avoid that it is better to close all tabs (or documents as SQL Server Data Tools for Visual Studio call them) and right click the package and choose {Execute Package} like this:

 

 

Layout of a SSIS package

In case you like to organized the layout of a SSIS package, Format can helps:

 

 

Some of the components

 

 

 

2. Control Flow and Data Flow components

A SSIS package has a 'Control Flow' canvas or design surface which is using 'Task' components to display the steps of a process and is the task workflow engine that coordinates the business process flow logic for a package. The 'Data Flow' is the data processing engine that handles data movement, transformation logic, data organization, and the extraction and commitment of the data to and from sources and destinations.

 

Parameter / Variable

Used in expression

Used in Dts.Variables[…]

Project parameter

@[$Project::parameter]

"$Project::parameter"

Package parameter

@[$Package::parameter]

"$Package::parameter"

Package variable

@[User::variable]

"User::variable"

Sequence Container

@[User::variable_in_scope]

Data Flow can’t refer to it.

"User::variable"

 

Some of the components and the usage

 

  • Execute T-SQL Statement Task for a fixed sql statement like TRUNCATE TABLE, UPDATE or DELETE. In case you like to use T-SQL MERGE statement it is also placed here. Sadly this task needs its own connection mana­ger because it is using ADO.NET connection manager type '.Net Providers\SqlClient Data Provider'.  
  • Execute SQL Task for a fixed or dynamic sql statement where criteria depends of an another task that has assigned a SSIS package variable that can be used in a Where @para­meter by using a question mark ? or as a parameter to execute a stored procedure. This task can also assign a SSIS package variable with a value from a Select statement. Connection mana­ger is the same as of 'Data Flow Task' called 'Native OLE DB\Microsoft OLE DB Driver for SQL Server' and not anymore SQL Server Native Client 11.0, that provide ? parameter for each variable. Instead when using ADO.NET connection it is providing a named @parameter like in T-SQL that is more self explanatory but also using an extra connection to the database beside the connection used inside the 'Data Flow Task'. This task can also return a recordset/result set from a sql select statement with records/rows and columns when property ResultSet is "Full result set" and the result set is saved in a new package variable with value type "Object" and afterwards to be used in a 'Foreach Loop Container'. I prefer a sql select statement placed in a 'OLE DB Source' and afterwards will the recordset be saved in a package variable with value type "Object" via 'Recordset Destination', see example in chapter 12.
  • Sequence Container for grouping tasks that can be running in parallel. Tasks is normally performed in a particular sequential order with the Precedence constraint green arrows between them. When a package has many tasks on different datasets they can be grouped together in several Sequence Container and each container can be performed in a parallel order. Please be carefull of having many Sequence Container and do tasks on many different datasets because that will make your SSIS package complex in runtime with many open connections to the database and for later maintenance. I recom­mand to have many SSIS packages like one for each dataset and then have a batch/main/ master/autoexec SSIS package that is calling packages inside a 'Sequence Container' to make the packages run in parallel. There can be several Sequence Containers with a constraint green arrow between them to control the ETL process order.
  • Foreach Loop Container for looping files in a folder or rows in a package user variable object and so some tasks on each file or each row with data.
  • Data Flow Task for pull or fetch data out of a database into a SSIS pipeline and for push or insert modified data back to the database to store the changes. Pipeline engine for manage the flow of data from the data source adapters to the data destination adapters and do necessary transformations.
  • Recordset Destination to place data from a pipeline into a in-memory variable with data type Object that later can be used in a 'Foreach Loop Container'.
  • OLE DB Source pull or fetch data out of a database into a pipeline using a Select statement with only the required columns or when using a table (or view) go to Columns pages and checkmark the required columns. Sometimes go to Columns page and typein a prefix s_ to the output columns when it is easier to merge, union or lookup to another tables with same column names. At Error Output page bad data can be re­di­rect to a red pipeline to be taking care of instead of ignore or failure. 'Source' can also pass data to a 'Recordset Destination' that is using a package variable with data type Object and the variable can be used in 'Control Flow' in a 'Foreach Loop Container' to loop and iterate each record/row of the variable, where a column is assigned to another package variable with type Int32 (int), Int64 (bigint), String (nvarchar), Boolean (bit), Decimal and Double or a char (nchar(1)) matching the data type in the database table.
  • OLE DB Destination push or insert modified data from a pipeline back to the database to store the changes using 'Table or view – fast load', checkmark in 'Table lock', no checkmark in 'Constraints' because it takes performance time to check data we asume ETL process already has done some data checking. If another process like a stored procedure for a report is doing a select from the table at the same time the SSIS package is run­ning then no checkmark in 'Table lock'. Seldome checkmark in 'Keep identity' and 'Keep nulls'. Set 'Rows per batch' 100000000 and 'Maximum insert commit size' 100000 to avoid transaction log growing huge and to save per batch by breaking the incoming rows in multiple chunks of rows. See later about how to set order. Remember to set database pro­perty Recovery model to Simple.
  • Multicast copy or clone a pipeline into two or several similar pipelines to avoid reading the same data from database or file by different 'Source'.
  • Data Conversation convert a column in a pipeline like data from a flat file to a better data type. A source column with data type varchar (string DT_STR) to a destination column as nvarchar therefore data conversion data type will be set to »Unicode string (DT_WSTR) and typein a length«. The output alias column can be prefix c_ or be the same name as source column and the pipeline will prefix column with »DataConversion« and I use this co­lumn in the OLE DB Destination mapping.
  • Derived Column add an extra column into a pipeline or replace the va­lue of an existing column with new va­lue or an expression like TRIM(<co­lumn­name>) or changing a null value REPLACENULL(<column­name>, -1) or assign a value to an e­xis­ting column in a pipeline like if the co­lumn is null it become -1 else it become 1, ISNULL(<columnname>) ? -1 : 1 or do a calculation. A SSIS package variable can be an extra column too. A new derived column name can be prefix d_.
  • Lookup add an extra column into a pipeline from another table like a Inner join. Normally use 'Full cache'. The output alias column can be prefix l_. Remember to set 'Specify how to handle rows with no matching entries' to at least 'Ignore failure' else Lookup gives a failure when a row in pipe­line has a column value that does not exists in the lookup table or the cache. Exists a column value in another table by using 'Redirect rows to no match output' where Lookup devide a pipeline into two pipelines called 'Lookup Match Output' (does exists) and 'Lookup No Match Output' (does not exists). SQL statement can insert an inferrred member and Lookup returns new Identity number value.
  • Cache Transform to place data from a pipeline into a in-memory variable that later can be used in a Lookup for faster performance.
  • Merge Join combines two sorted pipelines into one pipeline as adding an extra column into a pipeline from another pipeline with a Left outer join, or adding an extra row into a pipeline from another pipeline with a Full outer join, or reduce rows in a pipeline based of another pipeline with a Inner join. Pipelines has to be sorted before used in merge join.
  • Conditional Split devide a pipeline into two or multiple pipelines by criteria like data is ok or has error (right or wrong) or a pipeline into three pipelines by doing delta data detection for New rows, Changed rows and Deleted rows from a Full outer join merge to get data that does not exists.
  • Aggregate value in a pipeline to a new summarized pipeline. The output alias column can be prefix a_.
  • Union All combine multiple pipeline into one pipeline.
  • Script Component do C# programming to change data in a pipeline and divide to multiple pipelines.
  • OLE DB Command run a sql Update or Delete statement for each row in a pipe­line. That is not good for performance with a large number of rows in a pipeline. Seldome do a sql Insert into statement because OLE DB Destination is good for that through bulk insert. Like UPDATE T SET F = ?, D = Getdata() WHERE Id = ?. Question mark ? input parameter refers to a column in a pipeline. The best practices is to have a staging table in database, truncate it by 'Execute T-SQL Statement Task', do like Derived Column and Lookup transformation with the data inside the pipeline and then push it into the staging table through 'OLE DB Destination'. Then do a real bulk sql Update by joining the staging table with the real table in another 'Execute T-SQL Statement Task'.
  • Package variable is usefull for 'Control Flow' and for 'Data Flow' like criterie to a sql Select-Where, input or output parameter to stored procedure and a recordset or resultset. A variable gets default value from an expression like a DateTime variable has expression: DATEADD("m",-1,GETDATE()).
  • Package parameter is usefull to receive data from calling package to use in expression: !ISNULL(@[$Package::Period]) && @[$Package::Period] != "" in 'Control Flow' or as a criteria in a 'Data Flow Task'. I use package parameter as a constant variable like for a name of a table or a file or a value.
  • Project parameter is usefull for multiple packages using same variable like a path to a directory together with a package variable as a con­ca­te­nate string variables in this expression with double backslash:

"N'"+@[$Project::BackupDirectory]+"\\Backup_"+@[User::Name] +".bak'"

  • Expression Task assign a variable in a package like this expression:

@[User::Counter] = @[User::Counter] + 1

Only one package variable can be assigned a value in a 'Expression Task', so with multiple variable I need to make multiple 'Expression Task's or instead make a 'Script Task' what can do multiple variables in a C# program.

Building a dynamic sql statement to a variable and use it in 'OLE DB Source' e.g. to fetch delta data for incremental load, Where Id > max value.

·         Script Task do C# programming at package variable like show the value in debug runtime mode, remember to set property ReadOnlyVariables: using System.Windows.Forms;    // a namespace in top of the code

MessageBox.Show(Dts.Variables["User::CustomerName"].Value.ToString());   or

System.Windows.Forms.MessageBox.Show(Convert.ToString(Dts.Variables["User::Num"].Value));

if the variable can be null. Examples with the three types of variable:

MessageBox.Show(Dts.Variables["User::PackageVariable"].Value.ToString());

MessageBox.Show(Dts.Variables["$Package::PackageParameter"].Value.ToString());

MessageBox.Show(Dts.Variables["$Project::ProjectParameter"].Value.ToString());

Script can do calculations with programming or build a dynamic sql statement to a variable and use it in 'Execute SQL Task' to store a value in a table with sql Insert or Update. In 'Execute SQL Task' set SQLSourceType to Variable and at SourceVariable pick a pac­kage variable like User::TSQL­Save­Cal­cu­la­tion when it is a dynamic build sql statement.

  • DQS Cleansing Transform example with filter duplicates. Other examples.
  • Sort data in pipeline can remove rows with duplicate sort values. Sort and Ag­grea­te is blocking the stream of data in the pipeline because they need all the data to do the operations so it is not recommanded for many rows. Therefore let the 'Source' sql statement has ORDER BY that is needed for 'Merge Join', 'Lookup' or fit primary clustered key in 'Destination'. Assign sortorder to 'Source' by rightclick {Show Advanced Editor}, tab 'Input and Output Properties', click at 'OLE DB Source Output' and set property IsSorted to True:

 

 

And open 'OLE DB Source Output' and open 'Output Columns' and select the first column from ORDER BY and set property SortKeyPosition to 1:

 

 

Continue with next column from ORDER BY that is set to 2 and so on. When a column is DESC SortKeyPosition gets a minus in front as a negative value.

Read more. When data source can’t be ORDER BY like a flat file but data is delivered sorted, use SortKeyPosition on the columns that is actually sorted. 'Sort' removes duplicate values of sort key from a flat file.

From a OLE DB Source make a Script Component type Transformation to do filter duplicates in C# and remember to set property at page Inputs and Outputs at Output 0 set ExclusionGroup from 0 to 1. Filter on CustomerName:

        using System;

        using System.Data;

        using Microsoft.SqlServer.Dts.Pipeline.Wrapper;

        using Microsoft.SqlServer.Dts.Runtime.Wrapper;

        using System.Collections.Generic;

        [Microsoft.SqlServer.Dts.Pipeline.

          SSISScriptComponentEntryPointAttribute]

        public class ScriptMain : UserComponent

        {

          public HashSet<string> Customers = new HashSet<string>(); // int, script task variable

 

          public override void Input0_ProcessInputRow(Input0Buffer Row)

          {

            if (Row.CustomerName_IsNull)

               return;

            if (!Customers.Contains(Row.CustomerName))

            {

               Customers.Add(Row.CustomerName);  // only unique values is added.

               Row.DirectRowToOutput0();           

            }

          }

}

  • OLE DB Destination extra options (target table)

OLE DB Destination has a property FastLoadOptions that normally shows TABLOCK, CHECK_CONSTRAINTS. Four extra options can be added there:

ROWS_PER_BATCH = <number>

<number> indicates that destination is configured to use batches of rows.

KILOBYTES_PER_BATCH = <number>

<number> specifies the size in kilobytes to insert.

FIRE_TRIGGERS

Specifies whether triggers fire on the insert table because of fast load/bulk.

ORDER(<columnname> ASC, <columnname> DESC)

Specifies how the output data is sorted. If sort order is omitted, the insert o­pe­ra­tion assumes the data is unsorted. Performance can be improved if you use the ORDER option to sort the output data according to the clustered index on the table. For clarity, the ORDER hint needs to exactly match the clus­te­ring column of the destination table. ORDER is specially good when OLE DB Source is a flat file. E.g. Id is a clustered index on the destination table, the FastLoadOption property will have this statement:

TABLOCK,ROWS_PER_BATCH = 100000000,ORDER(Id)

Another example has no tablock because pipeline stop inserting data when someone select from the table and it has a three column composite primary key that is non-clustered be­cau­se there is a unique clustered index on a Identity column that just grow for each inserted row and since the OLE DB Source has a Order By in select statement and property IsSorted is True and 1 column is Date, 2 is SeqNo and 3 is DeviceNo because I want that sort order for the OLE DB Destination table rows, I am using this FastLoadOption:

ROWS_PER_BATCH = 100000000,ORDER(Date,SeqNo,DeviceNo)

Remember to set database property Recovery model to Simple. Remember also to look at index to get faster load performance like this:

* Disable/drop non-clustered indexes before bulk load.

* Rebuild/create non-clustered  indexes after bulk load.

The Data Loading Performance Guide

  • Restart Packages by Using Checkpoints read more.
  • Audit Transformation package system package variables read more.

 

 

3. Data type and Expression

Data Conversation is not using SQL Server database table data type and SSIS pac­kage variable and SSIS project parameter has their own data type, the map is:

 

SQL Server data type

SSIS data type for derived expression

Variable/para-meter datatype

OLE DB for input / output ? param.

bigint

DT_I8

Int64

LARGE_INTEGER

binary

DT_BYTES

Object

n/a

bit

DT_BOOL

Boolean

VARIANT_BOOL

char

DT_STR

String

VARCHAR

date

DT_DBDATE

DateTime

DBDATE

datetime

DT_DATE

DateTime

DATE for input

DBTIMESTAMP for output

datetime2

DT_DBTIMESTAMP2

DateTime

DATE / DBTIMESTAMP

datetimeoffset

DT_DBTIMESTAMP

OFFSET

Object

DBTIMESTAMP

OFFSET

smallint

DT_I2

Int16

SHORT

decimal

DT_DECIMAL

Decimal

DECIMAL

float

DT_R8

Double

FLOAT

int

DT_I4

Int32

LONG

image

DT_IMAGE

Object

n/a

money

DT_CY

Decimal

CURRENCY

smallmoney

DT_CY

Decimal

CURRENCY

nchar

DT_WSTR

String

NVARCHAR

nvarchar

DT_WSTR

String

NVARCHAR

ntext

DT_NTEXT

String

n/a

nvarchar(max)

DT_NTEXT

Object

n/a

numeric

DT_NUMERIC

Single

NUMERIC

real

DT_R4

Single

FLOAT, DOUBLE

smallint

DT_I2

Int16

SHORT

text

DT_TEXT

String

n/a

time nanoseconds

DT_DBTIME2

DateTime

DBTIME2

time seconds

DT_DBTIME

DateTime

DBTIME

timestamp

DT_BYTES

DateTime

n/a

tinyint

DT_UI1

Byte

BYTE

uniqueidentifier

DT_GUID

Object

GUID

varbinary

DT_BYTES

Object

n/a

varchar(max)

DT_TEXT

Object

n/a

varchar

DT_STR

String

VARCHAR

varchar(max)

DT_TEXT

Object

n/a

xml

DT_WSTR

String

n/a

single character char(1)

DT_UI2

Char

VARCHAR

declarative NULL value

 

DBNull

 

recordset

resultset

 

Object

 

 

Script Task C# datetime, string, byte, short (int16), int (int32) og long (int64).

 

(DT_WSTR, <length>) unicode string, length of characters, for nvarchar or nchar, e.g. (DT_WSTR,30) for nvarchar(30).

 

(DT_STR, <length>, <codepage>) ansi string with codepage 1252 for varchar or char, e.g. (DT_STR,30,1252) for varchar(30).

 

When you have a nvarchar column in a source table and you want only the first three letters and save it in a varchar column in a destination table, do this in a Derived Column to cast the value to codepage 1252 = ansi = varchar data type:

(DT_STR,3,1252)SUBSTRING(<columnname>,1,3)

Or save it in a nvarchar column in a destination table:

(DT_WSTR,3)SUBSTRING(<columnname>,1,3)

 

Illegal characters are replaced to empty string varchar(100) for ansi:

(DT_STR,100,1252)REPLACE(investmentdecisionwithinfirm,"�","")

 

ANSI has code page 1252 as standard for the SSIS pipeline to a SQL Server table.

UTF-8 has code page 65001 is often the encoding for a csv file.

Derived Co­lumn will be added for each column from a csv file to do a conversion to codepage 1252 ANSI with cast like this for a column called clientidcode:

clientidcode_  = (DT_STR,100,1252)clientidcode

to a new derived column called clientidcode_.

  • DT_STR indicates a string corresponding to the data type varchar. (DT_WSTR for to a nvarchar).
  • 100 indicates the number of characters in the string corresponding to the data type varchar(100) in the input table columns.
  • 1252 indicates the ANSI encoding as OLE DB Destination is expected, whereby ćřĺü and other characters from the csv file are saved accordingly and correctly in the input table in a SQL Server database.

 

DT_NUMERIC(<precision, scala>) precision tells the max number of digits in the number value both left and right of decimal point, scale tells the number of digits right of decimal point. Like DT_NUMERIC(5,2) is max 999.99.

(DT_DECIMAL,2) cast string to decimal value with 2 digits after decimal point.

Read more about Integration Services Data Types

 

DateTransaction with data type date has derived column expression:

ISNULL(TransactionDatetime) ? (DT_DBDATE)"1900-01-01" : (DT_DBDATE)TransactionDatetime

 

DateTimeTransaction with data type datetime has derived column expression:

ISNULL(TransactionDatetime) ? (DT_DATE)"1900-01-01 00:00:00.000" : (DT_DATE)TransactionDatetime

 

Date with data type integer as yyyymmdd has derived column expression:

(DT_I4)((YEAR(OrderDate) * 10000) + (MONTH(OrderDate) * 100) + DAY(OrderDate))

 

(DT_I4)(REPLACE((DT_WSTR, 12) (DT_DBDATE)GETDATE(),"-",""))

 

(DT_WSTR,4)YEAR(GETDATE()) + RIGHT("0" + (DT_WSTR,2)MONTH(GETDATE()),2)

+ RIGHT("0" + (DT_WSTR,2)DAY(GETDATE()),2)

 

Active Player Day as CustomerId concat with yyyymmdd as bigint value:

(CustomerId * (DT_I8)100000000) +

((YEAR(PlayDate) * 10000) + (MONTH(PlayDate) * 100) + DAY(PlayDate))

 

Time_key with date type smallint where null value become -1 and time 10:34:45 become 1034 has derived column expression:

ISNULL(TransactionDatetime) ? (DT_I2)-1 : (DT_I2)DATEPART("hh",TransactionDatetime) * 100 + DATEPART("mi",TransactionDatetime)

 

(DT_DBTIME2, «scale») e.g. (DT_DBTIME2,2)TransactionDateTime

has between 0 and 7 digits specified for fractional seconds.

Read more about Cast

 

24-hour clock in the format HH:mm:ss.

12-hour clock in the format hh:mm:ss.

mm for minutes and MM for number of month.

string dt = DateTime.Now.ToString("yyyyMMddTHHmmss");

string dt = DateTime.UtcNow.ToString("yyyy-MM-ddTHH:mm:ss.fffZ");

Z indicates Zero time zone, i.e. that the time is in UTC time zone e.g.

2014-05-26T22:17:38.549Z

 

Variable name is case-sensitive like in C#.

Using a variable in script C# programming: Dts.Variables["User::Number"].Value;

Using a variable in an Expression: @[User::Name]

            MONTH(@[User::Birthdate]) == 5 ?  "May" : "Another month"

Using a package parameter in an Expression: @[$Package::Keyvalue]

Using a project parameter in an Expression: @[$Project::BackupDirectory]

Using a system variable in an Expression: @[System::PackageName]

 

Cast always a value, Return null value cast to the right data type

 

 

Data type in SQL Server

Derived column

smallint

(DT_I2)0

(DT_I2)1

NULL(DT_I2)

int

(DT_I4)0

(DT_I4)1

NULL(DT_I4)

bigint

(DT_I8)0

(DT_I8)123456789123456789

NULL(DT_I8)

numeric(5,2)

(DT_NUMERIC,5,2)999.99

NULL(DT_NUMERIC,5,2)

decimal(18,2)

DT_DECIMAL(2)999.99

NULL(DT_DECIMAL,2)

float

(DT_R8)999.99

NULL(DT_R8)

bit

also called boolean

(DT_BOOL)1

(DT_BOOL)0

NULL(DT_BOOL)

datetime

(DT_DATE)"9999-12-31"

(DT_DATE)"2016-12-31 23:50:45"

NULL(DT_DATE)

datetime2

NULL(DT_DBTIMESTAMP2)

nvarchar(3)

(DT_WSTR,3)"Big"

NULL(DT_WSTR,3)

varchar(5)

(DT_STR,5,1252)"Small"

NULL(DT_STR,5,1252)

money

(DT_CY)123.45

NULL(DT_CY)

 

Expression that cast or convert an integer value to concatenate to a string:

 

OrderDate is a datetime2(7) column to set to a fixed value:

(DT_DBTIMESTAMP2,7)"2013-05-27 16:42:37.4900000"

 

OrderDate is a datetime, cast it to a dbdate and to a date that has time 00:00:00

 

(DT_DBDATE)OrderDate results in a date.

 

ISNULL(OrderDate) ? NULL(DT_DBDATE) : (DT_DBDATE)(OrderDate)

 

(DT_DATE)(DT_DBDATE)OrderDate results in a datetime with 00:00:00.

 

If column OrderDateDatetime is null then use a fixed date else the column:

 

ISNULL(OrderDatetime) ? (DT_DATE)"2000-01-01 00:00:00.000" : (DT_DATE)OrderDatetime

 

Fetch out each part of a date as dd-mm-yyyy to become yyyy-mm-dd:

(DT_DATE)(SUBSTRING(OrderDatetime,7,4) + "-" +SUBSTRING(OrderDatetime,4,2) + "-" +

SUBSTRING(OrderDatetime,1,2))

 

Get OrderDate as the first day of the month:

 

(DT_DBDATE)((DT_WSTR,4)(YEAR(OrderDate)) + "-" +  (DT_WSTR,2)(MONTH(OrderDate)) + "-01")

 

(DT_WSTR,4) is C# way to cast a value to a nvarchar(4).

 

(DT_WSTR, 10) (DT_DATE) DATEADD("dd", – @[$Package::DaysAgo] , GetDate())

 

Make a filename with number of current month:

 

"C:\\File\\Data_"+RIGHT("0"+ (DT_WSTR,2) MONTH(GETDATA()), 2)+".txt"

 

(DT_WSTR,2) is C# way to cast a value to a nvarchar(2).

(DT_DECIMAL,2) cast string to decimal value with 2 digits after decimal point.

 

Get Filename from column PathFileName:

RIGHT([PathFileName],FINDSTRING(REVERSE([PathFileName]), "\\",1) – 1)

 

Get Folderpath from column PathFileName:

SUBSTRING([PathFileName], 1, LEN([PathFileName]) – FINDSTRING(REVERSE([PathFileName] ), "\\" ,1 ) + 1)

 

Get Foldername from column PathFileName:

TOKEN[PathFileName],”\\”,TOKENCOUNT([PathFileName], "\\") – 1)

 

FINDSTRING(ReceiptNumber,"-",1) >= 1 ? ReceiptNumber :

    LEFT(TransactionIdentification,17) != LEFT(LinkTransactionIdentification,17) ?

      LEFT(LinkTransactionIdentification,17) : NULL(DT_WSTR,17)

 

SSIS package parameter is called with or is using a default value:

  SourceFolderPathName = \\fileserver\Landingzone\CRM\SourceZipFiles\20220627

 

A SSIS package variable has an expression to fetch the date e.g.  20220627:

SourceFolderName = REVERSE(LEFT(REVERSE(@[$Package::SourceFolderPathName]),

                    FINDSTRING(REVERSE(@[$Package::SourceFolderPathName]), "\\", 1) - 1))

 

Guid variable with data type Object

Dts.Variables["User::Guid"].Value = "{" + Guid.NewGuid().ToString() + "}";

 

Expression is useful in Derived Column but can also be used in Send Mail Task for property Subject like "ETL done at: " + (DT_WSTR,30) @[System::StartTime] +"."

Or in Execute SQL Task at property SqlStatementSource to make a dynamic sql statement but be careful because it can be hard to read, understand and maintain. I think it is better using a variable for dynamic sql so I can messagebox it for syntaks and correctness before it is send to the database server to pull or fetch data to a resultset inside Execute SQL Task or before to a 'Source' but SSIS now prefer a variable. I can make an expression at the Data Flow Task to Disable it, in case a previous step had put pipeline Row Count to a variable: @[User::NOBills] == 0,

or at Precedence constraint green arrow for continue when @[User::NOBills] > 0 and a fx logo is shown.

 

Expression operators taken from C#:

 

==

equivalent

same == same

!=

inequality

old != new

>   >=

greater than

10 > 5

<   <=

less than

10 < 15

&&

logical and

x >= 100 && x <= 200 interval

||

logical or

x == 100 || x == 200

!

logical not

!ISNULL(x)  x has a value

<boolean>?<true>:<false>

iif – inline if

ISNULL(x) ? -1 : 1

%

modulus

25 % 2 is 1 for a odd number

 

SUBSTRING, UPPER, LOWER, TRIM, LEN, DATEDIFF, CEILING for round up, FLOOR for round down, REPLACENULL(<columnname>, 0) for if null replace zero in SSIS.

 

SWITCH / CASE expression becomes a if-elseif-else statement like this:

(condition1) ? (true value1) : (condition2) ? (true value2) : (false value)

 

A boolean value is either True or False. A bit value is either 0 or 1.

ISNULL(Keep) ? NULL(DT_BOOL) : (DT_BOOL)(Keep == "false" ? 0 : 1)  anything else becomes true.

ISNULL(Keep) ? NULL(DT_BOOL) : (DT_BOOL)(Keep == "false" ? 0 :

                                                                   (Keep == "true"  ? 1 : NULL(DT_BOOL)))

                                                                                                         anything else becomes null.

 

ISNULL(Keep) ? NULL(DT_BOOL) : (DT_BOOL)(REPLACE(REPLACE(Keep,"false","0"),"true","1"))

 

Example of how to calculate number of days in a derived column expression:

 

ContiguousDayCounter=DATEDIFF("dd",(DT_DBDATE)"1979-12-31",(DT_DBDATE)TransactionDatetime)

 

Or a special number with value e.g. 12031-852356-21 up to 30 characters length:

 

SpecialNumber = (DT_WSTR,30)((DT_WSTR,10)DATEDIFF("dd",(DT_DBDATE)"1979-12-31", (DT_DBDATE)TransactionDatetime) + "-" + (DT_WSTR,12)SEQNO + "-" +

REPLACENULL(DEVICEN0,"00"))

 

VB example in a Script Component that build up a large string with errors:

Public Class ScriptMain Inherits UserComponent

  Dim errorMessages As StringBuilder = New StringBuilder()

  Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)

      errorMessages.AppendLine(

                                String.Format("Key value {0} not found in the reference table.", Row.KEY))

  End Sub

 

Unnatural NULL Behaviour in SQL server

  • NULL is not equal to zero or blank.
  • NULL is not equal to NULL.
  • NULL is not »not equal« to NULL.
  • TRUE OR NULL => TRUE.
  • TRUE AND NULL => NULL.
  • SELECT F FROM T WHERE EXISTS (SELECT NULL) makes EXISTS TRUE.
  • NULL-valued integer column is different from a NULL-valued string column.
  • IS NOT NULL in a query to search a column that has a value or an empty string "" with a length 0.
  • A null string is different than an empty string "" with length 0, because a null string doesn't point to anything in memory.
  • Also called DBNull, vbNullString or Nothing like: If Not rst Is Nothing Then.

 

In SSIS C# Script component

Any equality or non-equality comparison with null (==, !=) works as expected, i.e.:

 

null == null         //true

null != null          //false

null == nonNull   //false

null != nonNull    //true

 

Any inequality comparison with null (>=, <, <=, >) returns false even if both

o­perands are null, i.e.:

 

null >  anyValue  //false

null <= null        //false

 

IsNullOrEmpty function checks whether or not a string is empty or null. DBNull is not null (Nothing) but rather is a class that indicates a value from a database does not exist. IsDbNull checks for whether or not a value equals DBNull.

IIf(String.IsNullOrEmpty(DBNull.Value.ToString), "", DBNull.Value)

 

Derived column

 

ISNULL(SalesPersonName) ? "Missing" : SalesPersonName

better with: REPLACENULL(SalesPersonName, "Missing")  (or use "N/A").

REPLACENULL(SalesAmount, (DT_CY)0)

REPLACENULL(SalesDate, (DT_DBDATE)"1900-01-01")

REPLACENULL(SalesDateTime, (DT_DATE)"1900-01-01 00:00:00.000")

 

In case of null value or a zero length empty string value:

(ISNULL(SalesPersonName) || TRIM(SalesPersonName) == "") ?

       "Missing" : TRIM(SalesPersonName)

 

ISNULL(DrawNumber) ? "-" : TRIM(DrawNumber)

 

Make sure of the data type, works also for decimal with 2 decimals:

ISNULL(Amount) ? (DT_NUMERIC,15,2)0 : (DT_NUMERIC,15,2)Amount

 

(DT_STR, 4, 1252)([Amount] < 300 ? "Low" : "High")  to a varchar(4) column.

(DT_WSTR, 4)([Amount] < 300 ? "Low" : "High")        to a nvarchar(4) column.

 

Making a derived column d_Number from a Number column nvarchar(50) with null or empty string because it gives: failed because error code 0xC0049064 and failed with error code 0xC0209029 :

TRIM(Number) == "" ? NULL(DT_I8) : (DT_I8)Number

 

Making a new Id as first 17 digits from a string and cast it to bigint:

(DT_I8)REPLACENULL(LEFT(LinkId,17), LEFT(Id,17))

 

But if the both text column can contain an emtpy string:

ISNULL(LinkId) || TRIM(LinkId) == "" ?

  (ISNULL(Id) || TRIM(Id) == "" ? NULL(DT_I8) : LEFT(Id,17)) : (DT_I8)LEFT(LinkId,17)

 

Derived column does not have a sql try_cast but Derived column has a button called Configure Error Output and in the dialogbox you can change to 2 x Ignore failure when you like to cast or convert a text/string with digits to a real number with a data type as bigint:

 

NationalIdentification_Integer = (DT_I8)NationalIdentification_Text

 

If column NationalIdentification_Text has a 007 it becomes a 7 and has a JB007 becomes a NULL in the new derived column NationalIdentification_Integer in the pipeline to a destination/target table to be saved in a table in a bigint column.

 

A script component has a tryparse to check if LinkId or Id contain a not valid number e.g. 101A or BC123456-7.

 

Active Player Day as CustomerId concat with yyyymmdd as bigint value:

(CustomerId * (DT_I8)100000000)+

((YEAR(PlayDate) * 10000) + (MONTH(PlayDate) * 100) + DAY(PlayDate))

 

Compare two columns that might have a Null in a Conditional Split

I always replace Null with this character ¤ because very rare a text in a column on­ly has this value. Then I can compare columns that has checkmark in 'Allow Nulls'.

 

For nvarchar I can replace Null to ¤ because columns are already a string:

REPLACENULL(s_Name,"¤") != REPLACENULL(d_Name,"¤")

 

For int, bigint, tinyint, smallint, decimal, float, real, numeric, money, and date, datetime and time, I first cast column to a string with 30 characters:

 

REPLACENULL((DT_WSTR,30)s_Number,"¤") != REPLACENULL((DT_WSTR,30)d_Number,"¤")

 

A CustomerNo is a bigint and when it is null I want to -1 instead of null:

 

REPLACENULL(CustomerNo, (DT_I8)-1)  or  (DT_I8)REPLACENULL(CustomerNo,-1)

 

If an OrderDate column in pipeline is null then change it to January 1, 1900:

 

REPLACENULL(OrderDate, (DT_DBDATE)"1900-01-01")

 

If an OrderDate column with data type datetime as time 00:00:00.000 in this date 2016-12-31, I cast the date with (DT_DATE) to make it 2016-12-31 00:00:00.000 so the two dates can be compared, then give true else false:

 

OrderDate == (DT_DATE)"2016-12-31" ? (DT_BOOL)1 : (DT_BOOL)0

 

If a ValidTo date column with no time I cast the date with (DT_DBDATE):

 

(DT_DBDATE)ValidTo == (DT_DBDATE)"9999-12-31" ? (DT_BOOL)1 : (DT_BOOL)0

 

If divide by zero then let the result be 0 else calculate the ratio:

 

REPLACENULL(Quantity,0) == 0 ? 0.0 : Price / Quantity

 

Other string functions

Replace in a string or if integer, cast first:

REPLACE((DT_WSTR,8)NumberOfDays,"6","5")

 

Replace zero (0) with NULL:

REPLACE((DT_WSTR,8)<IntegerColumn>,"0",NULL(DT_I4))

 

Convert an empty string to NULL in a derived column as varchar:

TRIM(<TextColumn>) == "" ? (DT_STR, 4, 1252)NULL(DT_STR, 4, 1252) : <TextColumn>

 

Replace an integer column with NULL if the value is negative:

(<IntegerColumn> < 0) ? NULL(DT_I4) : <IntegerColumn>

 

Replace an double column with NULL if the value is negative:

(<DoubleColumn> < 0) ? NULL(DT_NUMERIC,3,2) : <DoubleColumn>

 

Replace a date column with NULL if the value is »in the future«:

(<DateColumn> > GETDATE()) ? NULL(DT_DATE) : <DateColumn>

 

LEFT(<string>, <number-of-chars>) A shorthand way of writing SUBSTRING(<string>, 1, <number-of-chars>)

TOKEN(<string>, <delimiter>, N) Returns the Nth token in <string> when it is split by <delimiter>

TOKENCOUNT(<string>, <delimiter>) Returns the total number of tokens in <string> as determined by <delimiter>

TOKEN(,,) will be particularly useful when manually parsing columns from a flat file.

When used in conjunction with TOKENCOUNT it can be used to return the last token in a string like so:

TOKEN(<TextColumn>, ",", TOKENCOUNT(<TextColumn>, ",") )

 

Compare with a string use " around the text like: @[User::Name] == "Paris" in a 'Conditional Split'. Use [ ] around a column that contains space or start with a digit.

 

Inside Data Flow Task the Derived Column do the manipulation of data

When data from a OLE DB Source goes to a pipeline the columns can be derived to new columns with new column name that exists in table of OLE DB Destination, or the columns values can be replaced by cleansing rule expression together with ? condition (if-then-else) and cast to get the right data type and length of text, here is some examples:

 

 

Expression: 'Discount != 0 ? (DT_BOOL)1 : (DT_BOOL)0' means when the pipeline column Discount has a value different from 0, a new column in the pipeline called DiscountStatus with datetype bit (DT_BOOL in SSIS) is assigned value 1 = true else value 0 = false and saved into a column in the table of 'OLE DB Destination'.

 

(DT_BOOL)([Discount] == "Y" ? 1 : 0) has the advantage of automatically setting the data type of the derived column correctly.

 

Expression: 'Quantity > 0 && Quantity < 50 ? Quantity : NULL(DT_I2)' means when the pipeline column Quantity has a value between 1 and 49, the value will be passed through, but if the column value is less than or equal to 0 or greater than or equal to 50 it will be replaced to a null value because theses values is wrong from the source system and can give wrong summary and average in a data warehouse reporting.

 

The new columns NumberOfAbroad and NumberOfDomestic will get value 1 or 0 depends of the business logic rule of column Place and since a sql statement, co­lumn in a report or a measure in a cube will do a summary and total of NumberOf, it is important that the derived column has the right data type that will cover the range of the total value, therefore I cast value 1 and 0 to DT_I4 that is int data type in the table of OLE DB Destination.

 

The SSIS package has a int32 variable called Batchnumber and the variable has been assigned a value in the 'Control Flow'. The variable value can be added to the pipeline inside the 'Derived Column Transformation' with this syntax notation: @[User::Batchnumber] and saved into a column in the table of 'OLE DB Destination'.

 

By Edit the 'OLE DB Destination' I can click New button and typein a good table name to be created automatically when I click OK, so I will change the [OLE DB Destination] to table name like [Stage_OrderDetails], and the pipeline will map the co­lumns from the source table and from the derived columns to the new destination table columns:

 

 

The new table will allow null in all columns and that can be good for a source input table therefore normally there is no need for check at 'Check constraints' above.

 

Cast or convert danish letter from code page 865 DOS ASCII to code page 1252 Windows ANSI to a varchar column

Cannot be processed because more than one code page means, that data in OLE DB Destination is using a different codepage than 1252. Here is an example to cast or convert between code pages, here I am using: DT_STR(50, 1252) as a cast in front of the column name, important in derived column to make a new column in the pipeline, like this where the column name is »by« meaning city:

 

 

When column is an unicode with nvarchar(50) use then (DT_WSTR,50).

 

 

4. Execute SQL Task to assign a variable to use in Control Flow

A SQL statement will query a database table and will only return one row and the two columns will be assigned to two SSIS package variables:

 

SELECT StatusId, StatusDateTime  -- int and datetime

FROM dbo.STATUS

WHERE Status = 0

 

In a SSIS package in Control Flow I rightclick the canvas and select {Variables} and in the window I click the small left side button [Add Variable] to insert a new row and type in the variable name and data type where sql int is a package Int32. Since my sql statement has two columns I also make two variables with a default value, but the default value will be override by the value from the sql statement:

 

 

In Control Flow I drag in a 'Execute SQL Task' and I edit it and choose a Connection and type in the sql statement at the property SQLStatement. I set property Result­Set to 'Single row' and at the page 'Result Set', I click the [Add] button and override 'NewResultName' with the first column name or just 0 for first column and 1 for second column. I also choose the right package variable:

 

 

To test if the assignment of the two variable has been done, I drag in a 'Script Task':

 

 

I edit it 'Script Task' and at property ReadOnlyVariables I click the […] and check­mark the two package user variables I like to use later inside the script. I click at [Edit Script] button and typein a MessageBox line above the Success line like this:

 

MessageBox.Show(Dts.Variables["User::SID"].Value.ToString() + " - " +

Dts.Variables["User::SDATE"].Value.ToString());

 

Dts.TaskResult = (int)ScriptResults.Success;

 

I start the package and see the messagebox showing the values from the row in the status table in the database.

 

When the sql statement does not return a row, an error occurred while assigning a value to variable: "Single Row result set is specified, but no rows were returned." Therefore I need to extend the sql statement so it will always return a row and I will make a stop constraint so Control Flow goes to another script.

 

In 'Execute SQL Task' I change the sql statement to control if there is a row and if the exists become false I create a row with default values for the two columns:

 

IF EXISTS(SELECT 1

          FROM dbo.STATUS

          WHERE Status = 0)

   SELECT StatusId, StatusDateTime

   FROM dbo.STATUS

   WHERE Status = 0

ELSE

   SELECT 0 AS StatusId, '1900-01-01 00:00:00' AS StatusDateTime

 

I edit the green 'Precedence constraint' arrow between 'Execute SQL Task' and 'Script Task', because I can extend the constraint with an logical boolean expression, so I change 'Evaluation operation' from "Constraint" to "Expression and Constraint". In the expression I’m u­sing the SID variable with a C# notation for comparison for equivalent:

@[User::SID] != 0

 

 

A fx logo is shown at the green 'Precedence constraint' arrow telling there is an expression included in the criteria for successful and continuation.

 

 

When the sql statement does not return a row the 'Precedence constraint' will not con­tinue and the SSIS package does not failed.

 

But if I want the 'Control Flow' to continue, I drag in another 'Script Task', rename it to 'No Row' and typein a MessageBox "No row". I edit the green 'Precedence constraint' arrow between 'Execute SQL Task' and 'No row' using: @[User::SID] == 0

 

 

The 'Control Flow' will always continue the task of the SSIS package and the two variables has a value that can be used in the package later like in a 'Data Flow Task'.

 

Assign a datetime variable for fixed value

 

 

Drag in an 'Execute SQL Task' and at edit then typein a sql statement to get the date and time from the SQL Server, you can also use GETUTCDATE() for London time with­out summertime, like this for normal local time with summertime:

 

SELECT GETDATE() AS CurrentDateTime

 

Change ResultSet to 'Single row' and select a Connection and at ResultSet page typein the columnname from the sql statement and select the variable:

 

 

Then you can use the package variable User::CurrentDateTime in a 'Script Task'

 

MessageBox.Show(Dts.Variables["User::CurrentDateTime"].Value.ToString());

 

Then you can use the package variable User::CurrentDateTime in a 'Execute SQL Task' as a input parameter to make a dynamic sql statement in a where part or in a set part of a update statement like this:

 

 

Please remark to use DATE as Date Type for Input Direction, because OLE DB DATE do also cover a time as a datetime. For Output Direction use a DBTIMESTAMP.

 

If you typein an Expression for the variable in the variables window like this:

 

 

the variable CurrentDateTime will change value each time it is used, like it is execute the expression.

 

Execute SQL Task does not failed with no row

In case you does not like the sql statement starting with IF EXISTS(SELECT 1 the 'Execute SQL Task' can be changed for using 'Parameter Mapping' which me­ans that the sql statement will have two output parameters like this where question mark is a output parameter:

 

SELECT ? = StatusId, ? = StatusDateTime

FROM dbo.STATUS

WHERE Status = 0

 

Other examples: SELECT TOP 1 ? = CustomerId   or   SELECT ? = MAX(System_key)

'Parameter Mapping' does not use property ResultSet, therefore I will set it to 'None' and the 'Execute SQL Task' will not failed when there is no row.

 

At page 'Pa­ra­me­ter Mapping' I connect each variable with each output parameter where Para­me­ter Name 0 is the first ? as StatusId and second ? is StatusDateTime. The Data Type for the output or input parameter is OLE DB as shown in chapter 3.

 

 

I still keep the two 'Precedence constraint' green arrows so the 'Control Flow' will do different tasks depending of a row or no row, because when there is no row the 'Pameter Mapping' will assign the SID variable to value 0.

 

Control flow depends of exists a value in a table

In a Result table I like to get a StatusId that has a status as true and the StatusId I like to check if it exists in a Status table. In a SSIS package I will use two 'Execute SQL Task' to get values from the database tables into SSIS package variables and let one variable ExistsStatus to control the flow by two fx expressions:

  @[User::ExistsStatus] == True         // for Exists

  @[User::ExistsStatus] == False        // for Not exists

 

 

Get StatusId from Result

Variable @[User::StatusId] becomes the output parameter for the question mark at left side of equal sign:

 

SELECT TOP 1 ? = StatusId

FROM dbo.RESULT

WHERE Status = 1

 

In case there is no row then @[User::StatusId] get value 0.

 

Exists StatusId ?

Variable @[User::ExistsStatus] becomes the output parameter for the first question mark at left side of equal sign and variable @[User::StatusId] becomes the input parameter of the second question mark at the right side of the equal sign

 

SELECT ? = IIF(EXISTS(

               SELECT 1

               FROM dbo.STATUS

               WHERE StatusId = ?), 1, 0) -- 1 = True, 0 = False

 

Fx expressions

The left fx is using the logical expression: @[User::ExistsStatus] == True

and when the condition is fulfilled the left green 'Precedence constraint' will continue the control flow.

 

The right fx is using the logical expression: @[User::ExistsStatus] == False

and when the condition is fulfilled the right green 'Precedence constraint' will continue the control flow.

 

Insert into a table

'Execute SQL Task' can do an insert into sql statement with input parameter:

 

INSERT INTO dbo.RESULT(StatusId, Status)

VALUES(?,0)

 

Update of a table

'Execute SQL Task' can do an update sql statement with two input parameters:

 

UPDATE dbo.RESULT       UPDATE r

SET Status = ?          SET Status = ?

WHERE StatusId = ?      FROM dbo.RESULT r

                        WHERE r.StatusId = ?

 

Delete in a table

'Execute SQL Task' can do a delete sql statement with input parameter:

 

DELETE

FROM dbo.RESULT

WHERE StatusId = ?

 

Call stored procedure to do update and get the result back

I like to update a column NumberOf for a specific OrderId and get the result

 

OrderId

NumberOf

1

10

2

20

3

30

 

I make the business logic in a stored procedure with two input parameter and one output parameter with the result of the updating that is assigned together with UPDATE sql statement:

 

ALTER PROCEDURE [dbo].[Order_NumberOf]

      (@OrderId int, @Extra int, @NumberOf int OUTPUT)

AS 

BEGIN 

    BEGIN TRANSACTION 

    UPDATE dbo.OrderNumber

    SET @NumberOf = NumberOf = NumberOf + @Extra

    WHERE OrderId = @OrderId

    COMMIT TRANSACTION 

END

 

 

Dts.Variables["User::OrderId"].Value = 2;

Dts.Variables["User::Extra"].Value = 3;

Dts.TaskResult = (int)ScriptResults.Success;

 

MessageBox.Show(Dts.Variables["User::NumberOf"].Value.ToString());

Dts.TaskResult = (int)ScriptResults.Success;

 

After running the SSIS package it will show in messagebox 23 and table look like:

 

OrderId

NumberOf

1

10

2

23

3

30

 

Parameter Names must be a 0, 1, 2 and showing as ? parameter within same order. Named @parameter requires an ADO.NET Connection.

 

In case I do not like to do the two package variable assignment in a 'Script Task', I can replace it with two 'Expression Task's that do each package variable assignment:

 

 

 

Read variable values from a table to SSIS packages variable

Table called SSIS_Variable has a row for each variable with a datatype and a value:

 

Variable

Datatype

Value

EndDate

DateTime

31-12-2015

Summertime

Boolean

True

Type

String

Production

Weight

Decimal

0,15

Year

Int32

2015

 

A SSIS package has same variables that will be assigned with the values from the table by a Script Task:

 

 

Assign variable has a C# program that connect to the database and to the table and loop for each row and assign each variable with the right data type:

 

#region Namespaces

using System;

using System.Data;

using Microsoft.SqlServer.Dts.Runtime;

using System.Windows.Forms;

//For connection to SQL Server

using System.Data.SqlClient;

using Microsoft.SqlServer.Dts.Runtime;

using System.Text.RegularExpressions;

#endregion

 

namespace ST_cad614c72d4f4a1ab407b6cb5a7ec482

{

 [Microsoft.SqlServer.Dts.Tasks.ScriptTask.

   SSISScriptTaskEntryPointAttribute]

 public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks. Script

                                   Task.VSTARTScriptObjectModelBase

 {

  public void Main()

  {

   try

   {

    //String connectionstring = "Server=<server>;Database=<db>;

    //                           Trusted_Connection=Yes;Connect Timeout=30";

    //Better using the Connection Manager from the package to get connection:

    String connectionstring =

      Dts.Connections["DSAconnectionstring"].ConnectionString.ToString();

    //Cleansing the connectionstring so it can be used for SqlConnection below:

    Regex regProvider = new Regex("Provider=([^;]*);");

    Regex regTranslate = new Regex("Auto Translate=([^;]*);");

    connectionstring = regProvider.Replace(connectionstring, "");

    connectionstring = regTranslate.Replace(connectionstring, "");

    SqlConnection connection = new SqlConnection(connectionstring);

    SqlCommand command = new SqlCommand();

    SqlDataReader reader;

    command.Connection = connection;

    connection.Open();

    command.CommandType = CommandType.Text;

    command.CommandText = "SELECT Variable,Datatype,Value FROM dbo.SSIS_Variable";

    reader = command.ExecuteReader();

    if (reader.HasRows)

    {

     while (reader != null && reader.Read())

     {

      //For debugging:

      //if (reader["Variable"].ToString() == "Year")

      //{

      //MessageBox.Show(reader["Variable"].ToString()+reader["Value"].ToString());

      //}

      //For convert or parse to the right data type for the package variables:

      switch (reader["Datatype"].ToString())

      {

       case "Boolean":

          Dts.Variables[reader["Variable"]].Value = Boolean.Parse(reader["Value"].ToString());

       break;

       case "DateTime":

          Dts.Variables[reader["Variable"]].Value = DateTime.Parse(reader["Value"].ToString());

       break;

       case "Decimal":

          Dts.Variables[reader["Variable"]].Value = Decimal.Parse(reader["Value"].ToString());

       break;

       case "Int32":

          Dts.Variables[reader["Variable"]].Value = Int32.Parse(reader["Value"].ToString());

       break;

       default: //For String

         Dts.Variables[reader["Variable"]].Value = reader["Value"].ToString();

       break;

      }                                       

     }

    }

    reader.Close();

    reader.Dispose();

    reader = null;

    command.Dispose();

    command = null;

    connection.Close();

    connection.Dispose();

    connection = null;

   }

   catch(SqlException ex)

   {

    MessageBox.Show("No connection");

   }

   Dts.TaskResult = (int)ScriptResults.Success;

  }

 }

}

 

Show variable has a C# program:

 

 public void Main()

 {

  MessageBox.Show(Dts.Variables["User::Year"].Value.ToString());

  MessageBox.Show(Dts.Variables["User::Weight"].Value.ToString());

  MessageBox.Show(Dts.Variables["User::EndDate"].Value.ToString());

  MessageBox.Show(Dts.Variables["User::Type"].Value.ToString());

  if ((bool)Dts.Variables["User::Summertime"].Value == true)

     MessageBox.Show("Summertime");

  else

     MessageBox.Show("Wintertime");

  Dts.TaskResult = (int)ScriptResults.Success;

 }

 

Package parameter in child package to be assigned from a parent package

A SSIS package can be parameterized like a stored procedure by using parameters that is places at the tab next to Control Flow and Data Flow and parameters can al­so have a default value which is useful under develop execute the package.

 

The SSIS package parameter has Required = True, meaning that the calling parent SSIS package need to provide a parameter value to its child. In case the parent does not always has a parameter value, Required should be set to False and then Value will be the default parameter value.

 

Inside a package I can use a parameter like it was variable in an Expression Task or a Script Task where I write like this: $Package::FromYearMonth.

 

But a package parameter is a readonly variable therefore it can not be assigned in an Expression Task like this:  @[$Package::FromYearMonth] = 201512

But a parameter can be used at the right side of the equal sign like this:

 

 

A SSIS package as a parent package when it execute or call another SSIS package as a child package and pass by value to the parameter through a 'Execute Package Task' like this:

 

 

 

 

5. Control Flow, workflow in design mode and runtime mode

When I open a SSIS package in design mode for development, I like to see the con­trol flow as it will be in runtime mode, therefore I do not like a package to do dynamically disable 'Data Flow Task' by expression fx mark at 'Data Flow Task' or by script task programming to do disabling 'Data Flow Task' DFT.

 

I like to use Precedence constraints green arrow with expression fx to show the control flow or workflow of the package in a self-explanatory graphical user interface perspective where the control flow determine the order in which executables (tasks and containers) will run and the control flow determines under what conditions they are executed. In other words, certain executables will run only when a set of defined conditions are met in an expression fx at precedence constraints green arrows.

 

Setting Precedence Constraints on Tasks and Containers for a linear or branching control flow like this where I fetch two boolean values from a table into two packages variable and let them with true/false do the runtime workflow:

  • Always do Fetch Customers as first task.
  • After Fetch Customers do Fetch Pro­ducts task if variable is true.
  • Either after Fetch Customers Or after Fetch Products do Fetch Suppliers task.
  • Only do Fetch Orders and details if variable is true.

 

 

·       Precedence constraint green arrow has an expression, showned by a fx logo.

·       Solid arrow meaning AND, that all precedence constraints that point to the constrained executable must evaluate to true in order for that executable to run.

·       Dotted arrow meaning OR, that only one precedence constraint that points to the constrained executable must eva­lua­te to true in order for that executable to run.

 

In runtime mode

 

 

Fetch data variable

I have made a table FetchData with columns of bit data type to control the flow that will be fetch into two packages variables through an 'Execute SQL Task':

 

 

 

From FetchCustomers to FetchProducts

When the package variable FetchProducts has value True from the table, the workflow will do FetchProducts.

 

I connect the two tasks with a precedence constraint green arrow and I right click at the arrow and select {Edit}, in first dropdown I select 'Expression and Constraint' so both the Expression »@[User::FetchProducts] == True« and Constraint of »FetchCustomers« must be true and success to continue with FetchProducts task. I use default Logical AND and is giving a solid arrow:

 

 

From FetchProducts to Fetch Suppliers

I do the same as above and I still using Logical AND and get a solid arrow.

 

From FetchCustomers to Fetch Suppliers

When the package variable FetchProducts has value False from the table, the workflow will only do FetchSuppliers, therefore the above path is shipped so no go for FetchProducts.

 

I connect the two tasks with a precedence constraint green arrow and I right click at the arrow and select {Edit}, in first dropdown I select 'Expression and Constraint' so both the Expression »@[User::FetchProducts] == False« and Constraint of »FetchCustomers« must be true and success to continue with FetchProducts task. I use now Logical OR and is giving a dotted arrow:

 

 

The solid arrow from FetchProducts to Fetch Suppliers become dotted automatically to show in the self-explanatory graphical user interface that Fetch Suppliers is only depends of either been doing FetchProducts or FetchCustomers.

 

To Fetch Orders and details / Do nothing, continue when FetchOrders is True

Task Fetch Orders and details will only be done if the package variable FetchOrders has value true. To show this in a self-explanatory graphical user interface perspective I am using an empty 'Script Task' that has no variable and has no programming, and from 'Script Task' to Fetch Orders and details I add this Expression: »@[User::FetchOrders] == True« and no need for Constraint because 'Script Task' will never failure:

 

 

 

Do nothing, continue when package parameter is True

A SSIS package can receive a value from another package through a parameter and then the package can control the flow depends of value of parameter.

 

An empty 'Script Task' »Do nothing, continue when Fetch is True« has no variable and has no programming, it is just placed there as a dummy object, so I can make a Precedence constraint green arrow that has an expression showned by a fx logo

fx: @[$Package::Fetch] == True

for programming a self-explanatory graphical user interface of the control flow of the package like this:

 

 

When the fx expression condition is right/true,

the job will enter the 'Sequence Contai­ner'.

 

 

When the fx expression condition is wrong/false,

the job will not enter the 'Sequence Contai­ner'.

 

 

Both 'Execute SQL Task' for »Begin package« and »End package« for log regi­stra­tion will be done and »Fetch Stock« will also be done because it is independent of the Fetch package parameter.

 

 

When a package variable or a package parameter has data type Int32 and a default value as 0, the Expression for continuation with the task can be like this:

 

@[$Package::parameter] != 0 && !ISNULL(@[$Package:: parameter])

 

When a package variable or a package parameter has data type String and a default value as 0000 (from nchar(4) column in a table), the Expression for continuation with the task can be like this:

 

@[$Package::param] != "0000" && !ISNULL(@[$Package::param]) && @[$Package::param] != ""

 

 

 

6. Package Job Log for raise and catch error failure

When I have parents and children packages, I like to wrap them all into one Job package with a simple joblog table that take care of controlling that the job can not be running two times at the same time and has few status state like 'Running', 'Finish' and 'Error'. To catch an error in the Job package or the below parent and child package I will use Event Handlers that is a tab next to Control Flow, Data Flow and Parameters. The main Job package can look like this where I using fx expression to control the process flow:

 

 

A JobLog table has some columns:

 

JobLogId

Status

BeginTime

EndTime

ErrorMessage

6

Finish

2015-01-01

04:00:00.000

2015-01-01

04:56:02.123

7

Finish

2015-01-02

04:00:00.000

2015-01-02

04:43:50.497

8

Error

2015-01-03

04:00:00.000

2015-01-02

04:12:31.617

Error while executing package

"Suppliers.dtsx".

 

Execute SQL: JobLog Status is not Running

To control that the job is not running or has status 'Running':

SELECT ? = IIF(EXISTS(SELECT 1 FROM dbo.JobLog WHERE Status = 'Running'), 0, 1)

and set package variable Status_Not_Running to output value True or False.

fx: @[User::Status_Not_Running] == True to continue the job control flow.

fx: @[User::Status_Not_Running] == False to send a failure email cos Running.

 

Execute SQL: Job Status set to Running and set BeginTime and get next JobLogId

To insert a new row in JobLog and fetch the next JobLogId from Identity number and set Status to 'Running' and set BeginTime datetimestamp:

BEGIN TRANSACTION

DECLARE @InsertedRow TABLE (JobLogId int)

INSERT INTO dbo.JobLog (Status, BeginTime)

OUTPUT inserted.JobLogId INTO @InsertedRow

VALUES ('Running', Getdate())

SELECT ? = JobLogId

FROM @InsertedRow

COMMIT TRANSACTION

and set package variable JobLogId to output value.

fx: @[User::JobLogId] >= 1 to continue the job control flow.

fx: @[User::JobLogId] == 0 to send a failure email cos no new JobLogId.

 

Sequence Container: Execute packages in parallel

The Sequence Container execute multiple packages in parallelism at same time.

 

Execute SQL: JobLog Status set to Finish and set EndTime

To update the new row in JobLog and set Status to 'Finish' and set EndTime:

BEGIN TRANSACTION

UPDATE dbo.JobLog

SET Status = 'Finish', EndTime = Getdate()

WHERE JobLogId = ?

COMMIT TRANSACTION

with package variable JobLogId as input value.

 

Event Handlers

I click at tab Event Handlers, and in left dropdown I select the name of the package that represent package level scope for the event. In right dropdown I select 'OnError' to catch an error in the process or children packages.

To get a canvas or design surface to drag in SSIS components, I click at line link:

 

              Click here to create an 'OnError' event handler

 

The canvas will become dark gray and I can drag in an 'Execute SQL Task' or copy one from 'Control Flow' tab:

 

Execute SQL: JobLog Status set to Error and set EndTime

To update the new row in JobLog and set Status to 'Error' and set EndTime and get system error message from the package:

BEGIN TRANSACTION

UPDATE dbo.JobLog

SET Status = 'Error', EndTime = Getdate(), ErrorMessage = ?

WHERE JobLogId = ?

COMMIT TRANSACTION

with package variable JobLogId as input value and the first ? parameter with index 0 to system variable as input for System::ErrorDescription shown in picture before.

 

Event Handlers has many properties like OnPreExecute and OnPostExecute that can do things before and after the package process.

 

Of course the Event Handlers can also be placed in each parent and child packages to fetch a better System::ErrorDescription message to be register in JobLog.

 

Beside System::ErrorDescription is System::SourceName, System::TaskName and System::EventHandlerStartTime (ErrorTime) to other columns in JobLog table.

 

A SQL Server Agent Job will in case of a failure be red with a message 'The step failed' when the Job package raise an error.

 

Great tip

Using an event handler OnPreExecute or OnPostExecute in SSIS to log execution data into a audit job execute table, an event can be fired multiple times in SSIS, but we want it to be fire only once. Add this to the event handler:

 

  1. Go to Event handler tab and select either OnPreExecute or OnPostExecute.
  2. In the Properties list go to Expressions.
  3. Click at [...].
  4. In the dialogbox under Property select Disable in the dropdown.
  5. In the same dialogbox under Expression click at [...].
  6. Copy-past this into the expression:

          @[System::PackageName] != @[System::SourceName]

  1. Click 2 times click OK.
  2. In the Properties list go to Expressions and click [+] to open it and see the Disable expression, when PackageName is different from SourceName then the event handler will be disabled and not executed the canvas.

 

Every time SSIS package like to call the event handler, the Disable property expression is evaluated and will become true therefore the event handler is disabled. SourceName is the executable that raised the event and when that is the PackageName, which we are interessing in, the expression becomes false, meaning disable becomes false, meaning event handler becomes enabled and will execute the canvas. Normally I do not add the above tip to the OnError Event Handler.

 

Catch an error  – alternative to OnError

A Sequence Container can be used for many different tasks, and in case you like to catch error failure inside the Sequence Container it can be done like this:

 

 

 

The database SSISDB has views to show data about the SSIS package execution:

 

SELECT  E.execution_id,

        E.project_name,

        E.package_name,

        S.start_time,

        S.end_time,

        S.execution_result

FROM [SSISDB].[catalog].[executions] E

JOIN [SSISDB].[catalog].[executables] EX ON E.execution_id = EX.execution_id

JOIN [SSISDB].[catalog].[executable_statistics] S

        ON E.execution_id = S.execution_id

 

SSIS error codes in daily running is because of a bad connection at the server, that often is solved by it self: 0x80004005 or 0xC0202009.

 

 

 

7. Execute T-SQL Statement Task

At Control Flow drag in 'Execute T-SQL Statement Task' and rightclick {Edit} and click [New] when first time using in a project. Call the connection TSQL<data­base­name> and typein SQL Server name (or use . or localhost). Typein a sql statement like a UPDATE, DELETE, TRUNCATE, EXEC stored procedure or a dummy like »print 'hello'«. A new connection is shown in the buttom of the package at Connection Managers and it is a new connection because this task is using ADO.NET connection manager type with '.Net Provi­ders\Sql­Client Data Provider'. Rightclick the TSQL connection and choose {Convert to Project Connection}. In Solution Explorer open the connec­tion and select a database for the sql statement else the database name has to be part of the statement FROM <database>.dbo.<table>. Back in package rightclick the TSQL connection {Parameterise} then the connection string becomes a project parameter, the value in the dialogbox only needs to be:

 

Data Source=<server>;Initial Catalog=<database>;Integrated Security=True;Connect Timeout=30;

 

Remember to click [Save All] and the parameter can be map to an Environment va­riable in SSISDB catalog. I can go back now and typein the right sql statement.

 

A project will have two connections to the same database for Data Flow Task and for Execute T-SQL Statement Task.

 

 

8. Kimball SCD type 7 dimension

I will show a way to implement type 7 dimension in a SSIS package fetching data from a source table from a source system to be stored in a dimension table with ten metadata columns that support type 7, that is an extension of type 1 and 2 and 3 plus handling deleted values. There is an example in my article Dimensional modeling that is good to read before programming the package that will perform the ETL process from source to dimension.

 

In case you don’t like the Previous reference _pkey column in the dimension table from the article, you can skip the _pkey column in the table and SSIS package im­ple­men­ta­tion, because I will in the end of this tip show an alternative sql update without _pkey. For my self I have nice usage of the _pkey co­lumn in fetching from a dimension.

 

The two table structures for this example:

 

CREATE TABLE [dbo].[Source_System]

([Id] [int] NOT NULL,                              -- Business key in source system therefore primary key here.

 [Name] [nvarchar](50) NULL,

 [Version] [decimal](5, 2) NULL,

 CONSTRAINT [PK_Source_System] PRIMARY KEY CLUSTERED

([Id] ASC) ) ON [PRIMARY]

GO

CREATE TABLE [dbo].[Dim_System]

([System_key] [int] IDENTITY(1,1) NOT NULL,

 [System_dkey] [int] NOT NULL,

 [System_pkey] [int] NOT NULL,

 [System_bkey] [int] NOT NULL,

 [System] [nvarchar](50) NULL,

 [Version] [decimal](5, 2) NULL,

 [ValidFrom] [datetime] NOT NULL,

 [ValidTo] [datetime] NOT NULL,

 [IsCurrent] [bit] NOT NULL,

 [IsDeleted] [bit] NOT NULL,

 [InsertTime] [datetime] NOT NULL,

 [UpdateTime] [datetime] NOT NULL,

 CONSTRAINT [PK_Dim_System] PRIMARY KEY CLUSTERED

([System_key] ASC) ) ON [PRIMARY]

GO

CREATE NONCLUSTERED INDEX [IX_Dim_System_bkey] ON [dbo].[Dim_System]

([System_bkey] ASC) ON [PRIMARY]

GO

CREATE NONCLUSTERED INDEX [IX_Dim_System_dkey] ON [dbo].[Dim_System]

([System_dkey] ASC) ON [PRIMARY]

GO

CREATE NONCLUSTERED INDEX [IX_Dim_System_pkey] ON [dbo].[Dim_System]

([System_pkey] ASC) ON [PRIMARY]

GO

 

Remark: datetime2(n) has a date range of 0001-01-01 through 9999-12-31 while datetime starts from 1753-01-01. datetime takes 8 bytes in storage size. datetime2 takes 6 bytes for precisions less than 3; 7 bytes for precisions 3 and 4 else 8 bytes.

datetime2(3) has same precision as datetime. smalldatetime has a date range of 1900-01-01 through 2079-06-06 and 00 second in the timepart and takes 4 bytes in storage size and I could use »the forever date« like 2070-12-31.

I choose datetime and let ValidFrom default at 1900-01-01 and ValidTo default at 9999-12-31.

 

The source system does not provide any datetime column when data is valid from or any status flag for New, Created, Changed, Updated or Deleted data, therefore the SSIS package has to do the data detection of New, Changed or Deleted data in the source system and reflect it to the dimension table.

 

In the SSIS package I’m not using 'OLE DB Command' instead the sql update of metadata columns in previous rows will be done by a sql statements that don’t need variable therefore I’m placing them in 'Execute T-SQL Statement Task' to get the best performance.

 

Based on full load from source.

 

I use 'Data Flow Task', 'Execute T-SQL Statement Task', 'OLE DB Source', 'Merge Join', 'Conditional Split', 'Derived Column' and 'OLE DB Destination' in flow of data from a source system from table Sour­ce_System to a type 7 dimension table Dim_Source in a datamart:

 

 

I will make a SSIS package where data in a Source_System table will do a Kimball SCD type 7 at a Dim_System table. From Source_System table package will do data detection of :

 

  • New data in Source_System that does not exists in Dim_System.
  • Changed data in Source_System compared to data in Dim_System.
  • Deleted data in Source_System that still exists in Dim_System.
  • Unchanged data in Source_System that has not changed.

 

The package will insert row from Source_System table into Dim_System table together with metadata columns. The package will end up with updating rows in table Dim_System to set _dkey column and other metadata columns.

 

Since the metadata columns will use default values, I can either typein the default va­lues for each metadata columns in the table design of Dim_System, or I can make them as parameters and variables for the SSIS project and the SSIS package like this:

 

Parameter name

Datatype

Value

defaultdkey

Int32

0

defaultpkey

Int32

0

defaultIsCurrent

Boolean

True    (same as 1 in sql)

defaultIsDeleted

Boolean

True    (same as 1 in sql)

defaultIsDeleted_Not

Boolean

False   (same as 0 in sql)

defaultValidFrom

DateTime

01-01-1900 00:00:00

defaultValidTo

DateTime

31-12-9999 00:00:00

DMAconnstr

String

TSQL_DMAconnstr

String

 

 

Variable name

Datatype

Expression / Value

currentDateTime

DateTime

GETDATE()

RowCount

Int32

0

 

In 'Control Flow' I drag in a 'Data Flow Task' and inside I drag in two 'Source' for tables Source_System and Dim_System with sql statements to make sure that data has the right sortorder for a later of 'Merge Join' that will join over the business keys therefore order by business keys in both sql statements where I only include the columns the package need, like this:

 

'Source' to get data from source table

'Source' to get data from dimension

SELECT Id, Name, Version

FROM Source_System

ORDER BY Id  -- business key

SELECT System_bkey, System, Version,

            System_key, System_dkey

FROM Dim_System

WHERE IsCurrent = 1 AND IsDeleted = 0

ORDER BY System_bkey  -- business key

 

When sql statement takes data from an archive there will be a WHERE Active = 1.

 

Since both 'Source' tables are using same column name, I will rename the columns of the Source_System table by Edit the 'Source', go to Columns page and typein a prefix s_ to the output columns like this:

 

 

At both 'Source' rightclick {Show Advanced Editor}, tab 'Input and Output Properties', click at 'OLE DB Source Output' and set property IsSorted to True and open 'OLE DB Source Output' and open 'Output Columns' and select the first column from ORDER BY and set property SortKeyPosition to 1, see previous tip about this.

 

Drag in 'Merge Join' and connect the two 'Source' tables as 'Merge Join Left Input' but in real I want a 'Full outer join' because it will give me all rows from both 'Source' and Dimension table, and both business keys will be null if there is no match. I include all co­lumns from both tables because:

 

  • New data I will use columns from Source_System.
  • Changed data I will compare columns from both tables.
  • Deleted data I will use columns from Dim_System.
  • Unchanged data is not relevant because data are »same same«.

 

I rightclick {Edit} and select join type 'Full outer join' and select all columns:

 

 

Full outer join merge two datasets and return all rows from both datasets into a new result dataset.

 

'Full outer join' will provide data for four situations called:

 

  • New when source table has a row that does not exists in dimension table, therefore all columns from Dim_System is Null.
  • Changed when source has values that is different compare to dimension.
  • Deleted when source does not has value anymore but dimension will keep, therefore all columns from Source_System is Null.
  • Unchanged when source data is still the same and already exists in dim.

 

I will devide or filtering data to each own pipeline through a 'Conditional Split' with three conditions using the business key and the columns of values from source table (prefix s_) and dimension table.

 

Pipeline

Rule

Remark

New

ISNULL(System_bkey)

Business key in dimension does not match to source business key there­fore source must come with new data.

Deleted

ISNULL(s_Id)

Business key in source does not match to dimension business key therefore source must had delete the data.

Changed

REPLACENULL(s_Name,"¤") != REPLACENULL(System,"¤") || REPLACENULL((DT_WSTR,30) s_Version,"¤") !=

REPLACENULL((DT_WSTR,30) Version,"¤")

Source has change some of the values therefore dimension need to make a new row with new ValidFrom/ValidTo. In case of many values a checksum can make them to one value to save many || = OR comparison. != not equal to.

Unchanged

Default output name

complement of a set

refers to not included in the above rules.

Source has not changed the data therefore no need to do more and I send the pipeline to a Row Count, but it can be dropped before deployment.

 

The order (rank) of rules is important else Changed must starts with a condition to make sure that both business keys has a value and therefore exists in both tables: !ISNULL(System_bkey) && !ISNULL(s_Id) && (<rule as above>).

 

 

New pipeline will insert new rows into Dim_System table, therefore the package has to set default values for metadata columns using the project parameters and pac­kage variable. I do it in a 'Derived Column' that will add extra columns to the New pipeline with a default value typein as an Expression refering to parameters and variable. ValidFrom is set to its general »begindate« because the data is new in Dim_System table and ValidTo is set to its general enddata called »forever end­date«. Sometimes the source system provides ValidFrom when the data was been new/created/born inside the system:

 

Derived Column name

Derived Column

Expression

System_dkey

Replace

'System_dkey'

@[$Project::defaultdkey]

System_pkey

 

@[$Project::defaultpkey]

ValidFrom

 

@[$Project::defaultValidFrom]

ValidTo

 

@[$Project::defaultValidTo]

IsCurrent

 

@[$Project::defaultIsCurrent]

IsDeleted

 

@[$Project::defaultIsDeleted_Not]

InsertTime

 

@[User::currentDateTime]

UpdateTime

 

@[User::currentDateTime]

 

I don’t like a metadata column to be Null or n/a, therefore UpdateTime gets the same datetime stamp as InsertTime, so both columns has the same value until data in the row will be either changed or deleted in the source system.

 

 

The New pipeline with metadata columns default values can now be insert into Dim_System table, I drag in a 'Destination' and connect new pipeline to it and select Dim_Source table as fast load. There is a default mapping that match columns with same name because the pipeline contains columns from both tables Sour­ce_System and Dim_System after the full outer join:

 

 

But the source columns (left side) has prefix s_ and they need to be map to the right columns in destination table for Dim_System (right side), so I rearrange the mapping:

 

Pipeline column

Dimension column

Remark

s_Id

System_bkey

Primary key from source become business key in destination.

s_Name

System

Destination not use generic word Name instead the column is named System.

s_Version

Version

From source to destination.

System_dkey

System_dkey

Taken from default value.

System_pkey

System_pkey

Taken from default value.

ValidFrom

ValidFrom

Taken from default value.

ValidTo

ValidTo

Taken from default value.

IsCurrent

IsCurrent

Taken from default value.

IsDeleted

IsDeleted

Taken from default value.

InsertTime

InsertTime

Taken from default value.

UpdateTime

UpdateTime

Taken from default value.

 

Metadata columns System_dkey and System_pkey for the new rows will be set to the right value in a sql update statement in the end of the ETL process.

 

After I have deleted the map lines and drag new map lines I rearrange the mapping, where I can’t use the input distination columns because data is new in the source and therefore is Null in the new pipeline so all the metadata columns comes from the default values to be insert into Dim_System table as new rows:

 

 

Metadata column System_dkey will be set to the value of System_key in a sql update statement in the end of the ETL process, because System_key is an Identity sequential number that first get its value after the row has been inserted. Then Sy­stem_dkey will inherited the number, so in the first row both columns has the same number. When the data later will be changed or deleted, it become a new row in Dim_System table with a new number in System_key and metadata column System_pkey will have a value as previous reference back to the number of System_key for the previous row before data was changed or deleted.

 

Changed pipeline will insert new rows into Dim_System table, therefore the pac­kage has to set default values for metadata columns through 'Derived Column' where ValidFrom is set to currentdatetime because that is the time where Dim_System receive a changed data from Source_System. Sometimes the source system provides ValidFrom when the data was been changed/updated inside the system:

 

Derived Column name

Derived Column

Expression

ValidFrom

 

@[User::currentDateTime]

ValidTo

 

@[$Project::defaultValidTo]

IsCurrent

 

@[$Project::defaultIsCurrent]

IsDeleted

 

@[$Project::defaultIsDeleted_Not]

InsertTime

 

@[User::currentDateTime]

UpdateTime

 

@[User::currentDateTime]

 

Metadata columns System_dkey and System_pkey will inherited the values from the existing rows in Dim_System table.

 

 

The Changed pipeline with metadata columns default values can now be insert into Dim_System table, I drag in a 'Destination' and connect changed pipeline to it and select Dim_Source table as fast load. I rearrange the mapping:

 

Pipeline column

Dimension column

Remark

s_Id

System_bkey

Primary key from source become business key in destination.

s_Name

System

Destination not use generic word Name instead the column is named System.

s_Version

Version

Changed/Updated Version to destinatio.

System_key

System_pkey

Previous reference is inherited from old row to new row for changed data.

System_dkey

System_dkey

The durable key is inherited from old row to new row for changed data.

ValidFrom

ValidFrom

Taken from default value.

ValidTo

ValidTo

Taken from default value.

IsCurrent

IsCurrent

Taken from default value.

IsDeleted

IsDeleted

Taken from default value.

InsertTime

InsertTime

Taken from default value.

UpdateTime

UpdateTime

Taken from default value.

 

 

Metadata columns ValidTo, IsCurrent and UpdateTime for the old previous rows will be set to the right value in a sql update statement in the end of the ETL process.

 

Deleted pipeline will insert new rows into Dim_System table, therefore the pac­kage has to set default values for metadata columns through 'Derived Column' where ValidFrom is set to currentdatetime because that is the time where Dim_System found out that data has been deleted in the Source_System. Sometimes the source system provides ValidFrom when the data was been deleted inside the sy­stem. The new row will be mark with IsDeleted = 1 = true because the row re­pre­sent the dele­ted data in the source system, therefore I use another pa­ra­me­ter default value here:

 

Derived Column name

Derived Column

Expression

ValidFrom

 

@[User::currentDateTime]

ValidTo

 

@[$Project::defaultValidTo]

IsCurrent

 

@[$Project::defaultIsCurrent]

IsDeleted

 

@[$Project::defaultIsDeleted]

InsertTime

 

@[User::currentDateTime]

UpdateTime

 

@[User::currentDateTime]

 

Metadata columns System_dkey and System_pkey will inherited the values from the existing rows in Dim_System table.

 

 

The Deleted pipeline with metadata columns default values can now be insert into Dim_System table, I drag in a 'Destination' and connect deleted pipeline to it and select Dim_Source table as fast load. I rearrange the mapping, where I can’t use the prefix s_ columns because data has been deleted in the source and therefore is Null in the deleted pipeline:

 

Pipeline column

Dimension column

Remark

System_bkey

System_bkey

Business key is inherited from old row to new row for deleted data.

System

System

System value is inherited from old row to new row for deleted data.

Version

Version

Version value is inherited from old row to new row for deleted data.

System_key

System_pkey

Previous reference is inherited from old row to new row for changed data.

System_dkey

System_dkey

The durable key is inherited from old row to new row for changed data.

ValidFrom

ValidFrom

Taken from default value.

ValidTo

ValidTo

Taken from default value.

IsCurrent

IsCurrent

Taken from default value.

IsDeleted

IsDeleted

Taken from default value.

InsertTime

InsertTime

Taken from default value.

UpdateTime

UpdateTime

Taken from default value.

 

 

Metadata columns ValidTo, IsCurrent and UpdateTime for the old previous rows will be set to the right value in a sql update statement in the end of the ETL process.

 

Metadata columns

The SSIS package control flow is ending with a TSQL statement that has two sql update statements.

 

  • The first update statement is only for new data rows from the source sy­stem to the dimension table to set metadata column System_dkey the same number value as the primary key column System_key that is an Identity sequential number that first gets its value after the new rows has been inser­ted. The default value of System_dkey is 0, therefore it is easy to find the new rows and when set System_dkey to same value as System_key. Some­times a dimension has an Unknown value with _key, _dkey and _pkey as 0 and _bkey as '¤' to handle source data when a business key does not exists in the dimension, therefore I add a criterie that System_key is greater than 0:

 

UPDATE d

SET System_dkey = d.System_key

FROM dbo.Dim_System d

WHERE d.System_key > 0 AND d.System_dkey = 0

 

  • The second update statement is for the old previous rows in the dimension table to set metadata columns. I’m using column System_pkey from new rows to connect to the old previous rows through its primary key column Sy­stem_key to perform update of three metadata columns in the old previous rows:

 

 * Set ValidTo to same datetimestamp that new row got in ValidFrom.

 * Set IsCurrent = 0 = false because it is new row that is current row.

 * Set UpdateTime to the current datetime when previous row was updated.

 

The sql statement is using a self-join between the new row with alias n and the previous row in the dimension with alias d with joining from new row column System_pkey to previous row column System_key where both has IsCurrent = 1 to fetch the new row and the last previous row and not older historical rows, but then the updating will set IsCurrent = 0 for the previous row, therefore there will be only one row with IsCurrent = 1 per data value:

 

UPDATE d

SET ValidTo = n.ValidFrom, IsCurrent = 0, UpdateTime = Getdate()

FROM dbo.Dim_System d

     INNER JOIN dbo.Dim_System n ON n.System_pkey = d.System_key

WHERE d.System_key > 0 AND n.IsCurrent = 1 AND d.IsCurrent = 1

 

In case you don’t like to use the Previous reference _pkey metadata column in a dimen­sion table, the second sql update statement will use System_dkey to perform the self-join together with two extra conditions that new row System_key must be greater than previous row System_key to avoid a row joining to itself:

 

UPDATE d

SET ValidTo = n.ValidFrom, IsCurrent = 0, UpdateTime = Getdate()

FROM dbo.Dim_System d

     INNER JOIN dbo.Dim_System n ON n.System_dkey = d.System_dkey AND

                                    n.System_key > d.System_key

WHERE d.System_key > 0 AND n.IsCurrent = 1 AND d.IsCurrent = 1

 

 

 

The SSIS package is ready to be testet.

 

Approaches for writing to the dimension table

The approach above starts with inserting the rows into the dimension table and hereafter updating the metadata columns. Another approach would be using a staging table which contains changed and deleted data and a ValidFrom column that is assigned inside the staging table, and hereafter the ValidFrom will be used to update the previous rows ValidTo column in the dimension table together with the other metadata columns, and hereafter inserting the rows from the staging table into the dimension table.

 

Merge in T-SQL

The process can also be solve by T-SQL programming with INSERT INTO and UPDATE or with a MERGE but it needs two updates afterwards like this where Target is table Dim_System and Source is table Source_System where Merge compare business key between source and target.

Remarks that (ISNULL(Source.Name,'¤') <> ISNULL(Target.System,'¤') is not wor­king for Merge when a value change to null.

Based on full load from source.

 

 

DECLARE @defaultdkey int = 0

DECLARE @defaultpkey int = 0

DECLARE @defaultIsCurrent bit = 1

DECLARE @defaultIsDeleted bit = 1

DECLARE @defaultIsDeleted_Not bit = 0

DECLARE @defaultValidFrom datetime = '1900-01-01 00:00:00'

DECLARE @defaultValidTo datetime   = '9999-12-31 00:00:00'

DECLARE @currentDateTime datetime  = GETDATE() -- or IdaInsertTime.

 

INSERT INTO dbo.Dim_System -- Insert rows for updated and deleted data into Target table

(System_bkey, [System], [Version], System_dkey, System_pkey, ValidFrom, ValidTo, IsCurrent,

 IsDeleted, InsertTime, UpdateTime)

SELECT System_bkey, [Name], [Version], System_dkey, System_key, @currentDateTime,

  @defaultValidTo, @defaultIsCurrent, @defaultIsDeleted_Not, @currentDateTime, @currentDateTime

FROM – the columns in select comes from output action columns, System_key is for System_pkey

(

MERGE dbo.Dim_System AS [Target]

USING (SELECT [Id], [Name], [Version] FROM dbo.Source_System) AS [Source]

ON [Target].System_bkey = [Source].Id AND [Target].IsCurrent = 1 AND [Target].IsDeleted = 0

WHEN NOT MATCHED BY TARGET -- Insert rows for new data into Target table

  THEN INSERT (System_bkey, [System], [Version], System_dkey, System_pkey, ValidFrom, ValidTo,

               IsCurrent, IsDeleted, InsertTime, UpdateTime)      

  VALUES ([Source].[Id], [Source].[Name], [Source].[Version], @defaultdkey, @defaultpkey,

          @defaultValidFrom, @defaultValidTo, @defaultIsCurrent, @defaultIsDeleted_Not,

          @currentDateTime, @currentDateTime)

WHEN MATCHED AND ((([Target].[System] <> [Source].[Name]) OR ([Target].[System] IS NOT NULL

    AND [Source].[Name] IS NULL) OR ([Target].[System] IS NULL AND [Source].[Name] IS NOT NULL))

  OR (([Target].[Version] <> [Source].[Version]) OR ([Target].[Version] IS NOT NULL

    AND [Source].[Version] IS NULL) OR ([Target].[Version] IS NULL

    AND [Source].[Version] IS NOT NULL)))

  -- Update rows in Target table because rows is not current anymore because Source change value

  THEN UPDATE SET ValidTo = @currentDateTime, IsCurrent = 0, UpdateTime = @currentDateTime

WHEN NOT MATCHED BY SOURCE AND [Target].IsCurrent = 1 AND [Target].IsDeleted = 0

  -- Update rows in Target table because rows has been deleted in Source table.

  THEN UPDATE SET ValidTo = @currentDateTime, IsCurrent = 0, UpdateTime = @currentDateTime 

  -- The updated rows in Target become output and inserted as new rows in Target via INSERT INTO

OUTPUT $Action AS ActionOutput, inserted.System_key, inserted.System_bkey, inserted.System_dkey,

                                [Source].[Name], [Source].[Version]

) AS MergeOutput WHERE MergeOutput.ActionOutput = 'UPDATE'

 

-- For the new rows in Target set System_dkey as same value as the given Identify for System_key

UPDATE d

SET System_dkey = d.System_key

FROM dbo.Dim_System d

WHERE d.System_key > 0 AND d.System_dkey = 0

 

-- For the new rows in Target that is deleted set IsDeleted and fetch values from previous row

UPDATE d 

SET IsDeleted = @defaultIsDeleted, [System] = p.[System], [Version] = p.[Version]

FROM dbo.Dim_System d

     INNER JOIN dbo.Dim_System p ON p.System_key = d.System_pkey

WHERE d.System_key > 0 AND d.IsCurrent = 1 AND d.IsDeleted = 0 AND d.[System] IS NULL AND

      d.[Version] IS NULL

 

MERGE is most for insert new data and update existing data as in Kimball type 1. For other Kimball types like 2 or 7 that wants to insert a new current row, merge has an OUTPUT commando to fetch these rows and an INSERT INTO statement around the MERGE statement, so the rows will be inserted into Target table as new rows inclusive a new row when data do not exists in source.

 

Insert and Update instead of Merge in T-SQL stored procedure

Based on full load from source.

 

CREATE PROCEDURE [dbo].[DimensionT7_InsertUpdate]

AS

BEGIN

SET NOCOUNT ON

 

DECLARE @defaultValidFrom datetime = '1900-01-01 00:00:00'

DECLARE @defaultValidTo   datetime = '9999-12-31 00:00:00'

DECLARE @currentDateTime  datetime = GETDATE() -- or IdaInsertTime.

 

DECLARE @New_ROW_COUNT     int = 0

DECLARE @Deleted_ROW_COUNT int = 0

DECLARE @Changed_ROW_COUNT int = 0

 

-- I am using a @table_variable instead of a #temporary_table or a real_stage_table inside db or

-- a memory optimized table. I have added primary key index on the columns I will join to later.

DECLARE @NewRows TABLE (System_key int NOT NULL PRIMARY KEY CLUSTERED)

DECLARE @DeletedRows TABLE (System_key int, System_dkey int, System_bkey int,

        System nvarchar(50), Version decimal(5, 2))

DECLARE @ChangedRows TABLE (System_key int, System_dkey int,

        System_bkey int NOT NULL PRIMARY KEY CLUSTERED)

 

-- bkey is not in dimension, therefore row is new in source, that will be marked as current row.

INSERT INTO dbo.Dim_System

  (System_bkey, System, Version, System_dkey, System_pkey,

   ValidFrom, ValidTo, IsCurrent, IsDeleted, InsertTime, UpdateTime)

OUTPUT inserted.System_key INTO @NewRows

SELECT DISTINCT s.Id, s.Name, s.Version, 0, 0,

   @defaultValidFrom, @defaultValidTo, 1, 0, @currentDateTime, @currentDateTime

FROM dbo.Source_System s

WHERE NOT EXISTS(SELECT 1 FROM dbo.Dim_System d WHERE d.System_bkey = s.Id)

   

SET @New_ROW_COUNT = @New_ROW_COUNT + @@ROWCOUNT

 

/* For new rows to set System_dkey as same value as the given Identify for System_key. */

UPDATE d

SET System_dkey = d.System_key

FROM dbo.Dim_System d

     INNER JOIN @NewRows n ON n.System_key = d.System_key

 

/* bkey is not in source, therefore the row is deleted in source, that will be marked in

   dimension in current row and an extra row is inserted. */

UPDATE d

SET IsCurrent = 0, ValidTo = @currentDateTime, UpdateTime = @currentDateTime

OUTPUT inserted.System_key, inserted.System_dkey, inserted.System_bkey, inserted.System,

       inserted.Version INTO @DeletedRows

FROM dbo.Dim_System d

WHERE d.System_key > 0 AND d.IsCurrent = 1 AND d.IsDeleted = 0 AND

      NOT EXISTS(SELECT 1 FROM dbo.Source_System s WHERE s.Id = d.System_bkey)

 

INSERT INTO dbo.Dim_System

  (System_bkey, System, Version, System_dkey, System_pkey,

   ValidFrom, ValidTo, IsCurrent, IsDeleted, InsertTime, UpdateTime)

SELECT DISTINCT d.System_bkey, d.System, d.Version, d.System_dkey, d.System_key,

   @currentDateTime, @defaultValidTo, 1, 1, @currentDateTime, @currentDateTime

FROM @DeletedRows d

--FROM dbo.Dim_System d

--WHERE d.IsCurrent = 0 AND d.IsDeleted = 0 AND d.ValidTo = @currentDateTime AND

--      NOT EXISTS(SELECT 1 FROM dbo.Source_System s WHERE s.Id = d.System_bkey)

 

SET @Deleted_ROW_COUNT = @Deleted_ROW_COUNT + @@ROWCOUNT

 

/* bkey is marked deleted in dimension but exists now again in source, reappearing, that will be

   marked in dimension in current/deleted row and an extra row is inserted.

   Keep IsDeleted = 1 to show it was deleted in a period. */

UPDATE d

SET IsCurrent = 0, ValidTo = @currentDateTime, UpdateTime = @currentDateTime

OUTPUT inserted.System_key, inserted.System_dkey, inserted.System_bkey INTO @ChangedRows

FROM dbo.Dim_System d

     INNER JOIN dbo.Source_System s ON s.Id = d.System_bkey

WHERE d.IsCurrent = 1 AND d.IsDeleted = 1

 

/* bkey's values is changed in source, that will be marked in dimension in current row and an

   extra row is inserted. */

UPDATE d

SET IsCurrent = 0, ValidTo = @currentDateTime, UpdateTime = @currentDateTime

OUTPUT inserted.System_key, inserted.System_dkey, inserted.System_bkey INTO @ChangedRows

FROM dbo.Dim_System d

     INNER JOIN dbo.Source_System s ON s.Id = d.System_bkey

WHERE d.IsCurrent = 1 AND d.IsDeleted = 0 AND

   (ISNULL(s.Name,'¤') <> ISNULL(d.System, '¤') OR ISNULL(s.Version,0) <> ISNULL(d.Version, 0))

   -- d.dwh_Comparison != s.dwh_Comparison

 

INSERT INTO dbo.Dim_System

  (System_bkey, System, Version, System_dkey, System_pkey,

   ValidFrom, ValidTo, IsCurrent, IsDeleted, InsertTime, UpdateTime)

SELECT DISTINCT s.Id, s.Name, s.Version, c.System_dkey, c.System_key,

   @currentDateTime, @defaultValidTo, 1, 0, @currentDateTime, @currentDateTime

FROM dbo.Source_System s

     INNER JOIN @ChangedRows c ON c.System_bkey = s.Id

--WHERE NOT EXISTS(SELECT 1 FROM dbo.Dim_System d WHERE d.System_bkey = s.Id AND d.IsCurrent = 1)

--Because source table contains rows that has not been changed and must be skipped in the ETL.

               

SET @Changed_ROW_COUNT = @Changed_ROW_COUNT + @@ROWCOUNT

 

END

 

 

9. Lookup, Cache Connection Manager Transform, Data Viewer

In case the ETL process needs to do two or more lookups to same table or sql sta­te­ment by adding the same Lookup in several Data Flow Tasks, data will be reloa­ded which can cost performance. Instead data can be loaded one time and added to a cache and multiple Lookup can use same cache.

 

Lookup is a join between two tables in a one-to-many relationship, from a many-table to a one-table for adding extra columns to the pipeline build on many-table.

 

  • In a 'Control Flow' drag in 'Data Flow Task' for doing Cache for later Lookup.
  • Make a 'OLE DB Source' for data from the one-table used for Lookup.
  • Drag in 'Cache Transform' and connect 'OLE DB Source' to it.
  • Edit 'Cache Transform' and click [New] and in tab Columns set Index Position to 1 for primary key that Lookup will join to. For later maintenance click [Edit] for open the Cache Tranformation Editor and go to tab Columns.
  • Notice that a 'Cache Connection Manager' is shown in 'Connection Managers' in the buttom of the package.
  • Back in 'Control Flow' drag another 'Data Flow Task' and insert a 'OLE DB Source' for data from the many-table.
  • Drag in 'Lookup' and connect the pipeline to it.
  • Edit 'Lookup' and select 'Cache connection manager' and at Connection page choose the Cache from before.
  • Set 'Specify how to handle rows with no matching entries' to 'Ignore failure' else Lookup gives a failure when one row from pipleline does not exists in the cache.
  • At Columns page join foreign key from pipeline with index 1 from the Cache (shown as a magnifying glass) and checkmark the other column to extend the pipeline with extra column, maybe replace it or chan­ge Output Alias if you like another columnname in pipeline.
  • Normally we have a 'OLE DB Destination' to pump data into from the pipeline, but instead I make a variable cal­led RowCount and drag in 'Row Count', pipeline to it as 'Lookup Match Output' and add the variable.
  • I like to see the data in the pipeline in a 'Data Viewer' window while I run the package, so I Rightclick the blue arrow from 'Lookup' to 'Row Count' and {Enable Data Viewer}. Of course this is only for development and rightclick and disable it.

 

 

 

10. Kimball SCD type 7 dimension through DSA layer

It is nice to have a Data Staging Area (DSA) layer in the ETL process between Source data and a dimension like type 7 from Kimball. A very strict ETL flow is first to have all data ready at DSA layer before start to update and insert data in dimensions and facts. Instead of compare columns if values has changed I make a checksum hashbytes column called dwh_Comparison that contains the compared columns which is the normal columns and not included the primary key (business key).

 

10.1 Using Lookup

For a new employee I let ValidFrom start from 1900-01-01 and for the current row I let ValidTo stop at 9999-12-31.

 

Dim_Employee

key

dkey

pkey

bkey

LastName

FirstName

dwh_Comparison

ValidFrom

ValidTo

IsCurrent

-1

-1

0

0

Missing

Missing

0x000000000000

1900-01-01

9999-12-31

1

1

1

0

123

Davolio

Nancy

0x8B186E741C4

1900-01-01

9999-12-31

1

2

2

0

124

Fuller

Andrew

0x43E8CE52E17

1900-01-01

9999-12-31

1

3

3

0

125

Leverling

Janet

0xC72A5851BBC

1900-01-01

9999-12-31

1

4

4

0

126

Peacock

Margaret

0x2EA9AFCDDD

1900-01-01

9999-12-31

1

 

From the source system the ETL fetch employees as full dump (delta data see 10.3) with an Updated_Time column that I will use for ValidFrom and ValidTo because the HR people is used to do ad hoc queries in the source system. Therefore the source will give 5 employees

 

Source_Employees

EmployeeID

LastName

FirstName

Updated_Time

123

Davolio

Nancy

2014-01-01

124

Fuller

Andrew

2014-01-01

125

Jones

Janet

2014-10-09

126

Fuller

Margaret

2014-10-09

127

Buchanan

Steven

2014-10-09

 

2 employees has changed last name (ID = 125 Janet and 126 Margaret) and 1 new employee (ID = 127 Steven) has started.

 

ETL process to a Stage_Employee table in a self-explanatory graphical user interface where I use Lookup to devide into existing employees (match output) and new employees (no match output) and I use a Conditional Split to only continue with changed employees where I compare the values in dwh_Comparison. The changed rows will be added two times in the stage and later in the dimension because of the ValidFrom and ValidTo, therefore I use a Multicast to copy the pipeline and save the rows with different metadata columns values. The data flow from source to stage table:

 

Multicast do a duplicate for Update and Insert of the pipeline of changed rows.

 

The »2 rows« is from Janet and Margaret that has created new key numbers, the »2 rows« is key 3 and 4 from existing row in dimension and the »1 rows« is from Steven that will be new member in the dimension.

 

Stage_Employee (order by bkey + ValidFrom)

key

dkey

pkey

bkey

LastName

FirstName

dwh_Comparison

ValidFrom

ValidTo

IsCurrent

3

3

0

125

NULL

NULL

NULL

NULL

2014-10-09

0

6

6

3

125

Jones

Janet

0xC099E645E83

2014-10-09

9999-12-31

1

4

4

0

126

NULL

NULL

NULL

NULL

2014-10-09

0

7

7

4

126

Fuller

Margaret

0x3C09C8DEA7D

2014-10-09

9999-12-31

1

5

5

0

127

Buchanan

Steven

0x14226BE0760

1900-01-01

9999-12-31

1

 

Key 3 and 4 rows is for update same rows in the dimension where ValidTo and IsCurrent will be changed, therefore no need for names in these rows. The same business keys is repeated in key 6 and 7 rows for insert into the dimension with the new names. Key 5 is the new employee ready to be inserted into dimension.

 

So the SSIS package has made the staging data ready to be launched into the dimension and that should be done in another SSIS packages that can do it parallel for all dimensions and facts from DSA to DMA a data mart. 2 rows updated with a new ValidTo from the source with 2014-10-09 and 3 new rows added because there was a new employee too:

 

 

Dim_Employee (order by bkey + ValidFrom)

key

dkey

pkey

bkey

LastName

FirstName

dwh_Comparison

ValidFrom

ValidTo

IsCurrent

-1

-1

0

0

Missing

Missing

0x000000000000

1900-01-01

9999-12-31

1

1

1

0

123

Davolio

Nancy

0x8B186E741C4

1900-01-01

9999-12-31

1

2

2

0

124

Fuller

Andrew

0x43E8CE52E17

1900-01-01

9999-12-31

1

3

3

0

125

Leverling

Janet

0xC72A5851BBC

1900-01-01

2014-10-09

0

6

3

3

125

Jones

Janet

0xC099E645E83

2014-10-09

9999-12-31

1

4

4

0

126

Peacock

Margaret

0x2EA9AFCDDD

1900-01-01

2014-10-09

0

7

4

4

126

Fuller

Margaret

0x3C09C8DEA7D

2014-10-09

9999-12-31

1

5

5

0

127

Buchanan

Steven

0x14226BE0760

1900-01-01

9999-12-31

1

 

Table Stage_Employee where some columns allow null value because these va­lues is not needed to update ValidTo in the dimension table, and _key is an Identity to create the new key numbers for the dimension:

 

CREATE TABLE [dbo].[Stage_Employee](

  [Employee_key] [int] IDENTITY(1,1) NOT NULL,

  [Employee_dkey] [int] NOT NULL,

  [Employee_pkey] [int] NOT NULL,

  [Employee_bkey] [int] NOT NULL,

  [LastName] [nvarchar](20) NULL,

  [FirstName] [nvarchar](10) NULL,

  [dwh_Comparison] [binary](32) NULL,

  [ValidFrom] [datetime] NULL,

  [ValidTo] [datetime] NOT NULL,

  [IsCurrent] [bit] NOT NULL,

  [InsertTime] [datetime] NOT NULL CONSTRAINT [DF_Stage_Employee_InsertTime]

                                   DEFAULT (getdate()),

  CONSTRAINT [PK_Stage_Employee] PRIMARY KEY CLUSTERED ([Employee_key] ASC)

) ON [PRIMARY]

GO

 

Employee_key is an Identity because it will calculate the new numbers for the new rows, and later the new rows will be inserted into dimension table Dim_Employee.

 

Insert the Missing row into the dimension table:

 

TRUNCATE TABLE dbo.Dim_Employee

-- SET IDENTITY_INSERT dbo.Dim_Employee ON

INSERT INTO dbo.Dim_Employee

(Employee_key, Employee_dkey, Employee_pkey, Employee_bkey, LastName, FirstName,

 dwh_Comparison, ValidFrom, ValidTo, IsCurrent)

VALUES

(-1, -1, 0, 0, 'Missing', 'Missing',

 0x0000000000000000000000000000000000000000000000000000000000000000,

 '1900-01-01', '9999-12-31', 1)

-- SET IDENTITY_INSERT dbo.Dim_Employee OFF

GO

 

When I have a DSA layer and a staging table, there is no need to have an Identity in dimension table Dim_Employee. Hash column dwh_Comparison using HASH­BYTES('SHA2_256') that become data type binary(32) or char(64) and I just give the Missing a default value that never will come from the source system.

 

ETL for staging table into DSA layer

 

The Truncate Stage_Employee makes empty staging table and copy the max key number value from the dimension table, so new inserted rows into staging table will get new identity number that later can be bulk insert to dimension:

 

TRUNCATE TABLE dbo.Stage_Employee

DECLARE @maxKey int

SELECT  @maxKey = ISNULL(MAX(Employee_key),0) + 1

FROM    dbo.Dim_Employee

DBCC CHECKIDENT (Stage_Employee, RESEED, @maxKey)

 

The Source_Employees makes a comparison column for the columns to compare with to the dimension, here FirstName and LastName is concatenate:

 

SELECT

  EmployeeID,

  LastName,

  FirstName,

  Updated_Time,

  dwh_Comparison = CAST(HASHBYTES('SHA2_256', CONCAT(UPPER(RTRIM(LTRIM(FirstName))),

                                  ';',UPPER(RTRIM(LTRIM(LastName))))) AS binary(32))

FROM dbo.Source_Employees

 

The cast makes sure the default data type binary(8000) is not used because that would take a lot of memory for the pipeline.

Concat takes care of null value in column else isnull(firstname, '').

 

The Dim_Employee makes a lookup to the dimension to fetch the key columns and the comparison column for the current dimension rows:

 

SELECT

      Employee_key,

      Employee_dkey,

      Employee_pkey,

      Employee_bkey,

      dwh_Comparison

FROM  dbo.Dim_Employee

WHERE IsCurrent = 1

 

 

The Lookup No Match Output means that the business key from source EmployeeID does not exists in the dimension. Derived Column extends the pipeline with extra co­lumn values that will be inserted into the staging table in Stage_Employee new rows with new identity number values of the _key column:

 

 

The Lookup Match Output means that the business key from source EmployeeID do exists in the dimension but then only continue with the rows where source has changed FirstName or LastName by compare the comparison column from source and dimension that was included in the Lookup in Find changed rows Conditional Split with the criteria not to be equal:

 

 

A changed row meaning source has updated FirstName or LastName will result in two rows in staging table, one row for update ValidTo/IsCurrent=0 and one row for insert new row with ValidFrom/IsCurrent=1, therefore a Multicast double the pipeline. Derived Column extends the pipeline with extra co­lumn values that will be inserted into the staging table in Stage_Employee existing rows but keep the identity number value of the _key column for doing a sql update later by joine the _key columns, and ValidTo gets its value from Updated_Time:

 

 

Stage_Employee changed rows with new identity number values of the _key and the _dkey gets the value from the dimension from the Lookup above and the _pkey gets the value from the dimension _key because _pkey is a reference back to previous row in the dimension, it is made as a column copy in Derived Column, and ValidFrom gets its value from Updated_Time:

 

 

The Stage_Employee set durable key update the _dkey for the new rows:

 

UPDATE dbo.Stage_Employee WITH(TABLOCK)

SET    Employee_dkey = Employee_key

WHERE  IsCurrent = 1 AND Employee_dkey = 0

 

ETL from staging table in DSA layer into the dimension

 

The Update Dim_Employee to set ValidTo/IsCurrent for the old current row:

 

UPDATE d

SET    ValidTo    = s.ValidTo,

       IsCurrent  = s.IsCurrent,

       UpdateTime = Getdate()

FROM dbo.Dim_Employee d WITH(TABLOCK)

     INNER JOIN dbo.Stage_Employee s WITH(TABLOCK) ON s.Employee_key = d.Employee_key

WHERE s.IsCurrent = 0

 

The Insert Dim_Employee for insert new row with ValidFrom/IsCurrent=1 from Stage_Employee:

 

SELECT

  Employee_key, Employee_dkey, Employee_pkey, Employee_bkey,

  LastName, FirstName, dwh_Comparison, ValidFrom, ValidTo, IsCurrent

FROM dbo.Stage_Employee

WHERE IsCurrent = 1

 

Into table Dim_Employee:

 

 

A SSIS package for update and insert data from staging tables to dimensions and fact tables can use BEGIN TRANSACTION and COMMIT TRANSACTION or ROLLBACK TRANSACTION with failure to secure that all will be loaded or nothing will happen.

 

10.2 Not using Lookup

I think it is important to make a SSIS package in a self-explanatory graphical user interface way where the Dim_Employee lookup and two pipelines tells a lot. But with many rows in a dimension Lookup is slow and use a lot of memory, therefore I can drop the Lookup and inside 'OLE DB Source' Source_Employees I make a sql that do a kind of lookup through Left Outer Join and it will perform inside the database at SQL Server:

 

;WITH source AS

(

 SELECT EmployeeID, LastName, FirstName, Updated_Time,

   dwh_Comparison = CAST(HASHBYTES('SHA2_256', CONCAT(UPPER(RTRIM(LTRIM(FirstName))),

                                   ';',UPPER(RTRIM(LTRIM(LastName))))) AS binary(32))

 FROM dbo.Source_Employees WITH(TABLOCK)

)

SELECT s.EmployeeID, s.LastName, s.FirstName, s.Updated_Time, s.dwh_Comparison,

       d.Employee_key, d.Employee_dkey, d.Employee_pkey

FROM source s

   LEFT OUTER JOIN dbo.Dim_Employee d WITH(TABLOCK) ON d.Employee_bkey = s.EmployeeID

WHERE d.dwh_Comparison IS NULL OR

      (d.IsCurrent = 1 AND d.dwh_Comparison <> s.dwh_Comparison)

 

·       Left outer join + d.dwh_Comparison IS NULL fetchs new rows from source with a business key that does not exist in the dimension, so all d. columns has null value.

·       d.IsCurrent = 1 AND d.dwh_Comparison <> s.dwh_Comparison fetchs chan­ged rows from source that does not compare to the current rows in the dimension.

 

A Conditional Split has an output called New for ISNULL(Employee_key) and the default output is called Changed. The New pipeline is 'Lookup No Match Output' and the Changed pipeline goes to 'Multicast'. The data flow from source to stage table:

 

Multicast do a duplicate for Update and Insert of the pipeline of changed rows.

 

Here is only 3 rows coming out of Source_Employees because 1 is new and 2 is changed, but as the first data flow shown there was 5 rows from source. There is always pros and cons to self-explanatory graphical user interface and performance.

 

10.3 Source data into an Archive with repeating members

The data warehouse fetchs source data and add them all to the archive also when source gives revision of rows, or rows that seams not to have been changed, I call it repeating members e.g. Fullor in three rows:

 

Archive_Employee

EmployeeID

LastName

FirstName

Updated_Time

ARC_INSERT_TIME

ARC_RECORD_ID

123

Davolio

Nancy

2016-01-01 10:00:00

2016-01-02 02:38:03

34521

124

Fuller

Andrew

2016-01-01 11:00:00

2016-01-02 02:38:03

34522

124

Fullar

Andrew

2016-01-01 12:00:00

2016-01-02 02:38:03

34523

124

Fullor

Andrew

2016-01-01 13:00:00

2016-01-02 02:38:03

34524

124

Fullor

Andrew

2016-01-01 14:00:00

2016-01-02 02:38:03

34525

124

Fullor

Andrew

2016-01-01 15:00:00

2016-01-02 02:38:03

34526

124

Fullar

Andrew

2016-01-01 16:00:00

2016-01-02 02:38:03

34527

124

Fuller

Andrew

2016-01-01 17:00:00

2016-01-02 02:38:03

34528

127

Buchanan

Steven

2016-01-01 17:00:00

2016-01-02 02:38:03

34529

127

Buchanan

Stephen

2016-01-01 20:00:00

2016-01-02 02:38:03

34530

 

In the dimension Andrew’s name is Fuller, but above there is a revision row with an unchanged name, later many rows with changed name and the last change is back to the original name Fuller. Buchanan is a new employee and the first row is a wrong typing of his first name.

 

»End Of The Day« for each data revision in Archive is easy by select the max time of Updated_Time from the source, but in this approach I will respect all the changes in the source with these rules:

 

  • When member is unchanged it will be skipped e.g. ID 123 has same name as the current member in the dimension, also the first row with Fuller.
  • When member is repeated in multiple contiguous rows with identical data except time, they will be smashed or collapsed into one row e.g. Fullor with time span of Updated_Time.

 

The result for a stage table in DSA where ValidFrom and ValidTo is smashed and null key will be set by Identity.

key

dkey

pkey

bkey

LastName

FirstName

dwh_Comparison

ValidFrom

ValidTo

IsCurrent

2

2

0

124

NULL

NULL

NULL

NULL

2016-01-01 12:00:00

0

NULL

2

0

124

Fullar

Andrew

0x130A89EB575

2016-01-01 12:00:00

2016-01-01 13:00:00

0

NULL

2

0

124

Fullor

Andrew

0xB2829725711

2016-01-01 13:00:00

2016-01-01 16:00:00

0

NULL

2

0

124

Fullar

Andrew

0x130A89EB575

2016-01-01 16:00:00

2016-01-01 17:00:00

0

NULL

2

0

124

Fuller

Andrew

0x43E8CE52E17

2016-01-01 17:00:00

9999-12-31 00:00:00

1

NULL

0

0

127

Buchanan

Steven

0x03A97F4DBE3

2016-01-01 17:00:00

2016-01-01 20:00:00

0

NULL

0

0

127

Buchanan

Stephen

0x903346CB324

2016-01-01 20:00:00

9999-12-31 00:00:00

1

 

  • When ValidFrom is null, it means the row will update an existing current row in a dimension by setting ValidTo and IsCurrent with values from stage table.
  • When ValidFrom is not null and dkey <> 0, it means the row is a changed member to insert into dimension.
  • When ValidFrom is not null and dkey =0, it means the row is a new member to insert into dimension.

 

The SQL that do the magic where dwh_Sortorder take care of the last row of Fuller and dwh_Groupnumber take care of the double change of Fullar and smashing the Fullor rows is done by using Min and Max:

 

;WITH source AS

(

 SELECT

   EmployeeID, LastName, FirstName, Updated_Time,

   dwh_Sortorder = ROW_NUMBER() OVER(PARTITION BY EmployeeID ORDER BY Updated_Time),

   dwh_Comparison = CAST(HASHBYTES('SHA2_256', CONCAT(UPPER(RTRIM(LTRIM(FirstName))),

                                   ';',UPPER(RTRIM(LTRIM(LastName))))) AS binary(32))

 FROM ARC.dbo.Archive_Employee WITH(TABLOCK)

),

datarow

AS

(

 SELECT

  s.LastName, s.FirstName, s.dwh_Comparison,

  d.Employee_key, d.Employee_dkey, d.Employee_pkey, Employee_bkey = s.EmployeeID,

  ValidFrom = s.Updated_Time,

  ValidTo = LEAD(s.Updated_Time, 1, '9999-12-31') OVER(PARTITION BY s.EmployeeID

                                                       ORDER BY s.Updated_Time),

  dwh_Groupnumber = ( ROW_NUMBER() OVER(ORDER BY s.EmployeeID, s.Updated_Time) ) -

                    ( ROW_NUMBER() OVER(PARTITION BY s.EmployeeID, s.dwh_Comparison

                                        ORDER BY s.Updated_Time) )

 FROM source s

      LEFT OUTER JOIN dbo.Dim_Employee d WITH(TABLOCK) ON d.Employee_bkey = s.EmployeeID

 WHERE d.dwh_Comparison IS NULL OR (d.IsCurrent = 1 AND (d.dwh_Comparison <> s.dwh_Comparison OR

                                (d.dwh_Comparison = s.dwh_Comparison AND s.dwh_Sortorder <> 1)))

)

-- New rows to insert into dimension repeating members smash together timespan ValidFrom-To

SELECT Employee_key = null, Employee_dkey = 0, Employee_pkey = 0, Employee_bkey,

       LastName, FirstName, dwh_Comparison,

       ValidFrom = MIN(ValidFrom), ValidTo = MAX(ValidTo),

       IsCurrent = IIF(MAX(ValidTo) = '9999-12-31', 1, 0), Flag = 'New member for insert'

FROM datarow

WHERE Employee_key IS NULL

GROUP BY Employee_bkey, dwh_Groupnumber, LastName, FirstName, dwh_Comparison

UNION ALL

-- Changed rows to update existing rows in dimension set ValidTo=From IsCurrent=0 ValidFrom=null

SELECT Employee_key, Employee_dkey, Employee_pkey, Employee_bkey,

       LastName = null, FirstName = null, dwh_Comparison = null,

       ValidFrom = null, ValidTo = MIN(ValidFrom),

       IsCurrent = 0, Flag = 'Changed member for update'

FROM datarow

WHERE Employee_key IS NOT NULL

GROUP BY Employee_bkey, Employee_key, Employee_dkey, Employee_pkey

UNION ALL

-- Changed rows to insert into dimension repeating members smash together timespan ValidFrom-To

SELECT Employee_key = null, Employee_dkey, Employee_pkey = 0, Employee_bkey,

       LastName, FirstName, dwh_Comparison,

       ValidFrom = MIN(ValidFrom), ValidTo = MAX(ValidTo),

       IsCurrent = IIF(MAX(ValidTo) = '9999-12-31', 1, 0), Flag = 'Changed member for insert'

FROM datarow

WHERE Employee_key IS NOT NULL

GROUP BY Employee_bkey, dwh_Groupnumber, Employee_dkey, LastName, FirstName, dwh_Comparison

ORDER BY Employee_bkey, ValidFrom

 

Of course the smashing of Min and Max has no relevans if all rows are different i.e. when there is a Start datetime column which is not used for ValidFrom/To.

 

ETL for staging table into DSA layer

 

From the SQL I build this SSIS package to fetch data from archive that is different compared to the dimension and load data into the DSA stage table.

 

Multicast do a duplicate for Update and Insert of the pipeline of changed rows.

 

The »4 rows« is from Andrew that has created new key numbers, the »1 rows« is key 2 from existing row in dimension and the »2 rows« is from Buchanan that will be new in the dimension including the wrong typing of his first name.

 

Stage_Employee (order by bkey + ValidFrom)

key

dkey

pkey

bkey

LastName

FirstName

dwh_Comparison

ValidFrom

ValidTo

IsCurrent

2

2

0

124

NULL

NULL

NULL

NULL

2016-01-01 12:00:00

0

7

2

2

124

Fullar

Andrew

0x130A89EB575

2016-01-01 12:00:00

2016-01-01 13:00:00

0

8

2

7

124

Fullor

Andrew

0xB2829725711

2016-01-01 13:00:00

2016-01-01 16:00:00

0

9

2

8

124

Fullar

Andrew

0x130A89EB575

2016-01-01 16:00:00

2016-01-01 17:00:00

0

10

2

9

124

Fuller

Andrew

0x43E8CE52E17

2016-01-01 17:00:00

9999-12-31 00:00:00

1

5

5

0

127

Buchanan

Steven

0x03A97F4DBE3

2016-01-01 17:00:00

2016-01-01 20:00:00

0

6

5

5

127

Buchanan

Stephen

0x903346CB324

2016-01-01 20:00:00

9999-12-31 00:00:00

1

 

Truncate Stage_Employee

Empty staging table and copy the max key number value from the dimension table, so new inserted rows into staging table will get new identity number that later can be bulk insert to dimension:

 

TRUNCATE TABLE dbo.Stage_Employee

DECLARE @maxKey int

SELECT  @maxKey = ISNULL(MAX(Employee_key),0) + 1

FROM    dbo.Dim_Employee

DBCC CHECKIDENT (Stage_Employee, RESEED, @maxKey)

 

Match Archive and Dimension

To fetch data from Archive the 'OLE DB Source' is using delta detection incremental load through a table Archive_DeltaDataDetection that’s keep a datetime2(7) for the latest successfully load to EDW or Data mart, e.g.

 

ARC_INSERT_TIME

2016-01-01 05:18:24

 

;WITH source AS

(

 SELECT

  EmployeeID, LastName, FirstName, Updated_Time,

  dwh_Sortorder = ROW_NUMBER() OVER(PARTITION BY EmployeeID ORDER BY Updated_Time),

  dwh_Comparison = CAST(HASHBYTES('SHA2_256', CONCAT(UPPER(RTRIM(LTRIM(FirstName))),

                                  ';',UPPER(RTRIM(LTRIM(LastName))))) AS binary(32))

 FROM ARC.dbo.Archive_Employee WITH(TABLOCK)

 WHERE ARC_INSERT_TIME > (SELECT ARC_INSERT_TIME FROM dbo.Archive_DeltaDataDetection)

)

SELECT

 s.LastName, s.FirstName, s.dwh_Comparison,

 d.Employee_key, d.Employee_dkey, d.Employee_pkey, Employee_bkey = s.EmployeeID,

 ValidFrom = s.Updated_Time,

 ValidTo = LEAD(s.Updated_Time, 1, '9999-12-31') OVER(PARTITION BY s.EmployeeID

                                                      ORDER BY s.Updated_Time),

 dwh_Groupnumber = ( ROW_NUMBER() OVER(ORDER BY s.EmployeeID, s.Updated_Time) ) -

                   ( ROW_NUMBER() OVER(PARTITION BY s.EmployeeID, s.dwh_Comparison

                                       ORDER BY s.Updated_Time) )

FROM source s

     LEFT OUTER JOIN dbo.Dim_Employee d WITH(TABLOCK) ON d.Employee_bkey = s.EmployeeID

WHERE d.dwh_Comparison IS NULL OR (d.IsCurrent = 1 AND (d.dwh_Comparison <> s.dwh_Comparison OR

                               (d.dwh_Comparison = s.dwh_Comparison AND s.dwh_Sortorder <> 1)))

ORDER BY s.EmployeeID, s.dwh_Sortorder

 

When the ETL for DSA layer and EDW layer or DMA layer is finished successfully, the Archive_DeltaDataDetection one row will be changed to current datetime:

 

UPDATE dbo.Archive_DeltaDataDetection

SET ARC_INSERT_TIME = SYSDATETIME()

 

Next time the ETL process begin, it will fetch all the new data from archive since the last process.

 

New - Changed

A Conditional Split has an output called New for ISNULL(Employee_key) and the default output is called Changed.

 

 

New output (N)

ValidTo == (DT_DATE)"9999-12-31" ? (DT_BOOL)1 : (DT_BOOL)0

 

Changed output (U) – right multicast

Notes in Aggregate, it is minumum of ValidFrom but it become output ValidTo as showned in the magic SQL statement.

 

Changed output (I) – left multicast

 

Set durable key

UPDATE s

SET Employee_dkey = t.Employee_key

FROM dbo.Stage_Employee s WITH(TABLOCK)

     INNER JOIN

      (SELECT Employee_bkey, Employee_key = MIN(Employee_key)

       FROM dbo.Stage_Employee

       WHERE  Employee_dkey = 0

       GROUP BY Employee_bkey

      ) t ON t.Employee_bkey = s.Employee_bkey

WHERE s.Employee_dkey = 0

 

Set previous key

UPDATE s

SET Employee_pkey = t.Employee_pkey

FROM dbo.Stage_Employee s WITH(TABLOCK)

     INNER JOIN

      (SELECT Employee_key, Employee_pkey = LAG(Employee_key, 1, Employee_pkey)

                            OVER(PARTITION BY Employee_bkey ORDER BY ValidFrom)

       FROM dbo.Stage_Employee

      ) t ON t.Employee_key = s.Employee_key

WHERE s.ValidFrom IS NOT NULL

 

ETL from staging table in DSA layer into the dimension

Almost the same as above, but now I am using WHERE ValidFrom.

The Update Dim_Employee to set ValidTo/IsCurrent for the old current row:

 

Update Dim_Employee

UPDATE d

SET    ValidTo    = s.ValidTo,

       IsCurrent  = s.IsCurrent,

       UpdateTime = Getdate()

FROM dbo.Dim_Employee d WITH(TABLOCK)

     INNER JOIN dbo.Stage_Employee s WITH(TABLOCK) ON s.Employee_key = d.Employee_key

WHERE s.ValidFrom IS NULL

 

Insert Dim_Employee

For new row with ValidFrom/IsCurrent=1 from Stage_Employee:

SELECT

  Employee_key, Employee_dkey, Employee_pkey, Employee_bkey,

  LastName, FirstName, dwh_Comparison, ValidFrom, ValidTo, IsCurrent

FROM dbo.Stage_Employee

WHERE ValidFrom IS NOT NULL

 

BIML to generate multiple dimensions in multiple SSIS packages

<Biml xmlns="http://schemas.varigence.com/biml.xsd">

 

  <Connections>

    <OleDbConnection Name="DSA_DB" ConnectionString="Data Source=sqlservername;Initial Catalog=DSA;

       Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False; Packet Size=32767;

       Application Name=SSIS;" CreateInProject="true"/>

    <AdoNetConnection Name="DSA_ADO" ConnectionString="Data Source=sqlservername;Database=DSA;

       Integrated Security=SSPI;Connect Timeout=30; Application Name=SSIS;"

       Provider="SQL" CreateInProject="true"/>

  </Connections>

 

  <Packages>

    <Package Name="Stage_Employee" ConstraintMode="Linear">

      <Tasks>

        <ExecuteTSqlStatement Name="Empty Stage table" ConnectionName="DSA_ADO">

          <SqlStatementSource>

            <![CDATA[

            TRUNCATE TABLE dbo.Stage_Employee

            DECLARE @maxKey int

            SELECT @maxKey = ISNULL(MAX(Employee_key),0) + 1

            FROM dbo.Dim_Employee

            DBCC CHECKIDENT (Stage_Employee, RESEED, @maxKey)

            ]]>

          </SqlStatementSource>

        </ExecuteTSqlStatement>

 

        <Dataflow Name="From Source to Stage">

          <Transformations>

            <OleDbSource Name="Match Source and Dimension" ConnectionName="DSA_DB">

              <DirectInput>

                <![CDATA[

                ;WITH s AS

                (

                 SELECT

 

                   -- Business key

 

                   EmployeeID,

 

                   -- Business columns

 

                   LastName, FirstName,

 

                   Updated_Time,

                   dwh_Sortorder = ROW_NUMBER() OVER(PARTITION BY EmployeeID ORDER BY Updated_Time),

                   dwh_Comparison = CAST(HASHBYTES('SHA2_256', CONCAT(UPPER(RTRIM(LTRIM(FirstName))),

                                                   ';',UPPER(RTRIM(LTRIM(LastName))))) AS binary(32))

                   -- Business columns above

 

                 FROM ARC.dbo.Archive_Employee WITH(TABLOCK)

                )

                SELECT

 

                  -- Business columns

 

                  s.LastName, s.FirstName,

 

                  s.dwh_Comparison,                                                 -- 5 x Business key

                  d.Employee_key, d.Employee_dkey, d.Employee_pkey, Employee_bkey = s.EmployeeID,

                  ValidFrom = s.Updated_Time,

                  ValidTo = LEAD(s.Updated_Time, 1, '9999-12-31') OVER(PARTITION BY s.EmployeeID

                                                                  ORDER BY s.Updated_Time),

                  dwh_Groupnumber = ( ROW_NUMBER() OVER(ORDER BY s.EmployeeID, s.Updated_Time) ) –

                   ( ROW_NUMBER() OVER(PARTITION BY s.EmployeeID, s.dwh_Comparison ORDER BY s.Updated_Time) )

                FROM s

                     LEFT OUTER JOIN dbo.Dim_Employee d WITH(TABLOCK) ON d.Employee_bkey = s.EmployeeID

                WHERE d.dwh_Comparison IS NULL OR (d.IsCurrent = 1 AND (d.dwh_Comparison <> s.dwh_Comparison

                                        OR (d.dwh_Comparison = s.dwh_Comparison AND s. dwh_Sortorder <> 1)))

                ORDER BY s.ID, s.dwh_Sortorder

                ]]>

              </DirectInput>

            </OleDbSource>

            <ConditionalSplit Name="New - Changed">

              <InputPath OutputPathName="Match Source and Dimension.Output"/>

              <OutputPaths>

                <OutputPath Name="New">

                  <Expression>ISNULL(Employee_key)</Expression>

                </OutputPath>

                <OutputPath Name="Changed">

                  <Expression>!ISNULL(Employee_key)</Expression>

                </OutputPath>

              </OutputPaths>

            </ConditionalSplit>

           

            <!--New pipeline (N) -->

            <Aggregate Name="Smash ValidFrom ValidTo (N)">

              <InputPath OutputPathName="New - Changed.New"/>

              <OutputPaths>

                <OutputPath Name="Output">

                  <Columns>

                    <Column SourceColumn="Employee_bkey" Operation="GroupBy"/>

                    <Column SourceColumn="dwh_Groupnumber"  Operation="GroupBy"/>

                    <Column SourceColumn="dwh_Comparison" Operation="GroupBy"/>

                    <Column SourceColumn="ValidFrom" Operation="Minimum"/>

                    <Column SourceColumn="ValidTo" Operation="Maximum"/>

 

                    <!—Business columns -->

 

                    <Column SourceColumn="LastName" Operation="GroupBy"/>

                    <Column SourceColumn="FirstName" Operation="GroupBy"/>

 

                  </Columns>

                </OutputPath>

              </OutputPaths>

            </Aggregate>

            <Sort Name="Businesskey + ValidFrom (N)">

              <InputPath OutputPathName="Smash ValidFrom ValidTo (N).Output"/>

              <Columns>

                <Column Sort="true" SourceColumn="Employee_bkey"/>

                <Column Sort="true" SourceColumn="ValidFrom"/>

              </Columns>

            </Sort>

            <DerivedColumns Name="Set IsCurrent (N)">

              <InputPath OutputPathName="Businesskey + ValidFrom (N).Output"/>

              <Columns>

                <Column Name="Employee_dkey" DataType="Int32">(DT_I4)0</Column>

                <Column Name="Employee_pkey" DataType="Int32">(DT_I4)0</Column>

                <Column Name="IsCurrent" DataType="Boolean">

                              ValidTo == (DT_DATE)"9999-12-31" ? (DT_BOOL)1 : (DT_BOOL)0</Column>

              </Columns>

            </DerivedColumns>

            <OleDbDestination Name="Stage, new rows to insert" ConnectionName="DSA_DB"

               UseFastLoadIfAvailable="true" TableLock="false" CheckConstraints="false"

               KeepIdentity="false" KeepNulls="false">

              <InputPath OutputPathName="Set IsCurrent (N).Output"/>

              <ExternalTableOutput Table="dbo.Stage_Employee"/>

            </OleDbDestination>

 

            <!--Changed pipeline -->

            <Multicast Name="Duplicate for Update and Insert">

              <InputPath OutputPathName="New - Changed.Changed"/>

              <OutputPaths>

                <OutputPath Name="Update"/>

                <OutputPath Name="Insert"/>

              </OutputPaths>

            </Multicast>

           

            <!--Update pipeline (U) -->

            <Aggregate Name="Smash ValidTo (U)">

              <InputPath OutputPathName="Duplicate for Update and Insert.Update"/>

              <OutputPaths>

                <OutputPath Name="Output">

                  <Columns>

                    <Column SourceColumn="Employee_bkey" Operation="GroupBy"/>

                    <Column SourceColumn="Employee_key" Operation="GroupBy"/>

                    <Column SourceColumn="Employee_dkey" Operation="GroupBy"/>

                    <Column SourceColumn="Employee_pkey" Operation="GroupBy"/>

                    <Column SourceColumn="ValidFrom" TargetColumn="ValidTo" Operation="Minimum"/>

                  </Columns>

                </OutputPath>

              </OutputPaths>

            </Aggregate>

            <DerivedColumns Name="Set IsCurrent to 0 (U)">

              <InputPath OutputPathName="Smash ValidTo (U).Output"/>

              <Columns>

                <Column Name="IsCurrent" DataType="Boolean">(DT_BOOL)0</Column>

              </Columns>

            </DerivedColumns>

            <OleDbDestination Name="Stage, changed rows to update" ConnectionName="DSA_DB"

               UseFastLoadIfAvailable="true" TableLock="false" CheckConstraints="false"

               KeepIdentity="true" KeepNulls="false">

              <InputPath OutputPathName="Set IsCurrent to 0 (U).Output"/>

              <ExternalTableOutput Table="dbo.Stage_Employee"/>

            </OleDbDestination>

 

            <!--Insert pipeline (I) -->           

            <Aggregate Name="Smash ValidFrom ValidTo (I)">

              <InputPath OutputPathName="Duplicate for Update and Insert.Insert"/>

              <OutputPaths>

                <OutputPath Name="Output">

                  <Columns>

                    <Column SourceColumn="Employee_bkey" Operation="GroupBy"/>

                    <Column SourceColumn="dwh_Groupnumber" Operation="GroupBy"/>

                    <Column SourceColumn="Employee_dkey" Operation="GroupBy"/>

                    <Column SourceColumn="dwh_Comparison" Operation="GroupBy"/>

                    <Column SourceColumn="ValidFrom" Operation="Minimum"/>

                    <Column SourceColumn="ValidTo" Operation="Maximum"/>

 

                    <!—Business columns -->

 

                    <Column SourceColumn="LastName" Operation="GroupBy"/>

                    <Column SourceColumn="FirstName" Operation="GroupBy"/>

 

                  </Columns>

                </OutputPath>

              </OutputPaths>

            </Aggregate>

            <Sort Name="Businesskey + ValidFrom (I)">

              <InputPath OutputPathName="Smash ValidFrom ValidTo (I).Output"/>

              <Columns>

                <Column Sort="true" SourceColumn="Employee_bkey"/>

                <Column Sort="true" SourceColumn="ValidFrom"/>

              </Columns>

            </Sort>

            <DerivedColumns Name="Set IsCurrent (I)">

              <InputPath OutputPathName="Businesskey + ValidFrom (I).Output"/>

              <Columns>

                <Column Name="Employee_pkey" DataType="Int32">(DT_I4)0</Column>

                <Column Name="IsCurrent" DataType="Boolean">

                              ValidTo == (DT_DATE)"9999-12-31" ? (DT_BOOL)1 : (DT_BOOL)0</Column>

              </Columns>

            </DerivedColumns>

            <OleDbDestination Name="Stage, changed rows to insert" ConnectionName="DSA_DB"

               UseFastLoadIfAvailable="true" TableLock="false" CheckConstraints="false"

               KeepIdentity="false" KeepNulls="false">

              <InputPath OutputPathName="Set IsCurrent (I).Output"/>

              <ExternalTableOutput Table="dbo.Stage_Employee"/>

            </OleDbDestination>

          </Transformations>

        </Dataflow>

 

        <ExecuteTSqlStatement Name="Set durable key" ConnectionName="DSA_ADO">

          <SqlStatementSource>

            <![CDATA[

            UPDATE s

            SET Employee_dkey = t.Employee_key

            FROM dbo.Stage_Employee s WITH(TABLOCK)

                 INNER JOIN

                   (SELECT Employee_bkey, Employee_key = MIN(Employee_key)

                    FROM dbo.Stage_Employee

                    WHERE Employee_dkey = 0

                    GROUP BY Employee_bkey

                   ) t ON t.Employee_bkey = s.Employee_bkey

            WHERE s.Employee_dkey = 0

            ]]>

          </SqlStatementSource>

        </ExecuteTSqlStatement>

 

        <ExecuteTSqlStatement Name="Set previous key" ConnectionName="DSA_ADO">

          <SqlStatementSource>

            <![CDATA[

            UPDATE s

            SET Employee_pkey = t.Employee_pkey

            FROM dbo.Stage_Employee s WITH(TABLOCK)

                 INNER JOIN

                   (SELECT Employee_key, Employee_pkey = LAG(Employee_key, 1, Employee_pkey)

                                                       OVER(PARTITION BY Employee_bkey ORDER BY ValidFrom)

                    FROM dbo.Stage_Employee

                   ) t ON t.Employee_key = s.Employee_key

            WHERE s.ValidFrom IS NOT NULL                       

            ]]>

          </SqlStatementSource>

        </ExecuteTSqlStatement>

      </Tasks>

    </Package>

 

    <Package Name="Dim_Employee" ConstraintMode="Linear">

      <Tasks>

        <Container Name="Stage_Employee -> Dim_Employee" ConstraintMode="Linear">

          <Tasks>       

            <ExecuteTSqlStatement Name="Update ValidTo and IsCurrent in Dim_Employee"

                                  ConnectionName="DSA_ADO">

              <SqlStatementSource>

                <![CDATA[

                UPDATE d

                SET ValidTo    = s.ValidTo,

                    IsCurrent  = s.IsCurrent,

                    UpdateTime = Getdate()

                FROM dbo.Dim_Employee d WITH(TABLOCK)

                     INNER JOIN dbo.Stage_Employee s WITH(TABLOCK) ON s.Employee_key = d.Employee_key

                WHERE s.ValidFrom IS NULL           

                ]]>

              </SqlStatementSource>

            </ExecuteTSqlStatement>

 

            <Dataflow Name="Insert Dim_Employee from Stage_Employee"

                      DefaultBufferSize="104857600" DefaultBufferMaxRows="10000">

              <Transformations>

                <OleDbSource Name="Stage_Employee" ConnectionName="DSA_DB">

                  <DirectInput>

                    <![CDATA[

                    SELECT

                      Employee_key, Employee_dkey, Employee_pkey, Employee_bkey,

 

                      -- Business columns

 

                      LastName, FirstName,

 

                      dwh_Comparison, ValidFrom, ValidTo, IsCurrent

                    FROM dbo.Stage_Employee

                    WHERE ValidFrom IS NOT NULL               

                    ]]>

                  </DirectInput>

                </OleDbSource>

                <OleDbDestination Name="Dim_Employee" ConnectionName="TEST_DB"

                  UseFastLoadIfAvailable="true" TableLock="true" CheckConstraints="false"

                  KeepIdentity="true" KeepNulls="false"

                  BatchSize="10000" MaximumInsertCommitSize="2000000">

                  <InputPath OutputPathName="Stage_Employee.Output"/>

                  <ExternalTableOutput Table="dbo.Dim_Employee"/>

                </OleDbDestination>

              </Transformations>

            </Dataflow>

          </Tasks>

        </Container>

      </Tasks>

    </Package>

  </Packages>

</Biml>

Download and install BIDSHelperSetup to extend SQL Server Data Tools for Visual Studio with BIML support https://bidshelper.codeplex.com/.

Tutoring to your first biml file and how to generate a SSIS package at this link:

https://bidshelper.codeplex.com/wikipage?title=Biml%20Package%20Generator.

 

 

11. Fact data connected to Kimball SCD type 7 dimension

A data mart is first loaded with data to the dimension tables as shown in chapter 5 and after that the data to fact tables will be loaded where the metadata columns _key and _dkey will be added to the fact data to create the connection to the values in a dimension for current and historical purpose.

 

A System dimension type 7 with metadata and descriptive columns:

 

System

_key

System

_dkey

System

_pkey

System

_bkey

System

ValidFrom

ValidTo

19

19

0

76CB

Warning

1900-01-01

00:00:00

2015-05-08

14:12:21

35

19

19

76CB

Notification

2015-05-08

14:12:21

2015-07-31

08:45:48

87

19

35

76CB

Caution

2015-07-31

08:45:48

9999-12-31

00:00:00

 

_key   stands for a primary key as a surrogate increment se­quen­tial number.

_dkey stands for a durable key is a integer representation of a business key.

_pkey stands for a previous key is re­fe­ren­ce to a _key value for the previous.

_bkey stands for a business key is the primary key in the source data.

 

Column _key is an unique primary key of the dimension table and the com­bi­nation of _bkey + ValidFrom or _bkey + ValidTo is unique too.

 

11.1 No transaction date then fact connect to current data

When fact data have no transaction date, the fact data will be connected to the current and not deleted data in a dimension where business key match. It can be implemented by a Lookup in a Data Flow Task that fill out a pipeline to be stored in the fact table.

 

  • In OLE DB Source the relevant columns are seleced specially the business key column that is already used in the dimension table _bkey column. In this source data the business key is called SystemId.
  • In Lookup I am using a sql statement in 'Use results of an sql query' to fetch all relevant rows for cur­rent data and not deleted data because no new fact data can refer to a deleted dimension value. I select co­lumn System_bkey for later match to column SystemId and I select columns Sy­stem_key and Sy­stem_dkey to be added as extra columns  into the pipeline to be inserted into the fact table through a OLE DB Destination.

 

SELECT System_bkey, System_key, System_dkey

FROM dbo.Dim_System

WHERE IsCurrent = 1 AND IsDeleted = 0

 

  • The Lookup match or join the business key SystemId and dimension co­lumn System_bkey and with a successful match the souce data rows con­ti­nue in the pipeline 'Lookup Match Output' and data will be inserted in the Fact_System OLE DB Destination.
  • When business key SystemId does not match System_bkey the source data rows continue in the pipeline 'Lookup No Match Output' and data will be inser­ted into an error table in Error_System OLE DB Destination. In Lookup at General page I select 'Redirect rows to no match output' in dropdown or in Lookup properties pane change property NoMatchBehavior to 'Send rows with no matching entries to the no match output'. This is same as sql statement with WHERE NOT EXISTS(SELECT 1 …), see later.

 

 

 

Example of a source data row:

 

SystemId

ProductionKWh

76CB

173

 

Become a fact data row that use the current dimension data reference:

 

System_key

System_dkey

ProductionKWh

87

19

173

 

 

11.2 With a transaction date then fact connect to historical data

When fact data have a transaction date, the fact data will be connected to the hi­sto­rical data in a dimension where business key match and where transaction date is between ValidFrom and ValidTo. It can be implemented by a Lookup in a Data Flow Task that fill out a pipeline to be stored in the fact table. The transaction date is called OutageDate in the source data and is called SystemDate in the fact data.

 

  • In OLE DB Source the relevant columns are seleced specially the business key column that is already used in the dimension table _bkey column. In this source data the business key is called SystemId.
  • In Lookup I am using a sql statement in 'Use results of an sql query' to fetch all relevant rows for historical and cur­rent data except deleted data because no new fact data can refer to a deleted dimension value. I select co­lumn System_bkey for later match to column SystemId and I select columns Sy­stem_key and Sy­stem_dkey to be added as extra columns to the pipeline to be inserted into the fact table through a OLE DB Destination. But I have to select column ValidFrom because I need to make a fake match to column OutageDate to have OutageDate as a parameter in the sql statement that will select the right rows where OutageDate is between ValidFrom and ValidTo.

 

SELECT System_bkey, System_key, System_dkey, ValidFrom

FROM dbo.Dim_System

WHERE IsDeleted = 0

 

  • In Lookup I match or join the business key SystemId and System_bkey. But I have to include a fake match from OutageDate to ValidFrom that will not be used in runtime mode but this fake match makes it possible to use OutageDate as a parameter.
  • In Lookup I use Partial cache and in Advanced page I modify the sql statement to fetch the right historical dimension row for OutageDate like this:

 

WHERE IsDeleted = 0 AND System_bkey = SystemId AND

      ValidFrom <= OutageDate AND ValidTo > OutageDate

 

But the sql statement has to use question mark ? input parameter for all source data to make a dynamic sql statement where I using same selected columns as above. I typein into the Lookup Advanced page 'Modify the sql statement' this date range lookup sql statement (range filtering):

 

SELECT System_bkey, System_key, System_dkey, ValidFrom

FROM dbo.Dim_System

WHERE IsDeleted = 0 AND System_bkey = ?

      AND ValidFrom <= ? AND ValidTo > ?

 

First ? parameter represent SystemId and second and third ? parameter represent OutageDate. I click [Parameters] and I can choose OutageDate in the dropdown because I before made a fake match from OutageDate to ValidFrom.

  • When business key SystemId does not match System_bkey the source data rows continue in the pipeline 'Lookup No Match Output' and data will be inser­ted into an error table in Error_System OLE DB Destination. In Lookup at General page I select 'Redirect rows to no match output' in dropdown.

 

Example of two source data rows:

 

OutageStart

SystemId

ProductionKWh

2015-05-25 10:22:43

76CB

173

2015-08-02 15:47:10

76CB

221

 

Become two fact data rows that use the historical dimension data reference:

 

SystemDate

System_key

System_dkey

ProductionKWh

2015-05-25 10:22:43

35

19

173

2015-08-02 15:47:10

87

19

221

 

 

 

In runtime mode the sql statement and ? parameters from the source will be used to make a Range Lookup (Range Filtering) but it puts the Lookup into a row-by-row operation and is calling SQL Server for each row from the source with data as parameter values:

 

exec sp_executesql N'SELECT System_bkey, System_key, System_dkey, ValidFrom

FROM dbo.Dim_System

WHERE IsDeleted = 0 AND System_bkey = @P1 AND ValidFrom <= @P2 AND ValidTo > @P3',

N'@P1 int,@P2 datetime2(3),@P3 datetime2(3)','76CB','2015-05-25 10:22:43','2015-05-25 10:22:43'

 

exec sp_executesql N'SELECT System_bkey, System_key, System_dkey, ValidFrom

FROM dbo.Dim_System

WHERE IsDeleted = 0 AND System_bkey = @P1 AND ValidFrom <= @P2 AND ValidTo > @P3',

N'@P1 int,@P2 datetime2(3),@P3 datetime2(3)','76CB','2015-08-02 15:47:10','2015-08-02 15:47:10'

 

That is not good for performance with a large number of rows from the source.

 

To avoid row-by-row operation the Lookup needs a Cache Connection Manager Transform as shown in earlier chapter with a sql statement that do the connection and fetching of data from source and dimension like this:

 

SELECT s.OutageDate, d.System_bkey, d.System_key, d.System_dkey

FROM dbo.Source_Production s

     INNER JOIN dbo.Dim_System d ON d.System_bkey = s.SystemId AND

                d.ValidFrom <= s.OutageDate AND d.ValidTo > s.OutageDate

WHERE d.IsDeleted = 0

 

Of course it will take some seconds to perform this query and save the result in the cache but indexes at business key, OutageDate, ValidFrom and ValidTo helps a lot.

 

This Lookup method will not use fake match or sql statement with ? parameter and from a self-explanatory graphical user interface perspective this lookup match looks nice and add columns _key and _dkey to the pipeline for the fact table into OLE DB Destination.

 

 

To avoid Lookup you can use a combination of Merge and Conditional split.

 

I still like the idea of doing a lookup to find the right row of the dimension but when I have million of rows in the source, I need to speed up lookup with a C# script I will show in section 11.4.

 

11.3 Inferred member when source data doesn't exist in dimension

A dimension has a member value of »Unknown«, »Missing«, »N/A«, »-« etc. to handle source data that is going to be loaded into a fact table:

 

  • Business key is empty with no value, null or an empty string.

Approach: Handling null business key as a missing member.

 

  • Business key value does not exist in the dimension.

Approach: Late arriving dimension or Early arriving fact as an unknown mem­ber, meaning a member value is an orphan child because there is no parent member value in the corresponding dimension table.

 

A forthcoming fact row has a member value that will infer a new dimension value, therefore it is called inferred member.

 

Approach to handle business key as an Unknown dimension value

Lookup catch both situations in the 'Lookup No Match Output' pipeline where I add two extra columns _key and _dkey and set them to default value 0 for Unknown through a 'Derived Column'. I use a Union All to put the pipelines together to insert all source data into the fact table. The SSIS package will have this control flow and data flow:

 

 

 

Approach to handle business key as a New dimension value

I am using the example from my article Dimensional modeling which describes a Kimball type 7 dimension called Dim_System, data example:

 

System

_key

System

_dkey

System

_pkey

System

_bkey

System

ValidFrom

ValidTo

IsCurrent

IsDeleted

IsInferred

-1

-1

0

0

Missing

1900-01-01 00:00:00

9999-12-31 00:00:00

1

0

1

19

19

0

76CB

Warning

1900-01-01 00:00:00

2015-05-08 14:12:21

0

0

0

35

19

19

76CB

Notification

2015-05-08 14:12:21

2015-07-31 08:45:48

0

0

0

87

19

35

76CB

Caution

2015-07-31 08:45:48

9999-12-31 00:00:00

1

0

0

88

88

0

88FF

Sign

1900-01-01 00:00:00

9999-12-31 00:00:00

1

0

0

 

  • When Business key value is NULL it will use the »Missing« value in the dimen­sion that has _key value -1.

 

  • When Business key value does not exist in the dimension, the Business key value will be inserted into a new row in the dimension table with »Unknown« + business key value, and a metadata column IsInferred is set to 1. The Business key value will be saved in _bkey column. The new row will provide a new generated sur­ro­gate identity number for _key column. The same number will also be saved in _dkey column. Both keys will be saved into the data row of the fact table.

 

Example of table Source_Production with business key called SystemId:

 

OutageDate

SystemId

ProductionKWh

2015-05-25 10:22:43

76CB

173

2015-08-02 15:47:10

76CB

221

2015-08-12 22:21:42

88FF

100

2015-10-02 00:00:00

90AA

200

2015-10-12 00:00:00

90AA

300

2015-10-15 00:00:00

91KL

400

2015-02-07 00:00:00

NULL

500

 

All data rows in table Source_Production will be loaded to table Fact_System where the business key SystemId will be transformed to the dimension key System_key in table Dim_System. There is three values in SystemId (90AA, 91KL, NULL) in four rows that does not exist in Dim_System, therefore they are inferred members and will be infer to Dim_System where NULL will be transformed to System_key -1 for »Missing«, and 90AA and 91KL will be inferred to two new data rows in Dim_Sy­stem as »Unknown 90AA« and »Unknown 91KL« with new numbers in System_key that will be used in loading data rows into Fact_System table.

 

After running the SSIS package the table Dim_System will have new inferred rows:

 

System

_key

System

_dkey

System

_pkey

System

_bkey

System

ValidFrom

ValidTo

IsCurrent

IsDeleted

IsInferred

-1

-1

0

0

Missing

1900-01-01 00:00:00

9999-12-31 00:00:00

1

0

1

19

19

0

76CB

Warning

1900-01-01 00:00:00

2015-05-08 14:12:21

0

0

0

35

19

19

76CB

Notification

2015-05-08 14:12:21

2015-07-31 08:45:48

0

0

0

87

19

35

76CB

Caution

2015-07-31 08:45:48

9999-12-31 00:00:00

1

0

0

88

88

0

88FF

Sign

1900-01-01 00:00:00

9999-12-31 00:00:00

1

0

0

89

89

0

90AA

Unknown 90AA

1900-01-01 00:00:00

9999-12-31 00:00:00

1

0

1

90

90

0

91KL

Unknown 91KL

1900-01-01 00:00:00

9999-12-31 00:00:00

1

0

1

 

The table Fact_System is loaded with all the data rows from Source_Production and is also using the new inferred numbers of System_key from the table Dim_System:

 

SystemDate

System_key

System_dkey

ProductionKWh

2015-05-25 10:22:43

35

19

173

2015-08-02 15:47:10

87

19

221

2015-08-12 22:21:42

88

88

100

2015-10-02 00:00:00

89

89

200

2015-10-12 00:00:00

89

89

300

2015-10-15 00:00:00

90

90

400

2015-02-07 00:00:00

-1

-1

500

 

SSIS package execution mode shows the number of rows in the pipelines with four rows from Source_Production that is inferred members and they will be handled in the Lookup 'Inferred members' with programming I will describe below:

 

 

When a data row in Lookup 'Dim_System' can not match business key SystemId from table Source_Production with System_bkey in table Dim_System, the data row will be handled as a inferred member in Lookup 'Inferred members' like this:

 

1.  In the data Flow Task I drag in a Lookup and I edit it for 'Inferred members'.

2.  In page General I select 'Partial cache' and 'Ignore failure'.

3.  In page Connection I connect to the database at OLE DB connection manager and I select 'Use results of an SQL query' because I want to limit the columns from table Dim_System with a sql statement:

 

   SELECT System_bkey, System_key, System_dkey

   FROM dbo.Dim_System

 

I select column System_bkey because it will be joined to the business key Sy­stemId from table Source_Production. I select columns System_key and Sy­stem_dkey because theses columns will be added to the pipeline to be loaded into the Fact_System table.

4.  In page Columns I join business key SystemId with System_bkey and I checkmark System_key and System_dkey.

5.  In page Advanced I select 'Modify the SQL statement' because I will do T-SQL programming but first I will make a parameter ? to represent business key Sy­stem­Id from Source_Production by click [Parameters].

 

I replace the default sql statement:

 

select * from (SELECT System_bkey, System_key, System_dkey

from dbo.Dim_System) [refTable]

where [refTable].[System_bkey] = ?

 

with a real T-SQL programming where parameter ? represent the business key SystemId that I check for NULL and replace to -1 for »Missing« value in the dimension that has _key value -1. When the business key does not exist in the dimen­sion table Dim_System, the variable @System_key will be null and I will insert a new data row into the dimension table Dim_System:

 

DECLARE @SystemId nvarchar(4) = ISNULL(?,-1) -- -1 = Missing.

DECLARE @System_key int

SET NOCOUNT ON

–- To check if @SystemId exists in dimension because with multiple same inferred

–- members I will only insert one new data row in the dimension for the first

–- inferred member. Therefore this T-SQL is a row-by-row operation.

SELECT @System_key = System_key

FROM dbo.Dim_System

WHERE System_bkey = @SystemId

 

IF @System_key IS NULL -- True when business key does not exist in dimension table.

BEGIN

  INSERT INTO dbo.Dim_System

  (System_dkey, System_pkey, System_bkey,

   System,

   ValidFrom, ValidTo,

   IsCurrent, IsDeleted, IsInferred,

   InsertTime, UpdateTime

  )

  VALUES

  (0, 0, @SystemId,                   -- System_dkey, System_pkey, System_bkey,

   CONCAT('Unknown', ' ', @SystemId), -- mix of Unknown with business key to System,

   '19000101', '99991231',            -- ValidFrom, ValidTo,

   1, 0, 1,                           -- IsCurrent, IsDeleted, IsInferred,

   Getdate(), Getdate()               -- InsertTime, UpdateTime

  )

  –- Grap the newly inserted and generated surrogate identity number of System_key.

  SET @System_key = SCOPE_IDENTITY()

 

  –- Set System_dkey to the new surrogate identity number.

  UPDATE dbo.Dim_System

  SET System_dkey = System_key

  WHERE System_key = @System_key

END

-- Give back value to Lookup therefore this select match the select in Connection.

SELECT System_bkey = @SystemId, System_key = @System_key, System_dkey = @System_key

 

 

When SSIS package has been saved and open again, the above sql statement has been formatet with no line breaks, therefore I can copy-paste it to a sql format site like http://www.sql-format.com sql-format and get a nice readable sql statement like I type it in when I made the package.

 

 

 

 

 

In case you like to put the T-SQL programming in a stored procedure, you can call it from 'Modify the SQL statement' with EXECUTE dbo.Dim_System_Inferred ?.

 

Read more about the theory in my article Dimensional modeling.

 

Approach to handle business key as an missing dimension value

T-SQL in a stored procedure gives best performance and is easy to write without a lot of property settings and mouse clicks:

 

CREATE PROCEDURE [dbo].[Fact_System_Insert]

AS

BEGIN

  SET NOCOUNT ON

 

  TRUNCATE TABLE dbo.Fact_System

 

  INSERT INTO dbo.Fact_System

  (SystemDate, System_key, System_dkey, ProductionKWh)

 

  SELECT s.OutageDate, d.System_key, d.System_dkey, s.ProductionKWh

  FROM dbo.Source_Production s

       INNER JOIN dbo.Dim_System d ON d.System_bkey = s.SystemId AND

                  d.ValidFrom <= s.OutageDate AND d.ValidTo > s.OutageDate

  WHERE d.IsDeleted = 0

 

  UNION ALL

 

  SELECT s.OutageDate, -1 AS System_key, -1 AS System_dkey, s.ProductionKWh

  FROM dbo.Source_Production s

  WHERE NOT EXISTS(SELECT 1 FROM dbo.Dim_System d

                   WHERE d.System_bkey = s.SystemId AND

                   d.ValidFrom <= s.OutageDate AND d.ValidTo > s.OutageDate

                   AND d.IsDeleted = 0)

  ORDER BY OutageDate, System_key

END

 

The stored procedure will be executed from a SSIS package by a 'Execute T-SQL Statement Task', therefore no data reading through a OLE DB Source into a SSIS pipeline and later write pipeline data back to the database through a OLE DB Destination because T-SQL do all the data extraction, transformation and loading inside the database using the query optimizer.

 

With T-SQL I can do Left outer join together with theta join for <= or > as another algorithm for inserting all source data rows into the fact table with Missing as -1 done by IsNull function:

 

CREATE PROCEDURE [dbo].[Fact_System_Insert]

AS

BEGIN

  SET NOCOUNT ON

 

  TRUNCATE TABLE dbo.Fact_System

 

  INSERT INTO dbo.Fact_System

  (SystemDate, System_key, System_dkey, ProductionKWh)

  SELECT s.OutageDate, ISNULL(d.System_key,-1), ISNULL(d.System_dkey,-1),

         s.ProductionKWh

  FROM dbo.Source_Production s

       LEFT OUTER JOIN dbo.Dim_System d ON d.System_bkey = s.SystemId AND

             d.ValidFrom <= s.OutageDate AND d.ValidTo > s.OutageDate AND

             d.IsDeleted = 0

  ORDER BY s.OutageDate, d.System_key

END

 

SSIS package Merge join can’t do theta join.

 

11.4 Date Range Lookup with script in C# for very fast performance

A SSIS package will query the source rows as »Full load« or as »Incremental load from an archive by delta data in stream« and the rows will be inserted into the pipeline and a lookup will fetch the right dimension value according to the business key e.g. SystemId and to the transactiondatatime e.g. OutageDate with this criteria that is called a Date Range Lookup because of the <= and > (or called Range Filtering):

 

System_bkey = SystemId AND

ValidFrom <= OutageDate AND ValidTo > OutageDate

 

It will be done as a row-by-row operation and the pipeline stream of rows ends into the fact destination table (target table). The SSIS Lookup component is too slow as mention in 11.2, therefore I will here make a Lookup as a Script Component that will be added between Source and Destination with properties:

 

 

Select relevant columns from source in the pipeline to do the lookup in the table Dim_System. SystemId is the business key in the source and OutageDate is going to be used like this:

 

System_bkey = SystemId AND

ValidFrom <= OutageDate AND ValidTo > OutageDate

 

Therefore I select columns SystemId and OutageDate from the source pipeline as input columns to the script component:

 

 

From the table Dim_System I want the two columns System_key and System_dkey (the type 2 key and the durable key for type 1) to the fact table as type 7, there­fore I add both of them as output columns to the script component, so they will be added to the pipeline and saved into the destination fact table.

 

In the dialog box click at page »Inputs and Outputs« and in the tree click at »Output Columns« and click at button [Add Column] and type in Name System_key and select the right data type (int = DT_I4). I repeat it all again for the other column System_dkey, and I get this:

 

 

Since the script is going to read and cache all rows from the table Dim_System, I add the ADO.NET connection to the component. Script component can not use OLEDB connection. Connection TSQL_DSAconnectionstring is an ADO.NET connection I used for Trun­cate Table Fact_System in Control Flow at Execute T-SQL Statement Task, there­fore I reuse the ADO.NET connection:

 

 

At page Script, I click the button [Edit Script] and type in this C# program, and after the typing I will in the menubar click Build and Build SC_ and look at status message in left bottom corner says Build succeeded. I leave the coding window through menubar by click File and Exit.

In the C# program code I am using a data structure called Dictionary that is a very fast load and search in-memory cache to store the values of the dimension that I read from the database with a sql statement.

 

using System;

using System.Data;

using System.Data.SqlClient;

using System.Collections.Generic; // for Dictionary as array/container data structure.

using System.Windows.Forms;       // for MessageBox.Show.

using Microsoft.SqlServer.Dts.Pipeline.Wrapper;

using Microsoft.SqlServer.Dts.Runtime.Wrapper;

 

[Microsoft.SqlServer.Dts.Pipeline.

  SSISScriptComponentEntryPointAttribute]

public class ScriptMain : UserComponent

{

    private Dictionary<Data, Keys> dictionary; // the cache to contain the sql output rows.

 

    public override void PreExecute()

    {

        base.PreExecute();

 

        // sql to fetch data from a Kimball type 7 dimension with key and durable key _dkey.

        // and data rows marked deleted is not included because not relevant for fact rows in

        // the pipeline inside the SSIS from a Source to a Destination (target) fact table.

 

        string sql = @"SELECT System_key, System_dkey, System_bkey, ValidFrom, ValidTo

                       FROM dbo.Dim_System

                       WHERE IsDeleted = 0";

 

        IDTSConnectionManager100 manager;

        SqlConnection connection;

        SqlCommand command;

        SqlDataReader reader;

        // ADO.NET connection in SSIS package, the last word Connection is from

        // the Name column of the Script Component Connection Managers.

        manager = this.Connections.Connection;

        connection = (SqlConnection)manager.AcquireConnection(null);

        command = new SqlCommand(sql, connection);

        reader = command.ExecuteReader();

 

        dictionary = new Dictionary<Data, Keys>(); // create cache in memory for adding rows.

 

        int bkey; // when System_bkey is varchar, we convert=parse to int to fit ProcessInputRow.

        // for each row the script creates two data objects and add them to the cache in memory.

        while (reader.Read())

        {

            Data d = new Data();  // business key and validfrom and validto for range lookup.

            Keys k = new Keys();  // the keys to be returned back into the pipeline.

 

            //d.Bkey = int.Parse(reader["System_bkey"].ToString()); // when _bkey is int.

            if (int.TryParse(reader["System_bkey"].ToString(), out bkey))

            {       // _bkey = "172-9" try becomes false and the row is not added to dictionary.

              d.Bkey = bkey; // the output value from TryParse is assigned to the d object.

              d.ValidFrom = DateTime.Parse(reader["ValidFrom"].ToString());

              d.ValidTo = DateTime.Parse(reader["ValidTo"].ToString());

              // only date d.ValidTo = DateTime.Parse(reader["ValidTo"].ToString()).Date;

              // to remove hour,minute,seconds and still use declaration: DateTime ValidTo

              // bigint with null d.Type = (reader["SystemType"] == DBNull.Value ? (long?)null :

              //                            Convert.ToInt64(reader["SystemType"]));

              //MessageBox.Show(reader["System_bkey"].ToString());

              k.Key = int.Parse(reader["System_key"].ToString());

              k.Dkey = int.Parse(reader["System_dkey"].ToString());

              // k.Num = (reader["SystemNum"] == DBNull.Value ? (int?)null :

              //                            Convert.ToInt32(reader["SystemType"]));

              dictionary.Add(d, k);

            }

        }

        //MessageBox.Show(dictionary.Count.ToString()); // number of rows in table.

        reader.Close();

        reader.Dispose();

        reader = null;

        command.Dispose();

        command = null;

        connection.Close();

        connection.Dispose();

        connection = null;

        manager = null;

    }

 

    public override void PostExecute()

    {

        base.PostExecute();

    }

 

    public override void Input0_ProcessInputRow(Input0Buffer Row)

    {

        // for each row in pipeline from source the script instantiate an object.       

        if (!Row.SystemId_IsNull) // if business key is not null then do.

        {                         // && Row.CodeText.Trim().Length >= 1 no empty string "".

            Data d = new Data();     // to hold the value of (bkey, transactiondatetime).

            d.Bkey = Row.SystemId;   // the value to be found for the business key and

            d.Date = Row.OutageDate; // datetime datatype as a TransactionDatetime.

            // only date d.Date = DateTime.Parse(Row.OutageDate.ToString()).Date;

            // bigint with null d.Type = (Row.SystemType_IsNull ? (long?)null :

            //                            Convert.ToInt64(Row.SystemType));

            Keys value;  // to hold the value of (key, dkey) that will be found by TryGetValue.

            if (dictionary.TryGetValue(d, out value)) // lookup business key in dictionary.

            {                                         // if found then use the keys values.

                Row.Systemkey = value.Key;            // script not like _ in column name.

                Row.Systemdkey = value.Dkey;

 

                //if (value.Num.HasValue) // An extra output column that allows a null value.

                //    Row.Num = (int)value.Num; // take care of an integer value with a cast.

                //else

                //    Row.Num_IsNull = true; // the way to assign null value to output column.

                // if (!Row.ColumnA_IsNull)  // assign ColumnB when ColumnA is allow null.

                //     Row.ColumnB = Row.ColumnA;

                // else

                //     Row.ColumnB_IsNull = true;

            }

            else // if business key is not found in dictionary then make an inferred member.

            {

                int key = -2, dkey = -2;   // -2 in a dimension represent Unknown data value.

                InferredMember(Row.SystemId, out key, out dkey);  // output new id for keys.

                Row.Systemkey = key;

                Row.Systemdkey = dkey;

                // or treat inferred member in fact row as a not available unknown key (-2).

                //Row.Systemkey = -2;

                //Row.Systemdkey = -2;

                //Row.Num_IsNull = true;

            }

        }

        else // if business key is null or is an empty string then use the missing key (-1).

        {

            Row.Systemkey = -1;

            Row.Systemdkey = -1;

            //Row.Num_IsNull = true;

        }

    }

 

    private void InferredMember(int Bkey, out int Key, out int Dkey)

    {

        try

        {

            IDTSConnectionManager100 manager;

            SqlConnection connection;

            SqlCommand command;

            // ADO.NET connection in SSIS package, the last word Connection is from

            // the name of the Script Component Connection Managers.

            manager = this.Connections.Connection;

            connection = (SqlConnection)manager.AcquireConnection(null);

            command = new SqlCommand();

            command.Connection = connection;

            command.CommandType = CommandType.StoredProcedure;

            //command.CommandTimeout = 10; Seconds to wait for execute. Default is 30 seconds.

            command.CommandText = "dbo.Dim_System_MakeInferred";

            command.Parameters.Add("Bkey", SqlDbType.Int).Value = Bkey;

            command.Parameters.Add("Key", SqlDbType.Int).Direction = ParameterDirection.Output;

            command.Parameters.Add("Dkey", SqlDbType.Int).Direction = ParameterDirection.Output;

            command.Parameters.Add("ValidFrom", SqlDbType.DateTime).Direction =

                                                                      ParameterDirection.Output;

            command.Parameters.Add("ValidTo", SqlDbType.DateTime).Direction =

                                                                      ParameterDirection.Output;

            command.ExecuteNonQuery();

 

            Key = (int)command.Parameters["Key"].Value;

            Dkey = (int)command.Parameters["Dkey"].Value;

            DateTime ValidFrom = (DateTime)command.Parameters["ValidFrom"].Value;

            DateTime ValidTo = (DateTime)command.Parameters["ValidTo"].Value;

 

            command.Parameters.RemoveAt("Bkey");

            command.Parameters.RemoveAt("Key");

            command.Parameters.RemoveAt("Dkey");

            command.Parameters.RemoveAt("ValidFrom");

            command.Parameters.RemoveAt("ValidTo");

            command.Dispose();

            command = null;

            connection.Close();

            connection.Dispose();

            connection = null;

            manager = null;

 

            Data d = new Data();

            Keys v = new Keys();

            d.Bkey = Bkey;

            d.ValidFrom = ValidFrom;

            d.ValidTo = ValidTo;

            v.Key = Key;

            v.Dkey = Dkey;

            dictionary.Add(d, v);

        }

        catch (Exception ex)

        {

            Key = Dkey = -2;

            throw new Exception("Something wrong: " + ex.Message);  // To be seen in SSISDB.

        }

    }

}

 

public struct Keys  // Keys to be find as output value of the Dictionary data structure.

{

    public int Key { get; set; }

    public int Dkey { get; set; }

    // public int? Num { get; set; } // When an extra output column allows a null value.

}

 

public class Data : IEquatable<Data>

{

    public int Bkey { get; set; } // columns from the lookup table to help to find the row.

    public DateTime ValidFrom { get; set; }

    public DateTime ValidTo { get; set; }

    public DateTime Date { get; set; } // input column to be between ValidFrom and ValidTo.

    // bigint with null value: public long? Type { get; set; } // or Nullable<Int64>

 

    // No need for the constructor

    // public Data()

    // {}

 

    public override int GetHashCode()

    {

        Unchecked // only hash the business key, not validfromto, to lookup and find a match

        {         // with the row in pipeline and Equals return condition must also be true.

            return this.Bkey.GetHashCode();

            //return (this.BkeyCRM.ToString() + ";" + this.BkeyERP.ToString() + ";" +

            //        this.BkeyHR.ToString()).GetHashCode();

            //int hash = 13;

            //hash = (hash * 7) + this.BkeyCRM.GetHashCode();

            //hash = (hash * 7) + this.BkeyERP.GetHashCode();

            //hash = (hash * 7) + this.BkeyHR.GetHashCode();

            //return hash;

        }

    }

 

    // Equals method handle a collision when same Bkey is in multiple rows because the values

    // to the business key has been changed from the source, then we have to choose the row

    // where Date is in between. The return condition must be true for all values to make

    // TryGetValue true.

    public bool Equals(Data d)

    {

        return this.Bkey == d.Bkey && this.ValidFrom <= d.Date && this.ValidTo > d.Date;

    }

}

 

-----------------

// float? amount = 10.25; equals to Nullable<float> amount = 10.25;

// Database null in a variable needs a datatype with ? or Nullable<Int32>, an example:

int? num = null;

if (num.HasValue)

{

    num = Convert.ToInt32(num);

}

else

{

    num = 0;

}

string? text = null;

if (!string.IsNullOrEmpty(text))

 

The stored procedure to make an inferred member in a dimension table.

 

CREATE PROCEDURE [dbo].[Dim_System_MakeInferred]

   @Bkey int,

   @Key int OUTPUT,

   @Dkey int OUTPUT,

   @ValidFrom datetime OUTPUT,

   @ValidTo datetime OUTPUT

AS

BEGIN

  DECLARE @InsertedRow TABLE (Keyx int)

  DECLARE @currentDateTime  datetime = GETDATE() -- or IdaInsertTime.

  SET @ValidFrom = '1900-01-01 00:00:00'

  SET @ValidTo   = '9999-12-31 00:00:00' 

  SET NOCOUNT ON

  BEGIN TRY

    BEGIN TRANSACTION

    -- Concurrency control when two or more ETL SSIS packages in parallel execution at the same

    -- time simultaneously want to make same inferred member is build as a serial execution to

    -- be serializable will be eligible for parallelism.

    -- Will be prevented from running at the same time automatically by the SQL locking

    -- mechanisms. Transaction isolation.        

    INSERT INTO dbo.Dim_System -- with (serializable) will be removed in future version.

    (System_dkey, System_pkey, System_bkey,

     System,

     ValidFrom, ValidTo,

     IsCurrent, IsDeleted, IsInferred,

     InsertTime, UpdateTime

    )

    -- Grap the newly inserted and generated surrogate identity number of System_key.

    OUTPUT inserted.System_key INTO @InsertedRow

    SELECT

     0, 0, @Bkey,                       -- System_dkey, System_pkey, System_bkey,

     CONCAT('Unknown', ' ', @Bkey),     -- mix of Unknown with business key to System,

     @ValidFrom, @ValidTo,              -- ValidFrom, ValidTo,

     1, 0, 1,                           -- IsCurrent, IsDeleted, IsInferred,

     @currentDateTime, @currentDateTime -- InsertTime, UpdateTime

    WHERE NOT EXISTS

    (

     SELECT TOP(1) 1

     FROM dbo.Dim_System

     WHERE System_bkey = @Bkey AND IsCurrent = 1

    )

    IF @@ROWCOUNT = 1

    BEGIN

      SELECT @Key = Keyx FROM @InsertedRow

        -- WAITFOR DELAY '00:00:15' -- for testing

        -- Set Dkey to the new surrogate identity number.

      UPDATE dbo.Dim_System

      SET @Dkey = System_dkey = System_key

      WHERE System_key = @Key

    END

    ELSE  -- When the business key has been inserted by another process we fetch it here.

    BEGIN

      SELECT @Key = System_key, @Dkey = System_dkey, @ValidFrom = ValidFrom, @ValidTo = ValidTo

      FROM dbo.Dim_System

      WHERE System_bkey = @Bkey AND IsCurrent = 1

    END

    IF @@TRANCOUNT > 0 COMMIT TRANSACTION -- The transaction will release the lock.

  END TRY 

  BEGIN CATCH

    -- Determine if an error occurred.  

    IF @@TRANCOUNT > 0 ROLLBACK TRANSACTION -- The transaction will be cancelled.

    -- Return the error information. 

    DECLARE @ErrorMessage nvarchar(4000),  @ErrorSeverity int, @ErrorState int

    SELECT @ErrorMessage = ERROR_MESSAGE(),@ErrorSeverity = ERROR_SEVERITY(),

           @ErrorState = ERROR_STATE()

    RAISERROR(@ErrorMessage, @ErrorSeverity, @ErrorState) WITH NOWAIT

    ;THROW  -- almost do the same as raiseerror so no need for both commands, I like THROW.

  END CATCH

END

GO

 

11.5 OLE DB Destination handle primary key violation and error

When OLE DB Destination meets bad data it will fail and not insert any rows, even the good ones. Bad data is primary key violation or foreign key violation and other constraint violation error. Primary key violation when a row from OLE DB Source table already exists in OLE DB Destination table, or when source table has two rows with same value in primary key and maybe the rows are duplicate rows. Here is a trick to have the good rows inserted in a double OLE DB Destination way:

 

  1. way when all rows is fine in the batch to destination e.g. 'Maximum insert commit size' is 100000, all rows are inserted successfully.
  2. way when one or more rows is bad in the batch to destination we make an extra OLE DB Destination and set 'Maximum insert commit size' to 1 then each good row will be inserted and each bad rows will be shipped because it will only try to commit a single row at time.

 

We let the first OLE DB Destination redirect bad rows to red pipeline to a second OLE DB Destination that insert rows one-by-one so all good rows will be inserted successfully and bad rows will again be redirecting to an error table with an error description we extract from a Script Component.

 

 

A OLE DB Source table comes with 10 rows, but two rows has same primary key ALFKI and the last row is has null value in a not allow null column in the OLE DB Destination table:

 

CustomerID

CompanyName

ContactName

ALFKI

Alfreds Futterkiste

Maria Anders

ANATR

Ana Trujillo Emparedados

Ana Trujillo

ANTON

Antonio Moreno Taquería

Antonio Moreno

AROUT

Around the Horn

Thomas Hardy

BERGS

Berglunds snabbköp

Christina Berglund

BLAUS

Blauer See Delikatessen

Hanna Moos

BLONP

Blondesddsl pčre et fils

Frédérique Citeaux

BOLID

Bólido Comidas preparadas

Martín Sommer

ALFKI

Bon app'

Laurence Lebihan

BOTTM

Bottom-Dollar Markets

NULL

 

All 10 rows are in the same batch therefore the first OLE DB Destination will fail and not insert any rows and all 10 rows are redirected to the second OLE DB Destina­tion that insert rows one-by-one so all good 8 rows will be inserted successfully and 2 bad rows will again be redirecting and inserted to an error table:

 

CustomerID

CompanyName

ContactName

ErrorDescription

ErrorColumnName

ALFKI

Bon app'

Laurence Lebihan

No status is available.

 

BOTTM

Bottom-Dollar Markets

NULL

The data value violates integrity constraints.

 

 

ErrorDescription can also be e.g.: The data was truncated when destination column size is too small.

 

Both OLE DB Destination is using the access mode 'Table or View - fast load'. Instead of using commit size 1, you can use the access mode 'Table or View' because it will save each row individually, to perform row-by-row insert.

 

The Script Component C# code to catch the error message and add them to two output columns that is in the error table:

 

 

SQL Server 2016+:

 

public override void Input0_ProcessInputRow(Input0Buffer Row)

{

 try

 {

  //Row.ErrorRowId = Row.SupplierID_IsNull ? "" : Row.SupplierID.ToString();

  Row.ErrorDescription = this.ComponentMetaData.GetErrorDescription(Row.ErrorCode);

  // = Row.ErrorCode.ToString();

  IDTSComponentMetaData130 componentMetaData = this.ComponentMetaData as IDTSComponentMetaData130;

  if (componentMetaData != null && Row.ErrorColumn != 0)

  {

   Row.ErrorColumnName = componentMetaData.GetIdentificationStringByID(Row.ErrorColumn).Replace("OLE DB Destination Retry.Inputs[OLE DB Destination Input].Columns[", "").Replace("]", "");

  }

  else

  {

   if (Row.ErrorCode == -1071607683 || Row.ErrorCode == -1071607685)

   {

    Row.ErrorDescription = "Check the row for a violation of a primary key or a foreign key constraint.";

    Row.ErrorColumnName = "Unable to fetch column name. ErrorCode: " + Row.ErrorCode.ToString();

   }

   else

   {

    Row.ErrorColumnName = "ErrorCode: " + Row.ErrorCode.ToString();

   }

  }

 }

 catch (Exception)

 {

  Row.ErrorColumnName = "Unable to fetch column name";

 }

}

 

Old version:

public override void Input0_ProcessInputRow(Input0Buffer Row)

{

 try

 {

  //Row.ErrorRowId = Row.SupplierID_IsNull ? "" : Row.SupplierID.ToString();

  if (Row.ErrorCode == -1071607685)

  {

   Row.ErrorDescription = "Check the row for a violation of a primary key or a foreign key constraint.";

   Row.ErrorColumnName = "Unable to fetch column name. ErrorCode: " + Row.ErrorCode.ToString();

  }

  else

  {

   Row.ErrorDescription = this.ComponentMetaData.GetErrorDescription(Row.ErrorCode);

                     // = Row.ErrorCode.ToString();

   IDTSComponentMetaData130 componentMetaData = this.ComponentMetaData as IDTSComponentMetaData130;

   if (componentMetaData != null && Row.ErrorColumn != 0)

   {

    Row.ErrorColumnName = componentMetaData.GetIdentificationStringByID(Row.ErrorColumn).Replace("OLE DB Destination Retry.Inputs[OLE DB Destination Input].Columns[", "").Replace("]", "");

   }

   else

   {

    Row.ErrorColumnName = "ErrorCode: " + Row.ErrorCode.ToString();

   }

  }

 }

 catch (Exception)

 {

  Row.ErrorColumnName = "Unable to fetch column name";

 }

}

 

// For sql server 2014

// Row.ErrorDescription = this.ComponentMetaData.GetErrorDescription(Row.ErrorCode) +

//     " for column " + Row.ErrorColumn.ToString();

// IDTSComponentMetaData100 componentMetaData = this.ComponentMetaData as IDTSComponentMetaData100;

// Row.ErrorColumnName = componentMetaData.Description;

 

Remember when you drag the red pipeline from a OLE DB Destination, you must select 'Redirect row' in the Error column dropdown in the dialogbox and the yellow warning triangle disappears:

 

 

With this »double error destination« approach you can implement a robust ETL pro­cess and avoid to fail and become a failure when source system provides bad data.

 

 

12. Dynamic sql based of for each row in a table

This approach can be used for asking for delta data from a specific date timestamp but I will make it with a loop of rows that each create a dynamic sql to fetch data from a source system table and store data to a table. The source database is Northwind that has a Customers tables where I will fetch data to a data warehouse table called Source_Customer with use of another table Source_Customer_ID which contains the CustomerID I want one by one to fetch from Northwind.

 

I use Execute T-SQL Statement Task, Data Flow Task, OLE DB Source, Recordset Destination, Foreach Loop Container and OLE DB Destination.

 

'Recordset Destination' creates and populates an in-memory ADO recordset that is available outside of the data flow and I will store the CustomerID values from the table Source_Customer_ID that will be loaded into the data warehouse.

 

 

I make three package variables to hold a recordset of table Source_Customer_ID and the column CustomerID that will be used for the 'Foreach Loop Container'.

 

Variable name

Datatype

Value

Recordset

Object

 

CustomerID

String

 

RowCount

Int32

0

 

In 'Control Flow' I drag in 'Execute T-SQL Statement Task' to start for empty the Source_Customer with an sql statement:

 

TRUNCATE TABLE dbo.Source_Customer

 

I drag in 'Data Flow Task' and a 'OLE DB Source' that I connect to table Sour­ce_Cu­sto­mer_ID.

 

I drag in 'Recordset Destination' and pull the pipeline to it and Edit it to select vari­able User::Recordset to property VariableName and at tab Input Columns I check­mark column CustomerID. Now the recordset variable will contain the records from table Source_Customer and I can loop through it.

 

 

Back in 'Control Flow' I drag in 'Foreach Loop Container' and pull the 'Precedence constraint' green arrow to it and Edit at Collection page where I at property Enumerator select {Foreach ADO Enumerator} and at property ADO object source variable select {User::Recordset}. At page Variable Mappings I select variable CustomerID that will change value for each record there will be looped by.

 

 

I like to test that the package is working but I’m not so much into breakpoint, the­re­fore I drag in a 'Script Task' into 'Foreach Loop Container' and I edit it and click at button [Edit Script] and in the editor window I typein a C# program statement above the line which is there already: Dts.TaskResult = (int)ScriptResults.Success;

I typein a messagebox that will be shown when the package is running, I cast the variable to a string, but I already know variable is a string:

 

MessageBox.Show(Dts.Variables["User::CustomerID"].Value.ToString());

 

I exit editor C# programming window and at property ReadOnlyVariables I check­mark User::Cu­sto­merID. Very important to remember this else script is not working.

 

I start the package to see messagebox for each value of variable CustomerID that has been taking from table Source_Customer into the recordset variable while 'Foreach Loop Container' is traversing through the recordset variable.

 

Everything is working well, so I rightclick 'Script Task' and {Disable} it, so no more messageboxes and easy to {Enable} it again another time.

 

The dynamic sql statement will be using the variable Cu­sto­merID and make a select-from-where dynamic sql statement per loop of the Recordset variable with different values in Cu­sto­merID variable. I drag in a 'Data Flow Task' into 'Foreach Loop Container' and a 'OLE DB Source' that I let have this SQL Command:

 

SELECT

  CustomerID,

  CompanyName,

  ContactName,

  ContactTitle,

  Address,

  City,

  Region,

  PostalCode,

  Country,

  Phone,

  Fax

FROM dbo.Customers

WHERE CustomerID = ?

 

? question mark is an unknow input parameter value that make the sql statement dynamic because I will connect the parameter to the CustomerID variable by click at button [Parameters] and select User::CustomerID and at runtime the variable will change value from the 'Foreach Loop Container' and 'OLE DB Source' will make a new query request to the Northwind database to table Customers and fetch a record that I like to store in Source_Customer.

 

 

To know when the select sql statement has found a record that match the value of CustomerID variable, I drag in a 'Row Count' and pull pipeline to it and edit it to select the package variable User::RowCount. I will use RowCount variable in 'Control Flow' later. First I will tell the 'Data Flow' to store the found record in the destination table Source_Customer.

 

I drag in a 'OLE DB Destination' and pull the pipeline to it and select table Sour­ce_Cu­sto­mer and check the mapping that is fine as default.

 

When a record has been found in Northwind and successful saved in Sour­ce_Cu­sto­mer, I like to delete the CustomerID record in the Source_Customer_ID table.

 

Back in 'Control Flow' I drag in 'Execute SQL Task' Task' into 'Foreach Loop Container' and pull the pipeline to it.

 

I edit the green 'Precedence constraint' arrow, because I can extend the constraint with an logical boolean expression, so I change 'Evaluation operation' from "Constraint" to "Expression and Constraint". In the expression I’m u­sing the RowCount variable with a C# notation for comparison for equivalent:

 

@[User::RowCount] == 1

 

Expression must be success before the 'Control Flow' continue to 'Execute SQL Task'. If a CustomerId in table Source_Customer_ID was not found in Northwind Customers table in the 'Data Flow Task', the RowCount variable will be assigned value 0 and the expression above will be false, and 'Control Flow' will not continue to 'Execute SQL Task'. A smart way to do a kind of 'Conditional Split' in a 'Control Flow' because a Failure red 'Precedence constraint' arrow will be followed when expression is false. A fx logo is shown at the green 'Precedence constraint' arrow tel­ling there is an expression included in the criteria for successful and continuation.

 

I edit the 'Execute SQL Task' to add a dynamic delete sql statement that will delete a record from Source_Customer_ID table for the value in CustomerID variable. In SQLStatement property I typein:

 

DELETE

FROM dbo.Source_Customer_ID

WHERE CustomerID = ?

 

At Parameter Mapping page I select:

 

Variable Name

User::CustomerID

Direction

Input

Data Type

NVARCHAR

Parameter Name

0

Parameter Size

5  for five characters in CustomerID column datatype.

 

'Parameter Name' show "NewParameterName" but it must be overwritten with a 0 (zero) for first ? parameter and 1 for second ? parameter and so on. Parameter Name is only for using a parameterized stored procedure where 'Parameter Name' is the same name as the stored procedure's parameter name.

For integer, decimal, date, datetime and so on 'Parameter Size' will be -1.

 

 

In case you want to be more sure of parameter name in a dynamic sql statement and give it a stored procedure look-a-like this is useful, where ? parameter value will be assigned to a t-sql variable:

 

DECLARE @CustomerID AS nvarchar(5) = ?

IF @CustomerID IS NOT NULL

BEGIN

   DELETE

   FROM dbo.Source_Customer_ID

   WHERE CustomerID = @CustomerID

END

 

When there is several delete statements in different tables you can reuse the t-sql variable.

 

Stored procedure

I prefer using stored procedures, so a SSIS package does not contains sql statements and has dependency to the database tables and columns. A stored procedure beco­me an interface, so SSIS package don’t need to know anything else about the data­base.

 

CREATE PROCEDURE dbo.Source_Customer_ID_Delete

          @CustomerID nchar(5)

AS

BEGIN

          SET NOCOUNT ON

          DELETE

          FROM dbo.Source_Customer_ID

          WHERE CustomerID = @CustomerID

END

GO

 

SQLStatement

dbo.Source_Customer_ID_Delete ?

Variable Name

User::CustomerID

Direction

Input

Data Type

NVARCHAR

Parameter Name

@CustomerID   stored procedure parameter or 0.

Parameter Size

5  for five characters in CustomerID column datatype.

 

Very important that Parameter Name has @ in the name, else I can also use 0 as shown before for the first ? parameter.

 

 

Stored procedure can also be used in a 'OLE DB Source' like this call for data:

 

EXECUTE dbo.CustomerBySelection ?

WITH RESULT SETS(

(

  CustomerID int,

  CompanyName nvarchar(50),

  ContactName nvarchar(50),

  ContactTitle nvarchar(35),

  Address nvarchar(100),

  City nvarchar(50),

  Region nvarchar(20),

  PostalCode nvarchar(10),

  Country nvarchar(50),

  Phone nvarchar(50),

  Fax nvarchar(50)

))

 

to explicitly specify the data types of the columns for the pipeline and under button [Parameters] in column Parameters I typein the parameter name from the stored procedure include @, like this: @CustomerID

 

 

Preview does not work together with a ? parameter, therefore replace ? with a va­lue like this: dbo.CustomerBySelection 'DALB'

Now Preview will show data from that CustomerID, remember to replace back to ? so the parameter for the stored procedure is using the SSIS package variable User::CustomerID.

 

When stored procedure has two or more multiple parameters, remember to place a comma , between ? like this: dbo.CustomerBySalesPeriod ?, ?

More readable way: dbo.CustomerBySalesPeriod @StartDate = ?, @EndDate = ?

And there must be two Mapping Parameters @StartDate and @EndDate connected to each own SSIS package variable, like this:

 

 

Execute SQL Task with ADO.NET connection to Result Set to Foreach Loop

Instead of using a 'Data Flow Task' to do the Select statement and flow data to a 'Recordset Destination', you can use an 'Execute SQL Task' in 'Control Flow' to do the Select statement and set property 'ResultSet' to 'Full result set' and at page 'Result Set' add a package variable with datatype Object like Recordset as used above. If Select statement has a Where part and you like to use named @pa­ra­me­ter instead of the values 0, 1, then let 'Execute SQL Task' use ConnectionType ADO.NET and the Where part can have @parameter like we saw above when we called a stored procedure with parameter. I am using package variables with default values to show how ADO.NET is implemented and the position of @parameters does not matter. It is the same ADO.NET connection that 'Execute T-SQL Statement Task' is using but this can not have parameters.

 

 

 

ADO.NET benefits for a sql statement with a search criteria in a variable

A package variable @[User::ContactTitle] in the dialogbox column Variable Name will at run time be transferred to a t-sql variable in column Parameter Name where I have named the variable @ContactTitle (don’t need to be same name as package variable) and it gets automatic the rigth data type, therefore no declare and inside the SQL Statement I can use the @ContactTitle like in normal t-sql code in a stored procedure or ad hoc query in SQL Server Management Studio.

 

 

 

13. Transforming data by C#

In a Data Flow Task there is a Script Component that can update columns in a SSIS pipeline by C#. In this example I have a source table with a nvarchar(50) column called Code that has a mix of chars like 12AK-34-x or 12.34.56 and I like to update letters to uppercase and I like to take away the letters from the Code and set null or empty string with length 0 to text "Nothing". The updated pipeline will be inser­ted into a staging table.

 

 

When Edit a Script Component it is important to checkmark the column(s) that is going to be used in C# program for reading only or for also write back as update the column. I open the editor by click the button [Edit Script] and there is already three methods:

 

  • PreExecute() to initialize attribute and object and open connection.
  • PostExecute() release object, close connection, set package variable.
  • Input0_ProcessInputRow(Input0Buffer Row)

Input0 represent the pipeline into Script Component, Row contains the checkmark columns that is refered like Row.Columnname. It is here the transformation is programmed like this:

 

public override void Input0_ProcessInputRow(Input0Buffer Row)

{

    String pattern = "[^0-9]";

    System.Text.RegularExpressions.Regex regex =

      new System.Text.RegularExpressions.Regex(pattern,

          System.Text.RegularExpressions.RegexOptions.Compiled);

 

    if (!Row.Code_IsNull && Row.Code != String.Empty)

    {

        Row.Code = Row.Code.ToUpper();

        Row.Code = regex.Replace(Row.Code, String.Empty);

    }

    else

    {

        Row.Code = "Nothing";

    }

}

 

Row has a _Isnull property and with ! it become true when column is not null.

A empty string with length 0 done in sql: UPDATE  T SET C = "" can also be detec­ted.

The method will be called per row in the pipeline so the two variables will be decla­red, assigned and released (or disposed) in each call, therefore it is better to place variables as attributes for the class object and assign them in PreExecute because that will happens only one time no matter how many rows there is in the pipeline. I have include "using" (same as VB.NET Imports) to enable type name like "Regex" to be refe­ren­ced without namespace qualification that makes the program more readable:

 

using System.Text.RegularExpressions;

[Microsoft.SqlServer.Dts.Pipeline.

  SSISScriptComponentEntryPointAttribute]

public class ScriptMain : UserComponent

{

    String pattern;

    Regex regex;

 

    public override void PreExecute()

    {

        base.PreExecute();

        pattern = "[^0-9]";

        regex = new Regex(pattern, RegexOptions.Compiled);

    }

 

    public override void PostExecute()

    {

        base.PostExecute();

    }

 

    public override void Input0_ProcessInputRow(Input0Buffer Row)

    {

        if (!Row.Code_IsNull && Row.Code != String.Empty)

        {

            Row.Code = Row.Code.ToUpper();

            Row.Code = regex.Replace(Row.Code, String.Empty);

        }

        else

        {

            Row.Code = "Nothing";

        }

    }

}

 

A Script Component can split an input pipeline into multiple output pipelines, so instead of updating to "Nothing" I will redirect theses rows to another pipeline and to another OLE DB Destination. I edit Script Component and in Inputs and Outputs page I click [Add Output] and I rename the two outputs to Valid and Error, and I set both property SynchronousInputId to same Script Component.Inputs[Input 0] and both property ExclusionGroup to 1 and change the C# programming:

 

 

It is Row that do the direction to the two outputs Valid and Error with:

Row.DirectRowToValid();   and Row.DirectRowToError();

where the . dot popup intelligence shows the names of the two outputs I made.

 

 

14. Transforming data by OLE DB Command row-by-row

I like to extend the pipeline with a new column but the Derived Column can’t access database or stored procedure, therefore I will use the row-by-row operation called OLE DB Command. For each row in the pipeline I will call a stored procedure with an input and output parameter and it is the output parameter that will be to the new column in the pipeline and to the destination table in the SSIS package.

 

The trick is to begin with to add a derived column with an null or empty expression that fit the data type to the pipeline from the source, like a new Education column:

 

The new empty column Education will fetch a value from a stored procedure’s output parameter, where the stored procedure has been doing som look up (or calculate) in the database to make an education string for each employee, one-by-one or row-by-row.

 

The OLE DB Command component gets a connection and a sql command is calling or execute a stored procedure with an input parameter and an output parameter: EXEC dbo.Stage_Employees_Fetch_Education ?, ? OUTPUT

The two ? parameters will be connected to the columns of the pipeline as a destination columns that will show the name of the parameters in the stored procedure (or in a ad hoc sql command as param_0, param_1, etc.):

 

 

Hereafter the new pipeline column Education is mapped to the destination table column also called Education.

 

A stored procedure could also output a new surrogate key number from a Identity column in a table using like SCOPE_IDENTITY() as I showned before inside Lookup component.

 

OLE DB Command sql command can also use a sql statement to extend pipeline:

Select ? = Name

From  T

Where Id = ?

 

Normally OLE DB Command is used for Update statement to a table, see next.

 

 

15. Don’t let OLE DB Command do row-by-row update

OLE DB Command has not good for performance with a large number of rows in a pipeline therefore it is better to insert updated rows in a staging table and then do a T-SQL bulk update back to the original table. Row-by-row operation is also called singleton update and is the opposite of a batch bulk update.

 

Maybe you are used to do all data transformation through t-sql Updates in a stored procedure, where you have a Update per transformation which means that the same row in a table will be updated mutiple times. That’s gives a lot of I/O and R/W to the database table. With a SSIS package and a Data Flow Task and its pipeline the multiple transformations will be done in memory and therefore save I/O to the database. But the problem is that when the updates in the pipeline are going to be stored in the database, SSIS package can only do it with a row-by-row operation that’s include a t-sql Update with ? parameters for each row which will kill the performance.

 

This example will show different transformations in a pipeline and use a staging table to do the updates back to the original table. The SSIS package will have I/O to the database when fetch data to pipeline, when insert updated values to the sta­ging table and when doing the update to the original table.

 

Control Flow where the staging table is truncated in beginning and the Update to the original table is done at the ending. In between there is two Data Flow Task where the first will fetch data to a cache used for a Lookup in the other Data Flow Task. I have an Orders table with three columns to be calculated and updated:

 

  • Quantity is the summarize of Quantity in the Order Details table for each OrderId. I do it with an Aggregate and keep result in the cache in the first Data Flow Task to be used for a Lookup in the second Data Flow Task.
  • Freight100 is the original Freight times with 100 in a Derived Column.
  • Orderdate will be change to the 1. of the month in same Derived Column.
  • CustomerCountry is the order’s customers country with a Lookup.

 

 

The first Data Flow Task fetchs the Order Details data and do a sum of Quantity per OrderId and store the resultset or recordset in a cache to be used later in a Lookup:

 

 

The second Data Flow Task fetchs the Orders data and to the transformation on the pipeline by adding extra columns where the new Orderdata is calculated like this:

(DT_DATE)((DT_WSTR,4)(YEAR(OrderDate)) + "-" + (DT_WSTR,2)(MONTH(OrderDate)) + "-01")

 

 

And when data is inserted into the Staging_Orders table, the bulk t-sql Update is performed with a very fast performance compared to a row-by-row update.

 

 

16. SSIS package gets connection string from a table

I like to place connection string to a source system in a table instead of environ­ment variable inside the SSISDB catalog at three servers for development, test and production. Here is an example of a solution called Source that contains a project per source system and each package will fetch the connectionstring from a table in a database called ETL_SYSTEM that is placed at same server as the SSISDB catalog (therefore package can use localhost connection) and it will be together with the other database warehouse databases. I like to have an Input Data Area (IDA) database per source system. I am using Northwind as one of my source systems and it gives me an IDA_Northwind database to store the fetched source data.

 

In ETL_SYSTEM database I have a table with connection string for each source system e.g. sql server, oracle, db2, a csv file and many more can be placed in the table. Of course the connection string syntax must fit together with the SSIS Connection Manager where I first make the connection and then I rightclick {View Code} to fetch the syntax. In each environment the connection string can be the same or be different, because development fetching data from source system in development and test and production fetching data from source system in production.

 

The Source table has some connection strings:

 

Source_Id

Source_Name

Source_Connetionstring

1

Northwind

Data Source=SQLDev4;Initial Catalog=Northwind;

Provider=SQLNCLI11.1;Integrated Security=SSPI;

Auto Translate=False;Application Name=SSIS;

2

HR

Data Source=OracleDev;User Id=Usr;Password=Pwd;

Integrated Security=no;

3

Economic

\\ECO\Data2DWH\data.csv

 

I am making a new SSIS solution called 'Source' with three projects for each source system.

 

I continue with 'Northwind' project where I add a project parameter with value 1 for the row with source_id 1 for Northwind connection. I let the parameter be requiried false, so the value 1 will automatic be the default value for the parameter when executed from SSISDB catelog. I add two localhost connections for database ETL_SYSTEM using ADO.NET to make sql statement with variable more readable, and for IDA_Northwind database using OLE DB to store source data as first step into the data warehouse:

 

 

I will make a package for each table I like to fetch from the source system e.g. Customers table. I make a package variable called 'Source_Connetionstring' that I will use to store the connectionstring from the Source table and use the variable in an Expression for the package connection to the source system. The variable 'Source_Connetionstring' will have a value for the connection string to the develop­ment server for the source system and is used for making the package in design mode:

 

 

In case I like to open the package in the production server of the data warehouse, I will need to change the value of the variable to point to the production server of the source system so I can debug the package in production e.g. to find a changed data type for a column.

 

In Control Flow I make an Execute SQL Task 'Get Source_Connetionstring' to fetch the connection string from the Source table for source_id from the project parameter and store it in the variable. Since the Source table is placed in ETL_SYSTEM database and the connection is using ADO.NET, I select Connectiontype and I am using @variable in the sql query statement:

 

 

SQL statement to fetch the connection string from the Source table is:

 

SELECT @Source_Connetionstring = Source_Connetionstring

FROM dbo.Source

WHERE Source_Id = @Source_Id

 

In Control Flow I make an Execute SQL Task 'Truncate table' to empty the Custo­mers table in the IDA_Northwind database:

 

 

In Control Flow I make a Data Flow Task 'Fetch data' to do a full dump of all rows from the source to the empty table in IDA_Northwind database.

 

Inside the package I make a new package connection in the tab Connection Manager to the source system server and database Northwind as a OLE DB because it is a SQL Server. I am calling the connection 'Source'. I click at 'Source' and I am looking at the properties and click at Expression and click at the […] button. In the dia­logbox I click in column Property and open the dropdown and select 'Connection String' and in the other column Expression I click at the […] button and in the dia­logbox I drag and drop variable User::Source_Connetionstring into Expression box and it become: @[User::Source_Connetionstring] meaning, when the package is running or is exe­cuted the connection manager will use the current value of the variable. There­fore the package will run in production without any changes because in production it will use the connection string from the Source table from the ETL_SYSTEM database in production server.

 

 

Notice the fx at the Source connection in Connection Managers, fx tells us that it is using an Expression to give it value in design mode and in running mode.

 

I can go inside the a Data Flow Task 'Fetch data' and make the OLE DB Source using the Source connection, and make OLE DB Destination using the IDA_Northwind project connection, click New button and rename [OLE DB Destination] to [Customers] to have the table created in the IDA database:

 

 

I like to have in the destination table an extra column called 'InsertTime' with data type datetime2(3) and default value (sysdatetime()) so I can see when the rows are inserted the last time. After the adding I need to rightclick at OLE DB Destination and select Mappings to update the SSIS package metadata.

 

Now the package is ready for a test run. If I have user access to production ser­ver I can change the connection string part Data Source=SQLDev4 to Data Source= SQLProd1 and fetch production data to the development server IDA database.

 

For the next source table I will reuse the first package and make a copy-paste of it and change the Truncate table, the OLE DB Source and the OLE DB Destination to the new table.

 

I will make a batch job run execute package that will be calling the other packages:

 

 

The _Execute.dtsx package will be part of the ETL process.

 

When there is many tables from the same source system, I can move the 'Get Source_Connectionstring' from each table-package to the _Execute package and send the connection string to each table-package as a parameter e.g. for Categories table where the package parameter is used in the source connection expression. The parameter 'Source_Connectionstring' will have a value for the connection string to the deve­lop­ment server for the source system and is used for making the package in design mode:

 

 

The _Execute.dtsx package has a variable Source_Connectionstring with a value for design mode and the variable will be assigned in running mode with the connection string from the Source table in ETL_SYSTEM database as we saw it made before. When calling the table-package e.g. Categories the variable will be passed by as a parameter to the other package:

 

 

 

I like the SSIS deployment by command line with a bat file for each environment development, test and production, here I making it for dev:

 

  1. Build the SSIS 'Northwind' project.
  2. From the folder \Source\Northwind\bin\Development I copy the file North­wind.ispac to a SSIS deployment folder I have made at my computer.
  3. In the deployment folder I have made a file DEV_Northwind.bat with this line for deploy to a SQLDEV server to the SSISDB catalog folder Source:

 

             "%ProgramFiles%\Microsoft SQL Server\120\DTS\Binn\ISDeploymentWizard.exe" /S /ST:File

             /SP:Northwind.ispac /DS:SQLDEV /DP:/SSISDB/Source/Northwind

 

          The notation is:

          "%ProgramFiles%\Microsoft SQL Server\120\DTS\Binn\ISDeploymentWizard.exe" /S /ST:File

          /SP:<SSIS project name>.ispac /DS:<SQL Server name>

            /DP:/SSISDB/<SSISDB folder name>/<SSIS project name>

 

  1. I run the bat file and the deployment is done to the dev sql server.
  2. I can execute the _Execute package from the SSISDB catalog from Source folder and later let it be executed from a SQL Agent job in dev sql server.

 

Later it is easy to have other .bat files for test server and for production server for deployment from my computer, or hand over the Northwind.ispac file to a person to do it in test and later in production.

Monitor Running Packages.

 

 

17. SSIS package handles files before a file is loading

A folder will contain some Excel xsl (97-2003) files that one-by-one has to be loa­ded as source into a table. I make a SSIS package to scan the files from the folder and copy it to an Input folder with a specific filename and another SSIS package to connect to the file and load it, and after load I move the file to an Archive folder with a datetime prefix to the filename to make all files unique.

 

In the first package I define som parameters so folder can be set outside:

 

 

Define variables with expressions because a variable will be assigned a value from a expression when the package is using the variable:

 

 

Variable vaArchivePathFile gets datetime as prefix in filename with this expression:

  @[$Package::paFolderArchive] + "\\" +   

                      (DT_WSTR,4) DATEPART("Year", Getdate()) +

  RIGHT( "0" + (DT_WSTR,2) DATEPART("Month", Getdate()) , 2 ) +

  RIGHT( "0" + (DT_WSTR,2) DATEPART("Day", Getdate()) , 2 ) +

  RIGHT( "0" + (DT_WSTR,2) DATEPART("Hour", Getdate()) , 2 ) +

  RIGHT( "0" + (DT_WSTR,2) DATEPART("Minute", Getdate()) , 2 ) +

  RIGHT( "0" + (DT_WSTR,2) DATEPART("Second", Getdate()) , 2 ) +

  @[User::vaSourceFile]

 

Control flow will copy and move a file and call the other package to load the file. It is wrapped in a 'Foreach Loop Container' to repeat the sequence for each file:

 

 

Foreach component settings that set the folder from a parameter and assign a vari­able with a filename (without path) from the folder:

 

 

Copy component settings to copy the file to folder Input and a fixed filename, because the Source will load from that specific filename:

 

 

Calling an other package settings that will load a file from Input folder into a table:

 

 

Move component settings, but using remove because it works:

 

 

 

 

18. SSIS package with a .NET C# program

In SSIS Toolbox there is a tool called Script Task where I can make a .NET C# program to handle things, like to copy a source file to the right folder and for backup the file and give the file the right name so it later can be imported by the IS pac­kage to a table in the DSA database. I find it better to make a C# program instead of variables and parameters in IS packages. This C# program has the main proce­dure as the program start and an extra function that read the first line of the file and returned it as a string for further use in the main program:

 

using System;

using System.IO; // gives File og Directory

//using System.Data;

//using System.Math;

//using System.Collections;

using Microsoft.SqlServer.Dts.Runtime;

using System.Windows.Forms;

 

namespace ST_d1bc9f3673db40b0b35603ccb0a0e61b.csproj

{

 [System.AddIn.AddIn("ScriptMain", Version = "1.0", Publisher = "", Description = "")]

 public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.

                                                              VSTARTScriptObjectModelBase

 {

  #region VSTA generated code

  enum ScriptResults

  {

   Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success,

   Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure

  };

  #endregion

 

 public void Main()

 {

  // The folder source system save the file into. Filename is different per delivery.

  string srcFolderPath = "S:\\SOURCE";   

  // The folder that the IS package will import the file from with same file name.

  string dsaFolderPath = "S:\\DSA_SOURCE";

  // The fixed file name to be imported by the IS package.

  string dsaFileName = "SOURCE.TXT";      

  string dsaBackupFileName = "SOURCEyyyymm.TXT"; // yyyymm replaced with the date of running.

  string srcBackupFolderPath = srcFolderPath + "\\Backup";

  string dsaBackupFolderPath = dsaFolderPath + "\\Backup";

  string srcFileName;

  string srcPathFileName;

  string srcBackupPathFileName;

  string dsaPathFileName;

  string dsaBackupPathFileName;

  long length = 0;

  string line = null;

  string yyyymm = null;

  if (Directory.Exists(srcFolderPath) &&

      Directory.Exists(srcBackupFolderPath) &&

      Directory.Exists(dsaFolderPath) &&

      Directory.Exists(dsaBackupFolderPath))

  {

   try

   {

    if (Directory.GetFiles(srcFolderPath).Length == 0)

    {

     //MessageBox.Show("No file in source folder from source system");

     Dts.TaskResult = (int)ScriptResults.Failure;

     return;

    }

    if (Directory.GetFiles(srcFolderPath).Length >= 2)

    {

     //MessageBox.Show("Two or more files in source folder, not good");

     Dts.TaskResult = (int)ScriptResults.Failure;

     return;

    }

    if (Directory.GetFiles(srcFolderPath).Length == 1)

    {

     srcFileName = Path.GetFileName(Directory.GetFiles(srcFolderPath)[0]);

     // Directory.GetFiles return an array of path and file names.

     // Copy source file to backup folder.

     srcPathFileName = srcFolderPath + "\\" + srcFileName;

     srcBackupPathFileName = srcBackupFolderPath + "\\" + srcFileName;

     if (!File.Exists(srcPathFileName))

     {

      //MessageBox.Show("Source file not found");

      Dts.TaskResult = (int)ScriptResults.Failure;

      return;

     }

     // If source file size is 0 byte, something is wrong and better stop.

     length = new FileInfo(srcPathFileName).Length;

     if (length == 0)

     {

      //MessageBox.Show("Source file size is 0 byte");

      Dts.TaskResult = (int)ScriptResults.Failure;

      return;

     }

     // Read first line of the file to receive year and month.

     line = ReadFileFirstLine(srcPathFileName);

     if (line == null || line.Length <= 13)

     {

      //MessageBox.Show("No line or too short line");

      Dts.TaskResult = (int)ScriptResults.Failure;

      return;

     }

     yyyymm = line.Substring(6, 6); // position 0 is the first char in string of C#.

     //MessageBox.Show(yyyymm);

     if (File.Exists(srcBackupPathFileName))

         File.Delete(srcBackupPathFileName);

     File.Copy(srcPathFileName, srcBackupPathFileName);

     // Move source file to DSA folder and give a fixed file name.

     dsaPathFileName = dsaFolderPath + "\\" + dsaFileName;

     if (File.Exists(dsaPathFileName))

         File.Delete(dsaPathFileName);

     File.Move(srcPathFileName, dsaPathFileName); // move file, the source folder will be empty.

     // Copy the file name + yyyymm to DSA Backup folder.

     if (!File.Exists(dsaPathFileName))

     {

      //MessageBox.Show("DSA file was not found");

      Dts.TaskResult = (int)ScriptResults.Failure;

      return;

     }

     dsaBackupFileName = dsaBackupFileName.Replace("yyyymm", yyyymm);

     dsaBackupPathFileName = dsaBackupFolderPath + "\\" + dsaBackupFileName;

     //MessageBox.Show(dsaBackupPathFileName);

     if (File.Exists(dsaBackupPathFileName))

         File.Delete(dsaBackupPathFileName);

     File.Copy(dsaPathFileName, dsaBackupPathFileName);

     Dts.TaskResult = (int)ScriptResults.Success;

     return;

    }

    else

    {

     //MessageBox.Show("Something had goes wrong 2");

     Dts.TaskResult = (int)ScriptResults.Failure;

     return;

    }

   }

   catch (Exception)

   {

    //MessageBox.Show("Something had goes wrong 1");

    Dts.TaskResult = (int)ScriptResults.Failure;

    return;

   }

  }

  else

  {

   //MessageBox.Show("One of the four folders was not found");

   Dts.TaskResult = (int)ScriptResults.Failure;

   return;

  }

 }

 

 private string ReadFileFirstLine(string dsaPathFileName)

 {

  FileStream fs = null;

  StreamReader sr = null;

  string line = null;

  long length = 0;

 

  if (File.Exists(dsaPathFileName))

  {

   try

   {

    length = new FileInfo(dsaPathFileName).Length;

    if (length != 0)

    {

     fs = File.OpenRead(dsaPathFileName); //, FileMode.Open, FileAccess.Read, FileShare.None);

     sr = new StreamReader(fs);

     line = sr.ReadLine();

    }

   }

   catch (Exception)

   {

    line = null;

   }

   finally

   {

    if (sr != null)

       sr.Close();

    if (fs != null)

       fs.Close();

   }

  }

  return line;

 }

 

 }

}

 

/* Another example not in use here:

  Directory.GetFiles(sFilePath, "*.xls")

  Dim di As DirectoryInfo = My.Computer.FileSystem.GetDirectoryInfo(sFilePath)

  For Each fi As FileInfo In di.GetFiles("*.xls")

      lstFiles.Items.Add(fi.Name)

  Next fi

 Dts.Variables("FileCnt").Value = 0

 */

 

A C# program can call a parameterized stored procedure that has output values or gives a recordset to be handled or calculated inside the program and save the data back to table in the database.

 

19. Upgrade to a Visual Studio 2022 solution

Having a SQL Server 2016 database and a Visual Studio 2017 32-bit solution using .NET 4.5 Framework with three projects for:

 

·         database pro­ject _DB with .sql files per object in the database

·         integration services project _IS with .dtsx files for SSIS packages

·         analysis services tabular dax project _AS with a Model.bim file

 

Example of the _DB properties:

 

 

Example of the _IS properties:

 

 

Want to upgrade to a SQL Server 2022 database and to a Visual Studio 2022 solution then be aware of:

·       The Connection Managers OLEDB provider SQL Server Native Client has been removed from SQL Server 2022 (160) (SQLNCLI or SQLNCLI11 or SQLOLEDB), therefore must use Microsoft OLE DB Driver for SQL Server (MSOLEDBSQL).

·       Visual Studio 2022 installation you must include the .NET Framework 4.8.x.

·       Visual Studio 2022 menu you click Extension to download for SSIS and SSAS.

·       Visual Studio 2022 menu you click Tools and Options to set SSAS 2022.

·       Visual Studio 2017 was 32-bit and had this path:

C:\Program Files (x86)\Microsoft Visual Studio\2017\SQL\Common7\IDE\

Visual Studio 2022 is 64-bit only and the free Community has this path:

C:\Program Files\Microsoft Visual Studio\2022\Community\Common7\IDE\

 

 

 

 

Before open your solution in Visual Studio 2022 to connect to SQL Server 2022, please fix few things in Notepad:

·       In database project file _DB.sqlproj you replace path for old Visual Studio 32-bit and old SQL version 2016 indicated by Compatibility level number 130:

 

<ArtifactReference Include="C:\Program Files (x86)\

Microsoft Visual Studio\2017\SQL\Common7\IDE\

Extensions\Microsoft\SQLDB\Extensions\SqlServer\

130\SqlSchemas\msdb.dacpac">

<HintPath>$(DacPacRootPath)\Extensions\

Microsoft\SQLDB\Extensions\SqlServer\

130\SqlSchemas\msdb.dacpac</HintPath>

<ArtifactReference Include="C:\Program Files (x86)\

Microsoft Visual Studio\2017\SQL\Common7\IDE\

Extensions\Microsoft\SQLDB\Extensions\SqlServer\

130\SqlSchemas\master.dacpac">

<HintPath>$(DacPacRootPath)\Extensions\

Microsoft\SQLDB\Extensions\SqlServer\

130\SqlSchemas\master.dacpac</HintPath>

 

to a new path for Visual Studio 2022 64-bit and for SQL version 2022 indicated by Compatibility level number 160. The new path and folder and files must exists at the computer:

 

<ArtifactReference Include="C:\Program Files\

Microsoft Visual Studio\2022\Community\Common7\IDE\

Extensions\Microsoft\SQLDB\Extensions\SqlServer\

160\SqlSchemas\msdb.dacpac">

<HintPath>$(DacPacRootPath)\Extensions\

Microsoft\SQLDB\Extensions\SqlServer\

160\SqlSchemas\msdb.dacpac</HintPath>

<ArtifactReference Include="C:\Program Files\

Microsoft Visual Studio\2022\Community\Common7\IDE\

Extensions\Microsoft\SQLDB\Extensions\SqlServer\

160\SqlSchemas\master.dacpac">

<HintPath>$(DacPacRootPath)\Extensions\

Microsoft\SQLDB\Extensions\SqlServer\

160\SqlSchemas\master.dacpac</HintPath>

 

·       In integration services project file _IS.dtproj you replace old provides SQLOLEDB and SQLNCLI11 to new provider MSOLEDBSQL, and at tag TargetServerVersion you replace value to SQL version 2022 like this:

<TargetServerVersion>SQLServer2022</TargetServerVersion>.

 

Now open your solution in Visual Studio 2022 and you accept .NET Framework 4.8:

 

 

Go to tab Solution Explorer and see your projects.

 

At database project rightclick and click Properties to change to SQL Server 2022 (Compatibility level number 160):

 

 

(Same as if you with Notepad in file _DB.sqlproj had changed to:

<DSP>Microsoft.Data.Tools.Schema.Sql.

  Sql160DatabaseSchemaProvider</DSP>).

 

And to change to .NET Framework 4.8.1:

 

 

At analysis project Model.bim in Data source change provider to from SQL Server Native Client (SQLNCLI) to Microsoft OLE DB Driver for SQL Server (MSOLEDBSQL):

 

 

To change Model.bim Compatibility level rightclick at Model.bim and click Open, click at tab Solution Explorer and rightclick again at Model.bim and click Properties and in dropdown Compatibility level click SQL Server 2022 and in dialogbox confirm by click Yes that the change is irreversible meaning can not be changed therefore the dropdown in Compatibility level is gone afterwards:

 

 

For each Connection Managers OLEDB connection open it to change provider from SQL Server Native Client to Microsoft OLE DB Driver for SQL Server:

 

 

At integration services project rightclick and click Properties to see SQLServer2022:

 

 

Now you can build your Visual Studio 2022 solution without errors and warnings, and remember to click SaveAll icon to save the content of your project files.

 

In case you had started by open your solution in Visual Studio 2022 and at drop­down for TargetServerVersion had changed to SQL Server 2022, then Visual Studio would had done a loop through all your SSIS packages. Afterwards you would had got problems like:

·       Error message: conmgr Unable to create the type with the name 'OLEDB'.

·       Error message: sys is unreachable at master or at msdb.

·       Open a Connection Managers OLEDB connection to get error message: Microsoft Visual Studio is unable to load this document. Unable to create the type with the name 'OLEDB'.

·       New Connection and Visual Studio 2022 will crash!

·       Open a SSIS package that is using a OLEDB connection to a source or destination to see a corrupt icon at all tasks:

 

 

SSIS.ReplacementTask is not registered because the loop had corrupted the pac­kage due to an anomaly in the normal upgrade process to a higher version of VS.

 

20. Build server – Agent pool in Azure DevOps

Git is a source code version control system to keep track of all previous versions, but Git does not include build files like database dacpac and ssis ispac, therefore we need a build server to build the source code to these files before a deploy. The build server is often your development server where all the programs and tools are already installed. Azure DevOps calls it a Agent Pool and when it is placed on a on prem server it is called a self-hosted agent, read more.

 

 

In case you are using classic Azure DevOps build pipeline and release pipelines to call a Power Shell Script you had made, remember to extend Power Shell with SQL Server from PowerShell Gallery SQLServerTools and ReportingServicesTools.

Other tip here. When build server is in the cloud you need a personal access tokens tip here.

 

21. About me

I have worked with SQL Server since 1997 called 6.5 and the first production was run­ning in 7.0 in 1998.

 

I have used Data Transformation Services (DTS) in SQL Server 2000, Business In­tel­li­gen­ce Development Studio (BIDS) in SQL Server 2005–2008 R2, SQL Server Data Tools (SSDT) in SQL Server 2012 and SQL Ser­ver Data Tools for Business Intelligence (SSDT BI) in SQL Server 2014, 2016, 2017, 2019, 2022.

 

Exporting SSIS packages from Integration Services Catalog SSISDB:

1.  Start SQL Server Management Studio.

2.  Locate the deployed project is the SSIS Catalog.

3.  Right click the project and select Export and save the file as ispac.

4.  Rename the file extension from .ispac to .zip.

5.  Extract the .zip file to a folder.

6.  In Visual Studio click New, Project, Integration Services Import Project Wizard where Project deployment ask for a ispac file, and Integration Services catalog ask for SSISDB path or create a new SSIS project and add packages from the folder.

 

Please visit my homepage for the english part:

 

JoakimDalby.dk

 

Sometimes I see a SQL statement in a SSIS package that is hard to read, because it does not have a nice format layout, therefore I copy-paste the statement to a site that make it more readable: sql-format or sqlformat with different dialects.