In part 3 of my current series, applying Kimball principles to QlikView projects, I will be examining SCD type 2 and how we can manage these in QlikView. In my previous posts we have examined how Kimballs Dimensionall Modelling Toolkit is a useful tool during the Modelling process and how we use this tool to define our Business keys, Surrogate keys, Dimension tables and Fact tables. During the definition of our Dimension tables we even consider how values in these tables may change and to which degree we want to manage this change. My last post covered the manangement of SCD type 1 changes, where we simply overwrite the existing value and flag the field as updated with an UpdateAuditKey. In this post I will be tackling SCD type 2 Dimensions and how we mange these in QlikView.
SCD Type 2
Type 2 Dimensions differ from type 1 Dimensions through the fact that the history of change is preserved. A type 2 change does not overwrite the existing value but creates a new row for the new value. In addition, the type 2 change sets the original row to expired. Consider a typical Organisation comprised of Business Units, Departments and Functions. The Organisation keeps track of organizational transactions through Cost Centres which are recorded in the Financial applications of the Business. It is fairly common that organisational structures change over time, units move between Departments and Functions expire or become redefined. These changes are interesting for an Organisation to record as they are essential to the holistic view of the Organization. If we chose to disregard these changes then we would not be able to present a correct view of the Business over time.
Modelling SCD2 dimensions
In my first post I provided an introduction the Kimball toolkit, and how it is a useful aid when modelling Facts and Dimensions. The toolkit provides us with the necessary Metadata from which we can effectivatively build our data model. We define the SCD Type for each column in the Dimension and clearly name the columns that will be included in the SCD change logic.
SCD 2 and QlikView
For our demo we will be using an SQL 2012 database as our source. Our database is a repository of fictive Players, Matches and Performance. We want to see how Players develop over time and how their Positon affects their Performance. On the basis of this we have decided that the attributes Default Position and Age will be treated as an SCD2 change.
The original databse table, without changes is as follows. The Player of interest here has ID 1. He currently plays on the left and is 9 years old.
We have found that player 1 performs better on the right and has also recently had a birthday. We need now to make the appropriate changes to the database table.
So, how do we get QlikView to detect these changes? Firstly we need to implement some kind of comparative logic where we can detect the change. Firstly, let us consider what we are trying to achieve. We want to compare attribute values from 2 different tables. We want to compare the values for Age and Default position with the values for the same attributes in the dimension table. If these values are different then we know that one, or both of these attributes has been changed at the source.
In a EDW solution SCD Logic is usually managed with checksums. CHECKSUM is a native function to SQL Server and computes a hash value, called the checksum, over its list of arguments. QlikView does not posess a Checksum function but does include a Hash function. By using the Hash function in QlikView we can easily achieve the same result.
In QlikView, the following hash functions are available:
- Hash128(): a 128 bit hash function that returns a 22 character string.
- Hash160(): a 160 bit hash function that returns 27 character string.
- Hash256(): a 256 bit hash function that returns a 43 character string.
The number of bits determines the output range of the function. A 128 bit hash can store 2^128 (or, 340.282.366.920.938.000.000.000.000.000.000.000.000) different combinations. 160 and 256 bit can store even more combinations (2^160 and 2^256 respectively).
The process for finding and reusing comparable values is relatively straightfoward. We start with an extract from the Data source. At the end of the Extract, we store the Business key and the calculated SCD2 change value (Ext_SCD2) in a csv file on disk.
When the Dimension table is loaded we perfom the same step as during the Extract. We Load the table data, calculating a new SCD2 value (Dest_SCD2) in the load statement.
We then store the Business key and the SCD2 value in another csv file on disk.
What we have achieved so far is that we have 2 comparable values. We have a Hash value for Default position and Age from the extracted data and a hash value for the same attributes from the Dimension table. We will now use these values when we read the data next time.
Finding SCD2 changes
The value for the last read hash values are now stored on disk, ready to be reused next time the data is loaded. So, how do we we reuse these values and compare them to what is currently being read? The solution for this is straightforward and involves a mapping table and a few resident loads.
The next time the data is read our first step is to read the data from source. This gives us a new SCD2 file for the extracted data. In order to find the changes, we load the data from this new file into a mapping table. Directly after that we read the Data from the SCD file generated for the Dimension table the last time it was loaded. The next step is to map the extracted SCD2 value (SCD2_compare) into a new table, thereby giving us something to compare.
We can see these changes clearly in the csv files that have been generated. This is the hash value for the extracted data with the changes.
And this is the hash value for the same player id the last time the Dimension was loaded.
Now that we have the hash values stored in a table it is very easy for us to find out how many rows have actually changed and the Business key for these changed rows.
We use the variable declared at the end of the last statement as a condition to the sub call which manages the ETL of our script.
The conditional If statement governs the direction of our dataflow by evaluating the variables passed in. The SCD2Rows sub will be called If the vL.SCD2ChangedRows > 0 and the vL.Counter variable is equal to the vL.MaxRecVersionNo variable. The 2 latter variables are used to determine the amount of rows in the existing dimension table and the rowcount in the extracted data. We expect these values to be equal when calling the SCD2 sub as we only want to manage changes, not insert new records.
Managing Row ID for Changed rows
The management of SCD2 changed rows requires that a new row be written to the dimension table in order to preserve history for this dimension and its related transactions. We also require that our Surrogate key be unique. A very simple way of achieving this is to count the rows in the existing Dimension table data and assign this value to a Variable.
We will use this Variable when loading the changed Data into the Dimension table.
Loading the changed data
The first step is to Resident load the Business keys where changes have been detected.
We then join this Data explicitly with the existing table and make the necessary changes to the ETL attributes.
What we have now are the Original rows where changes have been detected. We have set these rows to expired with help of the RowIsCurrent and RowEndDate attributes. We now repeat the procedure, this time finding the updates for those rows that have been found. We load the Business key for the changed rows and explicitly join it to the latest extract. We ensure that the Surrogate key is kept unique with help of the vL.MaxRecVersionNo variable.
What we have now is the changed rows and a unique Surrogate key. The final step in the process is to concatanate the updated rows with the remaining rows in the existing Dimension table. The expired rows are also concatanated onto the Dimension table, thus giving us our history.
Once the script has executed the changes can be monitored in Qviewer. Note the changes that have been made to Player ID 1. The original row, with Playeridx 5 has been updated to inactual (0) and the new row, with Playeridx 25 is now flagged as the current row. The UpdateAuditKey for the inactual row has been recorded and the hash values in the Dest_SCD2 column clearly indicate that changes have been recorded.
That concludes this post on Qlikview and SCD2 changes. In my next post I will be looking at Fact tables and how these differ from Dimension tables.
If you are new to this series, then parts 1 and 2 are available on the following links: