How to implement Slowly Changing Dimensions (SCD) in Talend (part 1/2)
One of the responsibilities of a data engineer is bringing data together from different sources into a data warehouse. In a data warehouse you gather facts and dimensions. The latter contain relatively static data about entities as geographical locations, customers, or products.This dimension data can change slowly and irregularly over time. So-called Slowly Changing Dimensions (SCD) should be handled to make sure the referential integrity within the data warehouse is correct. This blog consists of two parts: in the first we will talk about the types of Slowly Changing Dimensions with a focus on SCD type 2; In the second blog we will cover how to implement SCD type 2 in Talend.
What types of SCD are there?
Different types of SCD have been defined. In this section we’ll give a brief definition of the most common ones.
- Type 1: When a change occurs, the old record is overwritten by the new one. No historical records are kept.
- Type 2: Changes are stored as a new record, the old record is deactivated. The full history of each record is kept.
- Type 3: Stores the previous and current value of the specified attributes of a record. Only the previous version is kept as history.
- Type 4: Stores all historical changes in a separate “history” table. The full history of each record is kept.
Of these types, type 2 SCD is the most common. So in this blog, we will go into more detail on this type and how it works.
How does SCD type 2 work?
Changes between the source and target system can fall into three categories: inserts (new records), updates (changed records) and deletes (removed records). In SCD type 2 the following fields are added to track which records are active, and which are historical:
- scd_start: start date of the records activity
- scd_end: end date of the records activity
- scd_version: version of the record. Each time the record is updated, the version increases by one.
- scd_active: flag to indicate whether the record is active (current) or inactive (historical)
Let’s see what changes would look in a table about ‘cat adoptions’ using SCD type 2.
The first time the data is loaded into the target system, every record is an insert. As shown below, each record is set to active with a start date, a version of one, and no end date.
Let’s say some changes go through, here’s how they are reflected in SCD type 2:
- Zuzu was adopted, her record was removed from the source system. In the target table, the record is set to inactive (orange).
- The age of Richard Parker was updated from 5 to 6 years old. The old record (age = 5) is set to inactive (orange) and a new record is inserted with the updated age (blue).
- A record for Kit Cat was added to the source system. This record is inserted as a new active record in the target table (blue).
To summarise:
- Inserts: a new record is inserted as an active record
- Updates: the old record is set to inactive, the updated record is inserted as a new active record
- Deletes: the record is set to inactive
So, how do you build this in Talend? Read it in our next blog!