Fact table linking to natual key rather than row key in a slowly changing dimension table?



  • I’m designing a data warehouse and have a Slowly Changing Dimension Type 2 table something like this:

    UserKey UserID Group EffectiveDate ExpiryDate IsCurrent
    1 1001 GR1 2021-01-01 9999-12-31 Y
    2 1002 GR1 2021-07-31 2022-02-28 N
    3 1002 GR2 2022-28-28 9999-12-31 Y

    And I have a fact table, which whilst it does have dates, doesn't really relate to the user dimension at any particular point in time, say a subscriptions table like this:

    SubsKey SubsID SubsType UserKey / UserId ? StartDate EndDate
    55501 SBP501 Premium ??? / 2 2021-08-01 2022-08-01
    55502 SBB123 Bonus ??? / 2 2022-08-01 2022-09-01

    The subscriptions could be set up for dates in the future for example, or be changed retrospectively. Or alternatively we might have a fact table which doesn't have any date aspect at all.

    In such cases is it reasonable for the fact table to reference the natural key UserID instead of the type 2 dimension row key UserKey? As it's not obvious which UserKey ought to be used.

    I suppose one might use whichever row key is current at the time the fact row is generated, but that would seem to be implying that the fact relates to a particular snapshot in time for the user, when in reality the fact doesn't really.



  • I am curious to see if there are any other answers here but I will explain how I have approached this issue in the past. There are two solutions I believe you can implement in this case:

    1. As you said, use the latest userkey available. This is can be surprisingly difficult to maintain though. If your user dimension adds a new record for the same user you will need to update all related fact table records. To avoid this potentially large update, I would put the userId on the fact table and then expose the fact table through a view and as part of the view definition swap out the userid for the user key. This is the best approach if none of the other user attributes that are being tracked through the SCD are relevant to the subscriptions.
    2. I would consider this the more robust approach. You can falsely expand your fact table with date range joins. The result would be something like the following:

    enter image description here

    1. (cont) This accurately depicts that userID 1002 had a premium subscription that spanned their membership in group 1 and in group 2. Although it is arguably more accurate, it is not always better than the first option. It depends on the rest of your datawarehouse and how significantly a change in granularity to a fact table may impact it.

    Since your fact table is subscriptions, my guess is most of the statistics being run on this table will be related to counting the subscriptions. Therefore, a change in granularity can have a big impact as all your COUNT()'s will need to be COUNT(DISTINCT)'s. That being said, I usually prefer the second option as it seems more accurate to me and you can always roll it up in a query to be at the same grain as option 1. You cannot easily build option 2 from option 1 but you can easily build option 1 from option 2.




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2