Should I be using a non-clustered primary key on multiple columns?



  • I have a very large table that will just continue to get larger. It contains hourly interval meter usage data for around 22,000 meters.

    So, everyday 22,000 * 24 hours = 528,000 records are created. The current plan is to load new interval data into the table on a monthly basis.

    One problem I'm having is that there's really no unique identifier. So, I'm not sure how to best setup the Primary Key.

    Here's an example of data for 24 hours for two meters:

    ServiceLocation MeterNumber IntervalDay IntervalHour Demand
    111111 22222 2013-01-21 1 0.01
    111111 22222 2013-01-21 2 0.01
    111111 22222 2013-01-21 3 0.01
    111111 22222 2013-01-21 4 0.01
    111111 22222 2013-01-21 5 0.01
    111111 22222 2013-01-21 6 0.01
    111111 22222 2013-01-21 7 0.02
    111111 22222 2013-01-21 8 0.02
    111111 22222 2013-01-21 9 0.03
    111111 22222 2013-01-21 10 0.03
    111111 22222 2013-01-21 11 0.03
    111111 22222 2013-01-21 12 0.04
    111111 22222 2013-01-21 13 6.55
    111111 22222 2013-01-21 14 6.39
    111111 22222 2013-01-21 15 7.70
    111111 22222 2013-01-21 16 8.52
    111111 22222 2013-01-21 17 8.85
    111111 22222 2013-01-21 18 6.88
    111111 22222 2013-01-21 19 5.90
    111111 22222 2013-01-21 20 5.90
    111111 22222 2013-01-21 21 5.90
    111111 22222 2013-01-21 22 6.06
    111111 22222 2013-01-21 23 5.40
    111111 22222 2013-01-21 24 5.73
    555555 33333 2013-01-21 1 0.01
    555555 33333 2013-01-21 2 0.01
    555555 33333 2013-01-21 3 0.01
    555555 33333 2013-01-21 4 0.01
    555555 33333 2013-01-21 5 0.01
    555555 33333 2013-01-21 6 0.01
    555555 33333 2013-01-21 7 0.02
    555555 33333 2013-01-21 8 0.02
    555555 33333 2013-01-21 9 0.03
    555555 33333 2013-01-21 10 0.03
    555555 33333 2013-01-21 11 0.03
    555555 33333 2013-01-21 12 0.04
    555555 33333 2013-01-21 13 6.55
    555555 33333 2013-01-21 14 6.39
    555555 33333 2013-01-21 15 7.70
    555555 33333 2013-01-21 16 8.52
    555555 33333 2013-01-21 17 8.85
    555555 33333 2013-01-21 18 6.88
    555555 33333 2013-01-21 19 5.90
    555555 33333 2013-01-21 20 5.90
    555555 33333 2013-01-21 21 5.90
    555555 33333 2013-01-21 22 6.06
    555555 33333 2013-01-21 23 5.40
    555555 33333 2013-01-21 24 5.73

    And an example query that we would execute is:

    SELECT
        IntervalDay,
        SUM(Demand)
    FROM
        LoadData
    WHERE
        MeterNumber = '33333' AND
        IntervalDay >= '2013-01-21' AND
        IntervalDay < '2013-01-22'
    GROUP BY
        IntervalDay
    

    And a lot of times we'd aggregate using the IntervalHour field too, or maybe not specify a MeterNumber in the WHERE clause to get all meter usage.

    I'm having trouble figuring out what type of primary key I should have for this large table?

    It currently has a non-clustered Primary key on the ServiceLocation, MeterNumber, IntervalDay, and IntervalHour fields. Does this make sense? There is no clustered index at this time.

    The ServiceLocation is like the ID for the physical property, so it will never change for a location. However, the MeterNumber can change (ex, meter fails and needs replaced), but there will only ever be one MeterNumber at a ServiceLocation.



  • The primary key should be based on business logic - how do you uniquely identify a row in your table? Probably ServiceLocation, MeterNumber, IntervalDay, IntervalHour, that's most of your columns so I would probably make it the clustered index too - otherwise you'd be storing most of the table twice.

    Indexes you create to support queries are based on those queries and the data they need to use. If you want to make as few indexes as possible then look at what selectivity you can get from common filters - if all of your queries use a small range filter for IntervalDay then having that as an index on its own is probably fine. If you need to do bigger ranges and you need other filters to be used to get the data set down to a manageable size then you'll want to add those columns to the index too. Remember the https://ctandrewsayer.wordpress.com/2017/03/24/the-golden-rule-of-indexing/ : columns that will be used with range predicates should appear after columns that will be used with equality filters (so IntervalDay goes last).




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2