Backfilling data using generate time series



  • I have a table like this:

    Date Group name value
    2022-01-01 A 1.0
    2022-01-15 A 0.5
    2022-01-31 A 0.2

    but this contains only 3 days. I need a table with full-calendar dates, a row for every day of the year where the value is the last available.
    For example, all records from 1 Jan 2022 to 14 Jan have value 1.0 (scoped to group_name 'A').

    I have tried using LAST_VALUE() but it's not working.

    WITH dates AS (
    
    SELECT 
        date::date
    FROM generate_series ( '2022-01-01'::timestamp, '2022-01-31'::timestamp, '1 day'::interval) date
    

    ), incomplete_table AS (
    SELECT * FROM (VALUES
    ('2022-01-01'::date, 'a', 1),
    ('2022-01-15'::date, 'a', 0.5),
    ('2022-01-31'::date, 'a', 0.2),
    ('2022-01-02'::date, 'b', 0.1),
    ('2022-01-10'::date, 'b', 0.15),
    ('2022-01-20'::date, 'b', 0.15)
    ) AS t (date,group_name, value)
    )
    SELECT
    dates.date,
    group_name,
    value,
    LAST_VALUE(value) OVER (ORDER BY dates.date DESC) as last_value_window
    FROM dates
    LEFT JOIN incomplete_table ON incomplete_table.date = dates.date
    ORDER BY dates.date DESC;



  • Assuming you want one row per day and group name:

    WITH incomplete_table(date, group_name, value) AS (
       VALUES 
         ('2022-01-01'::date, 'a', 1)
       , ('2022-01-15'::date, 'a', 0.5)
       , ('2022-01-31'::date, 'a', 0.2)
       , ('2022-01-02'::date, 'b', 0.1)
       , ('2022-01-10'::date, 'b', 0.15)
       , ('2022-01-20'::date, 'b', 0.15)
       )
    SELECT d.date, g.group_name, i.value
    FROM  (
       SELECT date::date
       FROM   generate_series (timestamp '2022-01-01'
                             , timestamp '2022-01-31'
                             , interval '1 day') date
       ) d
    CROSS  JOIN (SELECT DISTINCT group_name FROM incomplete_table) g  -- ①
    LEFT   JOIN LATERAL (
       SELECT i.group_name, i.value
       FROM   incomplete_table i
       WHERE  i.group_name = g.group_name
       AND    i.date = timestamp '2022-01-01'  -- ? ②
       ORDER  BY i.date DESC
       LIMIT  1
       ) i ON true
    ORDER  BY g.group_name, d.date DESC;
    

    db<>fiddle https://dbfiddle.uk/?rdbms=postgres_14&fiddle=6a683b65cd480b1c9c934626ed098b77

    ① If available, use a table "groups" providing distinct group names. Faster. Else, if the table is big, consider an emulated index-skip scan. See:

    • https://stackoverflow.com/a/66894500/939860

    ② The search for the latest value is not limited to the given time range unless you spell that out in the LATERAL subquery. About LATERAL:

    • https://stackoverflow.com/a/28557803/939860

    You still get value IS NULL where no earlier value is found.

    If incomplete_table is big, an index on (group_name, date) will help performance (a lot). Possibly even a "covering" index, adding column value. See:

    • https://dba.stackexchange.com/questions/81456/slow-index-scans-in-large-table/81554#81554

    Very similar case with more explanation:

    • https://dba.stackexchange.com/questions/266009/return-previous-running-total-when-value-is-null-in-a-time-series/266013#266013



Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2