How to compute the average score for each user, by time interval?



  • I have a table User that only has an id and a name. A user has many Games, which is a table that contains an id, a score, inserted_at and a user_id referencing the User table. I would like that my query returns the following result:

    label user_id data
    Jacques 1 {null, 1, 4, 3, null, 6, null, 8, 9, null, 2, 5}
    Marine 2 {null, null, null, 4, 6, 2, 8, 6, 10, 10, 3, 8}

    where label is the name of the user and data is the average score of each games, per user, per unit of time within a time interval (here it is per month for the last 12 months) from this query:

    SELECT DATE_TRUNC('month', generate_series(LOCALTIMESTAMP - INTERVAL '11 month', LOCALTIMESTAMP, '1 month')) as time_unit)
    

    which return this result:

    "2021-07-01 00:00:00"
    "2021-08-01 00:00:00"
    "2021-09-01 00:00:00"
    "2021-10-01 00:00:00"
    "2021-11-01 00:00:00"
    "2021-12-01 00:00:00"
    "2022-01-01 00:00:00"
    "2022-02-01 00:00:00"
    "2022-03-01 00:00:00"
    "2022-04-01 00:00:00"
    "2022-05-01 00:00:00"
    "2022-06-01 00:00:00"
    

    The unit of time can be changed, in the above query, it is every months from the last 12 months, but it could be every months from the last 6 months, or every day for the last 2 weeks.

    The NULL are important, because this data is then to be fed to https://www.chartjs.org/ , so I need to know when a user hasn't played a game and scored anything.

    So far I have two attempts:

    Attempt 1:

    (SELECT time_unit, avg, u.name as user_name, u.id as user_id
    FROM users u
    CROSS JOIN (
        SELECT time_unit, AVG(ROUND(g.score, 1))
        FROM (SELECT DATE_TRUNC('month', generate_series(LOCALTIMESTAMP - INTERVAL '3 year', LOCALTIMESTAMP, '1 month')) as time_unit) as time_unit_query
        LEFT JOIN games g
        ON DATE_TRUNC('month', g.inserted_at) = time_unit
        LEFT JOIN users u
        ON u.id = g.user_id
        GROUP BY time_unit, u.name, u.id
    ) as q1
    ORDER By 1);
    

    Which returns the roughly this result:

      time_unit | score | user_name | user_id
      ...
      "2021-05-01 00:00:00",  null,                 "Marine",     1
      "2021-05-01 00:00:00",  null,                 "Jacques",  2
      "2021-05-01 00:00:00",  null,                 "Jean",     3
      "2021-06-01 00:00:00",    6.0000000000000000, "Marine",     1
      "2021-06-01 00:00:00",    6.0000000000000000, "Jacques",  2
      "2021-06-01 00:00:00",    6.0000000000000000, "Jean",     3
      "2021-06-01 00:00:00",    7.8333333333333333, "Marine",     1
      "2021-06-01 00:00:00",    7.8333333333333333, "Jacques",  2
      "2021-06-01 00:00:00",    7.8333333333333333, "Jean",     3
      ...
    

    As you can see, the results are wrong because all users have the same score, which is not the case in the database.

    Attempt 2:

    SELECT u.name AS label,
        u.id,
        ARRAY(
            SELECT AVG(ROUND(g.score))
            FROM (SELECT DATE_TRUNC('month', generate_series(LOCALTIMESTAMP - INTERVAL '1 year', LOCALTIMESTAMP, '1 month')) as time_unit) as time_unit_query
            LEFT JOIN games g
            ON DATE_TRUNC('month', g.inserted_at) = time_unit 
            GROUP BY time_unit, user_id
        )
    FROM games g
    JOIN users u
    ON u.id = g.user_id
    GROUP BY u.id
    

    returns this:

    user_name | user_id | data
    "Jean"  3   {6,5.222222222222222,2.625,7.833333333333333,4.6,5.111111111111111,NULL,6,6.5,6.888888888888889,6.545454545454546,4.125,5.666666666666667,6.666666666666667,4,7,5,5.75,4.428571428571429,5.545454545454546,5.111111111111111,5.25,4.375,6.181818181818182,5.083333333333333,6,5.666666666666667,5.857142857142857,5.8,6.625,6.875,5.125}
    "Jacques"   2   {6,5.222222222222222,2.625,7.833333333333333,4.6,5.111111111111111,NULL,6,6.5,6.888888888888889,6.545454545454546,4.125,5.666666666666667,6.666666666666667,4,7,5,5.75,4.428571428571429,5.545454545454546,5.111111111111111,5.25,4.375,6.181818181818182,5.083333333333333,6,5.666666666666667,5.857142857142857,5.8,6.625,6.875,5.125}
    "Marine"    1   {6,5.222222222222222,2.625,7.833333333333333,4.6,5.111111111111111,NULL,6,6.5,6.888888888888889,6.545454545454546,4.125,5.666666666666667,6.666666666666667,4,7,5,5.75,4.428571428571429,5.545454545454546,5.111111111111111,5.25,4.375,6.181818181818182,5.083333333333333,6,5.666666666666667,5.857142857142857,5.8,6.625,6.875,5.125}
    

    As you can see here too, all users have the same score ...

    Attempt 3:

    WITH span AS (
      SELECT
        1 AS user_id,
        NULL AS avg,
        DATE(series) AS date
      FROM GENERATE_SERIES(DATE_TRUNC('month', NOW() - INTERVAL '11' MONTH), NOW(), '1 month') AS series
      UNION
      SELECT
        games.user_id, ROUND(AVG(games.score), 1) AS avg, DATE_TRUNC('month', games.inserted_at) AS date
      FROM games
      WHERE games.user_id = 1
      AND games.inserted_at > NOW()-INTERVAL '11' MONTH
      GROUP BY games.user_id, date
      ORDER BY date)
    SELECT
      users.name AS name,
      ARRAY_AGG(s.avg) AS avg
    FROM
      (SELECT user_id, ROUND(AVG(avg), 1) AS avg, date FROM span GROUP BY user_id, date ORDER BY date) AS s JOIN users ON s.user_id = users.id
    GROUP BY
      name;
    

    returns this:

    user | data
    "Marine"    {6.8,6.6,4.4,5.3,7.0,5.7,5.8,6.0,5.0,4.0,6.5,NULL}
    

    but only works for one user.

    How can I get the correct score for each user and how can I aggregate the result into an array so that I get the data in the proper format?



  • You can check a DB fiddle of the solution here (without a function just raw SQL): https://www.db-fiddle.com/f/fR2JtQKVcQzrnuZ9SfpcNS/0

    For the solution, I am providing some dummy data to show some real output as I don't have your data:

    create table if not exists users (
        user_id serial primary key,
        name text
    );
    

    create table if not exists games (
    game_id serial primary key,
    user_id int references users(user_id),
    score numeric,
    inserted_at timestamptz
    );

    insert into users (name)
    values ('5w3n'),('CCoX'),('FHWbb8f7'),('NhsX1XcP'),('OPP6A3j3'),('t4N'),('scdHS'),('Adve'),('yqy'),
    ('fALF'),('J77T'),('TAXQ5B3V'),('GQGtBXV'),('U2yx6o'),('4in74FhL'),('ZsoZs'),('9dn6'),
    ('ewhi3u6q'),('q18');

    insert into games(user_id, score, inserted_at)
    values (5,589.53,'2022-01-01'),(9, 48.70,'2022-01-11'),(6, 27.05,'2022-02-01'),
    (1,952.81,'2022-02-23'),(11,23.80,'2022-02-15'),(5,732.60,'2022-03-02'),
    (7,460.05,'2022-03-09'),(4,402.98,'2022-03-29'),(7,375.06,'2022-03-30'),
    (3,325.03,'2022-04-12'),(2,363.32,'2022-04-15'),(12,34.60,'2022-04-19'),
    (6,370.70,'2022-04-22'),(6,462.08,'2022-05-01'),(2,575.39,'2022-05-02'),
    (13,14.56,'2022-05-03');

    The function:

    create or replace function get_user_averages(
        p_start_date timestamptz,
        p_end_date timestamptz default now(),
        p_interval interval default '1 month',
        p_truncate text default 'month'
    )
      returns table(
        user_id int,
        name text,
        data numeric[]
      )
      language plpgsql
    as
    $$
    begin
        return query
        select x.user_id, x.name, array_agg(x.average order by x.gen)
        from (
            select
                gen,
                u.user_id,
                u.name,
                round(avg(g.score) filter(where gen = date_trunc(p_truncate, g.inserted_at)) ,2) as average
            from generate_series(p_start_date, p_end_date, p_interval) gen
                cross join games g
                join users u on u.user_id = g.user_id
            group by gen, u.user_id, u.name
        ) x
        group by x.user_id, x.name
        order by x.user_id;
    end;
    $$;
    

    This function allows you to input your own custom start and end date as well as choosing your own interval (it can be 1 month, 1 day etc.). The output of the array itself (data column) is intentionally ordered (so the array itself is ordered) by the generated dates in ascending order. It might not be a useful feature, however, considering you don't have the dates available in the final output you won't be able to know which average belongs to which generated date.

    For this reason I have sorted the array so that your application code may be able to at least do something with the dates if it needs to. I would recommend trying to use a jsonb output instead of array where you would have a {month1:average1, month2:average2 ...} kind of output which may be more useful.

    You have to match the p_interval and p_truncate parameters to get the correct output. As one is for date generation, the other specifies the truncation of the timestamptz of the real table's inserted_at (so if you're sending interval '1 month', you should send month in p_truncate parameter)

    Calling the function:

    select * from get_user_averages('2022-01-01');
    

    The result:

    +-------+--------+-----------------------------------+
    |user_id|name    |data                               |
    +-------+--------+-----------------------------------+
    |1      |5w3n    |{null,952.81,null,null,null,null}  |
    |2      |CCoX    |{null,null,null,363.32,575.39,null}|
    |3      |FHWbb8f7|{null,null,null,325.03,null,null}  |
    |4      |NhsX1XcP|{null,null,402.98,null,null,null}  |
    |5      |OPP6A3j3|{589.53,null,732.6,null,null,null} |
    |6      |t4N     |{null,27.05,null,370.7,462.08,null}|
    |7      |scdHS   |{null,null,417.56,null,null,null}  |
    |9      |yqy     |{48.7,null,null,null,null,null}    |
    |11     |J77T    |{null,23.8,null,null,null,null}    |
    |12     |TAXQ5B3V|{null,null,null,34.6,null,null}    |
    +-------+--------+-----------------------------------+
    




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2