How to account for accumulating values which "start over" from zero for the same day?
I have set up a table like this:
CREATE TABLE IF NOT EXISTS "system"."network utilization" ( id bigserial, "collected at" timestamp with time zone NOT NULL DEFAULT now(), "number of byte received since startup" bigint NOT NULL, "number of byte sent since startup" bigint NOT NULL, PRIMARY KEY (id) )
Every minute while the computer is powered on, I automatically run a command to get how many bytes have been sent and received over the network, and INSERT these values as a new record into the table. These values from the command are reset every time the computer is started. (That's Microsoft who decided to do it that way. No idea if it's different on Linux.)
In time, this table will take up a lot of space. I thus want to save only one record for each day except for the current day, representing the total number of bytes incoming and outgoing for that day. The first, naive approach would be to simply keep the last record for the day. Or the one with the biggest numbers for the two columns representing bytes. But what if, during one day, I reboot the computer, maybe several times even, thus resetting the values entered to 0 and counting up again? If so, both picking the last record for the day, or the one with the biggest numbers, do not accurately reflect the total network usage for that day.
How should I handle this? I can't do
sum()on all records for a given day, obviously. I'd need something much more intelligent, which can detect if a later record is smaller than a previous one for the same day, and account for that. Is there anything built in for this? This turned out to be much more difficult than I first thought.
The task is not solvable in general using scheduled task. For example, you have 3 adjacent rows like ('01:00', 123456789), ('01:10', 123456), ('01:11', 234567). You cannot guarantee that the second reboot does not occur between 2nd and 3rd rows.
You must use permanently executed task (service) which saves both current counters and some UID unique enough (for example, task starting timestamp). You may say that such UID may be stored into the registry with another task... maybe. But you must guarantee that it is saved before the most first scheduled execution. You must have static base timestamp. When you have uptime only you cannot define what is the base - so you must store created_at timestamp too.
If you can receive both current system uptime and current network statistic then try to save them. This allows you to define the groups of rows in which you must select one row with max. traffic amounts. But such scheme needs to perform the base timestamp calculation for each row during the statistic retrieving - so it is more reasonable to calculate and to store not system uptime but system starting timestamp.
Moreover, I'd recommend to compare calculated timestamp and prev. row timestamp, and, if they're close enough then use the value from prev. row.
Check this solution:
WITH dates AS ( SELECT DISTINCT "collected at"::DATE "date" FROM "network utilization" ), raw AS ( SELECT t1."date", t2."system starting timestamp" sst, MIN("number of byte sent since startup") min_snt, MAX("number of byte sent since startup") max_snt, MIN("number of byte received since startup") min_rcv, MAX("number of byte received since startup") max_rcv FROM dates t1 JOIN "network utilization" t2 ON t1."date" = t2."collected at"::DATE GROUP BY 1,2 ), per_session AS ( SELECT "date", sst, max_snt - COALESCE(LAG(max_snt) OVER (PARTITION BY sst ORDER BY "date"), 0) sent, max_rcv - COALESCE(LAG(max_rcv) OVER (PARTITION BY sst ORDER BY "date"), 0) received FROM raw ) SELECT "date", SUM(sent) sent, SUM(received) received FROM per_session GROUP BY "date" ORDER BY 1;
https://dbfiddle.uk/?rdbms=postgres_12&fiddle=0453fe758a160ccbc617cce65830f40a with some comments/explanations.