What's the most efficient way to store many small user messages in Postgres?



  • I am expecting to store many messages from an online chat service, up to 100 messages per second, and about 20 per second on average. I want to be able to retrieve messages by user, and by timestamp/time range, so I have the user as PKEY and a BRIN index on the timestamp

    Most messages are very small, maybe

    I can't help but think that these messages are extremely compress-able and it's a huge waste to store them all in plaintext, especially when a lot of the messages are the exact same. Does Postgres have a way to automatically compress these plaintext messages?



  • Postgres employs a storage technique ( https://www.postgresql.org/docs/current/storage-toast.html ) that allows you to compress individual column values, but it is only suitable for large values, as it compresses each value on its own, without taking into account values in other rows. It will be useless in your scenario of many repeating short values.

    Some other databases (e.g. Db2 and Oracle) offer compression with table- and page-level dictionaries, which will address you requirement. However, you'll need to ask yourself what you want to achieve by compression. You say "it's a huge waste to store them all in plaintext", but is it really?

    Taking your estimates at face value we can calculate that you will generate about 630 million rows per year (at 20 per second on average). We will have at least these columns: a user ID (bigint, 8 bytes), a timestamp (also 8 bytes), and the message (at most 20 bytes). This gives us the estimated table size of ~21 GiB per year. Suppose you were able to compress message values with the 1:2 ratio, including overhead. This would result in the table size of ~15 GiB per year. This is solid 30% savings, but in absolute figures it's just 6 GiB, which is entirely immaterial. Taking into account possible other columns in the table as well as the two indexes, effects of compression become even less impressive.

    As mentioned in the comments, compression has non-trivial CPU costs during both insertion and retrieval (typically in the range of 5-10% extra CPU utilisation). True, it reduces the I/O cost, but is it worth it?


Log in to reply
 


Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2