Bulk Updates on Normalised data present across several PostgreSQL tables



  • We have several websites from where we download csv files - say 100 such websites. We don't have a control over their schedules, so we download files from them every few hours. Each such file can be assumed to have 20GB worth of data where 1 row should be on an avg. 4KB. So around 5 million rows per file.

    The first time this activity was performed, we read, transformed and normalised the data from all these files across 10 tables in a PostgreSQL Database. Now, every time we get a new file from each of these websites, we want to update our database. This would involve adding any new rows available, updating existing rows if corresponding columns changed or marking the row in the tables with a flag if the corresponding row was suddenly absent from the source(s).

    Is the most optimal way to do this is to use some diffing tool (maybe a UNIX diff) to find out what got added, what got removed and what got changed, and then only for those create a bulk INSERT/UPDATE queries and fire them into the Database? Is there a recommended file format or a tool for this or some sort of out of box solution if say the files are downloaded in AWS S3?

    OR are there faster ways like creating temporary normalised tables from new versions of the files and then doing some sort of compare and update into the main customer facing tables?

    OR just create temporary but not-normalised table and do the diffing there with the new version of the data and then transform and update the main normalised tables?

    I'm new to databases and had also read that we can write sever side function which Databases can execute and do a few things for us - ie., bake transform + normalise logic into such server-side functions and create a trigger to execute them etc.

    What is the recommendation?



  • It would probably be best to do as much as possible outside the database, as databases add a lot of overhead. But Unix diff might not be suitable, as it wants all the data in RAM, at least if there is more than a handful of differences. I find sort and uniq to be a more reliable way to deal with very large text data files than diff. This does depend on every line being a record, i.e. no embedded newlines.

    For example, to get all new lines not identical to an old line:

    sort new old old | uniq -u
    

    Since you included old twice, every old line will be present at least twice and so not survive the -u. And every new line identical to an old line will be present 3 times as a group, so also not surviving -u. Leaving only the truly new new lines.

    But if you do this on the entire data file, the results mix together new lines and changed lines. But if you isolated just the primary key columns, then what it would give you is all the new primary keys. If you use `INSERT...ON CONFLICT...DO UPDATE", then you don't need to distinguish those two situations as the database will resolve those two, without having to bother with all the provably identical data.

    To get the data that needs to be marked as deleted:

    extract_pk new new old | sort | uniq -u
    

    Would give you the list the primary keys from the old data no longer in the new data, which would then have to be flagged as deleted.




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2