Advice for applying locks in my data processing pipeline



  • I currently have a Python program that enters rows into a Postgres table that essentially works as a list of data I need to process. These processes create on-disk files and trigger other behavior so I only want it to run once for each row.

    I then have another script that takes the rows from that table then begins to do the processing. So, for example, there might be 100 rows and each row might take 10-20 minutes to complete and each produces a few output files.

    I currently am running into the problem where I can only run this script one at a time in fear that running two in parallel might end up with them processing the same data twice.

    If I create a boolean field that I flip within the application when it's 'busy', I fear having a stale lock due to an abruptly killed process that doesn't end gracefully. If I use locks as built within Postgres, it seems they disappear upon the connection/session ending. But if I'm on an unstable connection, I'm not quite sure what the behavior would be or how I can get the behavior I want? Given these are 10-20 minute processes, I foresee connection being lost within that time frame and thus the lock being lost. Thanks for any advice on where to go. I'm using a Python library called psycopg2 to connect to the Postgres database.



  • Instead of a boolean attribute that shows if a row is being processed, add a timestamp with time zone attribute that you set to current_timestamp when you start processing a row. When looking for rows to process, add a WHERE condition like

    WHERE locked_at IS NULL OR locked_at < current_timestamp - INTERVAL '30 minutes'
    

    Then you have a lock that automatically expires after 10 minutes. If processing could take longer than 30 minutes, use a different interval.




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2