Insert if update in datalake & Datalake: S3 or MongoDB



  • Scenario:

    • Each date, a semi-large bulk of .csv files (1.5Gb) is automatically gathered using a API call, compressed into one .zip file
    • The .zip is decompressed and each .csv is saved in S3
    • Each day, the .zip contains the same group of .csv's
    • The job is done outside of working hours, meaning that high computation speed is not an issue
    • The file structure in S3 is
    raw/{data_source}/{date}/{filename}/{filename}.csv
    
    • Approximately 99% of all rows in every single one of the .csv's are not updated each time

    My understanding is that data-lakes should by principle store all raw data according to a ELT job. But, with that solution, a lot of duplicate data would be stored, increase the cost of the data lake.

    One could therefore check what rows in each .csv has changed and store only the updated rows and newly added rows of each .csv to the data lake. However, in doing so, bugs could arise and as a result one could miss out on an important change for a particular variable.

    Another option would be to use a non-relational database as "data-lake" (More like storing all time series data, but would not then serve a complete backup of data). Each .csv would be its own collection.

    I see three options here:

    1. Use S3, dump all data there. Benefit is that it's easy, requires less time to maintain. Negative is that there will be a lot of duplicates, unnecessary amount of data, higher cost.
    2. Use S3, check updated rows. Benefit is less duplicates, lower cost. Negative is that it kind of defeats some of the purpose of a datalake, and requires more maintainability.
    3. Use Non-relational database as "data lake". Benefit is that checking what values has changed might be easier. Negative is that is requires more time to maintain.

    What options would suit best for this scenario?



  • This is kind of a shopping list question, so unfortunately you may not find many answers because it's off-topic. But my thoughts are you can really accomplish what you want in almost any database system solution, it's just a matter of how comfortable you are with the solution you pick here.

    You mentioned: "Columns in the csv is rather consistent" - If that's the case, a regular relational database management system (RDBMS) would work quite well for your problem too, and would be my own personal choice.

    Most modern RDBMS can load CSVs, and some can even query directly out of them (e.g. OPENROWSET() in SQL Server). They are good at preventing and gracefully handling dupes, and only updating the data that's changed, with a transactional guarantee, so you don't have to worry as much about bugs. In my opinion, they would be the easiest choice, but that's partly because I've already worked with them for a similar use case.




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2