How can I efficiently scale a data lake?
I am running into problems with AWS Athena data lakes. They were initially very quick, but as we've scaled, they're performing poorly and are becoming expensive and inefficient. What can we do to solve this without reducing data size?
You're definitely not alone in this problem, it's well known. So much so that there's https://blog.eduonix.com/bigdata-and-hadoop/improve-dataops-with-dynamic-indexing-in-data-lake/ . I personally have had issues with partitioning, queries failing completely and overall poor performance. The aforementioned article mentions Varada which I think could be a good solution. How it works:
What they do is to break down a large dataset into what they call a nano block of 64k rows. Their technology looks at each nano block (thus original dataset) and automatically chooses an index for each nano block.