Testing programs that handle huge datasets
I'm writing a script that parses many thousands of lines from a server log and collates the data. I'm wondering how to go about creating unit-tests for such a program, to make sure it gathers the data correctly.
Should I just create a very small dataset and collate it manually and make sure that the unit test handles this small amount of data correctly? Is there some other way of determining that the calculations I did are correct? Ideally I would like to write the test based on a file that approximates what my script would be actually processing, rather a much much smaller sample size.
Ideally I would like to write the test based on a file that approximates what my script would be actually processing, rather a much much smaller sample size.
If it were me, I would test functional correctness and scale independently. Fault isolation is easier that way. Your code/test/debug cycle can go more quickly too.
The thinking here is that functional issues are typically independent of scale; for example, whether you handle really long URLs or a missing IP address has nothing to do with how big the file is. On the flip side, the kind of errors you encounter at scale (e.g. running out of memory or slowing to a crawl) are only likely to happen with a really big file.