Waiting for asynchronous process in system-tests
Alberto last edited by
Note: I am aware that this topic is broad and maybe subjective, however, I do not know any better place to ask. If you close it, please tell me where I could ask instead.
Assuming we are running a fully automated system-test, we often have to wait for asynchronous tasks, such as installation, computation or response from web. If tests are run in the cloud, the execution time is influenced by the current workload of the cloud as well as other factors.
What do we want ideally? Success: We move on immediately to the next step of testing -> not the topic of this post. Failure: We want the test to wait until an obvious error occurs or until a maximum timeout is reached, because deadlocks are possible. For some cases, (e.g. network not available) we would like to retry the task for a couple of times. Implementations of the maximum time and the retry step are relatively straight forward. I am struggling with the obvious error case and therefore that is topic here.
Why is maximum time and number of retries not enough? Let's assume the following scenario: an automated system-test is verifying that a computation results in a certain result. The run-time could increase with a new update, vary due to test data size or workload of the cloud. If we simply set a maximum timeout, the test fails because the timeout is reached even though computation is ongoing. If the computation fails completely, the test might run for hours, because the timeout is not reached, yet. It is important to keep low execution time and feedback direct.
What is best practice here? Is it a good idea to check something like "does it look like the process is still running?" in addition to the regular "did it finish successfully?". Is analyzing logs or events the way to go? Any expertise on this topic would be welcome and helpful.
Actually that's a great question, although in most test systems I have worked with this stage came last as a reaction to problems happening and not as a solid design.
"does it look like the process is still running?"
Sure why not ? you can extend it to actual fully fledged monitoring, for example using monit on Linux, then you can monitor not just the processing process but also the test infrastructure itself, for example is the network share connected ? is the DB up ? do I have enough memory and CPU available ?
Is analyzing logs or events the way to go?
I had a secret plan to utilize this for post processing of results, but analyzing the system as it runs is the way to go if you can afford the complexity and processing load.
In the extreme you can use AI/ML and identify problems not based on hard coded rules but on past experience and results, although I doubt this is feasible for most projects.
run-time could increase with a new update, vary due to test data size or workload of the cloud
That is actually a bug in your tests, you should be able to asses how much time is needed and adjust the timeout accordingly. If you collect enough data about your tests (CPU load, memory consumed or available, file size, needed processing etc.) you can build a model that will help you predict the necessary timeout and even adjust it as you go.