How to guarantee good testing of a complex distributed system before release?
carriann last edited by
I have a problem on deciding how to guarantee good testing of a complex distributed system before release. We have an automated testing infrastructure in place. The problem is that running all the tests takes around 5 days (including stress, longhaul etc.). Let us assume that we start with build V1 of the system and we start testing on that. After 5 days, we look at the results and notice that there are some failures, either due to dev code changes, some due to breakage in the automation. In the meantime, devs continue work on their tasks and after 1 day, we fix all these issues, we get a new build out (V1.1 - which contains more dev changes + changes for the current testing) and we want to start testing again. Now, the question is, what do we test? Do we spend 5 days running the full bunch of tests on V1.1 or do we run only specific tests for the affected code and pray/hope the components do not break? Can you please suggest some methodology around testing such systems? One idea that I had was: create a branch in our repository that starting tests "snapshots" the dev branch and starts running the tests on the snapshot. If changes are required to either the dev or automation, those changes are made in the dev branch and cross-integrated to the snapshot branch. This guarantees that only changes for the test pass are in the snapshot branch, but the second problem is still not solved. Thanks, /cd
There's a few extra factors here that can impact the way you handle this problem: Do you get results for each test as it completes or do you have to wait until all tests complete? Do you have multiple machines on which to run the tests (and is it possible to do this) or are you tied to a single system running your tests in sequence? Can you break your tests into smaller groupings? With a complex distributed system it can be difficult to run tests in parallel, but if you can arrange to do this, it will make managing the automation cycle much less challenging. Assuming that you can set up your tests to run in parallel: I'd start by separating the functional regression tests from the stress and load tests. These are the tests you run daily. If it takes more than a day to complete, consider breaking the functional regression tests into multiple smaller suites that can be run in parallel on numerous machines. You want these scheduled to kick off at night so they (ideally) be completed by the time you arrive at work in the morning so you can analyze any failures and report regressions quickly. This gives you a faster turnaround for purely functional tests. For your long-run tests such as the stress tests and load tests, I'd look at generating short versions that can be run in parallel and scheduled the same way as the functional tests. These won't give you the same level of value as your long-run tests, but they should be sufficient to detect any serious problems. Your long-run tests can then be scheduled to run say once a week and will pick up any other problems. You won't have quite the same level of granularity this way but you're always going to be compromising in this situation. If you can't run tests in parallel (you don't have the equipment, the system isn't going to allow it, you don't have the licenses for your automation environment... It's not ideal, but it happens.) I'd suggest you work this way: If your tests aren't already reporting as they complete, set them up to do this. If you're running them as a single long suite, split it into multiple suites set to run in order. This way you can be analyzing failures before the run completes. Order your tests to put smoke tests and broad functional regression first. This will give you the earliest possible notification of functional regression issues. Some things you can do to gather more data faster regardless of whether you can split your automation into parallel runs or not: Each test should report as soon as it completes. This doesn't have to be a detailed report - a simple X items passed, Y items failed email is enough to alert your team to potential problems. After each test, export all your logs to a shared network location where you can analyze them while the automation is running the next test. That way, if there's a failure you can start analysis without impacting the current run. (This assumes that each of your tests is a granular sequence that starts with the AUT closed and closes the AUT at the end of the test.) If your tests use a common setup sequence, consider turning this into a database restore/data flush and reset operation that pulls the data set from a shared network location. I've been in the situation where every test run spent an hour or more configuring data for the actual testing. The team built a once-a-week data configuration automation run and modified the other runs to simply pull the data from the configuration run, reducing setup for the functional runs to about five minutes. Consider going through your tests and replacing static waits with component-dependent waits wherever possible. It's a lot easier to put a static wait for some number of seconds than it is to code a wait routine that will check for a required component, and return an error if it fails to instantiate and become active within a specified time, but a whole lot slower: if your static wait is 3 seconds and the component exists and is active in 10 milliseconds, the automation will still wait 3 seconds. A fused wait that checks for the component every 100 milliseconds will only wait 100 milliseconds. If your automation has been around for a while, chances are good that there's a lot of static waits scattered through the oldest code (I've never seen an automation code base that didn't evolve and improve over time). No matter what strategy you choose, you're going to be compromising between completeness and quick turnaround: your goal is really to get the best possible mix of both.