I'm doing performance testing through ant nightly, just running a custom package to exercise the server and grabbing metrics through a diagnostic call and outputting periodically.
Do you mean metrics from your program under test, the program testing it or the system?
We have the test metrics from the tester, the logs from the software under test, but it's working out how to properly monitor the machines and tie them to specific test runs we haven't managed yet. The tests start up from around 8 at night on a seperate 10 machine subnet. We are looking in to setting up munin in a controller-slave setup on one machine to take aggregate data from the other 9 test machines and somehow tie it to a specific test run and provide us with report graphs using both the throughput data from the grinder/jmeter. This is the bit I can't seem to find any information about doing. I've seen a lot of documentation and tech talks telling how you must monitor the system as well as the software, and they show throughput against memory usage and throughput against cpu usage, file IO etc. We do a lot of this stuff manually, after we see a spike in system usage stats during the night we will go digging in logs comparing time stamps finding out what was running and what went wrong. It's painful and it's slow.
It would be great to read how some other people solved this problem.
Most metrics are coming from a diagnostic call on the server being tested (memory use etc) with throughput data (request per second etc) coming from the program testing it. There is only limited data regarding the system collected - memory use of the JVM primarily. Each test instance starts fresh on 2 ec2 instances, one for the performance test package which replays abstracted logs and one for the server being tested. Each test instance saves the data for its test run, which I grab before shutting down the instance. On the assumption that each instance is clean at start of test, I have not considered much how performance of the system itself impacts the tests - but perhaps I should revisit this.
Most of our stuff is custom code for a number of reasons, the biggest external library being HttpClient. I'm not sure how other people have solved this particular problem, I just got to the point that clean EC2 instances were generating pretty comparable data and then set to putting out fires in the code (: