Currently I'm running the tests in Arena. One game at a time (on my quad processor...).
The games are timed at 10sec+0.1sec. With that 1000 games take about 9 hours. Which makes me barely miss the finish before going to work in the morning.
I wanted to use cutechess-cli, which is both faster and enables me to run four games at a time (one for each core of my processor), but I'm having severe problems with the engines timing out, I wonder if it has to do with Java. It's open source so maybe I can do something about it, we'll see.
Anyway, since I have two time windows where I can run tests, 8 hours at night, and 8 hours during the day (while at work), I should probably figure something out to fit that.
Perhaps have a self-play match first, then run a gauntlet if it turns out a new version is better.
Today's test looked like this:
Rank Name Elo + - games score draws
1 Gaviota-win64-0.84 334 34 30 606 93% 6%
2 Mediocre 1.0 -2 -60 21 20 600 43% 21%
3 Mediocre 1.0 -1 -112 20 20 613 35% 22%
4 Mediocre v0.34 -162 21 21 607 28% 19%
Mediocre -2 is the one with futility pruning, -1 also has lazy eval included. Not so successful it seems.
And Gaviota is beating the living crap out of all versions. Won't allow that for too long. :)
I'll get back on the new testing setup, time to get that stupid cutechess-cli to work somehow.