Nov 18, 2011

[Info] Importance of thorough testing

Lately I've been struggling with one of those "super versions" that seems to beat everything I throw at it.

When I got done with my search improvements I did some really extensive testing against Mediocre v0.34 and concluded the new version to have pretty much exactly 60% win rate against it.

So I tagged that version and called it Mediocre v1.0 beta.

Then I committed three things to the trunk of svn: renaming of row to file, tapered eval and the change from 32 bit to 64 bit keys in the transposition table (along with a sanity check of all tt moves).

I thought I'd tried all of these extensively, scoring more or less equal to v1.0 beta, which I deemed ok since the changes were more or less needed for readability, stability, and future work.

During the passed weeks any change I did, no matter how tiny it seemed, got slaughtered by 1.0 beta. All my evaluation tweaking seemed to give results, but against 1.0 beta it still lost.

Now, the newer (uncommitted) versions had some utility changes that I really wanted to have committed (things like the mirror evaluation test). So I took those changes and added them to the 1.0 beta tag one by one, testing quite extensively between every change.

After I'd moved over all the utility, I thought I might just as well try the three things I'd committed after 1.0 beta. This is how that testing went:

  1. Row to file change: This should just have been a readability change (the usage of "row" had lingered around since the very first version of Mediocre, while the correct terminology is of course "file"). But it turned out while doing this I'd changed the rank, file and distance methods to static (rather than instance methods). This seems to be a very good move since they're called a lot, and suddenly 1.0 beta was playing better, quite a bit better.

  2. 32 to 64 keys and hash move validation: I thought if anything, this would be the culprit since messing around with the transposition tables is very likely to introduce bugs. Now when re-adding it, it seems to give a tiny but noticeable strength increase..

  3. Tapered eval: Horrible horrible reduction in strength. I have no idea how I missed this, but it seems to completely ruin the evaluation. Here's the actual culprit and I'll be much more careful when trying to put it back.


So the moral of the story. Never assume you did enough testing if you see signs that you didn't.

1 comment:

Anonymous said...

Even more fun when you only have a binary left of your "super version"...