Up until now I have been testing additions to Mediocre by trying it on random positions and basically look at the node count and time to decide if the addition was an improvement or not.
Once I started to feel satisfied with a new version I ran as many games as I had patience for, somewhere between 100 and 1000, against the last version and a couple of other engines.
I never really ran into any problems using this crude way of testing since pretty much every addition and change resulted in a very noticeable improvement.
However recently I have been having some serious trouble determining if new changes actually improves the engine or not, so it is time to refine my methods a bit.
New way of testingQuick test: Run the
YATS test sets
selected.pgn and
tony.pgn at 5 seconds per position, and analyze the results with
Pro Deo 1.4. This takes 3.5 minutes and gives a general idea of the effect of the change(s).
Mediocre v0.31 got the following results:
Testset : selected.pgn
Level : 5 seconds
Engine : Mediocre 0.31
Positions : 26
Found : 1 (3.8%)
Maximum points : 260
Scored points : 145 (55.8%)
Maximum time : 2:10
Used time : 2:05
Testset : tony.pgn
Level : 5 seconds
Engine : Mediocre 0.31
Positions : 16
Found : 2 (12.5%)
Maximum points : 160
Scored points : 69 (43.1%)
Maximum time : 1:20
Used time : 1:17
Full test: Run the
btm.pgn (Beat the Masters) YATS set at 1 minute per position. This takes 2 hours 46 minutes and should give a very good idea of the strength of the engine.
Mediocre v0.31 got this result:
Testset : btm.pgn
Level : 60 seconds
Engine : Mediocre 0.31
Positions : 166
Found : 32 (19.3%)
Maximum points : 1660
Scored points : 993 (59.8%)
Maximum time : 2:46:00
Used time : 2:20:43
If this results in an expected improvement to the engine run a 100 3-minute game match between the previous version and the new version using the
sherwin50.pgn test set (by Michael Sherwin). 100 games means two games per opening so each version gets to play both sides.
By using a fixed set of openings a lot of the randomness is taken out and it should be easy to determine if the new engine is an improvement and by how much.
100 3-minute games take somewhere between 5-10 hours, probably closer to 5.
Why YATSYATS is a testing system created by Ed Schröder, that instead of giving 0 points for a wrong answer and 1 point for a right answer rewards 0-10 points depending on what move the engine chooses. 10 for the 'correct' move and less for alternative moves that might still be good.
For a mediocre engine, like Mediocre, this is a perfect way of analyzing its strength. It might not find the 'best' move very often, but usually find a decent alternative move in the same position.
This way even one analyzed position might supply some interesting information, an old version might find an alternative move that gives 2 points while the newer version finds a slightly better alternative that gives 5 points. The traditional test sets would reward 0 points for both versions.
Further reading on the
YATS-site.
And stick to itHaving set up this testing structure I do not intend to change it for quite some time, except for perhaps adding a fixed depth search to the YATS sets as well.
By having it unchanged like this it will be much easier to determine individual strength of the different versions over time without having to run thousands of games just to avoid the inherent randomness of openings and in fact chess in general.