Nov 29, 2011

[Info] Jim Ablett's compile of Mediocre v0.4

Jim has compiled Mediocre v0.4 and I added it to my sourceforge page.

I haven't had time to test it myself, but previous experience has it that Jim's compiles are far stronger than the Java version, so I'd recommend using that.

Jim's page

Mediocre v0.4 JA compile

Nov 27, 2011

[New Version] v0.4 - Ponder, revamped search, UCI only

Changes:
  • Any hash move used is now verified, this fixes a very rare occurrence of Mediocre crashing
  • The transposition table is now using the full 64 bit zobrist keys
  • The search was completely rewritten, possibly catching some bugs. Should show help quite a bit in playing strength
  • Ponder implemented
  • Removed the dependency of a settings file, things like hash sizes are now done through the UCI protocol
  • Removed the semi-working xboard protocol entirely. Sorry.

Note: This version is notably stronger than version 0.34, mainly due to bugfixes in the search.

Mediocre is as mentioned an UCI only engine from here on. This also means I've removed old settings file, use the UCI settings commands mentioned in the readme file.

Download here

Nov 25, 2011

[Info] Testing results

So some testing to confirm I didn't do anything silly.

M1-1 is a version with 64 bit zobrist keys in the transposition table, removal of the notion of "row" and some evaluation fixes. But without the tapered eval. (see previous posts for more info)

Against the Mediocre v1.0 beta it turned out like this:


Program Elo + - Games Score Av.Op. Draws
1 M1-1 : 2401 6 6 11029 50.4 % 2399 24.5 %
2 M1B : 2399 6 6 11029 49.6 % 2401 24.5 %


So pretty much equal, which is good enough. The worst scenario here would be the beta version being slightly stronger, but that should only be at most with a few elo points.

And against some other engines just to confirm.


Program Elo + - Games Score Av.Op. Draws
1 counter : 2593 15 15 2048 76.4 % 2389 23.4 %
2 M1-1 : 2392 8 8 6154 48.2 % 2405 14.4 %
3 adam : 2337 14 15 2048 42.5 % 2389 9.3 %
4 bikjump : 2294 15 15 2048 36.6 % 2389 10.6 %

Program Elo + - Games Score Av.Op. Draws
1 counter : 2580 14 14 2048 75.1 % 2388 25.4 %
2 M1B : 2390 8 8 5854 47.5 % 2407 15.8 %
3 adam : 2343 14 14 2048 43.7 % 2388 9.3 %
4 bikjump : 2290 16 16 1748 36.4 % 2388 12.1 %

The newer version seems to be holding up.

I'll release a new version with this during the weekend, probably on Sunday.

Then I have a steady foundation to start tackling the evaluation again.

Nov 23, 2011

[Info] So wrong again, but at least closer

So yeah, my imagined strength increase mentioned in the last post was non-existent of course.

But, the tapered eval seems to be holding up as the culprit of my recent failures.

I've tried to zone in on the exact version after Mediocre v1.0 Beta that did the best. With all kinds of combinations with and without 64 bit hash tables, tapered eval and removal of the notion of "row".

The results are... inconclusive.

However, it seems a version with everything except the specific addition of tapered eval seems to be playing at least equal with the beta version. So I think I'll just go with that one. Do a new release (to get a firm base to build from). And then start with my evaluation tampering.

I'll post some testing results in a day or two. (not going to leave any doubt this time)

Nov 18, 2011

[Info] Importance of thorough testing

Lately I've been struggling with one of those "super versions" that seems to beat everything I throw at it.

When I got done with my search improvements I did some really extensive testing against Mediocre v0.34 and concluded the new version to have pretty much exactly 60% win rate against it.

So I tagged that version and called it Mediocre v1.0 beta.

Then I committed three things to the trunk of svn: renaming of row to file, tapered eval and the change from 32 bit to 64 bit keys in the transposition table (along with a sanity check of all tt moves).

I thought I'd tried all of these extensively, scoring more or less equal to v1.0 beta, which I deemed ok since the changes were more or less needed for readability, stability, and future work.

During the passed weeks any change I did, no matter how tiny it seemed, got slaughtered by 1.0 beta. All my evaluation tweaking seemed to give results, but against 1.0 beta it still lost.

Now, the newer (uncommitted) versions had some utility changes that I really wanted to have committed (things like the mirror evaluation test). So I took those changes and added them to the 1.0 beta tag one by one, testing quite extensively between every change.

After I'd moved over all the utility, I thought I might just as well try the three things I'd committed after 1.0 beta. This is how that testing went:

  1. Row to file change: This should just have been a readability change (the usage of "row" had lingered around since the very first version of Mediocre, while the correct terminology is of course "file"). But it turned out while doing this I'd changed the rank, file and distance methods to static (rather than instance methods). This seems to be a very good move since they're called a lot, and suddenly 1.0 beta was playing better, quite a bit better.

  2. 32 to 64 keys and hash move validation: I thought if anything, this would be the culprit since messing around with the transposition tables is very likely to introduce bugs. Now when re-adding it, it seems to give a tiny but noticeable strength increase..

  3. Tapered eval: Horrible horrible reduction in strength. I have no idea how I missed this, but it seems to completely ruin the evaluation. Here's the actual culprit and I'll be much more careful when trying to put it back.


So the moral of the story. Never assume you did enough testing if you see signs that you didn't.

Nov 14, 2011

[Tournament] GECCO - Final results


1 Spike wwbwbw xrtnbd 111==1 5
2 Nightmare wbwbbw ctgsrb 1=1=1= 4.5
3 Tornado bwwbbw bnsdgm 1=0111 4.5
4 Rookie -bwbwb msdbnc 101=01 3.5
5 Baron wbbwwb tgmrsn 0=1=== 3
6 Goldbar wwbbwb dbnctx ==0101 3
7 Deuterium bwbwwb gxrtcs =10010 2.5
8 Mediocre -bw-bb rcbxxt 010010 2
9 Spartacus bwbwbw nmxgdr 001000 1
10 micro-Max bbw-ww sdcmmg 000100 1

Not what I'd hoped for, but with two forfeits I guess that's what I deserve. Atleast Mediocre won the two games it should and played very well against The Baron, while pretty horrible against Tornado.

Next time Mediocre will be in the top half. :)

[Tournament] GECCO - Game 6

Bit unlucky with the pairing and got Tornado here. Mediocre had the bishop pair and felt quite comfortable but underestimated the insanely strong white knight that ultimately lead to an unstoppable pair of passed pawns. Not much to say about this loss, Tornado was just better.

[Tournament] GECCO - Game 5

A second chance against micromax. Started out a bit crazy and then turned in to an endgame where Mediocre had the upper hand from the start.

[Tournament] GECCO - Game 4

Forfeit against micromax... yeah I overslept (and was a bit hungover after a late saturday night...), was connected to the server but for some reason Mediocre couldn't start the game. No idea why.

Nov 12, 2011

[Tournament] GECCO - Standings day 1

    Name              Rating Score Perfrm Upset  Results 
------------- ------ ----- ------ ------ -------
1 +Spike [1872] 3.0 [2168] [ 10] +10w +03w +04b
2 +Nightmare [1833] 2.5 [2060] [ 24] +08w =04b +07w
3 +Rookie [1747] 2.0 [1874] [ 0] +09w -01b +06w
4 -Tornado [1882] 1.5 [1793] [ 0] +05b =02w -01w
5 +Baron [ 0] 1.5 [1793] [2587] -04w =07b +09b
6 -Deuterium [ 0] 1.5 [1748] [2587] =07b +10w -03b
7 -Goldbar [1824] 1.0 [1594] [ 0] =06w =05w -02b
8 -Spartacus [ 0] 1.0 [1594] [1675] -02b -09w +10b
9 +Mediocre [ 0] 1.0 [1565] [1675] -03b +08b -05w
10 -microMax [ 0] 0.0 [1340] [ 0] -01b -06b -08w

Mediocre's walkover was against Rookie which I really thought I had a chance against. Too bad.

I guess MicroMax should be possible to beat and then we'll see what the other opponents are. Looking at the board it would be Deuterium and Goldbar. With some luck perhaps a 4.0 score isn't too impossible.

We'll see tomorrow.

[Tournament] GECCO - Game 3

Game 3 underway against The Baron. Have no high hopes for this one. :)

-

A solid loss as expected, but Mediocre played quite well I'd have to say. Ended up with some over extended pawns and the kings on the wrong side (The Baron had a pawn majority on the queenside, making the pawn ending a really simple win).



Game 4 starts tomorrow at 8:30 CET.

[Tournament] GECCO - Game 2

Spartacus played weird in the end game but still almost held the draw due to opposite colored bishops.

I have only a 20% adjustment towards draw for opposite bishops.. might be slightly too little, but rather too little than too much I guess.


[Tournament] Mediocre in GECCO

Mediocre is participating in a long time control tournament today and tomorrow.

http://marcelk.net/chess/GECCO/2011/GECCC.html

Unfortunately I had connection issues during the first game and had to forfeit it. Second game now, against Spartacus, seems everything is going fine, 14 moves in and Mediocre says up with +1.75. :)

I'm using a few weeks old version of Mediocre, with the changes to search but none of the recent evaluation dabbling.

Nov 7, 2011

[Info] Yay me

Up to my 10th failed attempt at tuning my passed pawn eval.

The last attempt I wasted 20,000 games.

I have tables looking like this:

Rank: 1 2 3 4 5 6 7 8
Value: {0,10,20,30,60,120,150,0}

That is increasingly higher evaluation the closer the passer is to promotion.

This table can than be stretched in all kinds of directions during the tuning (increasing/decreasing all values, or increasing the differences between them) using two "knobs", so the table only needs two values to tune instead of six.

Now, I had reversed the values when preparing for the tuning... so instead of giving 150 centipawns for being one square from queening I gave it 10.

The tuning tried to compensate and the best it came up with was:

Rank: 1 2 3 4 5 6 7 8
Value: {0,-83,-60,-14,-9,17,25,0}

Quite good effort, but I find it hard to believe a 8 cp difference for 6th and 7th rank is optimal.

Fixed the problem and running the tuning for the 11th time. :)

Nov 6, 2011

[Info] Fun little problem

Still tuning, moved on to passed pawn eval which is another of those problem areas (I've always thought Mediocre neglected passed pawns, but any attempts to manually tune it has resulted in heavily overvaluing them).

While running my tests I ran into a little interesting problem. Since I have aspiration windows it's quite common that researches occur (when the result is outside the window, i.e. a sudden drop/rise in evaluation between iterations).

Now if not careful it's possible that this window bounces back and forth, i.e. failing too low, then too high and never getting passed the iteration since it keeps researching.

I have all those security measures in place, but with the extreme numbers in evaluation the tuning can come up with, the score was so high it surpassed the "infinity" score. Now this will obviously fix itself after a few iterations of the testing (giving a passed pawn the value of 20 queens is probably not going to help), but my aspiration windows went berserk since I check it like this:

if(eval <= alpha) {
...
} else if (eval >= beta) {
...
}

With alpha and beta set to - and + "infinity" we have the maximum window that can never cause a research (mate in 1 is lower than infinity obviously). But as I said with these extreme evaluation parameters it did.

Easy fix, just a bit silly and quite hard to find.

Nov 5, 2011

[Info] CLOP windows executable

RĂ©mi Coulom was nice enough to create a windows executable of CLOP (I ran it through the Qt Creator before which was a bit silly).

If you haven't tried his software before I urge you to do it, it's quite aweosme:

http://remi.coulom.free.fr/CLOP/

Some changes in this version as well:

2011-11-05: 0.0.9
  • Stronger regularization (avoid overfitting in high dimensions)
  • "Merge Replications" option in gui -> faster, better display
  • Performance optimization of display and loading of large data files
  • Removed "-ansi" option for Windows compilation
  • Shrinking parameter ranges does not lose integer data any more
  • Removed confusing columns: max-1, max-2, ...
  • More explanations in the doc: biased win rate + GammaParameter

Link to the release post here.

[Info] The results are in

So I've run the mobility tuning overnight, with more than 10,000 games (probably too little for extreme accuracy, but good enough for me).

All of the four parameters (mentioned in my last post) started to zone in on the values very well. For example the first parameter looks like this:


The mean value was -370 which can clearly be seen in the plot.

So on to all of the results and some comparisons (sanity checks I guess). These were the resulting values (meaning of these explained in my last post):

  • MOBILITY_SAFE_MULTI = -370

  • MOBILITY_UNSAFE_MULTI = 783

  • MOBILITY_ONE_TRAPPED_MULTI = 767

  • MOBILITY_ZERO_TRAPPED_MULTI = 1343


I'll do four different examples.

  1. Average piece - a piece with 4 safe squares and 1 unsafe

  2. Good piece - a piece with 12 safe squares and 3 unsafe

  3. Bad piece - a piece on the fourth rank with 1 safe square and 1 unsafe

  4. Trapped piece - a piece on the fourth rank with 0 safe squares and 1 unsafe


So comparisons (using the examples above):

  1. 4*2+5 = 13 (before tuning)
    -370*4/100+783*5/100 = 25 (after tuning)

  2. 12*2+15 = 39
    -370*12/100+783*15/100 = 73

  3. 1*2+1-4*5/2 = -7
    -370*1/100+783*2/100-767*4/100 = -18

  4. 0*2+1-4*5 = -19
    -370*0/100+783*1/100-1343*4/100 = -46


So very reasonable numbers, just a bit higher in all directions just as I suspected (always nice when you have a guess and testing confirms it).

First interesting thing is the negative safe square number. I'm sure it's not trying to penalize safe squares, but rather put all bonus in the total squares (as those include safe squares as well), meaning safe vs unsafe squares is really not that important.

Second thing is my apparently good guestimate of giving a piece with one square half the penalty of a piece with zero squares. Which the tuning seems to confirm by almost doubling the factor (767 to 1343).

-

Time to run a test with all these new values. Will be very interesting.

Nov 4, 2011

[Info] The mobility tuning

I think the results of the mobility tuning is going to be quite interesting (and hopefully useful), so I'm going to scribble down the specifics of the test.

I'm going to test the following parameters:

MOBILITY_SAFE_MULTI
Every safe square that a piece can reach on the board, time this constant, divided by 100.

So basically:

MOBILITY_SAFE_MULTI = 500
Safe squares = 4
Equals 500*4/100 = 20 centipawns bonus

MOBILITY_UNSAFE_MULTI
Every safe squares, plus every nonsafe square (i.e. squares that are protected by less valued pieces). Same equation as above.

MOBILITY_ONE_TRAPPED_MULTI
If the piece on has one safe square it's penalized by the rank it's on plus 1. (and weighed as above)

MOBILITY_ZERO_TRAPPED_MULTI
If the piece has no safe squares it's penalized by the rank it's on plus 1. (and weighed as above)


After 1100 games I got the following numbers

MOBILITY_SAFE_MULTI = -141
MOBILITY_UNSAFE_MULTI = 865
MOBILITY_ONE_TRAPPED_MULTI = -174
MOBILITY_ZERO_TRAPPED_MULTI = -133

Not too happy about those negative numbers, but this would result in in the following scoring for say bishop with 4 safe and 1 unsafe squares (quite reasonable assumption):

-141*4/100 + 865*5/100 = 39

So basically it's favoring the unsafe squares and trying to reduce the safe squares evaluation. (I wonder if this is indicative of all squares, rather than safe/unsafe squares being preferable)

I'll leave it overnight and see what it comes up with.

[Plan] Tuning

I've started some tuning with CLOP which is an excellent piece of software. Of course I'd want a software completely focused on chess engine tuning, i.e. choose parameters, choose opponents, and receive optimal parameters and expected gain. But this very much good enough.

First tuned the futility levels which resulted in a quite expected (but hard to guess) increase in value of the shallow nodes. From:

120, 120, 310, 310, 400

to

210, 230, 260, 260, 460

So basically a higher margin before skipping searching nodes close to the leaves (remember futility pruning checks how far behind you are in the search and if the arbitrary margin doesn't get you back on the plus side, simply do not keep searching). And a slightly lower margin in nodes further from the leaves, and then slightly higher again for the nodes 5 plies from the leaves.

Pretty much what I expected (I borrowed the previous values from Crafty and generally thought they were a bit optimistic).

The next thing I tuned was the king positioning table in the endgame, which looked like this:

-20 -15 -10 -10 -10 -10 -15 -20
-15 -5 0 0 0 0 -5 -15
-10 0 5 5 5 5 0 -10
-10 0 5 10 10 5 0 -10
-10 0 5 10 10 5 0 -10
-10 0 5 5 5 5 0 -10
-15 -5 0 0 0 0 -5 -15
-20 -15 -10 -10 -10 -10 -15 -20

Again this was just randomly chosen (in this case I think I simply went with gut-feeling to pick the numbers). And I always suspected them to be too low, that is not giving enough credit for having the king in the center in the endgames.

Tuning gave this:

-187 -157 -128 -128 -128 -128 -157 -187
-157 -99 -70 -70 -70 -70 -99 -157
-128 -70 -41 -41 -41 -41 -70 -128
-128 -70 -41 -12 -12 -41 -70 -128
-128 -70 -41 -12 -12 -41 -70 -128
-128 -70 -41 -41 -41 -41 -70 -128
-157 -99 -70 -70 -70 -70 -99 -157
-187 -157 -128 -128 -128 -128 -157 -187

So a whole bunch lower, bit suprising... But bigger differences between center and edges (two pawns difference). I did a quick test of this, and over 200 games it seems it's certainly better (about 60% win rate over the old values).

I have a feeling my evaluation is so badly tuned that I'll be seeing a lot of these quite extreme numbers, and I might have to pass through all the variables a few times until I get them all right.

But I love this tuning business (which I've done very little of in the passed). Simply pass in a few parameters, wait a couple hours, and out comes an improved engine. No effort whatsoever. Silly really. :)

Next thing up is mobility which I have no idea how valid it is at the moment. Currently I do something like count the available squares for a piece, and give twice the number in centipawns along with half the number of unsafe squares (protected by lesser valued pieces).

This gives a really arbitrary number which I have no idea how good it is. Will be really interesting what CLOP comes up with.