As a response to John's question in a comment in the last post, here's a breakdown of where Mediocre spends its time. Sorted on time spent in each method (based on a 30 second snapshot).
Keep in mind that this does not include subsequent method calls, it's just the time spent inside the actual method, so including subsequent calls, the evaluation would look like this:
Basically more than half the time spent in evaluation, and a third of that is generating attacks for the different pieces.
Also high in the list is the Board.isAttacked-method, due to "is in check" calls in every node in the search, i.e. isAttacked is mostly used to see if the king is in check.
I do think the things related to making and generating moves are in a good place, around 15% of the total time. Maybe slightly high but not too bad.
If I used bitboards instead, with same (or better) performance in generating and making/unmaking moves. I would gain an obscene speed upgrade for "free", as the attack-checks cost a fraction compared to my current setup.
Oct 26, 2011
Oct 25, 2011
[Plan] Now I remember
Now I remember why I took a break last time. Annoying piece of ... changes that never improve a thing.
Everything that isn't a completely obvious bugfix (and even some of them) seem to hurt the performance.
So I'm trying to convince myself that I should rewrite the board representation and start using bitboards.
One of the ways (to convince myself) was to shut down everything that had something to do with attacks in the evaluation (which is extremely expensive without bitboards). This means most of the king safety, hanging pieces and mobility, and big parts of pawn evaluation.
Basically a huge part of the evaluation just removed. And the result versus my latest (non-handicapped) version was:
Where M1B was the non-handicapped.
So basically +70 elo points (and probably less if you account for the self-play inflation), for 50% of the evaluation.
Stupid.
I'm sure this isn't completely true, the speed gain is probably worth more in self-play than against other engines (that are already faster).
However, I'll spend some time playing around with bitboards and see where I end up.
Everything that isn't a completely obvious bugfix (and even some of them) seem to hurt the performance.
So I'm trying to convince myself that I should rewrite the board representation and start using bitboards.
One of the ways (to convince myself) was to shut down everything that had something to do with attacks in the evaluation (which is extremely expensive without bitboards). This means most of the king safety, hanging pieces and mobility, and big parts of pawn evaluation.
Basically a huge part of the evaluation just removed. And the result versus my latest (non-handicapped) version was:
Score of M1B vs M1-1: 529 - 316 - 179 [0.60] 1024
Where M1B was the non-handicapped.
So basically +70 elo points (and probably less if you account for the self-play inflation), for 50% of the evaluation.
Stupid.
I'm sure this isn't completely true, the speed gain is probably worth more in self-play than against other engines (that are already faster).
However, I'll spend some time playing around with bitboards and see where I end up.
Oct 23, 2011
[Info] Slight improvement
It seems the bug fixes I mentioned in my previous post has given a slight but measurable improvement:
(M1-1 being the newest version with the fixes)
Fortunate since I hadn't committed changes to the svn repository for a while and it started to be hard to see exactly what was changed.
This run includes the tapered eval change. I did a 1024 games test run with only that and it turned out to be exactly 50/50, that is no change in strength whatsoever. I had expected a slight improvement from that, but I guess my eval isn't really tuned for it. But no change is just fine, since I now have it in there and can start building from it.
I've also added a whole bunch of testing features and rewritten my "console mode" quite a bit. For example it's now possible to define a file with fen-positions and do an "evalmirror" on all of them to spot bugs. Might not be so fun for the average user, but extremely useful for me.
Before I release 1.0 I'll make it possible to run testsets in the epd format, directly from within Mediocre.
-
I've tried desperately to improve on the passed pawn evaluation. As it stands it's not very good and I suspect there are a lot of improvements that can be done.
But all my attempts so far has been in vain, I've tried Ed Schröder's way of evaluating them, and also Stockfish's and Crafty's, and some combinations between them coupled with my own ideas.
But everything I do seem to severely hurt the playing strength. And with severely I mean things like 100 rating points worse or so.
I'll keep trying though, I'm sure I'll hit the magic implementation at some point.
Score of M1-1 vs M1B: 420 - 366 - 238 [0.53] 1024
(M1-1 being the newest version with the fixes)
Fortunate since I hadn't committed changes to the svn repository for a while and it started to be hard to see exactly what was changed.
This run includes the tapered eval change. I did a 1024 games test run with only that and it turned out to be exactly 50/50, that is no change in strength whatsoever. I had expected a slight improvement from that, but I guess my eval isn't really tuned for it. But no change is just fine, since I now have it in there and can start building from it.
I've also added a whole bunch of testing features and rewritten my "console mode" quite a bit. For example it's now possible to define a file with fen-positions and do an "evalmirror" on all of them to spot bugs. Might not be so fun for the average user, but extremely useful for me.
Before I release 1.0 I'll make it possible to run testsets in the epd format, directly from within Mediocre.
-
I've tried desperately to improve on the passed pawn evaluation. As it stands it's not very good and I suspect there are a lot of improvements that can be done.
But all my attempts so far has been in vain, I've tried Ed Schröder's way of evaluating them, and also Stockfish's and Crafty's, and some combinations between them coupled with my own ideas.
But everything I do seem to severely hurt the playing strength. And with severely I mean things like 100 rating points worse or so.
I'll keep trying though, I'm sure I'll hit the magic implementation at some point.
[Info] Soo many bugs
I stumbled across a nice test for the evaluation method.
Take a position and evaluate it, then mirror the position (switching all pieces from left to right) and evaluate again, then reflect the position (from front to back, and invert the colors, and switch side to move) and finally reflect and mirror it.
This gives four different positions that should give exactly the same evaluation, assuming a symmetric evaluation.
So far I've found around 10 bugs, and there are probably more to come. Some of them were just symmetric flaws (like giving the king better evaluation if it's on C1 rather than F1, which makes sense on some level but really is not that good).
But some of them were ranging from minor to pretty severe. For example I noticed that my weak pawn evaluation was first of all checking the pawn could move backwards(!) to be supported. And also that I for some infernal reason checked for the absolute value of the piece when checking for blocked pawns. Which on some positions made white not care about the pawn being blocked (unless it was doubled), while black did.
Another one was evaluating fianchetto for black by checking if the pawn in front of the king was on the third rank instead of the fifth.
I'm not sure how much these bugs actually matter in terms of playing strength (sure rewarding fianchetto when pawn is on third rank might be wrong, but a bishop in front of the king in those positions might not be so bad after all).
I'll write something together to explain this excellent test a bit more thouroughly.
Take a position and evaluate it, then mirror the position (switching all pieces from left to right) and evaluate again, then reflect the position (from front to back, and invert the colors, and switch side to move) and finally reflect and mirror it.
This gives four different positions that should give exactly the same evaluation, assuming a symmetric evaluation.
So far I've found around 10 bugs, and there are probably more to come. Some of them were just symmetric flaws (like giving the king better evaluation if it's on C1 rather than F1, which makes sense on some level but really is not that good).
But some of them were ranging from minor to pretty severe. For example I noticed that my weak pawn evaluation was first of all checking the pawn could move backwards(!) to be supported. And also that I for some infernal reason checked for the absolute value of the piece when checking for blocked pawns. Which on some positions made white not care about the pawn being blocked (unless it was doubled), while black did.
Another one was evaluating fianchetto for black by checking if the pawn in front of the king was on the third rank instead of the fifth.
I'm not sure how much these bugs actually matter in terms of playing strength (sure rewarding fianchetto when pawn is on third rank might be wrong, but a bishop in front of the king in those positions might not be so bad after all).
I'll write something together to explain this excellent test a bit more thouroughly.
Oct 19, 2011
[Guide] Tapered Eval
As I just mentioned here, game phase interpolation (or tapered eval) is a way of avoiding sharp differences between evaluations in different phases of the game.
The following explanation is based on how it's done in Fruit, and there are similar implementations in Crafty and Stockfish (and probably many many other engines as well).
The scaling looks like this:
Where opening is the evaluation of the position with middle game in mind (e.g. keep kings protected behind their pawn covers) and endgame is the evaluation with endgame in mind (e.g. activate the kings). Both these evaluations are done in parallel when evaluating a position.
The phase is evaluated like this (code specifics left out):
Example
White and black each has NNBR and the evaluation for opening is +100 and endgame is +300
According to the above numbers we then get:
phase = (14 * 256 + (24 / 2)) / 24 = 149
## Where 14 is 24-1-1-1-2-1-1-1-2 (TotalPhase - phases of all pieces)
eval = ((100 * (256 - 149)) + (300 * 149)) / 256 = 216 tapered eval
Chess programming wiki - Tapered Eval
-
Quite ingenious. Funny how the decision on game phase in current version of Mediocre is extremely close to this (without interpolating on it of course), I wonder if I stole that from Fruit way back when I implemented it. :)
The following explanation is based on how it's done in Fruit, and there are similar implementations in Crafty and Stockfish (and probably many many other engines as well).
The scaling looks like this:
eval = ((opening * (256 - phase)) + (endgame * phase)) / 256
Where opening is the evaluation of the position with middle game in mind (e.g. keep kings protected behind their pawn covers) and endgame is the evaluation with endgame in mind (e.g. activate the kings). Both these evaluations are done in parallel when evaluating a position.
The phase is evaluated like this (code specifics left out):
PawnPhase = 0
KnightPhase = 1
BishopPhase = 1
RookPhase = 2
QueenPhase = 4
TotalPhase = PawnPhase*16 + KnightPhase*4 +
BishopPhase*4 + RookPhase*4 + QueenPhase*2
phase = TotalPhase
phase -= wp * PawnPhase // Number of white pawns
phase -= wn * Knight // White knights
...
phase -= br * RookPhase
phase -= bq * QueenPhase
phase = (phase * 256 + (TotalPhase / 2)) / TotalPhase
Example
White and black each has NNBR and the evaluation for opening is +100 and endgame is +300
According to the above numbers we then get:
phase = (14 * 256 + (24 / 2)) / 24 = 149
## Where 14 is 24-1-1-1-2-1-1-1-2 (TotalPhase - phases of all pieces)
eval = ((100 * (256 - 149)) + (300 * 149)) / 256 = 216 tapered eval
Chess programming wiki - Tapered Eval
-
Quite ingenious. Funny how the decision on game phase in current version of Mediocre is extremely close to this (without interpolating on it of course), I wonder if I stole that from Fruit way back when I implemented it. :)
[Info] Chessprogramming wiki and me
Since we now have this excellent resource Chessprogramming wiki I've decided to put my efforts on guides there. I.e. anything I write from now on in terms of trying to explain a subject will be put both there and here.
This will probably mean I'll have to try to cut down on my babbling and keep to proven concepts with good source information.
Hopefully Gerd (Isenberg) can keep up with my spamming.. :)
This will probably mean I'll have to try to cut down on my babbling and keep to proven concepts with good source information.
Hopefully Gerd (Isenberg) can keep up with my spamming.. :)
[Plan] Moving on to evaluation
I'm starting to be quite happy with the results of my rewritten search, a bit bigger test gave this:
This is comfortably within the bounds for a my new engine being superior. The rating difference is probably inflated quite a bit due this being a self test (as Thomas Petzke pointed out), but nonetheless it's an improvement for sure.
So time to move on to the evaluation revamp.
Having looked through it for the first time in a couple years I remember how extremely complicated it is. :) With a lot of Ed Schröder's recommendations in it (not exactly simplifying things).
I honestly don't think a complete rewrite is plausible. The best course of action is probably to prepare it for some auto-tuning and then start playing massive amounts of games.
However, there are a few things that could be looked over before that.
So first, tapered eval.
Program Elo + - Games Score Av.Op. Draws
1 M1-1 : 2435 15 15 1508 59.9 % 2365 25.7 %
2 M34 : 2365 15 15 1508 40.1 % 2435 25.7 %
This is comfortably within the bounds for a my new engine being superior. The rating difference is probably inflated quite a bit due this being a self test (as Thomas Petzke pointed out), but nonetheless it's an improvement for sure.
So time to move on to the evaluation revamp.
Having looked through it for the first time in a couple years I remember how extremely complicated it is. :) With a lot of Ed Schröder's recommendations in it (not exactly simplifying things).
I honestly don't think a complete rewrite is plausible. The best course of action is probably to prepare it for some auto-tuning and then start playing massive amounts of games.
However, there are a few things that could be looked over before that.
- Game phase interpolation (or tapered eval) - Seems every other engine has this now and I can clearly see the benefits. Basically don't switch sharply between the different phases of the game, but rather do it smoothly with separate scores for the different phases.
Simply put, let's say we evaluate as a middle game and get an evaluation of +200 while an evaluation as endgame gives +400 (maybe the king is close to the middle which would give a penalty in middle game but a bonus in endgame). Now if we're clearly in the middle game (queens and rooks and minor pieces all over), we would just take the +200. And if we're clearly in the endgame (maybe one or two pieces) we take the +400. But if we're in the middle of transitioning to the endgame (perhaps we have a few minor pieces but also queens) we shouldn't take one or the other straight off. But rather interpolate between the two. If we decide we're exactly in the middle between middle game and endgame we would take a +300 score.
I want to get this done first before poking anymore in the evaluation, since I'll probably need to split up a whole bunch of scores.
I'll get back to this. (maybe a new guide is in order) - Endgame knowledge - Currently a KBKP endgame is evaluated heavily in favour of the side with the bishop (somewhere around +200 centipawns), when it clearly should be close to 0. KRKB give the side with a rook a clear win. And there's plenty more of that. Since I have no plans of adding tablebases (and even if I do I can't assume they'll always be present), I will need to put in some basic endgame knowledge.
- Passed pawns - Something is not right here. Need to take a good look at how I evaluate them (a simple comparison to Crafty's evaluation gave far too low scores for Mediocre).
- Tropism - I'm not sure how much is gained (or lost) from this, I just put it in because I could (it's a fairly simple piece of code). Maybe take a look at it and see how others do it.
- Manual tuning - Find a good way to look at the individual evaluation parameters (even more so than the current one I have) and see if I can spot any obvious flaws. For example I'm quite sure the king's endgame piece square table has too low values (not rewarded enough for moving out the king in the endgame).
- Auto-tuning - I've had a look at CLOP which seems really nice, and also Joona Kiiski explained how he tunes Stockfish (forum post), which was identified as SPSA but with self-play.
However I decide to do it, I need to prepare Mediocre's evaluation for it. Which means access to more parameters from outside the evaluation.
So first, tapered eval.
Oct 18, 2011
[Test] Horrible horrible bug (and fever)
Home sick from work I forced my feverish self to go through a few of the horrendous test tournaments I've run since Sunday (the doomsday of my new version, where all I gained suddenly was nullified).
Firstly, I noticed both v0.34 and my new version are quite bad at evaluating pawns. Sure I run the tournaments at very fast time controls (10sec+0.1) and I can't expect them to calculate all the way to queen all the time. But I do think that's where I have the main evaluation weakness. I'll see what I can do about that.
But there were games where my new version's evaluation suddenly dropped from winning to severely losing. On some occasions it was just a bad position that would inevitably lead to doom, but then I found tenths of games where it just dropped the queen for nothing, returning a mate score.
Hopeful and confused I started digging. The "mates" were all singular response moves, like checking with the queen next to the king with the only move available being capturing the queen.
The big problem was I couldn't replicate any of the positions. Inserting them and searching would always give a reasonable move.
Suspecting my new check evasion algorithm I put down traces where the quiescent search returned mate and ran hundreds of test positions, but no luck. So I added traces to every conceivable place where mate could be returned and then finally I found a position where the main alpha-beta loop would return a mate score even though there existed a legal move that avoided it.
So after some stepping through the code I found the culprit, some time for some idiotic reason I had removed the scoring of the hashmove. Since the score isn't really used (hasmoves are always searched first) I must have deemed it unnecessary.
But with my new way of storing moves (by ply) it's imperative that every move is given a score, even if it isn't used. Reason being further searches can stumble upon old scores and act on them. Like setting the hashmove score to -10000 to avoid researching it, and then not research the next hashmove either since the -10000 score is lingering around.
Which is exactly what happened.
I seem to remember mentioning this not so long ago, and still I managed to fall into the trap again. :)
After having fixed the bug I ran a quick 128 game test to make sure it solved the problem. And being burnt from Sunday's fiasco I ran another one as well. :) The result:
Firstly, I noticed both v0.34 and my new version are quite bad at evaluating pawns. Sure I run the tournaments at very fast time controls (10sec+0.1) and I can't expect them to calculate all the way to queen all the time. But I do think that's where I have the main evaluation weakness. I'll see what I can do about that.
But there were games where my new version's evaluation suddenly dropped from winning to severely losing. On some occasions it was just a bad position that would inevitably lead to doom, but then I found tenths of games where it just dropped the queen for nothing, returning a mate score.
Hopeful and confused I started digging. The "mates" were all singular response moves, like checking with the queen next to the king with the only move available being capturing the queen.
The big problem was I couldn't replicate any of the positions. Inserting them and searching would always give a reasonable move.
Suspecting my new check evasion algorithm I put down traces where the quiescent search returned mate and ran hundreds of test positions, but no luck. So I added traces to every conceivable place where mate could be returned and then finally I found a position where the main alpha-beta loop would return a mate score even though there existed a legal move that avoided it.
So after some stepping through the code I found the culprit, some time for some idiotic reason I had removed the scoring of the hashmove. Since the score isn't really used (hasmoves are always searched first) I must have deemed it unnecessary.
But with my new way of storing moves (by ply) it's imperative that every move is given a score, even if it isn't used. Reason being further searches can stumble upon old scores and act on them. Like setting the hashmove score to -10000 to avoid researching it, and then not research the next hashmove either since the -10000 score is lingering around.
Which is exactly what happened.
I seem to remember mentioning this not so long ago, and still I managed to fall into the trap again. :)
After having fixed the bug I ran a quick 128 game test to make sure it solved the problem. And being burnt from Sunday's fiasco I ran another one as well. :) The result:
Back in business again! Now I just have to get rid of this stupid cold. :)
Program Elo + - Games Score Av.Op. Draws
1 M1-1 : 2441 37 37 256 61.7 % 2359 28.1 %
2 M34 : 2359 37 37 256 38.3 % 2441 28.1 %
Oct 17, 2011
[Test] Rough day
After Saturday's test gauntlet with v0.34 I ran another set with the newest new version (futility pruning being the latest big addition).
And I was completely dumbfounded by the result..
Mediocre v0.34 back ahead of 1.0?? Having played literally thousands of games the last week I haven't seen any sign of this..
The new version was beating the crap out of 0.34, at 60+% win ratio.
So I ran a few more test sets and v0.34 were neck and neck with any of the newer versions.
Chess programming can be really hard on your self esteem. :)
Re-running a huge (by my standards) gauntlet again, including both v0.34 and v1.0 and we'll see what happens.
I almost suspect I'm using some weird compile or something, but this seems to be the new truth. :(
And I was completely dumbfounded by the result..
Rank Name Elo + - games score draws
1 Counter 1.2 265 34 33 297 70% 22%
2 Knightx 1.92 140 34 33 297 53% 9%
3 iCE 0.2 124 33 33 297 51% 10%
4 Mediocre v0.34 119 19 19 1107 65% 7%
5 Mediocre 1.0 -2 112 15 15 1857 64% 9%
6 Horizon 4.4 64 34 34 296 44% 7%
7 TJchess 1.01 -8 35 36 295 35% 5%
8 Roce 0.0390 -47 34 36 295 30% 13%
9 Lime 66 -58 36 37 297 30% 5%
10 Adam 3.3 -171 40 43 297 19% 5%
11 Wing 2.0a -252 45 51 296 13% 3%
12 Bikjump 2.01 -290 48 55 297 10% 4%
Mediocre v0.34 back ahead of 1.0?? Having played literally thousands of games the last week I haven't seen any sign of this..
The new version was beating the crap out of 0.34, at 60+% win ratio.
So I ran a few more test sets and v0.34 were neck and neck with any of the newer versions.
Chess programming can be really hard on your self esteem. :)
Re-running a huge (by my standards) gauntlet again, including both v0.34 and v1.0 and we'll see what happens.
I almost suspect I'm using some weird compile or something, but this seems to be the new truth. :(
Oct 15, 2011
[Info] Lazy eval failure
I simply can't get it to work. I tried including pawn eval and king safety in it, which would make it much for accurate (in exchange of plenty of speed obviously), but no go again.
Perhaps the more accurate lazy eval makes it have so little speedup that the few times it actually misses something are enough to make it fail in total.
Some people suggest that the futility pruning already does a big part of lazy eval, which of course makes sense. It's also a check on inability of reaching alpha, even if you were given a piece, but with a full eval and then cutting big chunks out of the search tree.
So maybe they're stepping on each other toes, and if I'd have to choose between the two, futility pruning has a lot more potential.
I'm leaving it for now, will revisit when I start my evaluation revamp though.
Perhaps the more accurate lazy eval makes it have so little speedup that the few times it actually misses something are enough to make it fail in total.
Some people suggest that the futility pruning already does a big part of lazy eval, which of course makes sense. It's also a check on inability of reaching alpha, even if you were given a piece, but with a full eval and then cutting big chunks out of the search tree.
So maybe they're stepping on each other toes, and if I'd have to choose between the two, futility pruning has a lot more potential.
I'm leaving it for now, will revisit when I start my evaluation revamp though.
[Test] Gauntlet engine test done
Turned out like this for v0.34:
At the start I was debating whether I should have 8 or 10 engines, and since Lime and Bikjump are a bit too far behind I think I'll just drop those and go with 8. That would enable me to have 800 games played over night, with 100 against each engine.
Elo-wise it looks like this:
Mediocre is ranked slightly too high in the field. But I'll stick with this (minus Lime and Bikjump), and add higher ranked engines later on.
Engine Score
01: Mediocre v0.34 667,0/1000
02: Counter 1.2 70,0/100
03: Knightx 1.92 67,5/100
04: iCE 0.2 53,0/100
05: Horizon 4.4 42,5/100
06: TJchess 1.01 38,5/100
07: Roce 0.0390 30,5/100
08: Wing 2.0a 13,0/100
09: Adam 3.3 11,5/100
10: Bikjump 2.01 4,0/100
11: Lime 66 2,5/100
At the start I was debating whether I should have 8 or 10 engines, and since Lime and Bikjump are a bit too far behind I think I'll just drop those and go with 8. That would enable me to have 800 games played over night, with 100 against each engine.
Elo-wise it looks like this:
Rank Name Elo + - games score draws
1 Counter 1.2 330 56 53 111 72% 22%
2 Knightx 1.92 305 60 57 111 66% 5%
3 iCE 0.2 208 55 54 111 55% 9%
4 Mediocre v0.34 168 20 20 1118 65% 7%
5 Horizon 4.4 104 55 56 110 42% 6%
6 TJchess 1.01 94 55 56 110 41% 9%
7 Roce 0.0390 20 56 59 110 32% 9%
8 Wing 2.0a -178 69 82 111 14% 4%
9 Adam 3.3 -179 69 82 111 13% 5%
10 Lime 66 -274 82 105 111 8% 3%
11 Bikjump 2.01 -288 85 111 111 8% 0%
Mediocre is ranked slightly too high in the field. But I'll stick with this (minus Lime and Bikjump), and add higher ranked engines later on.
[Plan] Deadline for next relase
I received a mail from Olivier Deville this morning, asking for registration to OpenWar 9th Edition, which starts November 15th.
So now I have my deadline for when Mediocre 1.0 will be released. (and probably a couple days before that)
So now I have my deadline for when Mediocre 1.0 will be released. (and probably a couple days before that)
[Blog] Layout fun
Played around a bit with the layout editor for the blog, it's evolved quite a bit in the last few years. :)
Feels kinda fresh, perhaps a bit too plain, but well, different atleast.
Enjoy.
Feels kinda fresh, perhaps a bit too plain, but well, different atleast.
Enjoy.
Oct 14, 2011
[Info] Gauntlet test set
After some back and forth testing (couldn't get about half of the engines I tried to work) I've decided on a new set of engines for gauntlet testing. After a very small sample test (100 games gauntlet for Mediocre v0.34) it came out like this:
Where "Rel Elo" is result of the test and "CCRL Elo" how they're rated on that list.
Turned out like I wanted, with Mediocre somewhere in the middle, with some far superior and some far inferior engines.
I'll run a more extensive test later to get a good estimated baseline ELO for Mediocre v0.34, so I have something to compare the new version with.
Rank Name Rel Elo CCRL Elo
1 Counter 1.2 319 2556
2 iCE 0.2 185 2440
3 TJchess 1.01 138 2403
4 Lime 66 65 2250
5 Knightx 1.92 61 2405
6 Mediocre v0.34 34 -
7 Bikjump 2.01 1 2188
8 Roce 0.0390 -2 1951
9 Horizon 4.4 -38 2452
10 Adam 3.3 -115 2326
11 Wing 2.0a -215 2200
Where "Rel Elo" is result of the test and "CCRL Elo" how they're rated on that list.
Turned out like I wanted, with Mediocre somewhere in the middle, with some far superior and some far inferior engines.
I'll run a more extensive test later to get a good estimated baseline ELO for Mediocre v0.34, so I have something to compare the new version with.
[Info] Time for more testing
Well, not exactly more testing but need to work out a better procedure.
Currently I'm running the tests in Arena. One game at a time (on my quad processor...).
The games are timed at 10sec+0.1sec. With that 1000 games take about 9 hours. Which makes me barely miss the finish before going to work in the morning.
I wanted to use cutechess-cli, which is both faster and enables me to run four games at a time (one for each core of my processor), but I'm having severe problems with the engines timing out, I wonder if it has to do with Java. It's open source so maybe I can do something about it, we'll see.
Anyway, since I have two time windows where I can run tests, 8 hours at night, and 8 hours during the day (while at work), I should probably figure something out to fit that.
Perhaps have a self-play match first, then run a gauntlet if it turns out a new version is better.
Today's test looked like this:
Mediocre -2 is the one with futility pruning, -1 also has lazy eval included. Not so successful it seems.
And Gaviota is beating the living crap out of all versions. Won't allow that for too long. :)
I'll get back on the new testing setup, time to get that stupid cutechess-cli to work somehow.
Currently I'm running the tests in Arena. One game at a time (on my quad processor...).
The games are timed at 10sec+0.1sec. With that 1000 games take about 9 hours. Which makes me barely miss the finish before going to work in the morning.
I wanted to use cutechess-cli, which is both faster and enables me to run four games at a time (one for each core of my processor), but I'm having severe problems with the engines timing out, I wonder if it has to do with Java. It's open source so maybe I can do something about it, we'll see.
Anyway, since I have two time windows where I can run tests, 8 hours at night, and 8 hours during the day (while at work), I should probably figure something out to fit that.
Perhaps have a self-play match first, then run a gauntlet if it turns out a new version is better.
Today's test looked like this:
Rank Name Elo + - games score draws
1 Gaviota-win64-0.84 334 34 30 606 93% 6%
2 Mediocre 1.0 -2 -60 21 20 600 43% 21%
3 Mediocre 1.0 -1 -112 20 20 613 35% 22%
4 Mediocre v0.34 -162 21 21 607 28% 19%
Mediocre -2 is the one with futility pruning, -1 also has lazy eval included. Not so successful it seems.
And Gaviota is beating the living crap out of all versions. Won't allow that for too long. :)
I'll get back on the new testing setup, time to get that stupid cutechess-cli to work somehow.
Oct 13, 2011
[Info] Not so futile attempts
Title refers to an earlier post where I tried to add futility pruning in the past and never really got it to work.
This time though:
So in worst case 2 elo better, in best case 60 (the -2 refers to futility pruning added to that version). But somewhere around 30 elo is probably accurate.
Combining this with the test last night would give this:
Probably flawed to bits with no statistical significance. But I wonder if I can't say that Mediocre has gained somewhere upwards of almost 100 elo.
Probably not, but considerably better it is.
On to internal iterative deepening.
This time though:
Rank Name Elo + - games score draws
1 Mediocre 1.0 -2 31 29 28 100 60% 31%
2 Mediocre 1.0 -1 -31 28 29 100 41% 31%
So in worst case 2 elo better, in best case 60 (the -2 refers to futility pruning added to that version). But somewhere around 30 elo is probably accurate.
Combining this with the test last night would give this:
Rank Name Elo + - games score draws
1 Mediocre 1.0 -2 53 39 38 100 60% 31%
2 Mediocre 1.0 -1 -11 12 12 1100 53% 24%
3 Mediocre v0.34 -42 13 13 1000 46% 24%
Probably flawed to bits with no statistical significance. But I wonder if I can't say that Mediocre has gained somewhere upwards of almost 100 elo.
Probably not, but considerably better it is.
On to internal iterative deepening.
[Info] Decent
Not as good as I had hoped for (it never is), but still a decent and almost statistically certain improvement.
Seems I finally managed to improve on Mediocre v0.34! :)
A few more things left to implement before I'll consider myself done with the search revamp for a while.
After that I'll have a look at the evaluation. I'm quite happy with what's in it, but I'm certain the evaluation parameters are no where near optimal (in fact I have a hard time imagining an engine having less optimal parameters as many are borrowed and manually adjusted from other engines, and some are leftovers from early version of Mediocre, and some were just arbitrarily added).
Anyway, I can now promise that the next version of Mediocre will have a gain in playing strength. Yay! :)
1: Mediocre 1.0 544,5/1000
2: Mediocre v0.34 455,5/1000
Seems I finally managed to improve on Mediocre v0.34! :)
A few more things left to implement before I'll consider myself done with the search revamp for a while.
After that I'll have a look at the evaluation. I'm quite happy with what's in it, but I'm certain the evaluation parameters are no where near optimal (in fact I have a hard time imagining an engine having less optimal parameters as many are borrowed and manually adjusted from other engines, and some are leftovers from early version of Mediocre, and some were just arbitrarily added).
Anyway, I can now promise that the next version of Mediocre will have a gain in playing strength. Yay! :)
[Info] Another breakthrough, maybe
I spent the evening dabbling with root move ordering. I had a general idea of what I wanted to do. Something like:
I was hovering around that result for all my attempts. A second over or a second under. It seemed my search was immune to root ordering, or rather the ordering obtained by just calling the alphaBeta-method directly was good enough and my attempts at improving it were futile.
Finally I decided to test something else. Instead of using quiescence search I used a one ply ordinary alpha-beta search, combined with the node counts. Obviously this is not something new, I seem to remember some years ago that this was the way to go, but many have moved on to better things (which didn't work out for me).
So I tried it on the test set again. 30 seconds?! Down from above 2 minutes. First reaction was look for bugs, but couldn't find any.
Have a test going again, we'll see where it lands. Looks promising though.
- Break out the first ply from the alphaBeta-method so it's possible to add some unique logic to it (examining a few open source engines it seems fairly common that this is done incorporated in the main search method, which I don't quite understand since it adds both complexity to the code and a bunch of extra if-statements).
This also allows us to make sure the best move is always obtainable (as I mentioned some posts ago I suspect there were some losses where the best move was overwritten in the hash).
And finally output the currently examined move to UCI, which is kinda nice. - 2. Add an extensive move ordering algorithm to the root moves. Doesn't matter how expensive since it will be done at most like 30-50 times, compared to to the millions of ordinary nodes.
Having read quite extensively about it, without actually finding any concrete evidence of the "best" approach I settled for trying a couple alternatives.
* Do a quiescence search on all moves and order by the values it returned, but bringing the PV move up front if not there already.
* Count the nodes each move caused and order by that.
* A combination of the above (quiescence score with some scale for lower plies, and node counts deeper depths)
I was hovering around that result for all my attempts. A second over or a second under. It seemed my search was immune to root ordering, or rather the ordering obtained by just calling the alphaBeta-method directly was good enough and my attempts at improving it were futile.
Finally I decided to test something else. Instead of using quiescence search I used a one ply ordinary alpha-beta search, combined with the node counts. Obviously this is not something new, I seem to remember some years ago that this was the way to go, but many have moved on to better things (which didn't work out for me).
So I tried it on the test set again. 30 seconds?! Down from above 2 minutes. First reaction was look for bugs, but couldn't find any.
Have a test going again, we'll see where it lands. Looks promising though.
Oct 12, 2011
[Info] A small breakthrough
First I added the history heuristic (and mate checks in quiescence search which shouldn't amount to much).
This seemed to help a tad:
A slight improvement from the last test. I think the history heuristic helps since I don't order my moves for things like moving forward/towards the center.
I used an idea from GreKo where you keep track of the history of all moves made (from/to as usual), and also moves that causes a beta cutoff (in GreKo this uses moves that raised the value of alpha, but beta cutoff worked slightly better for me). Then divide the beta cutoff history value with the all-move value.
This should make the history data a bit more sane (essentially penalizing it for making the move without causing a cutoff).
-
Now the above was all nice, v0.34 went from 62.25% win rate to 58.57%.
But then I decided to try something I've been meaning to for some time. That is cut all moves with a negative SEE-score in the quiescence search. Point being they are extremely unlikely to prove useful for anything, at least that the quiescence search can detect.
And this happened:
Close to dead even score, and most likely enough games to confirm that neither engine would beat the other on a regular basis.
So basically I'm back to the strength of v0.34. And I still haven't even done basic root ordering, which I'm expecting to show significant gains.
Fun fun.
This seemed to help a tad:
1: Mediocre v0.34 502,5/858
2: Mediocre 1.0 355,5/858
A slight improvement from the last test. I think the history heuristic helps since I don't order my moves for things like moving forward/towards the center.
I used an idea from GreKo where you keep track of the history of all moves made (from/to as usual), and also moves that causes a beta cutoff (in GreKo this uses moves that raised the value of alpha, but beta cutoff worked slightly better for me). Then divide the beta cutoff history value with the all-move value.
This should make the history data a bit more sane (essentially penalizing it for making the move without causing a cutoff).
-
Now the above was all nice, v0.34 went from 62.25% win rate to 58.57%.
But then I decided to try something I've been meaning to for some time. That is cut all moves with a negative SEE-score in the quiescence search. Point being they are extremely unlikely to prove useful for anything, at least that the quiescence search can detect.
And this happened:
1: Mediocre 1.0 352,0/700
2: Mediocre v0.34 348,0/700
Close to dead even score, and most likely enough games to confirm that neither engine would beat the other on a regular basis.
So basically I'm back to the strength of v0.34. And I still haven't even done basic root ordering, which I'm expecting to show significant gains.
Fun fun.
Oct 11, 2011
[Info] Another test, with killer moves
Killer moves implemented with a reasonable gain:
Still some ways to go, but certainly getting closer.
I have some suspicion the new version is forfeiting some games due to not returning a best move. I have a safety measure for this in v0.34 but not having the root search in place for v1.0 it's missing there.
There is also something fishy with mate scores in the new version, will have to look into that.
I'm hoping root move ordering will bring v1.0 up to par with v0.34. An even score at that stage would certainly be thrilling as there is still extensions, futility pruning and internal iterative deepening left to do.
1: Mediocre v0.34 622,5/1000
2: Mediocre 1.0 377,5/1000
Still some ways to go, but certainly getting closer.
I have some suspicion the new version is forfeiting some games due to not returning a best move. I have a safety measure for this in v0.34 but not having the root search in place for v1.0 it's missing there.
There is also something fishy with mate scores in the new version, will have to look into that.
I'm hoping root move ordering will bring v1.0 up to par with v0.34. An even score at that stage would certainly be thrilling as there is still extensions, futility pruning and internal iterative deepening left to do.
Oct 10, 2011
[Info] A little test
Getting closer to v0.34:
What's left to implement is internal iterative deepening, killer moves, extensions and a more elaborate root move ordering. I should try history moves for ordering as well, though from what I've read they seem to have little relevance anymore. But trying doesn't hurt.
1: Mediocre v0.34 711,0/1000
2: Mediocre 1.0 289,0/1000
What's left to implement is internal iterative deepening, killer moves, extensions and a more elaborate root move ordering. I should try history moves for ordering as well, though from what I've read they seem to have little relevance anymore. But trying doesn't hurt.
Oct 9, 2011
[Info] Starting on basic move ordering
I've spent the last day trying to identify a bug that made Mediocre go haywire (dropping pieces, missing mates in one etc.) when the hash move was used for ordering, which is of course really strange since it shouldn't add or remove any moves, just search them in a different order.
Of course it turned out that one of the few new things I've added was the culprit.
Instead of keeping two integer arrays, one for the moves and one for the ordering values, I've mashed them together into a Move-object. Pretty much only for clarity (it shouldn't cause any performance degradation).
Since I didn't use any move ordering at all, ordering part of the Move-object was left untouched. That is until I started to use hash moves which I order as 10,000 (search first in any circumstance) and then -10,000 to mark it already used.
As no other moves received a score, that -10,000 was lingering in the move array and eventually caused no moves to be searched. :)
-
Well, that's over and done with and using the hash move for ordering gave the following result in my standard mini-test:
Barely any statistical relevance really, but I'm happy.
Steady going, on to more move ordering.
Of course it turned out that one of the few new things I've added was the culprit.
Instead of keeping two integer arrays, one for the moves and one for the ordering values, I've mashed them together into a Move-object. Pretty much only for clarity (it shouldn't cause any performance degradation).
Since I didn't use any move ordering at all, ordering part of the Move-object was left untouched. That is until I started to use hash moves which I order as 10,000 (search first in any circumstance) and then -10,000 to mark it already used.
As no other moves received a score, that -10,000 was lingering in the move array and eventually caused no moves to be searched. :)
-
Well, that's over and done with and using the hash move for ordering gave the following result in my standard mini-test:
1: Mediocre 1.0+ 14,0/20
2: Mediocre 1.0 6,0/20
Barely any statistical relevance really, but I'm happy.
Steady going, on to more move ordering.
Oct 8, 2011
[Info] Check evasions
I'm taking baby steps and try to get everything right from the beginning.
Having considered Jon Dart's suggestion and examined a few open source engines (GreKo http://greko.110mb.com/ is a favorite) I decided to add the notion of not standing pat in the quiescence search if in check.
That required continuing for one step and a need for check evasions.
My new check evasion code takes about a fourth of the time compared to the old one, which seems nice. And there might still be a few more optimizations to do.
Last bug a squashed concerning it was evading a check by capturing to the last rank and promote to a non-queen.
How I've missed these ultra-specific bugs... :)
Having considered Jon Dart's suggestion and examined a few open source engines (GreKo http://greko.110mb.com/ is a favorite) I decided to add the notion of not standing pat in the quiescence search if in check.
That required continuing for one step and a need for check evasions.
My new check evasion code takes about a fourth of the time compared to the old one, which seems nice. And there might still be a few more optimizations to do.
Last bug a squashed concerning it was evading a check by capturing to the last rank and promote to a non-queen.
How I've missed these ultra-specific bugs... :)
Oct 7, 2011
[Info] A new approach
After having spent about 20 hours the last week trying to tweak Mediocre into becoming atleast a tiny bit better I suddenly had an epiphany and gave up.
It's been far too long since I last worked with this and there are so many things going on that I have no idea if they help or hurt.
So I scrapped the entire search function and started over...
Currently Mediocre 1.0 beta searches about 4-5 ply on average and gets beaten to scraps by v0.34. :) So far I've implemented quiescence search and transposition tables, and pretty much nothing else, not even rudimentary move ordering.
In doing so I've noticed that something is not well with how I handle transposition scores. For example I've been returning alpha/beta scores from the table rather than the actual alpha/beta in the node. This seems very wrong.
There's also something fishy going on with the hash move, but I'm not sure if this is true for v0.34 (there might be something fixing it along the way).
-
Anyway, I'm having fun, and the improvements come in 100 rating chunks again. My favorite kind of improvement. :)
Let's see if it ends up surpassing v0.34 eventually. I certainly hope so.
Edit:
First non-loss for the new engine! :)
It's been far too long since I last worked with this and there are so many things going on that I have no idea if they help or hurt.
So I scrapped the entire search function and started over...
Currently Mediocre 1.0 beta searches about 4-5 ply on average and gets beaten to scraps by v0.34. :) So far I've implemented quiescence search and transposition tables, and pretty much nothing else, not even rudimentary move ordering.
In doing so I've noticed that something is not well with how I handle transposition scores. For example I've been returning alpha/beta scores from the table rather than the actual alpha/beta in the node. This seems very wrong.
There's also something fishy going on with the hash move, but I'm not sure if this is true for v0.34 (there might be something fixing it along the way).
-
Anyway, I'm having fun, and the improvements come in 100 rating chunks again. My favorite kind of improvement. :)
Let's see if it ends up surpassing v0.34 eventually. I certainly hope so.
Edit:
First non-loss for the new engine! :)
Engine Score
1: Mediocre v0.34 19,5/20
2: Mediocre 1.0 beta 0,5/20
Oct 6, 2011
Sad day
Sometimes when you innovate, you make mistakes. It is best to admit them quickly, and get on with improving your other innovations. - Steve Jobs
Oct 5, 2011
[Info] Some info please and a TODO
I've implemented Jon Dart's suggested changes and I think they bring some improvement to playing strength. However to be honest, I'm not sure. My small tests so far point in all kinds of directions.
I need to set up a new testing environment and have taken a look at cutechess-cli which seems really nice. But I'm having problems with Mediocre crashing (or rather stalling) while running tournaments. Especially the new version. It also leaves really expensive Mediocre processes running, completely clogging up the computer.
I have no idea why.
This brings me to an important point that's been missing in Mediocre since the beginning. A decent logging framework. This should shed some light on those pesky stalls.
-
So a TODO for the next couple of days:
General
Set up a testing environment using cutechess-cli, some creative scripting (to be able to run gauntlet matchs) and a number of well-chosen engines. Most likely using the YATS testing again (even though I can't get ProDeo to work anymore).
Get some decent logging going. This should help immensely in tracking down bugs and getting a feel for where improvements might be needed.
Engine improvement
Apart from Jon Dart's improvements:
Pondering - It's time to get it done, and shouldn't take too long.
Endgame knowledge and evaluation - Take a look at what I started with two years ago. I might be able to shake out some improvements.
I need to set up a new testing environment and have taken a look at cutechess-cli which seems really nice. But I'm having problems with Mediocre crashing (or rather stalling) while running tournaments. Especially the new version. It also leaves really expensive Mediocre processes running, completely clogging up the computer.
I have no idea why.
This brings me to an important point that's been missing in Mediocre since the beginning. A decent logging framework. This should shed some light on those pesky stalls.
-
So a TODO for the next couple of days:
General
Set up a testing environment using cutechess-cli, some creative scripting (to be able to run gauntlet matchs) and a number of well-chosen engines. Most likely using the YATS testing again (even though I can't get ProDeo to work anymore).
Get some decent logging going. This should help immensely in tracking down bugs and getting a feel for where improvements might be needed.
Engine improvement
Apart from Jon Dart's improvements:
Pondering - It's time to get it done, and shouldn't take too long.
Endgame knowledge and evaluation - Take a look at what I started with two years ago. I might be able to shake out some improvements.
Oct 3, 2011
Reboot
What happened to the repositories?
It has been a long long time since I took a look at the source code at https://sourceforge.net/projects/mediocrechess/. I have a firm memory of using CVS, but there's no trace of CVS ever being used.
Well, I'm planning for a new version of Mediocre. Thanks to Jon Dart (http://www.arasanchess.org/) for coming with some suggested improvements that sparked my interest.
So in light of that I'll set up everything again. Mediocre v0.34 has been uploaded to the SVN on Sourceforge and I'll be using it "correctly" this time.
Trunk will be used for ongoing development, branches for specific ongoing changes, and tags for released versions.
Look forward to a new version in the near future.
It has been a long long time since I took a look at the source code at https://sourceforge.net/projects/mediocrechess/. I have a firm memory of using CVS, but there's no trace of CVS ever being used.
Well, I'm planning for a new version of Mediocre. Thanks to Jon Dart (http://www.arasanchess.org/) for coming with some suggested improvements that sparked my interest.
So in light of that I'll set up everything again. Mediocre v0.34 has been uploaded to the SVN on Sourceforge and I'll be using it "correctly" this time.
Trunk will be used for ongoing development, branches for specific ongoing changes, and tags for released versions.
Look forward to a new version in the near future.
Subscribe to:
Posts (Atom)