Page 1 of 1

Tunning the eval

PostPosted: 02 Jan 2009, 13:49
by daniel anulliero
hi all!
This is my first topic in this forum (no posts since long :-) )
Well I am the Jars and Yoda programmer and I think my eval is so ... wrong... :-)
I try to rewrite it from scratch (before making hashtables)
I add param eval one by one with adjusting weights

Then I test the change :
- By selftest against "old" versions (RR 10 games vs eachover)
-By gauntlets vs 6 engines (10 games vs eachover)

Do you think it's a good way ?

best regards

Re: Tunning the eval

PostPosted: 02 Jan 2009, 14:41
by Sven Schüle
Hi Daniel,

it is very likely that 60 games is not enough to see whether your changes really made your engine play stronger. Evaluate your 60 games with BayesElo to get relative ratings and have a look at the resulting +/- error bars. If you repeat the same from scratch several times you might get substantially different results each time, that's what the error bars indicate. Then perhaps play 600 games, and you will see the error bars decreasing, although probably not to an acceptable amount.

I can't tell you exact numbers but some engine authors, like Bob Hyatt, tend to play thousands of games to reliably measure small ELO differences between two engine versions.

Bob also prefers not to use opening books but instead uses a huge number of different (balanced) starting positions which he extracted from high-quality games.

He also does not repeat playing the same position too often within one test run since it has been stated that doing so would have negative impact on the stability of test results (dependent measurements).

Finally, to get such a huge number of games finished within reasonable time you need
a) to play with ultra-fast time control (in the range of few seconds for each game),
b) Bob's cluster :-)
Since you don't have b) you will have to live with slightly higher error bars compared to those Bob is getting now.

There have been huge threads in CCC (IIRC) about this topic within the past 12 months.


Re: Tunning the eval

PostPosted: 04 Jan 2009, 01:01
by daniel anulliero
hi sven
thx for your answer
well I'll add much more engines for tests :)
I want to have 150 engines , 4 games each , is it ok?

this will take lot of time :)


Re: Tunning the eval

PostPosted: 04 Jan 2009, 01:34
by Sven Schüle
daniel anulliero wrote:I want to have 150 engines , 4 games each , is it ok?

this will take lot of time :)

I can't say whether this setup will be successful. But at least I think there may arise problems, since either you always take the same two (or four) starting positions against each opponent engine (then your engine will be tuned for too few types of positions), or you take different starting positions for different opponents (this might work, I don't know, but it looks dubious to me), or you play with opening books (which is disliked by many due to the instability of results it may introduce).

Perhaps it could be slightly better to take 10 engines only (strength not too far away from that of your own engine) and 60 games each, by using 30 different starting positions and playing each position twice (switching colors). With roughly 12-15 seconds per game (e.g. 40 moves in 4 seconds for each player, or anything similar) you'll end up in 120..150 minutes total for one complete test run of 600 games. That may give an acceptable result for you, although not with the precision that Bob is able to obtain.


Re: Tunning the eval

PostPosted: 04 Jan 2009, 10:57
by H.G.Muller
The new WinBoard release includes the Nunn and Silver positions, (10 and 40, respectively), so obtaining suitable start positions should be no problem! 8-)

Re: Tunning the eval

PostPosted: 05 Jan 2009, 01:33
by daniel anulliero
finaly I'm starting with 59 engines (include Surprise and Micromax :) )
10 games each = 590 games
we'll see
thx for your interrest

Re: Tunning the eval

PostPosted: 11 Feb 2009, 23:21
by Richard Allbert

I'm late to this, but you might be better taking, say, 10 opponents, and playing a gauntlet vs all of them, as black and white using the Silver or Noomen suite (or something similar). That way you play 600 games under much more "repeatable" conditions.

Just as number of games total is important, I assume that number of games vs opponent x is also important...


Re: Tunning the eval

PostPosted: 12 Feb 2009, 21:34
by H.G.Muller
You've got to have both, actually. If too large a fraction of the games is against the same engine, you will be training your engine to defeat that engine, even if this will go at the expense of performing worse in general (i.e against most other opponents). But if too large a fraction of your games is played from the same position, you will be training your engine to play the best moves in that position, which might again go at the expense of doing well in other positions, without you noticing.

So 10 oppoents 60 positions each is just as bad as 60 opponents, 10 positions each. The best compromise would be 25 opponents, 24 positions each.