Engine strength and statistics

Archive of the old Parsimony forum. Some messages couldn't be restored. Limitations: Search for authors does not work, Parsimony specific formats do not work, threaded view does not work properly. Posting is disabled.

Engine strength and statistics

Postby Rudolf Posch » 30 Dec 2003, 10:25

Geschrieben von: / Posted by: Rudolf Posch at 30 December 2003 10:25:01:

I just have read an already older posting of Igor Gorelikov
http://f11.parsimony.net/forum16635/messages/59053.htm ,
a version qualifier for IL-4 with the following results:

1 2 3 4 5 6 S P B
----------------------------------------------------
1 TRACE 1.25 XX 11 == 10 10 10 6 1-3 30.00
2 TRACE 1.1.2 00 XX =1 11 =1 == 6 1-3 27.50
3 BigLion 2.17 == =0 XX =1 11 == 6 1-3 27.25
4 RDChess 3.15 01 00 =0 XX =1 =1 4= 4 20.25
5 BigLion 2.23f 01 =0 00 =0 XX 11 4 5 18.25
6 RDChess 3.22 01 == == =0 00 XX 3= 6 20.25
----------------------------------------------------

RDChess V3.15 scored better compared to V3.22 after playing 10 games each.
(And seemingly the older BigLion V2.17 is worse than the V2.23f !?).
I personally believe RDChess V3.22 is stronger as V3.15. I think 20 games for comparing the strength of 2 engines are too few if the engines do not differ too much in strength.
I always play extensive tests before releasing a newer RDChess version in matches against the previous RDChess version or against "benchmark engines" (I have been using an older GNU chess version V5.02+ for this purpose).
While watching the match e.g. V3.21 against V3.22 the older version V3.21 leads 12-1-2 and I think "uugh, V3.22 is much worse!". But at the end of my (standard 60 games) benchmark tourney the final score is e.g. 25-28-7.
So one has to be careful to make evaluations on basis of a few games!
Rudolf
Rudolf Posch
 

Re: Engine strength and statistics

Postby Uri Blass » 30 Dec 2003, 10:41

Geschrieben von: / Posted by: Uri Blass at 30 December 2003 10:41:28:
Als Antwort auf: / In reply to: Engine strength and statistics geschrieben von: / posted by: Rudolf Posch at 30 December 2003 10:25:01:
I just have read an already older posting of Igor Gorelikov
http://f11.parsimony.net/forum16635/messages/59053.htm ,
a version qualifier for IL-4 with the following results:

1 2 3 4 5 6 S P B
----------------------------------------------------
1 TRACE 1.25 XX 11 == 10 10 10 6 1-3 30.00
2 TRACE 1.1.2 00 XX =1 11 =1 == 6 1-3 27.50
3 BigLion 2.17 == =0 XX =1 11 == 6 1-3 27.25
4 RDChess 3.15 01 00 =0 XX =1 =1 4= 4 20.25
5 BigLion 2.23f 01 =0 00 =0 XX 11 4 5 18.25
6 RDChess 3.22 01 == == =0 00 XX 3= 6 20.25
----------------------------------------------------

RDChess V3.15 scored better compared to V3.22 after playing 10 games each.
(And seemingly the older BigLion V2.17 is worse than the V2.23f !?).
I personally believe RDChess V3.22 is stronger as V3.15. I think 20 games for comparing the strength of 2 engines are too few if the engines do not differ too much in strength.
I always play extensive tests before releasing a newer RDChess version in matches against the previous RDChess version or against "benchmark engines" (I have been using an older GNU chess version V5.02+ for this purpose).
While watching the match e.g. V3.21 against V3.22 the older version V3.21 leads 12-1-2 and I think "uugh, V3.22 is much worse!". But at the end of my (standard 60 games) benchmark tourney the final score is e.g. 25-28-7.
So one has to be careful to make evaluations on basis of a few games!
Rudolf
If the difference in score is only 3 points you cannot be sure which version is better and the number of games is not important.
I think that the first thing that you need is to remove bugs and the main question that should be important for you is what is the reason that RDchess does stupid blunders in Leo's tournament.
Uri
Uri Blass
 

Re: Engine strength and statistics

Postby Dr.WAEL DEEB » 30 Dec 2003, 10:41

Geschrieben von: / Posted by: Dr.WAEL DEEB at 30 December 2003 10:41:49:
Als Antwort auf: / In reply to: Engine strength and statistics geschrieben von: / posted by: Rudolf Posch at 30 December 2003 10:25:01:
I just have read an already older posting of Igor Gorelikov
http://f11.parsimony.net/forum16635/messages/59053.htm ,
a version qualifier for IL-4 with the following results:

1 2 3 4 5 6 S P B
----------------------------------------------------
1 TRACE 1.25 XX 11 == 10 10 10 6 1-3 30.00
2 TRACE 1.1.2 00 XX =1 11 =1 == 6 1-3 27.50
3 BigLion 2.17 == =0 XX =1 11 == 6 1-3 27.25
4 RDChess 3.15 01 00 =0 XX =1 =1 4= 4 20.25
5 BigLion 2.23f 01 =0 00 =0 XX 11 4 5 18.25
6 RDChess 3.22 01 == == =0 00 XX 3= 6 20.25
----------------------------------------------------

RDChess V3.15 scored better compared to V3.22 after playing 10 games each.
(And seemingly the older BigLion V2.17 is worse than the V2.23f !?).
I personally believe RDChess V3.22 is stronger as V3.15. I think 20 games for comparing the strength of 2 engines are too few if the engines do not differ too much in strength.
I always play extensive tests before releasing a newer RDChess version in matches against the previous RDChess version or against "benchmark engines" (I have been using an older GNU chess version V5.02+ for this purpose).
While watching the match e.g. V3.21 against V3.22 the older version V3.21 leads 12-1-2 and I think "uugh, V3.22 is much worse!". But at the end of my (standard 60 games) benchmark tourney the final score is e.g. 25-28-7.
So one has to be careful to make evaluations on basis of a few games!
Rudolf
Hi Rudolf,
Yes,I agree with you that in one tournament must be played a big number of games!In my BasicLeague series of tournaments every engine plays 32 games at slow time control:2 hours per game!OTOH,sometimes older versions are better than the latest one!
BTW,RDchess 3.21 participate in my BasicLeague_007:
Ghost 0.13
Ghost 0.13_PrBk*
Matheus 2.0:1-0
Merlin 1.1
Pepito 1.59i nonprofile
RDChess 3.21
Sinapse 1.0
Thinker 4.2h
WAEL
*=using my Power_Book!
BasicLeague_007
Hardware:AMD Athlon(tm) Processor 1400Mz,512Mb Ram,
hashtables=128Mb for each engine!
Software:Windows XP Professional,GUI:Arena 1.0!
Nalimov EGTB:all 4 pieces and most of 5 pieces!
Conditions:
Time controls:2 hours per game!
Each engine plays 4 games against all other engines!
Ponder off!
Learning enabled!
No adapters are used!
No frozen,or out of time games!Just clear cut results!
*Participate 8 engines + 1 human!
**Every engine starts with 1800 ELO,which is calculated manualy using FIDE calculating method!
Regards,
Dr.WAEL DEEB
P.S.I'll post results when I have enough played games!
Dr.WAEL DEEB
 

Re: Engine strength and statistics

Postby Igor Gorelikov » 30 Dec 2003, 12:05

Geschrieben von: / Posted by: Igor Gorelikov at 30 December 2003 12:05:53:
Als Antwort auf: / In reply to: Engine strength and statistics geschrieben von: / posted by: Rudolf Posch at 30 December 2003 10:25:01:
I just have read an already older posting of Igor Gorelikov
http://f11.parsimony.net/forum16635/messages/59053.htm ,
a version qualifier for IL-4 with the following results:

1 2 3 4 5 6 S P B
----------------------------------------------------
1 TRACE 1.25 XX 11 == 10 10 10 6 1-3 30.00
2 TRACE 1.1.2 00 XX =1 11 =1 == 6 1-3 27.50
3 BigLion 2.17 == =0 XX =1 11 == 6 1-3 27.25
4 RDChess 3.15 01 00 =0 XX =1 =1 4= 4 20.25
5 BigLion 2.23f 01 =0 00 =0 XX 11 4 5 18.25
6 RDChess 3.22 01 == == =0 00 XX 3= 6 20.25
----------------------------------------------------

RDChess V3.15 scored better compared to V3.22 after playing 10 games each.
(And seemingly the older BigLion V2.17 is worse than the V2.23f !?).
I personally believe RDChess V3.22 is stronger as V3.15. I think 20 games for comparing the strength of 2 engines are too few if the engines do not differ too much in strength.
I always play extensive tests before releasing a newer RDChess version in matches against the previous RDChess version or against "benchmark engines" (I have been using an older GNU chess version V5.02+ for this purpose).
While watching the match e.g. V3.21 against V3.22 the older version V3.21 leads 12-1-2 and I think "uugh, V3.22 is much worse!". But at the end of my (standard 60 games) benchmark tourney the final score is e.g. 25-28-7.
So one has to be careful to make evaluations on basis of a few games!
Rudolf
Hi Rudolf,
I try to use the longest time control for qualifiers (30'+3") to make it
more close to the one used in the New Long loop (80'+3"). It is due to the
fact that engines may perform different at different time controls.
Many-games qualifiers would last for months and have no sense for me.
BTW, which time control you use for your tests?
In the rating list of New IL (after 20-25 games) RDChess 3.15 stays higher than 3.21 and 3.19.
In my opinion, ten games are enough to find out that a new version is not
much stronger than the previous.
Note also that there will be next cycles and further chances for new
versions.
Best regards,
Igor
Igor Gorelikov
 

Re: Engine strength and statistics

Postby U.Tuerke » 30 Dec 2003, 12:09

Geschrieben von: / Posted by: U.Tuerke at 30 December 2003 12:09:34:
Als Antwort auf: / In reply to: Re: Engine strength and statistics geschrieben von: / posted by: Igor Gorelikov at 30 December 2003 12:05:53:
I just have read an already older posting of Igor Gorelikov
http://f11.parsimony.net/forum16635/messages/59053.htm ,
a version qualifier for IL-4 with the following results:

1 2 3 4 5 6 S P B
----------------------------------------------------
1 TRACE 1.25 XX 11 == 10 10 10 6 1-3 30.00
2 TRACE 1.1.2 00 XX =1 11 =1 == 6 1-3 27.50
3 BigLion 2.17 == =0 XX =1 11 == 6 1-3 27.25
4 RDChess 3.15 01 00 =0 XX =1 =1 4= 4 20.25
5 BigLion 2.23f 01 =0 00 =0 XX 11 4 5 18.25
6 RDChess 3.22 01 == == =0 00 XX 3= 6 20.25
----------------------------------------------------

RDChess V3.15 scored better compared to V3.22 after playing 10 games each.
(And seemingly the older BigLion V2.17 is worse than the V2.23f !?).
I personally believe RDChess V3.22 is stronger as V3.15. I think 20 games for comparing the strength of 2 engines are too few if the engines do not differ too much in strength.
I always play extensive tests before releasing a newer RDChess version in matches against the previous RDChess version or against "benchmark engines" (I have been using an older GNU chess version V5.02+ for this purpose).
While watching the match e.g. V3.21 against V3.22 the older version V3.21 leads 12-1-2 and I think "uugh, V3.22 is much worse!". But at the end of my (standard 60 games) benchmark tourney the final score is e.g. 25-28-7.
So one has to be careful to make evaluations on basis of a few games!
Rudolf
Hi Rudolf,
I try to use the longest time control for qualifiers (30'+3") to make it
more close to the one used in the New Long loop (80'+3"). It is due to the
fact that engines may perform different at different time controls.
Many-games qualifiers would last for months and have no sense for me.
BTW, which time control you use for your tests?
In the rating list of New IL (after 20-25 games) RDChess 3.15 stays higher than 3.21 and 3.19.
In my opinion, ten games are enough to find out that a new version is not
much stronger than the previous.
Note also that there will be next cycles and further chances for new
versions.
Best regards,
Igor
I disagree with this: 10 games do not suffice at all.
I have too often observed a match starting with 7:3 ending up with 13:17.
Statistical fluctuations are really annoying.
Uli
U.Tuerke
 

Re: Engine strength and statistics

Postby Uri Blass » 30 Dec 2003, 12:31

Geschrieben von: / Posted by: Uri Blass at 30 December 2003 12:31:28:
Als Antwort auf: / In reply to: Re: Engine strength and statistics geschrieben von: / posted by: U.Tuerke at 30 December 2003 12:09:34:
I just have read an already older posting of Igor Gorelikov
http://f11.parsimony.net/forum16635/messages/59053.htm ,
a version qualifier for IL-4 with the following results:

1 2 3 4 5 6 S P B
----------------------------------------------------
1 TRACE 1.25 XX 11 == 10 10 10 6 1-3 30.00
2 TRACE 1.1.2 00 XX =1 11 =1 == 6 1-3 27.50
3 BigLion 2.17 == =0 XX =1 11 == 6 1-3 27.25
4 RDChess 3.15 01 00 =0 XX =1 =1 4= 4 20.25
5 BigLion 2.23f 01 =0 00 =0 XX 11 4 5 18.25
6 RDChess 3.22 01 == == =0 00 XX 3= 6 20.25
----------------------------------------------------

RDChess V3.15 scored better compared to V3.22 after playing 10 games each.
(And seemingly the older BigLion V2.17 is worse than the V2.23f !?).
I personally believe RDChess V3.22 is stronger as V3.15. I think 20 games for comparing the strength of 2 engines are too few if the engines do not differ too much in strength.
I always play extensive tests before releasing a newer RDChess version in matches against the previous RDChess version or against "benchmark engines" (I have been using an older GNU chess version V5.02+ for this purpose).
While watching the match e.g. V3.21 against V3.22 the older version V3.21 leads 12-1-2 and I think "uugh, V3.22 is much worse!". But at the end of my (standard 60 games) benchmark tourney the final score is e.g. 25-28-7.
So one has to be careful to make evaluations on basis of a few games!
Rudolf
Hi Rudolf,
I try to use the longest time control for qualifiers (30'+3") to make it
more close to the one used in the New Long loop (80'+3"). It is due to the
fact that engines may perform different at different time controls.
Many-games qualifiers would last for months and have no sense for me.
BTW, which time control you use for your tests?
In the rating list of New IL (after 20-25 games) RDChess 3.15 stays higher than 3.21 and 3.19.
In my opinion, ten games are enough to find out that a new version is not
much stronger than the previous.
I disagree with this: 10 games do not suffice at all.
I have too often observed a match starting with 7:3 ending up with 13:17.
Statistical fluctuations are really annoying.
Uli
The question is what do you mean by the word much.
I think that Igor does not consider 50 elo as much.
Uri
Uri Blass
 

Re: Engine strength and statistics

Postby U.Tuerke » 30 Dec 2003, 12:40

Geschrieben von: / Posted by: U.Tuerke at 30 December 2003 12:40:47:
Als Antwort auf: / In reply to: Re: Engine strength and statistics geschrieben von: / posted by: Uri Blass at 30 December 2003 12:31:28:
I just have read an already older posting of Igor Gorelikov
http://f11.parsimony.net/forum16635/messages/59053.htm ,
a version qualifier for IL-4 with the following results:

1 2 3 4 5 6 S P B
----------------------------------------------------
1 TRACE 1.25 XX 11 == 10 10 10 6 1-3 30.00
2 TRACE 1.1.2 00 XX =1 11 =1 == 6 1-3 27.50
3 BigLion 2.17 == =0 XX =1 11 == 6 1-3 27.25
4 RDChess 3.15 01 00 =0 XX =1 =1 4= 4 20.25
5 BigLion 2.23f 01 =0 00 =0 XX 11 4 5 18.25
6 RDChess 3.22 01 == == =0 00 XX 3= 6 20.25
----------------------------------------------------

RDChess V3.15 scored better compared to V3.22 after playing 10 games each.
(And seemingly the older BigLion V2.17 is worse than the V2.23f !?).
I personally believe RDChess V3.22 is stronger as V3.15. I think 20 games for comparing the strength of 2 engines are too few if the engines do not differ too much in strength.
I always play extensive tests before releasing a newer RDChess version in matches against the previous RDChess version or against "benchmark engines" (I have been using an older GNU chess version V5.02+ for this purpose).
While watching the match e.g. V3.21 against V3.22 the older version V3.21 leads 12-1-2 and I think "uugh, V3.22 is much worse!". But at the end of my (standard 60 games) benchmark tourney the final score is e.g. 25-28-7.
So one has to be careful to make evaluations on basis of a few games!
Rudolf
Hi Rudolf,
I try to use the longest time control for qualifiers (30'+3") to make it
more close to the one used in the New Long loop (80'+3"). It is due to the
fact that engines may perform different at different time controls.
Many-games qualifiers would last for months and have no sense for me.
BTW, which time control you use for your tests?
In the rating list of New IL (after 20-25 games) RDChess 3.15 stays higher than 3.21 and 3.19.
In my opinion, ten games are enough to find out that a new version is not
much stronger than the previous.
I disagree with this: 10 games do not suffice at all.
I have too often observed a match starting with 7:3 ending up with 13:17.
Statistical fluctuations are really annoying.
Uli
The question is what do you mean by the word much.
I think that Igor does not consider 50 elo as much.
Uri
You are right. The evaluation depends on the difference in ELO. In case, version n+1 is stronger by 300 ELO points than version n, 10 games may even give an indication.
Note that SSDF does (usually) not publish their results before the minimum of 200 games is reached. They have very good reasons to do this.
Usually, we are hunting (at least me) for improvements of 10-20 ELO by switching from version n to n+1. 10 games are really hopeless in this case.
For instance, there had been times, when testers had reported to me that version n+1 plays far weaker than version n although I knew that I had only made a tiny and very plausible bug fix. -:)
I just ignore reports like this.
Uli
U.Tuerke
 

Re: Engine strength and statistics

Postby Igor Gorelikov » 30 Dec 2003, 12:48

Geschrieben von: / Posted by: Igor Gorelikov at 30 December 2003 12:48:27:
Als Antwort auf: / In reply to: Re: Engine strength and statistics geschrieben von: / posted by: U.Tuerke at 30 December 2003 12:09:34:

I try to use the longest time control for qualifiers (30'+3") to make it
more close to the one used in the New Long loop (80'+3"). It is due to the
fact that engines may perform different at different time controls.
Many-games qualifiers would last for months and have no sense for me.
BTW, which time control you use for your tests?
In the rating list of New IL (after 20-25 games) RDChess 3.15 stays higher than 3.21 and 3.19.
In my opinion, ten games are enough to find out that a new version is not
much stronger than the previous.
Note also that there will be next cycles and further chances for new
versions.
Best regards,
Igor
I disagree with this: 10 games do not suffice at all.
I have too often observed a match starting with 7:3 ending up with 13:17.
Statistical fluctuations are really annoying.
Uli

Please read my sentence once more:
"In my opinion, ten games are enough to find out that a new version is not
MUCH STRONGER than the previous."
I may also add that human life is too short... ;-)
After third cycle of New IL I have 168 participants and this number will grow with each further cycle. Yes, 100 games for a new version to be qualified would be a good approximation but... it needs years and years to run.
I try to find happy medium between duration of my life and number of games.
Best regards,
Igor
Igor Gorelikov
 

Re: Engine strength and statistics

Postby U.Tuerke » 30 Dec 2003, 14:38

Geschrieben von: / Posted by: U.Tuerke at 30 December 2003 14:38:42:
Als Antwort auf: / In reply to: Re: Engine strength and statistics geschrieben von: / posted by: Igor Gorelikov at 30 December 2003 12:48:27:
I try to use the longest time control for qualifiers (30'+3") to make it
more close to the one used in the New Long loop (80'+3"). It is due to the
fact that engines may perform different at different time controls.
Many-games qualifiers would last for months and have no sense for me.
BTW, which time control you use for your tests?
In the rating list of New IL (after 20-25 games) RDChess 3.15 stays higher than 3.21 and 3.19.
In my opinion, ten games are enough to find out that a new version is not
much stronger than the previous.
Note also that there will be next cycles and further chances for new
versions.
Best regards,
Igor
I disagree with this: 10 games do not suffice at all.
I have too often observed a match starting with 7:3 ending up with 13:17.
Statistical fluctuations are really annoying.
Uli

Please read my sentence once more:
"In my opinion, ten games are enough to find out that a new version is not
MUCH STRONGER than the previous."
I may also add that human life is too short... ;-)
After third cycle of New IL I have 168 participants and this number will grow with each further cycle. Yes, 100 games for a new version to be qualified would be a good approximation but... it needs years and years to run.
I try to find happy medium between duration of my life and number of games.
Best regards,
Igor
As Uri already had said, it really depends on what is MUCH.
50 ELo gain with a new version seems a very good success to me; so I'd say that this is MUCH. -:)
Where are the developers improving playing level by 200 or more ELO with a new version ?
It's an illusion to expect this.
That's of course too true. I do not expect testers to produce huge collections of test games but I just would like to remind not forgetting statistics. I myself am using a minimum of 20-25 games at a "reasonable" time control in a test match; that's admittedly still too less. But I don't own a dozen machines (like Ed does for instance) in order to run a couple of matches simulatenously.
Finally, electrictity isn't too cheap neither, -:)

-:)
Have a good new year,
Uli
U.Tuerke
 

Re: Engine strength and statistics

Postby Uri Blass » 30 Dec 2003, 14:58

Geschrieben von: / Posted by: Uri Blass at 30 December 2003 14:58:42:
Als Antwort auf: / In reply to: Re: Engine strength and statistics geschrieben von: / posted by: U.Tuerke at 30 December 2003 14:38:42:
I try to use the longest time control for qualifiers (30'+3") to make it
more close to the one used in the New Long loop (80'+3"). It is due to the
fact that engines may perform different at different time controls.
Many-games qualifiers would last for months and have no sense for me.
BTW, which time control you use for your tests?
In the rating list of New IL (after 20-25 games) RDChess 3.15 stays higher than 3.21 and 3.19.
In my opinion, ten games are enough to find out that a new version is not
much stronger than the previous.
Note also that there will be next cycles and further chances for new
versions.
Best regards,
Igor
I disagree with this: 10 games do not suffice at all.
I have too often observed a match starting with 7:3 ending up with 13:17.
Statistical fluctuations are really annoying.
Uli

Please read my sentence once more:
"In my opinion, ten games are enough to find out that a new version is not
MUCH STRONGER than the previous."
As Uri already had said, it really depends on what is MUCH.
50 ELo gain with a new version seems a very good success to me; so I'd say that this is MUCH. -:)
Where are the developers improving playing level by 200 or more ELO with a new version ?
It's an illusion to expect this.
I improved by something close to 200 elo when I released 00_799 after 00_7a
There are more examples for big improvement and I remember Danchess that improve significantly from 1.01 to 1.02 thanks to doing the program 8 times faster.
The way to do it is not to release new versions very often.
There are already enough free programs and I do not need to release a new version only because it is 10-20 elo better.
Uri
Uri Blass
 

Re: Engine strength and statistics

Postby Slobodan R. Stojanovic » 30 Dec 2003, 15:02

Geschrieben von: / Posted by: Slobodan R. Stojanovic at 30 December 2003 15:02:13:
Als Antwort auf: / In reply to: Re: Engine strength and statistics geschrieben von: / posted by: Igor Gorelikov at 30 December 2003 12:05:53:

In my opinion, ten games are enough to find out that a new version is not
much stronger than the previous.
Note also that there will be next cycles and further chances for new
versions.
I just have read an already older posting of Igor Gorelikov
http://f11.parsimony.net/forum16635/messages/59053.htm ,
a version qualifier for IL-4 with the following results:

1 2 3 4 5 6 S P B
----------------------------------------------------
1 TRACE 1.25 XX 11 == 10 10 10 6 1-3 30.00
2 TRACE 1.1.2 00 XX =1 11 =1 == 6 1-3 27.50
3 BigLion 2.17 == =0 XX =1 11 == 6 1-3 27.25
4 RDChess 3.15 01 00 =0 XX =1 =1 4= 4 20.25
5 BigLion 2.23f 01 =0 00 =0 XX 11 4 5 18.25
6 RDChess 3.22 01 == == =0 00 XX 3= 6 20.25
----------------------------------------------------

RDChess V3.15 scored better compared to V3.22 after playing 10 games each.
(And seemingly the older BigLion V2.17 is worse than the V2.23f !?).
I personally believe RDChess V3.22 is stronger as V3.15. I think 20 games for comparing the strength of 2 engines are too few if the engines do not differ too much in strength.
I always play extensive tests before releasing a newer RDChess version in matches against the previous RDChess version or against "benchmark engines" (I have been using an older GNU chess version V5.02+ for this purpose).
While watching the match e.g. V3.21 against V3.22 the older version V3.21 leads 12-1-2 and I think "uugh, V3.22 is much worse!". But at the end of my (standard 60 games) benchmark tourney the final score is e.g. 25-28-7.
So one has to be careful to make evaluations on basis of a few games!
Rudolf
Hi Rudolf,
I try to use the longest time control for qualifiers (30'+3") to make it
more close to the one used in the New Long loop (80'+3"). It is due to the
fact that engines may perform different at different time controls.
Many-games qualifiers would last for months and have no sense for me.
BTW, which time control you use for your tests?
In the rating list of New IL (after 20-25 games) RDChess 3.15 stays higher than 3.21 and 3.19.
In my opinion, ten games are enough to find out that a new version is not
much stronger than the previous.
Note also that there will be next cycles and further chances for new
versions.
Best regards,
Igor

Hi Igor,
Ten games are surely not enough, I know it by my experience. But 20 games could be considered enough for this.
Regards. SL.
Slobodan R. Stojanovic
 

Re: Engine strength and statistics

Postby U.Tuerke » 30 Dec 2003, 15:10

Geschrieben von: / Posted by: U.Tuerke at 30 December 2003 15:10:33:
Als Antwort auf: / In reply to: Re: Engine strength and statistics geschrieben von: / posted by: Uri Blass at 30 December 2003 14:58:42:
I try to use the longest time control for qualifiers (30'+3") to make it
more close to the one used in the New Long loop (80'+3"). It is due to the
fact that engines may perform different at different time controls.
Many-games qualifiers would last for months and have no sense for me.
BTW, which time control you use for your tests?
In the rating list of New IL (after 20-25 games) RDChess 3.15 stays higher than 3.21 and 3.19.
In my opinion, ten games are enough to find out that a new version is not
much stronger than the previous.
Note also that there will be next cycles and further chances for new
versions.
Best regards,
Igor
I disagree with this: 10 games do not suffice at all.
I have too often observed a match starting with 7:3 ending up with 13:17.
Statistical fluctuations are really annoying.
Uli

Please read my sentence once more:
"In my opinion, ten games are enough to find out that a new version is not
MUCH STRONGER than the previous."
As Uri already had said, it really depends on what is MUCH.
50 ELo gain with a new version seems a very good success to me; so I'd say that this is MUCH. -:)
Where are the developers improving playing level by 200 or more ELO with a new version ?
It's an illusion to expect this.
I improved by something close to 200 elo when I released 00_799 after 00_7a
There are more examples for big improvement and I remember Danchess that improve significantly from 1.01 to 1.02 thanks to doing the program 8 times faster.
The way to do it is not to release new versions very often.
There are already enough free programs and I do not need to release a new version only because it is 10-20 elo better.
Uri
You really don't know before.
Well, I must say that my releases are not (only) to satisfy the "consumers' needs" but also a bit egoistic: to get some feedback. In particular, I like to test my changes in the regular tourneys being performed in the "wb scene". The easiest way to do it is a release.
A bug fix may also be a good reason which comes into my mind for making a release.
If I'd wait for a 200 ELO improvement, then there would have been only a single Comet release, I'm afraid. -)
Happy new year,
Uli
U.Tuerke
 

OK!

Postby Igor Gorelikov » 30 Dec 2003, 16:16

Geschrieben von: / Posted by: Igor Gorelikov at 30 December 2003 16:16:07:
Als Antwort auf: / In reply to: Engine strength and statistics geschrieben von: / posted by: Rudolf Posch at 30 December 2003 10:25:01:

That's OK.
Just from curiosity and by request of U.Tuerke and Slobodan R. Stojanovic I'm ready to run my qualifiers for the second time to allows all new versions gain 20 games. It takes 7 through 10 days more (for 7 qualifiers).
Then we can compare results of 2-round and 4-rounds events.
It also means that the previous promotions/failures would be reconsider according to the new results.
The first qualifier starts today at evening.
The participants of the first qualifier:
=========================================

New IL-4, Ver-qual 1 (now results of 2 rounds)
P3 1GHz 256MB, 2003.12.15 - 2003.12.16
1 2 3 4 5 6 S P B
----------------------------------------------------
1 TRACE 1.25 XX 11 == 10 10 10 6 1-3 30.00
2 TRACE 1.1.2 00 XX =1 11 =1 == 6 1-3 27.50
3 BigLion 2.17 == =0 XX =1 11 == 6 1-3 27.25
4 RDChess 3.15 01 00 =0 XX =1 =1 4= 4 20.25
5 BigLion 2.23f 01 =0 00 =0 XX 11 4 5 18.25
6 RDChess 3.22 01 == == =0 00 XX 3= 6 20.25
----------------------------------------------------
Games 30/30, +12 -7 =11

Best regards and Happy New Year!
Igor
Igor Gorelikov
 

Re: OK!

Postby U.Tuerke » 30 Dec 2003, 16:33

Geschrieben von: / Posted by: U.Tuerke at 30 December 2003 16:33:20:
Als Antwort auf: / In reply to: OK! geschrieben von: / posted by: Igor Gorelikov at 30 December 2003 16:16:07:
That's OK.
Just from curiosity and by request of U.Tuerke and Slobodan R. Stojanovic I'm ready to run my qualifiers for the second time to allows all new versions gain 20 games. It takes 7 through 10 days more (for 7 qualifiers).
Then we can compare results of 2-round and 4-rounds events.
It also means that the previous promotions/failures would be reconsider according to the new results.
The first qualifier starts today at evening.
The participants of the first qualifier:
=========================================

New IL-4, Ver-qual 1 (now results of 2 rounds)
P3 1GHz 256MB, 2003.12.15 - 2003.12.16
1 2 3 4 5 6 S P B
----------------------------------------------------
1 TRACE 1.25 XX 11 == 10 10 10 6 1-3 30.00
2 TRACE 1.1.2 00 XX =1 11 =1 == 6 1-3 27.50
3 BigLion 2.17 == =0 XX =1 11 == 6 1-3 27.25
4 RDChess 3.15 01 00 =0 XX =1 =1 4= 4 20.25
5 BigLion 2.23f 01 =0 00 =0 XX 11 4 5 18.25
6 RDChess 3.22 01 == == =0 00 XX 3= 6 20.25
----------------------------------------------------
Games 30/30, +12 -7 =11

Best regards and Happy New Year!
Igor
Thanks, Igor.
I'd bet that you will get another ranking this time.
I mean, this does in no way de-evaluate your tourneys. They are very interesting and they certainly give some clues.
However, one should be aware of the uncertainties.
Anyway, we'll see.
Uli
U.Tuerke
 

Re: OK!

Postby Rudolf Posch » 30 Dec 2003, 18:11

Geschrieben von: / Posted by: Rudolf Posch at 30 December 2003 18:11:18:
Als Antwort auf: / In reply to: OK! geschrieben von: / posted by: Igor Gorelikov at 30 December 2003 16:16:07:
That's OK.
Just from curiosity and by request of U.Tuerke and Slobodan R. Stojanovic I'm ready to run my qualifiers for the second time to allows all new versions gain 20 games. It takes 7 through 10 days more (for 7 qualifiers).
Then we can compare results of 2-round and 4-rounds events.
Hi Igor,
thanks for your reply. My posting was not aimed to critisize your way to make your tourneys. One has always to make a compromise regarding the time to spend to test the engines in times where the number of winboard engines increases steadily.
But I would really be suprised if your second qualifier would show the same result as the first! I look forward eagerly to the result!
Thanks for your great work to run your infinite loops and
a Happy New Year !
Rudolf
Rudolf Posch
 

Re: Engine strength and statistics

Postby Rudolf Posch » 30 Dec 2003, 18:36

Geschrieben von: / Posted by: Rudolf Posch at 30 December 2003 18:36:49:
Als Antwort auf: / In reply to: Re: Engine strength and statistics geschrieben von: / posted by: Igor Gorelikov at 30 December 2003 12:05:53:
Hi Rudolf,
I try to use the longest time control for qualifiers (30'+3") to make it
more close to the one used in the New Long loop (80'+3"). It is due to the
fact that engines may perform different at different time controls.
Many-games qualifiers would last for months and have no sense for me.
BTW, which time control you use for your tests?
In the rating list of New IL (after 20-25 games) RDChess 3.15 stays higher than 3.21 and 3.19.
In my opinion, ten games are enough to find out that a new version is not
much stronger than the previous.
Note also that there will be next cycles and further chances for new
versions.
Best regards,
Igor
Hi Igor,
I have to admit that I use for my benchmarks at releasing newer versions of RDChess shorter time controls (3'+0", 10'+0", 10' for 40 moves, seldom longer ones). Because of the same reasons as you mentioned (I do not want to use my whole life for testing) I assume that RDChess's "strength distribution over different time controls" changes not very much with different versions.
This does not mean thats RDChess strength does not depend on the time control.
RDChess plays certainly stronger against other engines with shorter time controls. But I assume normally that this behaviour does not change with a newer version. I assume sloppily that an improvement of RDChess improves the strength (nearly) equal at all time controls.
Feedback from your tournaments -especially your IL long- is therefore very important for me.
Rudolf
Rudolf Posch
 

Re: OK!

Postby Igor Gorelikov » 30 Dec 2003, 18:45

Geschrieben von: / Posted by: Igor Gorelikov at 30 December 2003 18:45:45:
Als Antwort auf: / In reply to: Re: OK! geschrieben von: / posted by: Rudolf Posch at 30 December 2003 18:11:18:
That's OK.
Just from curiosity and by request of U.Tuerke and Slobodan R. Stojanovic I'm ready to run my qualifiers for the second time to allows all new versions gain 20 games. It takes 7 through 10 days more (for 7 qualifiers).
Then we can compare results of 2-round and 4-rounds events.
Hi Igor,
thanks for your reply. My posting was not aimed to critisize your way to make your tourneys. One has always to make a compromise regarding the time to spend to test the engines in times where the number of winboard engines increases steadily.
But I would really be suprised if your second qualifier would show the same result as the first! I look forward eagerly to the result!
Thanks for your great work to run your infinite loops and
a Happy New Year !
Rudolf

The qualifier has been started and RDChess 3.22 wins both games vs BigLion 2.17 (round 3 and 4) while in first and second rounds there were draws!
But the final results will be known only on 5-6 January ;-( after holidays ;-).
My site will be also updated 5th January.
Have a good rest!
Igor
Igor Gorelikov
 

Re: Engine strength and statistics

Postby Rudolf Posch » 30 Dec 2003, 18:58

Geschrieben von: / Posted by: Rudolf Posch at 30 December 2003 18:58:20:
Als Antwort auf: / In reply to: Re: Engine strength and statistics geschrieben von: / posted by: U.Tuerke at 30 December 2003 12:40:47:
Note that SSDF does (usually) not publish their results before the minimum of 200 games is reached. They have very good reasons to do this.
Usually, we are hunting (at least me) for improvements of 10-20 ELO by switching from version n to n+1. 10 games are really hopeless in this case.
For instance, there had been times, when testers had reported to me that version n+1 plays far weaker than version n although I knew that I had only made a tiny and very plausible bug fix. -:)
I just ignore reports like this.
Uli
Uri
I agree with this!
I play benchmark tourneys between version n and n+1 and the first results in 30-20-10 and the second in 25-30-5 (35:25 points versus 27,5: 32,5).
So what does this mean ( a third tourney results in xx-yy-zz)?
It depends for instance on the opening book moves which are randomly chosen. But when I switch off the opening book use too many games are identical.
I am not a mathematician, but I am sure there exists a probability formula which allows to say version n+1 is x % better as version n with an uncertainity of +/- y % at playing z games.
With too few games y may be greater than x, so you can't say what version is better .
Rudolf
Rudolf Posch
 

Re: Engine strength and statistics

Postby Peter Fendrich » 30 Dec 2003, 20:50

Geschrieben von: / Posted by: Peter Fendrich at 30 December 2003 20:50:47:
Als Antwort auf: / In reply to: Re: Engine strength and statistics geschrieben von: / posted by: U.Tuerke at 30 December 2003 12:40:47:
I just have read an already older posting of Igor Gorelikov
http://f11.parsimony.net/forum16635/messages/59053.htm ,
a version qualifier for IL-4 with the following results:

1 2 3 4 5 6 S P B
----------------------------------------------------
1 TRACE 1.25 XX 11 == 10 10 10 6 1-3 30.00
2 TRACE 1.1.2 00 XX =1 11 =1 == 6 1-3 27.50
3 BigLion 2.17 == =0 XX =1 11 == 6 1-3 27.25
4 RDChess 3.15 01 00 =0 XX =1 =1 4= 4 20.25
5 BigLion 2.23f 01 =0 00 =0 XX 11 4 5 18.25
6 RDChess 3.22 01 == == =0 00 XX 3= 6 20.25
----------------------------------------------------

RDChess V3.15 scored better compared to V3.22 after playing 10 games each.
(And seemingly the older BigLion V2.17 is worse than the V2.23f !?).
I personally believe RDChess V3.22 is stronger as V3.15. I think 20 games for comparing the strength of 2 engines are too few if the engines do not differ too much in strength.
I always play extensive tests before releasing a newer RDChess version in matches against the previous RDChess version or against "benchmark engines" (I have been using an older GNU chess version V5.02+ for this purpose).
While watching the match e.g. V3.21 against V3.22 the older version V3.21 leads 12-1-2 and I think "uugh, V3.22 is much worse!". But at the end of my (standard 60 games) benchmark tourney the final score is e.g. 25-28-7.
So one has to be careful to make evaluations on basis of a few games!
Rudolf
Hi Rudolf,
I try to use the longest time control for qualifiers (30'+3") to make it
more close to the one used in the New Long loop (80'+3"). It is due to the
fact that engines may perform different at different time controls.
Many-games qualifiers would last for months and have no sense for me.
BTW, which time control you use for your tests?
In the rating list of New IL (after 20-25 games) RDChess 3.15 stays higher than 3.21 and 3.19.
In my opinion, ten games are enough to find out that a new version is not
much stronger than the previous.
I disagree with this: 10 games do not suffice at all.
I have too often observed a match starting with 7:3 ending up with 13:17.
Statistical fluctuations are really annoying.
Uli
The question is what do you mean by the word much.
I think that Igor does not consider 50 elo as much.
You are right. The evaluation depends on the difference in ELO. In case, version n+1 is stronger by 300 ELO points than version n, 10 games may even give an indication.
Note that SSDF does (usually) not publish their results before the minimum of 200 games is reached. They have very good reasons to do this.
Usually, we are hunting (at least me) for improvements of 10-20 ELO by switching from version n to n+1. 10 games are really hopeless in this case.
For instance, there had been times, when testers had reported to me that version n+1 plays far weaker than version n although I knew that I had only made a tiny and very plausible bug fix. -:)
I just ignore reports like this.
This can sometimes be true because of some compiler decision.
I have in rare cases got as much as 10% difference in n/s after just
changing some part of the code that shouldn't affect performance at all.
Compilers today are living a life of their own... :-)
/Peter
Peter Fendrich
 

Re: Engine strength and statistics

Postby U.Tuerke » 02 Jan 2004, 16:28

Geschrieben von: / Posted by: U.Tuerke at 02 January 2004 16:28:03:
Als Antwort auf: / In reply to: Re: Engine strength and statistics geschrieben von: / posted by: Rudolf Posch at 30 December 2003 18:36:49:
Hi Rudolf,
I try to use the longest time control for qualifiers (30'+3") to make it
more close to the one used in the New Long loop (80'+3"). It is due to the
fact that engines may perform different at different time controls.
Many-games qualifiers would last for months and have no sense for me.
BTW, which time control you use for your tests?
In the rating list of New IL (after 20-25 games) RDChess 3.15 stays higher than 3.21 and 3.19.
In my opinion, ten games are enough to find out that a new version is not
much stronger than the previous.
Note also that there will be next cycles and further chances for new
versions.
Best regards,
Igor
Hi Igor,
I have to admit that I use for my benchmarks at releasing newer versions of RDChess shorter time controls (3'+0", 10'+0", 10' for 40 moves, seldom longer ones). Because of the same reasons as you mentioned (I do not want to use my whole life for testing) I assume that RDChess's "strength distribution over different time controls" changes not very much with different versions.
This does not mean thats RDChess strength does not depend on the time control.
RDChess plays certainly stronger against other engines with shorter time controls. But I assume normally that this behaviour does not change with a newer version. I assume sloppily that an improvement of RDChess improves the strength (nearly) equal at all time controls.
Feedback from your tournaments -especially your IL long- is therefore very important for me.
Rudolf

This depends of course on the particular change.
For instance, if you decide to cancel any preprocessing at the root doing everything at the leaves instead, this is very likely to benefit more at larger time controls.
Uli
U.Tuerke
 


Return to Archive (Old Parsimony Forum)

Who is online

Users browsing this forum: No registered users and 36 guests