If the difference in score is only 3 points you cannot be sure which version is better and the number of games is not important.I just have read an already older posting of Igor Gorelikov
http://f11.parsimony.net/forum16635/messages/59053.htm ,
a version qualifier for IL-4 with the following results:
1 2 3 4 5 6 S P B
----------------------------------------------------
1 TRACE 1.25 XX 11 == 10 10 10 6 1-3 30.00
2 TRACE 1.1.2 00 XX =1 11 =1 == 6 1-3 27.50
3 BigLion 2.17 == =0 XX =1 11 == 6 1-3 27.25
4 RDChess 3.15 01 00 =0 XX =1 =1 4= 4 20.25
5 BigLion 2.23f 01 =0 00 =0 XX 11 4 5 18.25
6 RDChess 3.22 01 == == =0 00 XX 3= 6 20.25
----------------------------------------------------
RDChess V3.15 scored better compared to V3.22 after playing 10 games each.
(And seemingly the older BigLion V2.17 is worse than the V2.23f !?).
I personally believe RDChess V3.22 is stronger as V3.15. I think 20 games for comparing the strength of 2 engines are too few if the engines do not differ too much in strength.
I always play extensive tests before releasing a newer RDChess version in matches against the previous RDChess version or against "benchmark engines" (I have been using an older GNU chess version V5.02+ for this purpose).
While watching the match e.g. V3.21 against V3.22 the older version V3.21 leads 12-1-2 and I think "uugh, V3.22 is much worse!". But at the end of my (standard 60 games) benchmark tourney the final score is e.g. 25-28-7.
So one has to be careful to make evaluations on basis of a few games!
Rudolf
Hi Rudolf,I just have read an already older posting of Igor Gorelikov
http://f11.parsimony.net/forum16635/messages/59053.htm ,
a version qualifier for IL-4 with the following results:
1 2 3 4 5 6 S P B
----------------------------------------------------
1 TRACE 1.25 XX 11 == 10 10 10 6 1-3 30.00
2 TRACE 1.1.2 00 XX =1 11 =1 == 6 1-3 27.50
3 BigLion 2.17 == =0 XX =1 11 == 6 1-3 27.25
4 RDChess 3.15 01 00 =0 XX =1 =1 4= 4 20.25
5 BigLion 2.23f 01 =0 00 =0 XX 11 4 5 18.25
6 RDChess 3.22 01 == == =0 00 XX 3= 6 20.25
----------------------------------------------------
RDChess V3.15 scored better compared to V3.22 after playing 10 games each.
(And seemingly the older BigLion V2.17 is worse than the V2.23f !?).
I personally believe RDChess V3.22 is stronger as V3.15. I think 20 games for comparing the strength of 2 engines are too few if the engines do not differ too much in strength.
I always play extensive tests before releasing a newer RDChess version in matches against the previous RDChess version or against "benchmark engines" (I have been using an older GNU chess version V5.02+ for this purpose).
While watching the match e.g. V3.21 against V3.22 the older version V3.21 leads 12-1-2 and I think "uugh, V3.22 is much worse!". But at the end of my (standard 60 games) benchmark tourney the final score is e.g. 25-28-7.
So one has to be careful to make evaluations on basis of a few games!
Rudolf
Hi Rudolf,I just have read an already older posting of Igor Gorelikov
http://f11.parsimony.net/forum16635/messages/59053.htm ,
a version qualifier for IL-4 with the following results:
1 2 3 4 5 6 S P B
----------------------------------------------------
1 TRACE 1.25 XX 11 == 10 10 10 6 1-3 30.00
2 TRACE 1.1.2 00 XX =1 11 =1 == 6 1-3 27.50
3 BigLion 2.17 == =0 XX =1 11 == 6 1-3 27.25
4 RDChess 3.15 01 00 =0 XX =1 =1 4= 4 20.25
5 BigLion 2.23f 01 =0 00 =0 XX 11 4 5 18.25
6 RDChess 3.22 01 == == =0 00 XX 3= 6 20.25
----------------------------------------------------
RDChess V3.15 scored better compared to V3.22 after playing 10 games each.
(And seemingly the older BigLion V2.17 is worse than the V2.23f !?).
I personally believe RDChess V3.22 is stronger as V3.15. I think 20 games for comparing the strength of 2 engines are too few if the engines do not differ too much in strength.
I always play extensive tests before releasing a newer RDChess version in matches against the previous RDChess version or against "benchmark engines" (I have been using an older GNU chess version V5.02+ for this purpose).
While watching the match e.g. V3.21 against V3.22 the older version V3.21 leads 12-1-2 and I think "uugh, V3.22 is much worse!". But at the end of my (standard 60 games) benchmark tourney the final score is e.g. 25-28-7.
So one has to be careful to make evaluations on basis of a few games!
Rudolf
I disagree with this: 10 games do not suffice at all.Hi Rudolf,I just have read an already older posting of Igor Gorelikov
http://f11.parsimony.net/forum16635/messages/59053.htm ,
a version qualifier for IL-4 with the following results:
1 2 3 4 5 6 S P B
----------------------------------------------------
1 TRACE 1.25 XX 11 == 10 10 10 6 1-3 30.00
2 TRACE 1.1.2 00 XX =1 11 =1 == 6 1-3 27.50
3 BigLion 2.17 == =0 XX =1 11 == 6 1-3 27.25
4 RDChess 3.15 01 00 =0 XX =1 =1 4= 4 20.25
5 BigLion 2.23f 01 =0 00 =0 XX 11 4 5 18.25
6 RDChess 3.22 01 == == =0 00 XX 3= 6 20.25
----------------------------------------------------
RDChess V3.15 scored better compared to V3.22 after playing 10 games each.
(And seemingly the older BigLion V2.17 is worse than the V2.23f !?).
I personally believe RDChess V3.22 is stronger as V3.15. I think 20 games for comparing the strength of 2 engines are too few if the engines do not differ too much in strength.
I always play extensive tests before releasing a newer RDChess version in matches against the previous RDChess version or against "benchmark engines" (I have been using an older GNU chess version V5.02+ for this purpose).
While watching the match e.g. V3.21 against V3.22 the older version V3.21 leads 12-1-2 and I think "uugh, V3.22 is much worse!". But at the end of my (standard 60 games) benchmark tourney the final score is e.g. 25-28-7.
So one has to be careful to make evaluations on basis of a few games!
Rudolf
I try to use the longest time control for qualifiers (30'+3") to make it
more close to the one used in the New Long loop (80'+3"). It is due to the
fact that engines may perform different at different time controls.
Many-games qualifiers would last for months and have no sense for me.
BTW, which time control you use for your tests?
In the rating list of New IL (after 20-25 games) RDChess 3.15 stays higher than 3.21 and 3.19.
In my opinion, ten games are enough to find out that a new version is not
much stronger than the previous.
Note also that there will be next cycles and further chances for new
versions.
Best regards,
Igor
The question is what do you mean by the word much.I disagree with this: 10 games do not suffice at all.Hi Rudolf,I just have read an already older posting of Igor Gorelikov
http://f11.parsimony.net/forum16635/messages/59053.htm ,
a version qualifier for IL-4 with the following results:
1 2 3 4 5 6 S P B
----------------------------------------------------
1 TRACE 1.25 XX 11 == 10 10 10 6 1-3 30.00
2 TRACE 1.1.2 00 XX =1 11 =1 == 6 1-3 27.50
3 BigLion 2.17 == =0 XX =1 11 == 6 1-3 27.25
4 RDChess 3.15 01 00 =0 XX =1 =1 4= 4 20.25
5 BigLion 2.23f 01 =0 00 =0 XX 11 4 5 18.25
6 RDChess 3.22 01 == == =0 00 XX 3= 6 20.25
----------------------------------------------------
RDChess V3.15 scored better compared to V3.22 after playing 10 games each.
(And seemingly the older BigLion V2.17 is worse than the V2.23f !?).
I personally believe RDChess V3.22 is stronger as V3.15. I think 20 games for comparing the strength of 2 engines are too few if the engines do not differ too much in strength.
I always play extensive tests before releasing a newer RDChess version in matches against the previous RDChess version or against "benchmark engines" (I have been using an older GNU chess version V5.02+ for this purpose).
While watching the match e.g. V3.21 against V3.22 the older version V3.21 leads 12-1-2 and I think "uugh, V3.22 is much worse!". But at the end of my (standard 60 games) benchmark tourney the final score is e.g. 25-28-7.
So one has to be careful to make evaluations on basis of a few games!
Rudolf
I try to use the longest time control for qualifiers (30'+3") to make it
more close to the one used in the New Long loop (80'+3"). It is due to the
fact that engines may perform different at different time controls.
Many-games qualifiers would last for months and have no sense for me.
BTW, which time control you use for your tests?
In the rating list of New IL (after 20-25 games) RDChess 3.15 stays higher than 3.21 and 3.19.
In my opinion, ten games are enough to find out that a new version is not
much stronger than the previous.
I have too often observed a match starting with 7:3 ending up with 13:17.
Statistical fluctuations are really annoying.
Uli
You are right. The evaluation depends on the difference in ELO. In case, version n+1 is stronger by 300 ELO points than version n, 10 games may even give an indication.The question is what do you mean by the word much.I disagree with this: 10 games do not suffice at all.Hi Rudolf,I just have read an already older posting of Igor Gorelikov
http://f11.parsimony.net/forum16635/messages/59053.htm ,
a version qualifier for IL-4 with the following results:
1 2 3 4 5 6 S P B
----------------------------------------------------
1 TRACE 1.25 XX 11 == 10 10 10 6 1-3 30.00
2 TRACE 1.1.2 00 XX =1 11 =1 == 6 1-3 27.50
3 BigLion 2.17 == =0 XX =1 11 == 6 1-3 27.25
4 RDChess 3.15 01 00 =0 XX =1 =1 4= 4 20.25
5 BigLion 2.23f 01 =0 00 =0 XX 11 4 5 18.25
6 RDChess 3.22 01 == == =0 00 XX 3= 6 20.25
----------------------------------------------------
RDChess V3.15 scored better compared to V3.22 after playing 10 games each.
(And seemingly the older BigLion V2.17 is worse than the V2.23f !?).
I personally believe RDChess V3.22 is stronger as V3.15. I think 20 games for comparing the strength of 2 engines are too few if the engines do not differ too much in strength.
I always play extensive tests before releasing a newer RDChess version in matches against the previous RDChess version or against "benchmark engines" (I have been using an older GNU chess version V5.02+ for this purpose).
While watching the match e.g. V3.21 against V3.22 the older version V3.21 leads 12-1-2 and I think "uugh, V3.22 is much worse!". But at the end of my (standard 60 games) benchmark tourney the final score is e.g. 25-28-7.
So one has to be careful to make evaluations on basis of a few games!
Rudolf
I try to use the longest time control for qualifiers (30'+3") to make it
more close to the one used in the New Long loop (80'+3"). It is due to the
fact that engines may perform different at different time controls.
Many-games qualifiers would last for months and have no sense for me.
BTW, which time control you use for your tests?
In the rating list of New IL (after 20-25 games) RDChess 3.15 stays higher than 3.21 and 3.19.
In my opinion, ten games are enough to find out that a new version is not
much stronger than the previous.
I have too often observed a match starting with 7:3 ending up with 13:17.
Statistical fluctuations are really annoying.
Uli
I think that Igor does not consider 50 elo as much.
Uri
I disagree with this: 10 games do not suffice at all.I try to use the longest time control for qualifiers (30'+3") to make it
more close to the one used in the New Long loop (80'+3"). It is due to the
fact that engines may perform different at different time controls.
Many-games qualifiers would last for months and have no sense for me.
BTW, which time control you use for your tests?
In the rating list of New IL (after 20-25 games) RDChess 3.15 stays higher than 3.21 and 3.19.
In my opinion, ten games are enough to find out that a new version is not
much stronger than the previous.
Note also that there will be next cycles and further chances for new
versions.
Best regards,
Igor
I have too often observed a match starting with 7:3 ending up with 13:17.
Statistical fluctuations are really annoying.
Uli
As Uri already had said, it really depends on what is MUCH.I disagree with this: 10 games do not suffice at all.I try to use the longest time control for qualifiers (30'+3") to make it
more close to the one used in the New Long loop (80'+3"). It is due to the
fact that engines may perform different at different time controls.
Many-games qualifiers would last for months and have no sense for me.
BTW, which time control you use for your tests?
In the rating list of New IL (after 20-25 games) RDChess 3.15 stays higher than 3.21 and 3.19.
In my opinion, ten games are enough to find out that a new version is not
much stronger than the previous.
Note also that there will be next cycles and further chances for new
versions.
Best regards,
Igor
I have too often observed a match starting with 7:3 ending up with 13:17.
Statistical fluctuations are really annoying.
Uli
Please read my sentence once more:
"In my opinion, ten games are enough to find out that a new version is not
MUCH STRONGER than the previous."
I may also add that human life is too short...
After third cycle of New IL I have 168 participants and this number will grow with each further cycle. Yes, 100 games for a new version to be qualified would be a good approximation but... it needs years and years to run.
I try to find happy medium between duration of my life and number of games.
Best regards,
Igor
I improved by something close to 200 elo when I released 00_799 after 00_7aAs Uri already had said, it really depends on what is MUCH.I disagree with this: 10 games do not suffice at all.I try to use the longest time control for qualifiers (30'+3") to make it
more close to the one used in the New Long loop (80'+3"). It is due to the
fact that engines may perform different at different time controls.
Many-games qualifiers would last for months and have no sense for me.
BTW, which time control you use for your tests?
In the rating list of New IL (after 20-25 games) RDChess 3.15 stays higher than 3.21 and 3.19.
In my opinion, ten games are enough to find out that a new version is not
much stronger than the previous.
Note also that there will be next cycles and further chances for new
versions.
Best regards,
Igor
I have too often observed a match starting with 7:3 ending up with 13:17.
Statistical fluctuations are really annoying.
Uli
Please read my sentence once more:
"In my opinion, ten games are enough to find out that a new version is not
MUCH STRONGER than the previous."
50 ELo gain with a new version seems a very good success to me; so I'd say that this is MUCH. -:)
Where are the developers improving playing level by 200 or more ELO with a new version ?
It's an illusion to expect this.
much stronger than the previous.
Note also that there will be next cycles and further chances for new
versions.Hi Rudolf,I just have read an already older posting of Igor Gorelikov
http://f11.parsimony.net/forum16635/messages/59053.htm ,
a version qualifier for IL-4 with the following results:
1 2 3 4 5 6 S P B
----------------------------------------------------
1 TRACE 1.25 XX 11 == 10 10 10 6 1-3 30.00
2 TRACE 1.1.2 00 XX =1 11 =1 == 6 1-3 27.50
3 BigLion 2.17 == =0 XX =1 11 == 6 1-3 27.25
4 RDChess 3.15 01 00 =0 XX =1 =1 4= 4 20.25
5 BigLion 2.23f 01 =0 00 =0 XX 11 4 5 18.25
6 RDChess 3.22 01 == == =0 00 XX 3= 6 20.25
----------------------------------------------------
RDChess V3.15 scored better compared to V3.22 after playing 10 games each.
(And seemingly the older BigLion V2.17 is worse than the V2.23f !?).
I personally believe RDChess V3.22 is stronger as V3.15. I think 20 games for comparing the strength of 2 engines are too few if the engines do not differ too much in strength.
I always play extensive tests before releasing a newer RDChess version in matches against the previous RDChess version or against "benchmark engines" (I have been using an older GNU chess version V5.02+ for this purpose).
While watching the match e.g. V3.21 against V3.22 the older version V3.21 leads 12-1-2 and I think "uugh, V3.22 is much worse!". But at the end of my (standard 60 games) benchmark tourney the final score is e.g. 25-28-7.
So one has to be careful to make evaluations on basis of a few games!
Rudolf
I try to use the longest time control for qualifiers (30'+3") to make it
more close to the one used in the New Long loop (80'+3"). It is due to the
fact that engines may perform different at different time controls.
Many-games qualifiers would last for months and have no sense for me.
BTW, which time control you use for your tests?
In the rating list of New IL (after 20-25 games) RDChess 3.15 stays higher than 3.21 and 3.19.
In my opinion, ten games are enough to find out that a new version is not
much stronger than the previous.
Note also that there will be next cycles and further chances for new
versions.
Best regards,
Igor
You really don't know before.I improved by something close to 200 elo when I released 00_799 after 00_7aAs Uri already had said, it really depends on what is MUCH.I disagree with this: 10 games do not suffice at all.I try to use the longest time control for qualifiers (30'+3") to make it
more close to the one used in the New Long loop (80'+3"). It is due to the
fact that engines may perform different at different time controls.
Many-games qualifiers would last for months and have no sense for me.
BTW, which time control you use for your tests?
In the rating list of New IL (after 20-25 games) RDChess 3.15 stays higher than 3.21 and 3.19.
In my opinion, ten games are enough to find out that a new version is not
much stronger than the previous.
Note also that there will be next cycles and further chances for new
versions.
Best regards,
Igor
I have too often observed a match starting with 7:3 ending up with 13:17.
Statistical fluctuations are really annoying.
Uli
Please read my sentence once more:
"In my opinion, ten games are enough to find out that a new version is not
MUCH STRONGER than the previous."
50 ELo gain with a new version seems a very good success to me; so I'd say that this is MUCH. -:)
Where are the developers improving playing level by 200 or more ELO with a new version ?
It's an illusion to expect this.
There are more examples for big improvement and I remember Danchess that improve significantly from 1.01 to 1.02 thanks to doing the program 8 times faster.
The way to do it is not to release new versions very often.
There are already enough free programs and I do not need to release a new version only because it is 10-20 elo better.
Uri
Thanks, Igor.That's OK.
Just from curiosity and by request of U.Tuerke and Slobodan R. Stojanovic I'm ready to run my qualifiers for the second time to allows all new versions gain 20 games. It takes 7 through 10 days more (for 7 qualifiers).
Then we can compare results of 2-round and 4-rounds events.
It also means that the previous promotions/failures would be reconsider according to the new results.
The first qualifier starts today at evening.
The participants of the first qualifier:
=========================================
New IL-4, Ver-qual 1 (now results of 2 rounds)
P3 1GHz 256MB, 2003.12.15 - 2003.12.16
1 2 3 4 5 6 S P B
----------------------------------------------------
1 TRACE 1.25 XX 11 == 10 10 10 6 1-3 30.00
2 TRACE 1.1.2 00 XX =1 11 =1 == 6 1-3 27.50
3 BigLion 2.17 == =0 XX =1 11 == 6 1-3 27.25
4 RDChess 3.15 01 00 =0 XX =1 =1 4= 4 20.25
5 BigLion 2.23f 01 =0 00 =0 XX 11 4 5 18.25
6 RDChess 3.22 01 == == =0 00 XX 3= 6 20.25
----------------------------------------------------
Games 30/30, +12 -7 =11
Best regards and Happy New Year!
Igor
Hi Igor,That's OK.
Just from curiosity and by request of U.Tuerke and Slobodan R. Stojanovic I'm ready to run my qualifiers for the second time to allows all new versions gain 20 games. It takes 7 through 10 days more (for 7 qualifiers).
Then we can compare results of 2-round and 4-rounds events.
Hi Igor,Hi Rudolf,
I try to use the longest time control for qualifiers (30'+3") to make it
more close to the one used in the New Long loop (80'+3"). It is due to the
fact that engines may perform different at different time controls.
Many-games qualifiers would last for months and have no sense for me.
BTW, which time control you use for your tests?
In the rating list of New IL (after 20-25 games) RDChess 3.15 stays higher than 3.21 and 3.19.
In my opinion, ten games are enough to find out that a new version is not
much stronger than the previous.
Note also that there will be next cycles and further chances for new
versions.
Best regards,
Igor
Hi Igor,That's OK.
Just from curiosity and by request of U.Tuerke and Slobodan R. Stojanovic I'm ready to run my qualifiers for the second time to allows all new versions gain 20 games. It takes 7 through 10 days more (for 7 qualifiers).
Then we can compare results of 2-round and 4-rounds events.
thanks for your reply. My posting was not aimed to critisize your way to make your tourneys. One has always to make a compromise regarding the time to spend to test the engines in times where the number of winboard engines increases steadily.
But I would really be suprised if your second qualifier would show the same result as the first! I look forward eagerly to the result!
Thanks for your great work to run your infinite loops and
a Happy New Year !
Rudolf
I agree with this!Note that SSDF does (usually) not publish their results before the minimum of 200 games is reached. They have very good reasons to do this.
Usually, we are hunting (at least me) for improvements of 10-20 ELO by switching from version n to n+1. 10 games are really hopeless in this case.
For instance, there had been times, when testers had reported to me that version n+1 plays far weaker than version n although I knew that I had only made a tiny and very plausible bug fix. -:)
I just ignore reports like this.
UliUri
This can sometimes be true because of some compiler decision.You are right. The evaluation depends on the difference in ELO. In case, version n+1 is stronger by 300 ELO points than version n, 10 games may even give an indication.The question is what do you mean by the word much.I disagree with this: 10 games do not suffice at all.Hi Rudolf,I just have read an already older posting of Igor Gorelikov
http://f11.parsimony.net/forum16635/messages/59053.htm ,
a version qualifier for IL-4 with the following results:
1 2 3 4 5 6 S P B
----------------------------------------------------
1 TRACE 1.25 XX 11 == 10 10 10 6 1-3 30.00
2 TRACE 1.1.2 00 XX =1 11 =1 == 6 1-3 27.50
3 BigLion 2.17 == =0 XX =1 11 == 6 1-3 27.25
4 RDChess 3.15 01 00 =0 XX =1 =1 4= 4 20.25
5 BigLion 2.23f 01 =0 00 =0 XX 11 4 5 18.25
6 RDChess 3.22 01 == == =0 00 XX 3= 6 20.25
----------------------------------------------------
RDChess V3.15 scored better compared to V3.22 after playing 10 games each.
(And seemingly the older BigLion V2.17 is worse than the V2.23f !?).
I personally believe RDChess V3.22 is stronger as V3.15. I think 20 games for comparing the strength of 2 engines are too few if the engines do not differ too much in strength.
I always play extensive tests before releasing a newer RDChess version in matches against the previous RDChess version or against "benchmark engines" (I have been using an older GNU chess version V5.02+ for this purpose).
While watching the match e.g. V3.21 against V3.22 the older version V3.21 leads 12-1-2 and I think "uugh, V3.22 is much worse!". But at the end of my (standard 60 games) benchmark tourney the final score is e.g. 25-28-7.
So one has to be careful to make evaluations on basis of a few games!
Rudolf
I try to use the longest time control for qualifiers (30'+3") to make it
more close to the one used in the New Long loop (80'+3"). It is due to the
fact that engines may perform different at different time controls.
Many-games qualifiers would last for months and have no sense for me.
BTW, which time control you use for your tests?
In the rating list of New IL (after 20-25 games) RDChess 3.15 stays higher than 3.21 and 3.19.
In my opinion, ten games are enough to find out that a new version is not
much stronger than the previous.
I have too often observed a match starting with 7:3 ending up with 13:17.
Statistical fluctuations are really annoying.
Uli
I think that Igor does not consider 50 elo as much.
Note that SSDF does (usually) not publish their results before the minimum of 200 games is reached. They have very good reasons to do this.
Usually, we are hunting (at least me) for improvements of 10-20 ELO by switching from version n to n+1. 10 games are really hopeless in this case.
For instance, there had been times, when testers had reported to me that version n+1 plays far weaker than version n although I knew that I had only made a tiny and very plausible bug fix. -:)
I just ignore reports like this.
This depends of course on the particular change.Hi Igor,Hi Rudolf,
I try to use the longest time control for qualifiers (30'+3") to make it
more close to the one used in the New Long loop (80'+3"). It is due to the
fact that engines may perform different at different time controls.
Many-games qualifiers would last for months and have no sense for me.
BTW, which time control you use for your tests?
In the rating list of New IL (after 20-25 games) RDChess 3.15 stays higher than 3.21 and 3.19.
In my opinion, ten games are enough to find out that a new version is not
much stronger than the previous.
Note also that there will be next cycles and further chances for new
versions.
Best regards,
Igor
I have to admit that I use for my benchmarks at releasing newer versions of RDChess shorter time controls (3'+0", 10'+0", 10' for 40 moves, seldom longer ones). Because of the same reasons as you mentioned (I do not want to use my whole life for testing) I assume that RDChess's "strength distribution over different time controls" changes not very much with different versions.
This does not mean thats RDChess strength does not depend on the time control.
RDChess plays certainly stronger against other engines with shorter time controls. But I assume normally that this behaviour does not change with a newer version. I assume sloppily that an improvement of RDChess improves the strength (nearly) equal at all time controls.
Feedback from your tournaments -especially your IL long- is therefore very important for me.
Rudolf
Return to Archive (Old Parsimony Forum)
Users browsing this forum: No registered users and 20 guests