How to measure playing strength with limited resources?

Archive of the old Parsimony forum. Some messages couldn't be restored. Limitations: Search for authors does not work, Parsimony specific formats do not work, threaded view does not work properly. Posting is disabled.

How to measure playing strength with limited resources?

Postby Robert Allgeuer Robert Al » 05 Feb 2004, 13:16

Geschrieben von: / Posted by: Robert Allgeuer Robert Allgeuer at 05 February 2004 13:16:32:
Als Antwort auf: / In reply to: Re: YABRL: Aristarch 4.37 scores slightly less than Aristarch 4.21 geschrieben von: / posted by: Kurt Utzinger at 05 February 2004 07:25:49:
I am much more interested in knowing difference
of playing strength at 40'/40 or 90m+30s or
120'/40 than at 5m+2s and would like to see
such comparisons.
Kurt

Me too!
The underlying question is: How can we obtain good estimates of the playing strength of engines with _limited_ computing resources? It is agreed that the ultimate way to test would be to have a large number of games at longer time controls, but for doing this properly in an acceptable timeframe, this would require five to ten identical computers, which not all of us can afford ....
With these constraints in mind we have 3 choices:
1) Selective matches between engines (List vs. Ruffian, then Aristarch vs. Shredder etc.) at long time controls. This is certainly interesting, but gives only selective information, not an overall picture.
2) A rating list with relatively few games at long time controls, which results in a list with wide error margins.
3) A rating list at shorter time controls, but higher number of games, which results in a list with narrow error margins, although of course under Blitz conditions.
All approaches are ok and interesting, I went for number 3. I did number 2 a while ago (rating list not posted), but I simply realised that still having error margins of 60 to 70 after half a year is not what I want and does not help me too much in measuring the real playing strength.
I do believe that there is a correlation between Blitz performance and performance at longer time controls, in fact my rating list looks quite similar to other lists at longer time controls (with some notable exceptions), which supports this assumption. It is also no coincidence that I test with an increment of 2 seconds, because this eliminates some of the typical Blitz factors: big influence of time management, high number of wins on time and the instant moves at the end of a long game.
There are of course the exceptions of the Blitz experts (Delfi, Pepito, Yace, Amyan, Knightdreamer etc.) which go down in rating lists with longer time controls, but such tendencies are either known already or can be spotted with a few selective additional matches at longer time controls, leading overall to a pretty good picture where each engine fits.
E.g. for El Chinito 3.25 which I have currently running: At the moment probably not too many of us know where El Chinito 3.25 really fits, after the Blitz results we will have some idea. A few additional matches at longer time controls against some of the more balanced engines, such as Ruffian or Aristarch, and comparing El Chinito's performance with its performance against the same opponents in Blitz would finally give a pretty good picture of the characteristics of El Chinito.
Robert
Robert Allgeuer Robert Al
 

Re: How to measure playing strength with limited resources?

Postby Kurt Utzinger » 05 Feb 2004, 13:22

Geschrieben von: / Posted by: Kurt Utzinger at 05 February 2004 13:22:14:
Als Antwort auf: / In reply to: How to measure playing strength with limited resources? geschrieben von: / posted by: Robert Allgeuer Robert Allgeuer at 05 February 2004 13:16:32:


Hi Robert
Many thanks for your detailed answer. I am in the happy
situation that chessfriend Rolf Bühler has four identical
PC's running and therefore it is possible to play a great
numer of games even at longer time controls, see under
"Tournament" at http://www.utzingerkurt.com
Kind regards
Kurt
Kurt Utzinger
 

Re: How to measure playing strength with limited resources?

Postby Uri Blass » 05 Feb 2004, 13:57

Geschrieben von: / Posted by: Uri Blass at 05 February 2004 13:57:58:
Als Antwort auf: / In reply to: How to measure playing strength with limited resources? geschrieben von: / posted by: Robert Allgeuer Robert Allgeuer at 05 February 2004 13:16:32:
I am much more interested in knowing difference
of playing strength at 40'/40 or 90m+30s or
120'/40 than at 5m+2s and would like to see
such comparisons.
Kurt

Me too!
The underlying question is: How can we obtain good estimates of the playing strength of engines with _limited_ computing resources? It is agreed that the ultimate way to test would be to have a large number of games at longer time controls, but for doing this properly in an acceptable timeframe, this would require five to ten identical computers, which not all of us can afford ....
With these constraints in mind we have 3 choices:
1) Selective matches between engines (List vs. Ruffian, then Aristarch vs. Shredder etc.) at long time controls. This is certainly interesting, but gives only selective information, not an overall picture.
2) A rating list with relatively few games at long time controls, which results in a list with wide error margins.
3) A rating list at shorter time controls, but higher number of games, which results in a list with narrow error margins, although of course under Blitz conditions.
All approaches are ok and interesting, I went for number 3. I did number 2 a while ago (rating list not posted), but I simply realised that still having error margins of 60 to 70 after half a year is not what I want and does not help me too much in measuring the real playing strength.
I do believe that there is a correlation between Blitz performance and performance at longer time controls, in fact my rating list looks quite similar to other lists at longer time controls (with some notable exceptions), which supports this assumption. It is also no coincidence that I test with an increment of 2 seconds, because this eliminates some of the typical Blitz factors: big influence of time management, high number of wins on time and the instant moves at the end of a long game.
There are of course the exceptions of the Blitz experts (Delfi, Pepito, Yace, Amyan, Knightdreamer etc.)
I do not think that yace is a blitz expert.
Yace is a premier division engine at long time control.
I do not think that other engines that you describe as blitz experts are weak at long time control.
All of them with the exception of knightdreamer are at least in the first division of Leo.
Uri
Uri Blass
 

Re: How to measure playing strength with limited resources?

Postby Robert Allgeuer » 05 Feb 2004, 14:06

Geschrieben von: / Posted by: Robert Allgeuer at 05 February 2004 14:06:00:
Als Antwort auf: / In reply to: Re: How to measure playing strength with limited resources? geschrieben von: / posted by: Kurt Utzinger at 05 February 2004 13:22:14:
Hi Robert
Many thanks for your detailed answer. I am in the happy
situation that chessfriend Rolf Bühler has four identical
PC's running and therefore it is possible to play a great
numer of games even at longer time controls, see under
"Tournament" at http://www.utzingerkurt.com
Kind regards
Kurt
... and it is good that we test different things, so we get a better overall picture.
Robert
Robert Allgeuer
 

Re: How to measure playing strength with limited resources?

Postby Robert Allgeuer » 05 Feb 2004, 14:12

Geschrieben von: / Posted by: Robert Allgeuer at 05 February 2004 14:12:36:
Als Antwort auf: / In reply to: Re: How to measure playing strength with limited resources? geschrieben von: / posted by: Uri Blass at 05 February 2004 13:57:58:
I do not think that yace is a blitz expert.
Yace is a premier division engine at long time control.
I do not think that other engines that you describe as blitz experts are weak at long time control.
All of them with the exception of knightdreamer are at least in the first division of Leo.
Uri

All these machines are strong, so there is no contradiction that they play in premier division and first division etc. When I say Blitz experts it just means that they are _comparatively_ performing better at shorter time controls. Delfi e.g. is among the top 5 or 6 free engines in Blitz, at longer time controls not.
Yace e.g. in Blitz is almost identically strong as Little Goliath, at longer time controls not, etc.
Heinz van Kempen sent his ELO comparison chart for Blitz and longer time control to me, and it was interesting to see that he came to the same conclusion.
Robert
Robert Allgeuer
 

Re: How to measure playing strength with limited resources?

Postby Igor Gorelikov Igor Gorel » 05 Feb 2004, 15:37

Geschrieben von: / Posted by: Igor Gorelikov Igor Gorelikov at 05 February 2004 15:37:12:
Als Antwort auf: / In reply to: How to measure playing strength with limited resources? geschrieben von: / posted by: Robert Allgeuer Robert Allgeuer at 05 February 2004 13:16:32:
I do believe that there is a correlation between Blitz performance and performance at longer time controls
In very general you are right but there are so many exceptions...
Here are just two examples of a huge difference at different time controls (on
the same hardware):

Time Change Place Program Elo + - Games Score Av.Op. Draws
control in Places
rapid +1 5 Green Light Chess 3.00 : 2681 80 64 72 53.5 % 2657 29.2 %
blitz -15 28 Green Light Chess 3.00 : 2430 66 83 72 49.3 % 2435 23.6 %
rapid +2 8 Aristarch 4.4 : 2659 68 64 81 59.9 % 2589 30.9 %
blitz +2 49 Aristarch 4.4 : 2382 77 73 72 56.2 % 2339 20.8 %

==================
blitz means 5'+3"
rapid means 15'+3"
Best regards,
Igor Gorelikov
Igor Gorelikov Igor Gorel
 

Re: How to measure playing strength with limited resources?

Postby Uri Blass » 05 Feb 2004, 17:03

Geschrieben von: / Posted by: Uri Blass at 05 February 2004 17:03:36:
Als Antwort auf: / In reply to: Re: How to measure playing strength with limited resources? geschrieben von: / posted by: Igor Gorelikov Igor Gorelikov at 05 February 2004 15:37:12:
I do believe that there is a correlation between Blitz performance and performance at longer time controls
In very general you are right but there are so many exceptions...
Here are just two examples of a huge difference at different time controls (on
the same hardware):

Time Change Place Program Elo + - Games Score Av.Op. Draws
control in Places
rapid +1 5 Green Light Chess 3.00 : 2681 80 64 72 53.5 % 2657 29.2 %
blitz -15 28 Green Light Chess 3.00 : 2430 66 83 72 49.3 % 2435 23.6 %
rapid +2 8 Aristarch 4.4 : 2659 68 64 81 59.9 % 2589 30.9 %
blitz +2 49 Aristarch 4.4 : 2382 77 73 72 56.2 % 2339 20.8 %

==================
blitz means 5'+3"
rapid means 15'+3"
Best regards,
Igor Gorelikov
I believe that the difference is because of statistical error and you simply do not have enough games.
I do not believe that there can be a huge difference between 5+3 and 15+3
unless the program has a big bug in handling one of the time controls.
Uri
Uri Blass
 

Re: How to measure playing strength with limited resources?

Postby Igor Gorelikov » 05 Feb 2004, 17:36

Geschrieben von: / Posted by: Igor Gorelikov at 05 February 2004 17:36:15:
Als Antwort auf: / In reply to: Re: How to measure playing strength with limited resources? geschrieben von: / posted by: Uri Blass at 05 February 2004 17:03:36:
I believe that the difference is because of statistical error and you simply do not have enough games.
I do not believe that there can be a huge difference between 5+3 and 15+3
unless the program has a big bug in handling one of the time controls.
Uri
I believe that the difference is not because of statistical error.
I believe that 72 games for the given pool is enough to draw some conclusions.
I believe, you believe... We have different "believes" ;-)
Best regards and thoughts,
Igor
Igor Gorelikov
 


Return to Archive (Old Parsimony Forum)

Who is online

Users browsing this forum: No registered users and 40 guests