Winboard Forum

Posted: **10 Dec 2004, 19:51**

Hi,

I am looking for a description of the algorithm used by EloStat. In particular, I wonder how the uncertainty margins are computed. The docs of EloStat mention articles in CSS magazine. I asked on the CSS forum, but got no reply. Can anybody here help me ?

Thanks,

R?mi

Posted: **10 Dec 2004, 21:40**

I'm not exacty sure how he is doing it but I think he was one of the persons who asked for the SSDF way many years ago. This is a copy from a description of the SSDF way I gave in discussion forum (I have forget which one):

Code: Select all: 1) Use number of wins, loss, and draws W = number of wins, L = number of lost, D = number of draws n = number of games (W + L + D) m = mean value 2) Apply the following formulas to compute s ( SQRT: square root of. ) x = W*(1-m)*(1-m) + D*(0.5-m)*(0.5-m) + L*(0-m)*(0-m) s = SQRT( x/(n-1) ) 3) Compute error margin A (use 1.96 for 95% confidence) A = 1.96 * s / SQRT(n) 4) State with 95% confidence: The 'real' result should be somewhere in between m-A to m+A 5) Lookup the ELO figures with the win% from m-A and m+A to get the lower and higher values in the error margin.

/Peter

Posted: **11 Dec 2004, 12:12**

Thanks Peter. I'll try to implement it and see if I get the same results.

R?mi

Posted: **12 Dec 2004, 23:31**

Elostat calculates error margins wrong, at least in some cases. Some Monte Carlo simulations of mine gave the same error margins, as those in Peter's formula (which he showed to me in a previous CCC discussion).

I forgot the details. Perhaps, there was still one issue with Peter's formula, in case you know, that players perform significantly differently with white and with black, and you also know the number of actual games played with white and black. Peter's formula just assumes a neutral color, and won't use that additional knowledge.

I also just saw your question in the CSS forum. Did you receive any relevant article?

Regards,
Dieter

Posted: **12 Dec 2004, 23:47**

Somewhere down the thread http://chessprogramming.org/cccsearch/c ... ead=340554 you can see an example with strange error margins given by Elostat. It was actually the interesting property pointed out by you ("draws don't count"), that inspired some experimenting of mine.

Regards,
Dieter

Posted: **13 Dec 2004, 03:09**

Dieter B?r?ner wrote:I also just saw your question in the CSS forum. Did you receive any relevant article?

Someone will scan the paper and send it to me. I will give information here when I receive it.

R?mi

Posted: **16 Dec 2004, 14:43**

Ok, I think I know it now. In a tournament, calculate the performance p_i of each player. Give every player an initial rating (the same for everybody). Calculate

E_i = E_average_of_opponents + 400 * log10(p_i/1-p_i);

Now you have the first estimate. Here the average Elo of the opponents is the same for all. But now, start to iterate, with the newly calculated E_i calculate a new average rating of the opponents, and new E_i until the E_is have converged. For the error estimation, assume this was a single match against one opponent (with average strength) and use a formula like Peter has shown above, but calculate a positive and a negative error seperately.

s+/- = \sqrt{1/N+/- \sum_{i=1}^{N+/-}{x_i-x_av}^2}

\sigma+/- = s+/- / \sqrt{N-1}

x_i is the result of the ith game, 1 for win, 0.5 for draw, 0 for loss. N+ and N- are the number of the games, that have better/worse result than the average.

My comments: This is the first time, I see such a seperate s+ and s-. Is it a "normal" thing to do it?

Using average ratings of opponents seems an obvious simplification. For this reason, the error calculation is also somewhat simplified.

For those, who have time and interest: http://www.stat.psu.edu/~dhunter/papers/bt.pdf

Implement this model (perhaps preferably the one with ties and homefileld=white advantage), and compare wit Elostat. Perhaps statisticans here already have a library for it.

I remember, that Uri has pointed out some artefacts of Elostat calculated ratings, but I don't remember them anymore. What was it?

Regards,
Dieter

Posted: **16 Dec 2004, 17:02**

Dieter B?r?ner wrote:Ok, I think I know it now. In a tournament, calculate the performance p_i of each player. Give every player an initial rating (the same for everybody). Calculate

E_i = E_average_of_opponents + 400 * log10(p_i/1-p_i);

What if p_i = 1 or 0 ?

Dieter B?r?ner wrote:s+/- = \sqrt{1/N+/- \sum_{i=1}^{N+/-}{x_i-x_av}^2}

\sigma+/- = s+/- / \sqrt{N-1}

What if N=1 ?

Dieter B?r?ner wrote:For those, who have time and interest: http://www.stat.psu.edu/~dhunter/papers/bt.pdf

Implement this model (perhaps preferably the one with ties and homefileld=white advantage), and compare wit Elostat. Perhaps statisticans here already have a library for it.

I am implementing all kind of stuff right now, so I may try this.

One of my main ideas is to use Bayesian inference to estimate likelihood distributions over ratings. I will post my results here soon.

R?mi

Posted: **16 Dec 2004, 20:39**

>>E_i = E_average_of_opponents + 400 * log10(p_i/1-p_i);

>What if p_i = 1 or 0 ?

From readme.rtf of Elostat:

Following this theory [the CSS articles], the Elo rating corresponding to a relative performance of 100 % or 0 % is indefinite. Due to mathematical reasons (e.g. to guarantee the feasibility of the iteration procedure) ELOStat assigns to those programs a finite Elo value which is exactly 600 points smaller (0 % perf.) or greater (100 % perf.) than the Av.Op. Elo. Or in other words: ELOStat does not support Elo differences greater than  600 points (therefore the 95% error margins can be at most  1200 points). For nearly all practical purposes, this restriction does not play an important role.

>>s+/- = \sqrt{1/N+/- \sum_{i=1}^{N+/-}{x_i-x_av}^2}

>>\sigma+/- = s+/- / \sqrt{N-1}

>What if N=1 ?

I just tried Elostat, and it gives 0.0 error margin for a "match" with +0 -0 =1.

One thing, I forgot to mention (it was not in the article, but is in readme):

The data file ?cluster.dat? also contains the value of the so called iteration offset (itoffset). This is a special characteristic of the iteration algorithm, which normally is not of great interest for practical purposes. Nevertheless it is described shortly: In order to guarantee the conver-gence of the iteration procedure, after each iteration the average value of all Elo performances (weighted by the individual number of games played by each program) must be equal to the Elo start value. Due to the nonlinearity between the relative performance (in %) and the corre-sponding Elo value (more exactly: Elo difference), a constant offset appears that must be added after each iteration in order to guarantee the constancy of the Elo mean value.
This offset leads to the situation that the Av.Op. Elo given in the rating list is different from the value obtained by taking the average Elo values from the first column. This difference exactly corresponds to the iteration offset explained above. Or in other words: The Elo per-formance of each program in the rating list is itoffset times higher than can be expected from its relative performance and its Av.Op. value. This systematic offset is valid for all programs in similar manner. Therefore, the Elo differences between the programs, i.e. the relative playing strengths are not affected by the iteration offset. In most cases the value of itoffset is smaller than 1 Elo point, but in some cases it can amount to 50 points and more.

Dieter

Posted: **16 Dec 2004, 23:17**

Thanks Dieter,

Here are some preliminary result with Bayesian Elo:
http://remi.coulom.free.fr/Bayesian-Elo/

What do you think ?

R?mi

Posted: **17 Dec 2004, 01:38**

R?mi Coulom wrote:Thanks Dieter,

Here are some preliminary result with Bayesian Elo:
http://remi.coulom.free.fr/Bayesian-Elo/

What do you think ?

R?mi

Yes I think that a Bayesian approach is more sound. Using mixed Bayesian/likelihood approaches is the more modern way to go compared to the old frequency Elo! I'm not sure however that it in practice will make life better. By not using the same approach for computer chess as human chess it will be even harder to compare the different pools. It is already hard...
After a quick look: Why shouldn't we assume that the strength is constant for chess programs? I think that this is one of the things that we can assume and utilize when estimating computer programs strenght.
Maybe you meant estimated strength?
Have you looked at how sensitive the result is on alpha=0.5? As I understand it, the impact of alpha varying between 0.4 and 0.6 can be ignored but I may have missed something.
The maximum-likelihood mentioned or a similar estimation seems to be doable. Is this maximum-likelihood standard these days?

/Peter

Posted: **17 Dec 2004, 09:29**

Peter Fendrich wrote:After a quick look: Why shouldn't we assume that the strength is constant for chess programs?

Yes, we should assume that chess programs have a constant strength. When I mentioned time-varying strength, I was thinking about applying the method to humans. It is not necessary for computers.

Peter Fendrich wrote:Have you looked at how sensitive the result is on alpha=0.5? As I understand it, the impact of alpha varying between 0.4 and 0.6 can be ignored but I may have missed something.

No, but that should be easy to do. I do not expect a huge influence of alpha, especially when the number of games grows.

Peter Fendrich wrote:The maximum-likelihood mentioned or a similar estimation seems to be doable. Is this maximum-likelihood standard these days?

After thinking a little more about it, I do not believe that the maximum likelihood is the best thing to compute. As the figure shows, there may be more than one maximum. Also, likelihood distributions are often assymetric, so the expected rating is very different from the maximum-likelihood rating.

Here is how I would do it. It is closer to the ELOStat iterative approach:
- start with all elos to 500
- iterate a procedure that replaces the elo of each player by the expected Elo obtained by Bayesian inference, assuming the Elos of opponents are their true Elos.
- Finally, compute confidence interval from the likelihood of every player, assuming their opponents have true Elos.

I will try this and make the program available.

R?mi

Posted: **17 Dec 2004, 16:14**

R?mi Coulom wrote:After thinking a little more about it, I do not believe that the maximum likelihood is the best thing to compute. As the figure shows, there may be more than one maximum. Also, likelihood distributions are often assymetric, so the expected rating is very different from the maximum-likelihood rating.

Ok, I became a little bit suspicious... :-)

Here is how I would do it. It is closer to the ELOStat iterative approach:
- start with all elos to 500
- iterate a procedure that replaces the elo of each player by the expected Elo obtained by Bayesian inference, assuming the Elos of opponents are their true Elos.
- Finally, compute confidence interval from the likelihood of every player, assuming their opponents have true Elos.

I will try this and make the program available.

R?mi

Great!
/Peter

Posted: **19 Dec 2004, 19:40**

R?mi Coulom wrote:I will try this and make the program available.

Here it is:
http://remi.coulom.free.fr/Bayesian-Elo/
I have implemented the maximum-likelihood algorithm, after all.

This program still computes the maximum likelihood in a very inefficient manner. The paper that Dieter indicated contains much better algorithms. I will implement them later. Unfortunately, the author does not give the formula in the case with ties _and_ home-field advantage. I guess I will have to understand the theory to figure it out myself. Or I will email David Hunter.

Dieter, if you understand the paper well, and can provide the formula, this would be very welcome.

Bye,

R?mi

Posted: **21 Dec 2004, 07:40**

Hello, I've been running Engines Matches in my computer for a while and always wanted to calculate the programs' rating, but I didn't like ELOStat because if you ran a tournament of 4 programs ("A", "B", "C" and "D") and another with other 4 programs ("E", "F", "G", and "H") you may get a PGN with 32 games, ELOStat is ok, but once you enter the PGN a game of program "I" beating program "C" and losing to program "F", the ratings are messed up, you might need to use separated PGNs and the ratings are lost; the solution of course is: Instead of "All the programs start with the same rating", "Each program can be assigned a rating that may be different than other program's rating before the new rating calculation starts" seems like a fancy feature, but I'm affraid it's beyond humanity

.

I downloaded Bayeselo, but "No command line parameter found". To solve this I may either: Run it from the MSDOS Prompt and enter the commands in real time; make a bayeselo.bat file containing the commands so I just double click it and the work is done; or create a shortcut and add the commands in the Destiny box.

I still have to figure out what is my best option and how to do it, but being able to see an online "How to use it" page would be nice

Best regards.

Posted: **21 Dec 2004, 09:25**

Ulysses Omycron wrote:Hello, I've been running Engines Matches in my computer for a while and always wanted to calculate the programs' rating, but I didn't like ELOStat because if you ran a tournament of 4 programs ("A", "B", "C" and "D") and another with other 4 programs ("E", "F", "G", and "H") you may get a PGN with 32 games, ELOStat is ok, but once you enter the PGN a game of program "I" beating program "C" and losing to program "F", the ratings are messed up, you might need to use separated PGNs and the ratings are lost; the solution of course is: Instead of "All the programs start with the same rating", "Each program can be assigned a rating that may be different than other program's rating before the new rating calculation starts" seems like a fancy feature, but I'm affraid it's beyond humanity ;).

That feature would be very easy to add. I will do it.

Also, I wonder what you mean by "the ratings are messed up". The ratings of E, F, G, and H will go up a lot, and the ratings of A, B, C and D will go down a lot. Is this what you'd like to avoid ? I believe Leo Dijksman has the same problem with his divisions, and this is the reason why he does not use ELOStat. Assigning values to some players should fix the problem, but I will try to think about better solutions.

I downloaded Bayeselo, but "No command line parameter found". To solve this I may either: Run it from the MSDOS Prompt and enter the commands in real time; make a bayeselo.bat file containing the commands so I just double click it and the work is done; or create a shortcut and add the commands in the Destiny box.

I still have to figure out what is my best option and how to do it, but being able to see an online "How to use it" page would be nice :)

Best regards.

Right now, my program is in a very experimental state. So, I did not try to make it particularly user-friendly. I plan to make a GUI once the basic algorithm is efficient and robust. But I still have to work on the math. Once I have a good-working fast algorithm, I will make the program easier to use, and advertise it more widely.

Thanks for your interest,

R?mi

Posted: **21 Dec 2004, 09:28**

I will follow the development with interest too!

Best regards and merry Xmas in advance,
Guenther

Posted: **21 Dec 2004, 11:04**

R?mi Coulom wrote:Dieter, if you understand the paper well, and can provide the formula, this would be very welcome.

Sorry, no I don't understand enough (yet?). I had the impression, that I saw some mathematica or matlab code supporting Hunter's paper. Perhaps you can find it.

Cheers,
Dieter

Posted: **21 Dec 2004, 11:15**

Dieter B?r?ner wrote:Sorry, no I don't understand enough (yet?). I had the impression, that I saw some mathematica or matlab code supporting Hunter's paper. Perhaps you can find it.

Cheers,
Dieter

In fact, the paper is not so hard to understand. I have understood the principle of the formulas he gives, so I should be able to figure out the formula to combine home-field advantage and ties by myself. I will probably do that right after the holidays.

bye,

R?mi

Posted: **23 Dec 2004, 12:36**

R?mi Coulom wrote:That feature would be very easy to add. I will do it.

You are beyond humanity

R?mi Coulom wrote:Also, I wonder what you mean by "the ratings are messed up". The ratings of E, F, G, and H will go up a lot, and the ratings of A, B, C and D will go down a lot. Is this what you'd like to avoid ?

Yes, I want to have all the games on one PGN, but if a new engine arrives and I make a new tourney with some Engines of Tourneys 3, 5 and 7, and some new engines, just imagine what will happen with the ratings after some loses, draws and wins "Rock-Paper-Scizor" style. At low levels some programs will fall the 1200 rating and others will jump the 3200 rating, without mention that the programs in the middle will have weird ratings (I even got some that have a higher rating and lower Perf% after the same amount of games). At higher levels you may get engines rated at 4000 (I haven't seen negative ratings) or so... (Please tell me how this effect is called as it doesn't happen like this if the ratings are calculated 1 game at a time).

R?mi Coulom wrote:I believe Leo Dijksman has the same problem with his divisions, and this is the reason why he does not use ELOStat. Assigning values to some players should fix the problem, but I will try to think about better solutions.

I really hope your program to become the new standard for rating calculations, you might try changing it's name to ELOStat and using it under arena (If it creates the same files, there will be no problem). [edit]Thsi part sound confusing, I meant to do it just to test it under a GUI.[/edit]

I really don't know anything about rating formulas (Just the +400 -400 provisional stuff), but in this page you may find information about something "Better than ELO": http://www.chessbase.com/newsdetail.asp?newsid=562

This page has kewl features that may be added to your GUI: http://www.uschess.org/ratings/ratingcalc9.html (Calculating one rating after each game solves all problems [I think], but I don't know if it'd be too slow to compute).

I have never liked that in most tourneys the programs start with a fixed rating (From 2000 to 2700), I think the programs should deserve their rating, something like "If the new program faces another new program and they draw, both get a 1500 Rating, if it loses 1300, if it wins 1700; then average the ratings of the enemy +400 (If win) and -400 (If lose) or with the enemy rating (if draw) [This must be done after the first match or if the first match is against a no new program that already has a rating]; Make no calculations If the rating difference is more than 400 (They should be separate leagues and no game should be played)"; This works for at least 25 games (In where normal calculation takes effect and the rating is no longer provisional) and the Program has deserved it's rating (Instead of just start with 2300 like everybody else). You may need to add a "Game-Count" that can be entered like the rating (So if it's higher than 25, rating calculation is made normally).

I had more ideas that just dissappeared from my mind, but I may remember them later (If you are interested

).

Best wishes.

Winboard Forum

ELOStat algorithm ?

ELOStat algorithm ?

Re: ELOStat algorithm ?

Re: ELOStat algorithm ?

Re: ELOStat algorithm ?

Re: ELOStat algorithm ?

Re: ELOStat algorithm ?

Re: ELOStat algorithm ?

Re: ELOStat algorithm ?

Re: ELOStat algorithm ?

Re: ELOStat algorithm ?

Re: ELOStat algorithm ?

Re: ELOStat algorithm ?

Re: ELOStat algorithm ?

Re: ELOStat algorithm ?

Re: ELOStat algorithm ?

Re: ELOStat algorithm ?

Bayes. Elo program

Re: ELOStat algorithm ?

Re: ELOStat algorithm ?

Re: ELOStat algorithm ?