R?mi Coulom wrote:If I understand Huang et al's paper correctly, when a team is made of players 1, 2, 3, and the gamma's of these players (in the sense of gammas in Hunter's paper) are gamma_1, gamma_2, gamma3, then the strength of the team is assumed to be gamma_1 + gamma_2 + gamma_3. This does no[t] seem good. gamma_1 * gamma_2 * gamma_3 should be the right solution (multiplying gammas means adding Elo points). Using the product of gammas is not a problem, since the model can still be minorized-maximized.

Well, this is a model detail and is different for them than you. ELO scores are on the exponential scale, whereas Huang et al.'s are not. They don't exponentiate the difference between scores. It's simply \frac{\lambda_i}{\

lambda_i + \lambda_j} where the \lambdas are the Bradley-Terry strengths of individuals (or groups) i and j. Since they don't exponentiate their scores, they add instead of multiply. Multiplying exponentiated scores is the same as adding the exponents, etc.

R?mi Coulom wrote:I am not sure what you mean by "online".

I mean it in, at least for me, the neural network training sense where you have online vs. batch training. In online training, you update the parameters after every instance (match in the game sense) according to the gradient of the error (or likelihood) with respect to to the parameters. In batch training, the gradient is instead the sum of all the instantaneous gradients taken for each pattern.

Online training is what allows you to approximate time-varying player strengths. In neural network training, it allows you to more efficiently calculate the gradient leading to faster training (see Wilson, D. Randall and Martinez, Tony R. The general inefficiency of batch training for gradient descent learning. Neural Networks, volume 16 (10), pages 1429--1451, Elsevier Science Ltd. Oxford, UK, UK, 2003. available here:

http://axon.cs.byu.edu/detailed_publica ... nn03.batch).

Minorization-Maximization is so efficient, that it could update ratings extremely rapidly after an additional match result is incorporated into the data.

Yes, I can see it is a very efficient algorithm. Kind of a poor-man's Gibb's sampler that gives you just the posterior mode. So what you're saying is you would add the most recent match to the current data and then re-run the entire MM algorithm again. There are 2 problems I have here. 1) As discussed by both of us below, it doesn't take into account time-varying player strengths, and 2) it requires storing information about past matches, something I can't do in my production environment. I can do it for research, but can't use it in the real life situtation I have. At least, I don't think I can. You're welcome to prove me wrong.

I'm not sure what exactly I would need to store for an MM approach in my situation. I have two teams competing and they are almost NEVER composed of the same players. The players mix between the teams, etc. I need to construct the team strengths from the player strengths, and then in turn infer the player strengths from team play. The "number of times Team A beat Team B" is always 0 or 1, never > 1. Hmm. I'll have to think about this. If it's obvious to you, I'm listening.

I do not use an uniform prior in Bayeselo. I don't use a Gaussian either. A Gaussian prior is not very convenient, because it makes it more complicated to find the maximum likelihood. Minorization-maximization does not work with a Gaussian prior. So I use prior draws instead. They fit perfectly into the model. The prior they provide looks like a bump, but it is not a Gaussian (it has longer tails). It is very important to use a prior.

I don't see how using a Gaussian prior will hurt MM, at least not from Hunter's paper. Maybe I missed it. Seems like it just slightly changes the log-likelihood function. At least, that's what it did for my current approach.

It seems to me you're essentially using a beta distribution here.

"Academically-oriented" people do not necessary prefer inefficient methods all the time. Isn't Minorization-Maximization academic ? Minorization-Maximization is great, really.

Ah, no, I know for sure that efficient approximations of a posterior mode are definitely academic research. I just meant their paper doesn't discuss any because, at least according to Shane, JASA cares more about the model and the full posterior than about approximations to it. A paper discussing approximations would go somewhere else (Applied Statistics?). Unless it's like Hunter's paper which unifies the theory and / or a paper that presents a novel approximation approach. Plus, your typical stats Bayesian always wants the full posterior, although I'm guessing your average CS guy assumes this isn't practical.

Although, I did run MCMC on several subsets of my own data involving around 500 players and 500 matches each. I got some nicely formed posteriors, which you don't get when just approximating the mode, etc. Well, unless you assume some distributionapproximation like using the posterior mode as as the mean and the negative inverse of the Hessian as the variance of a multivariate normal, etc. I really like MCMC for that, although I know I could never run it real time. MCMC took several hours per set.

I don't understand why you seem to consider MM so different from stochastic gradient descent. MM is a form of gradient descent. It is an updating approach. The only difference I see between MM and gradient descent is that MM is tremendously more efficient than traditional generic forms of gradient descent, as Hunter shows in his paper.

Hmm, by updating I mean after each match you apply a single update to each parameter (player strengths, means, etc.), instead of rerunning the entire algorithm on all of the data.

I mean an algorithm that doesn't need to track any history but the last match. For my application, I can't store that information.

Maybe what you mean is that bayeselo does not deal with players of time-varying strength. Bayeselo was indeed designed for the evaluation of computer-programs, so time-varying strength was not an issue that I considered.

But the question of MM vs stochastic gradient descent has nothing to do with that. It is possible to design statistical models that deal with players of time-varying strength, and that can be solved with the MM algorithm. In fact, I am working on one right now.

Yes, if you add time to your original model such that a more recent game carries a higher weight. The nice thing about methods like Elo's original approach is that you get that for free. Every time you update a player's rating, you are essentially weighting their old rating by some amount and their current performance by another, and combining them. This allows you to naturally track increasing (or decreasing) performance.

I am sure a similar approach can be based on Minorization-Maximization, with the Luce model to deal with multiplayer games. As I said previously, I am currently working on something along these lines. I'll give more information as soon as I have results to show.

Again, with expectation propagation you have a very fast, memory-less, and naturally time-varying algorithm. The nice thing with TrueSkill is the variance is estimated explicitly instead of calculated afterwards.

Thanks for your response. I've been banging my head against this for a bit now. If I can get MM to work for me, I'll be happy. For now I'm going to try Spall's method.

I'd really like to get variance out of it as well. I guess you can evaluate the Hessian at the mode you get out of MM to do this.

Thanks again.