There is too much uncertainty when using small sets for statistical purposes. ie: If I flip a coin 4 times and it lands heads 3 of them, can I then conclude that heads are 3 times more likely to appear as tails? How many times will I have to flip a coin to 'prove' that the heads/tails odds are equal?

You can read this page from the Chess Programming Wiki

http://chessprogramming.wikispaces.com/Match+StatisticsElo, Elo difference, and Likelihood of Superiority [LOS] are the most often-used comparison measurements.

The only way to avoid playing many games is looking at what other people have already measured

CCRL ratings

http://www.computerchess.org.uk/ccrl/404/Here's a little chart for you (I've forgotten where I got it from and I don't know how accurate it is):

- Code: Select all
` Confidence`

Score 90% 95% 99%

55% 170 281 550

60% 46 71 141

65% 21 30 64

70% 14 18 35

75% 9 13 22

80% 7 11 17

85% 7 8 14

90% 4 5 11

95% 4 5 7

100% 4 5 7

'Confidence' is a statistical term. A 90% confidence means that your results will be right 90% of the time and wrong 10% of the time. Notice that, if one engine is much stronger than the other, then very few games are needed.

The table indicates that if one engine scores 55% success then, for 90% confidence you will need to play 170 games. If you want 99% confidence you will need to play 550 games.

I'm not sure what class you are taking or what kind of statistics background you have. Mathematically the shortest possible test for you is to choose the strongest chess engine and compare it to the weakest engine. As soon as your measured Elo difference between the engines is beyond the 99% confidence error bars you're done.