A Bayesian rating system using W-Stein s identity

Size: px

Start display at page:

Download "A Bayesian rating system using W-Stein s identity"

Zoe Harris
5 years ago
Views:

1 A Bayesian rating system using W-Stein s identity Ruby Chiu-Hsing Weng Department of Statistics National Chengchi University Joint work with C.-J. Lin Ruby Chiu-Hsing Weng (National Chengchi Univ.) 1 / 33

2 Outline The rating problem The rating problem Review online rating systems Bayesian approximation methods Our approach Experiments on game data Ruby Chiu-Hsing Weng (National Chengchi Univ.) 2 / 33

3 Rating/Ranking The rating problem paired comparison: chess, tennis multiple comparison: car racing, horse racing multiple teams/players comparison: double tennis, bridge, sports, online games - 3 teams: (A1,A2,A3), (B1,B2), (C1,C2,C3,C4) Questions: Who will win the next game? Top 10? A global ranking? Ruby Chiu-Hsing Weng (National Chengchi Univ.) 3 / 33

4 Online rating The rating problem Online rating: online method for rating An online method learns cases sequentially. procedure: - predict the next outcome - soon the outcome is available - refine the prediction model discard cases after learning; large-scale data Online rating: - rank individuals/estimate skills (after each game) Ruby Chiu-Hsing Weng (National Chengchi Univ.) 4 / 33

5 Outline Review online rating systems The rating problem Review online rating systems Bayesian approximation methods Our approach Experiments on game data Ruby Chiu-Hsing Weng (National Chengchi Univ.) 5 / 33

6 Online rating systems Review online rating systems Elo (Elo (1960, 1986)): the first chess rating system with probabilistic underpinning - US Chess Federation (USCF), World Chess Federation (FIDE), world football league, etc Glicko (Glickman (1992)): Bayesian rating system - the free internet chess server, etc TrueSkill TM (Herbrich, Graepel and Minka (Microsoft Research 2006)): Bayesian rating system - Microsoft Xbox Live (multiple teams/players) Ruby Chiu-Hsing Weng (National Chengchi Univ.) 6 / 33

7 Elo vs Glicko Review online rating systems Elo (a physics professor): player s skill characterized by strength θ θ i θ i + K(s i P(i wins)) Glicko (a statistics professor): θ Normal(µ, σ 2 ) player s skill characterized by (µ, σ 2 ) Ruby Chiu-Hsing Weng (National Chengchi Univ.) 7 / 33

8 A Bayesian framework Review online rating systems player s skill θ (µ, σ 2 ) before the game, i s skill is N(µ i, σ 2 i ) after the game, i s skill is N(µ new i, (σ 2 i )new ), the posterior mean and variance of θ i Ruby Chiu-Hsing Weng (National Chengchi Univ.) 8 / 33

9 Outline Bayesian approximation methods The rating problem Review online rating systems Bayesian approximation methods Our approach Experiments on game data Ruby Chiu-Hsing Weng (National Chengchi Univ.) 9 / 33

10 The game model Bayesian approximation methods prior of stregnth θ i : N(µ i, σi 2 ), i = 1,..., k D: game outcome likelihood: L(θ) = P(D θ) the posterior density of θ = (θ 1,..., θ k ) : p(θ D) p(θ) P(D θ) goal: derive E(θ D) and Var(θ D) fast and accurate Ruby Chiu-Hsing Weng (National Chengchi Univ.) 10 / 33

11 Models for ranked data Bayesian approximation methods R: observed rank of k objects of a game Thurstone-Mosteller (TM) model (1927) - X i : unobserved actual performance of object i - X i N(θ i, σ 2 ), θ i is strength of i. - Example, k = 2 R = (1, 2) if and only if X 1 > X 2. L(θ) = P(X 1 > X 2 ) = Φ( θ 1 θ 2 2σ 2 ) Bradley-Terry (BT) model: X 1 X 2 logistic L(θ) = P(X 1 > X 2 ) = θ 1 θ 1 + θ 2, θ i > 0 Ruby Chiu-Hsing Weng (National Chengchi Univ.) 11 / 33

12 then use TM or BT for paired comparison. It involves several terms! Ruby Chiu-Hsing Weng (National Chengchi Univ.) 12 / 33 Bayesian approximation methods Likelihood for a multiple-team game Suppose R=(1,2,3,4). Plackett-Luce (PL) model: (a generalization of BT) ( ) ( ) ( ) θ 1 θ 2 θ3 L(θ) = θ 1 + θ 2 + θ 3 + θ 4 θ 2 + θ 3 + θ 4 θ 3 + θ 4 Alternative: first decompose L(θ) = P(X 1 > X 2 )P(X 1 > X 3 )P(X 1 > X 4 )P(X 2 > X 3 ) P(X 2 > X 4 )P(X 3 > X 4 )

13 Bayesian approximations Bayesian approximation methods Bayesian inference often involves intractable integrations. Popular approximation techniques: Markov Chain Monte Carlo (MCMC) - draw samples approximately from p(θ D) - slow for large-scale data - we need only E(θ D) and Var(θ D) Laplace method variational Bayes bound integral by Jensen s ineq. Ruby Chiu-Hsing Weng (National Chengchi Univ.) 13 / 33

14 Glicko and TrueSkill TM Bayesian approximation methods Glicko: rating of chessplayers - two players - update skills after a rating period (e.g. a tournament) - better to have 5-10 games per player - based on CLT and some ad doc methods p(θ D) p(θ) P(D θ) TrueSkill: rating of players for Microsoft Xbox - multiple teams/players - update skills after a single game - based on EP (expectation propagation) Ruby Chiu-Hsing Weng (National Chengchi Univ.) 14 / 33

15 More on TrueSkill TM Bayesian approximation methods For multiple-team games L(θ) = P(θ D) = P 1 P 2 P m expectation propagation: 1. employ Assumed Density Filtering (ADF) (it learns P 1, P 2,... sequentially; sensitive to order ) 2. recursively refine it update rule: - analytic form available for two-team game (k = 2) - numerical integration for k > 2 (slower) Ruby Chiu-Hsing Weng (National Chengchi Univ.) 15 / 33

Bayesian approximation methods TrueSkill Ranking System - Microsoft Research Videos Projects Publications People Downloads Home Our Research Connections Careers Worldwide Labs Research Areas Research

16 Bayesian approximation methods TrueSkill Ranking System - Microsoft Research Videos Projects Publications People Downloads Home Our Research Connections Careers Worldwide Labs Research Areas Research Groups The making of 20 years Watch and subscribe > Projects > TrueSkill Ranking System TrueSkill Ranking System The TrueSkill ranking system is a skill based ranking system for Xbox Live developed at Microsoft Research. Detailed Description Calculators FAQ The TrueSkill ranking system is a skill based ranking system for Xbox Live developed at Microsoft Research. The purpose of a ranking system is to both identify and track the skills of gamers in a game (mode) in order to be able to match them into competitive matches. The TrueSkill ranking system only uses the final standings of all teams in a game in order to update the skill estimates (ranks) of all gamers playing in this game. Ranking systems have been proposed for many sports but possibly the most prominent ranking system in use today is ELO. Ranking Players So, what is so special about the TrueSkill ranking system? In short, the biggest difference to other ranking systems is that in the TrueSkill ranking system skill is characterised by two numbers: The average skill of the gamer (μ in the picture). The degree of uncertainty in the gamer's skill (σ in the picture). The ranking system maintains a belief in every gamer's skill using these two numbers. If the Ruby Chiu-Hsing Weng (National Chengchi Univ.) uncertainty is still high, the ranking system 16 / 33

17 Outline Our approach The rating problem Review online rating systems Bayesian approximation methods Our approach Experiments on game data Ruby Chiu-Hsing Weng (National Chengchi Univ.) 17 / 33

18 Our approach Stein s lemma vs Stein s identity Stein s lemma (Stein 1961, 1981): the expectations of normal distributions W is standard normal iff E(f (W )) = E(Wf (W )) applications: James-Stein estimator, empirical Bayes Stein s Identity (Woodroofe, 1989): the expectations of distributions which are nearly normal: dγ(z) = f (z)φ(z)dz coined as W-Stein s identity (Weng and Lin, 2011) essentially exchange the orders of integration Ruby Chiu-Hsing Weng (National Chengchi Univ.) 18 / 33

19 The basic equations Our approach Z: a k-dim random vector with density p(z) φ p (z)f (z) Let E be expectation w.r.t. Z. By W-Stein s Identity, E(Z) = E [ ] f (Z), E(Z i Z j ) = δ ij + E f (Z) [ ] 2 f (Z) f (Z) ij : gradient w.r.t. z; [ ] ij : ij component of a matrix δ ij = 1 if i = j and 0 otherwise. Ruby Chiu-Hsing Weng (National Chengchi Univ.) 19 / 33

20 The game model Our approach prior of θ i : N(µ i, σi 2 ), i = 1,..., k let Z i = θ i µ i σ i ; the posterior of Z = (Z 1,..., Z k ): p(z D) φ k (z) f (z) Here f (z) = P(D z). If f (z) = f 1 (z) f m (z), then [ ] f (Z) E(Z D) = E f (Z) [ ] [ ] f1 (Z) f2 (Z) = E + E + + E f 1 (Z) f 2 (Z) [ ] fm (Z). f m (Z) Ruby Chiu-Hsing Weng (National Chengchi Univ.) 20 / 33

21 Team skill Our approach µ new i [ ] f (Z)/ Zi = µ i + σ i E[Z i D] = µ i + σ i E f (Z) ( i ) 2 = σi 2 Var[Z i D] = σi 2 E[Z 2 i ] E[Z i ] 2) ( [ ] [ ] ) = σi f (Z) f (Z)/ Zi 1 + E E. f (Z) f (Z) (σ new ii Approximate these expectations at z i = 0, i.e. θ i = µ i i.e. as if distribution of θ is concentrated on µ. Ruby Chiu-Hsing Weng (National Chengchi Univ.) 21 / 33

22 Our approach How accurate is the approximation? Assess accuracy for k = 2 and Thurstone-Mosteller model X i N(θ i, βi 2) Joint posterior pdf of (θ 1, θ 2 ) ( ) θ1 µ 1 φ φ σ 1 ( θ2 µ 2 σ 2 ) ( Φ Exact E(θ i D) can be derived. Suggest correcting our approximation by β 2 i + σ 2 i β 2 i. θ 1 θ 2 β β 2 2 ), Ruby Chiu-Hsing Weng (National Chengchi Univ.) 22 / 33

23 Individual skill Our approach After updating k team skills, we can apply our method to update individual player in each team. Here the ith team has n i players the jth player in ith team has strength θ ij prior of θ ij is N(µ ij, σ 2 ij ) assume θ i = j θ ij so, prior of θ i is N( j µ ij, j σ2 ij ) Ruby Chiu-Hsing Weng (National Chengchi Univ.) 23 / 33

24 Reparametrize f (z) Our approach Let Z ij θ ij µ ij σ ij. So, Z i θ i µ i σ i = j σ ijz ij σ i z = [z 1,..., z k ] T, z = [z 11,..., z 1n1,..., z k1,..., z knk ] T. Probability of game outcome f (z) is: n 1 f (z) = f σ 1j z 1j n k σ kj z kj,..., σ 1 σ k j=1 j=1 = f ( z) ( ) f (z)/ z ij Update rule: µ ij µ ij + σ ij E f Ruby Chiu-Hsing Weng (National Chengchi Univ.) 24 / 33

25 Our update rules Our approach Team skill update: µ i µ i + Ω i σi 2 σi 2 max(1 i, κ) where κ is a small positive value (0.0001). Individual skill update: µ ij µ ij + σ2 ij σ 2 ij σ 2 ij max ( σ 2 i Ω i 1 σ2 ij σ 2 i i, κ (adjustment to µ ij ) σ 2 ij Ruby Chiu-Hsing Weng (National Chengchi Univ.) 25 / 33 )

26 Outline Experiments on game data The rating problem Review online rating systems Bayesian approximation methods Our approach Experiments on game data Ruby Chiu-Hsing Weng (National Chengchi Univ.) 26 / 33

27 Game Data Experiments on game data Data: generated by Bungie Studios during the beta testing of the Xbox title Halo 2. Game type # games # players description Free for All 5,943 60,022 up to 8 players in a game Small Teams 27,539 4,992 up to 12 players in 2 teams Head to Head 6,227 1,672 2 players in a game Large Teams 1,199 2,576 up to 16 players in 2 teams Table: Data summary Ruby Chiu-Hsing Weng (National Chengchi Univ.) 27 / 33

28 Experiments on game data Comparison with TrueSkill TM BT PL TM TrueSkill Free for All *30.59% 31.74% 44.65% 30.82% Small Teams 33.97% *33.89% 36.46% 35.23% Head to Head 32.53% 32.53% *32.44% *32.44% Large Teams *37.30% 37.67% 39.37% 38.15% Table: Prediction error. We apply our method for BT, PL, TM models. The prediction uses µ 3σ. red: accuracy beats TrueSkill *: best among four methods Ruby Chiu-Hsing Weng (National Chengchi Univ.) 28 / 33

29 TM vs BT Experiments on game data BT seems better than TM Thurstone-Mosteller: X 1 X 2 follows normal. Bradley-Terry: X 1 X 2 follows logistic. Most currently used Elo variants use logistic rather than normal, because it is argued that: Ruby Chiu-Hsing Weng (National Chengchi Univ.) 29 / 33

30 TM vs BT Experiments on game data BT seems better than TM Thurstone-Mosteller: X 1 X 2 follows normal. Bradley-Terry: X 1 X 2 follows logistic. Most currently used Elo variants use logistic rather than normal, because it is argued that: weaker players have significantly greater winning chances than normal model predicts. Ruby Chiu-Hsing Weng (National Chengchi Univ.) 29 / 33

31 Normal vs logistic Experiments on game data Solid line: normal; Dashed line: logistic x-axis: θ 1 θ 2 ; y-axis: Prob(player 1 wins) Winning probability of player θ 1 θ Ruby Chiu-Hsing Weng (National Chengchi Univ.) 30 / 33

32 Normal vs logistic Experiments on game data Solid line: normal; Dashed line: logistic With logistic, weaker player has bigger winning chance Winning probability of player θ 1 θ Ruby Chiu-Hsing Weng (National Chengchi Univ.) 31 / 33

33 Computation time Experiments on game data implementation: C and F# TrueSkill is written in F# Free for All : using F# TrueSkill 13 seconds, ours 1.2 seconds competitive accuracy and shorter running time online_ranking Ruby Chiu-Hsing Weng (National Chengchi Univ.) 32 / 33

34 Comparison with Glicko Experiments on game data Glicko TrueSkill ours Model BT TM BT, PL, TM Game type 2 teams multiple multiple Technique CLT EP W-Stein Update after a rating period one game one game * The single-game version of Glicko has prediction error 33.88% on Head to Head * Results by ours and TrueSkill: 32.53% 32.53% 32.44% 32.44% * Possible reason: Glicko is for a rating period * Glicko considers the increase in variance with the elapse of time Ruby Chiu-Hsing Weng (National Chengchi Univ.) 33 / 33

35 Experiments on game data A. E. Elo. The Rating of Chessplayers, Past and Present. Arco Pub., New York, 2nd edition, M. Glickman. Parameter estimation in large dynamic paired comparison experiments. Applied Statistics, 48(3): , R. Herbrich, T. Minka, and T. Graepel. TrueSkill TM : A Bayesian skill rating system. Advances in Neural Information Processing Systems 19, pages MIT Press, Cambridge, MA, R. C. Weng and C.-J. Lin. A Bayesian approximation method for online ranking. Journal of Machine Learning Research, 12: , Ruby Chiu-Hsing Weng (National Chengchi Univ.) 33 / 33

Outcome Forecasting in Sports. Ondřej Hubáček

Outcome Forecasting in Sports. Ondřej Hubáček Outcome Forecasting in Sports Ondřej Hubáček Motivation & Challenges Motivation exploiting betting markets performance optimization Challenges no available datasets difficulties with establishing the state-of-the-art