CS-E4800 Artificial Intelligence Jussi Rintanen Department of Computer Science Aalto University March 9, 2017
Difficulties in Rational Collective Behavior Individual utility in conflict with collective utility Examples: greenhouse gases over-population de-forestation arms race, military build-up No general solution to resolve this conflict Issue: How to align agents and collectives utilities Law/agreements to constrain individual actions (hard to enforce when utilities high)
Tragedy of the Commons Using jointly-owned resource; cost evenly shared 0 2 4 6 0 0,0-1,1-2,2-3,3 2 1,-1 0,0-1,1-2,2 4 2,-2 1,-1 0,0-1,1 6 3,-3 2,-2 1,-1 0,0 Using your own resource 0 2 4 6 0 0,0 0,0 0,0 0,0 2 0,0 0,0 0,0 0,0 4 0,0 0,0 0,0 0,0 6 0,0 0,0 0,0 0,0 Best action: Spend joint resource as much as you can (Or (under diminishing marginal utility) at least: Spend it more than you would if you had to pay for it in full.) (Flatmates agree to fill-up fridge every day, divide the cost evenly, and let everybody eat as much as they like. Good idea?)
Games with State Strategies in normal form games single shot Real-world games typically involve multiple stages Formalizations: Games in extensive form (game theory) Multi-agent Markov decision processes: MDPs with actions replaced by normal form games, and payoffs obtained from values of successor states) Game-tree search for zero-sum games (later today) Can be abstractly viewed as normal form games (reduction to normal form exponential size) Time-dependent aspects cannot be investigated in normal form
Challenges in multi-agent systems 1 players utilities opposite (coordination impossible, mixed strategies) 2 conflicting individual and collective utility (coordination difficult, suboptimal collective outcomes) 3 making a decision collectively (measuring utility)
Preference aggregation Decision between alternatives A, B and C Agents express their preferences option 1: some ranking/ordering of A, B, C option 2: numeric values of A, B, C Preferences need to be aggregated to obtain a joint ordering/valuation of A, B, C This is difficult! Agents utilities generally not publicly known Optimal strategy (often): lie about utilities/preferences Suboptimal outcomes
Aggregation of rankings Set of candidates (outcomes, alternatives) Set of agents Objective: Produce aggregate ordering of all candidates, or a winning candidate.
Aggregation of rankings A scoring rule assigns a numeric score based on the position in each invididual ordering. Aggregate ordering formed by summing the scores from each individual. Possible rules (for ranking 4 individuals) plurality x > y > z > u mapped to 1, 0, 0, 0 Only 1st preference counts. veto x > y > z > u mapped to 1, 1, 1, 0 (or 0, 0, 0, 1) Only last preference counts. orda count x > y > z > u mapped to 3, 2, 1, 0
Aggregation of rankings Scoring rules can be combined with runoff procedures: 2-candidate runoff with plurality rule 1 Eliminate all but top two based on scores 2 Recalculate scores, and winner is the one scoring higher Single transferable vote with plurality rule 1 Eliminate candidate with lowest plurality score 2 Continue eliminations until one left
Aggregation of preferences/ranks Ordering by pairwise plurality Order x > y if plurality of agents prefer x to y. Can lead to cycles: agent 1: > > agent 2: > > agent 3: > >
Aggregation of preferences/ranks Ordering by pairwise plurality How are these cycles possible? Candidates have different property vectors: (1,1,0) (1,0,1) (0,1,1) Even if the agents value all properties positively, uneven weights lead to cycles. Example: (3,2,1) (1,3,2) (2,1,3)
Strategic voting Expressing preferences incorrectly can be beneficial: Assume plurality voting Agent s actual preferences are A > B > C Agent knows that other agents preferences are B > C > A B > C > A C > A > B C > A > B B and C will be tied if agent votes A > B > C. B wins if agent votes B > A > C (better result!) Other scoring rules (and voting systems in general) are manipulable similarly.
(Can be viewed as a generalization of second-price sealed bid auctions. Vickrey-Clarke-Grove mechanism With the Clarke pivot rule Choice between alternatives in set X : 1 Agents report their value functions v i (x), x X 2 Best outcome is x opt = arg max n x X i=1 v i(x) 3 Agent i is paid j i v j (x opt ) max x X v j (x) Value of x opt - value of best alternative (without i) Agent s payment+utility is maximized by truthful reporting! j i
Game tree search Two-person multi-stage zero-sum games player wins, opponent loses, or vice versa (or it s a draw) Board games: checkers, chess, backgammon, go Other applications? (Military operations?) Issue: very large search trees Issue: focusing search difficult
Basic game tree search by Minimax Depth-first search of bounded depth AND-OR tree Leaf nodes evaluated with a heuristic value function Chess: value of pieces, relative positions (mobility, safety of king,...) Values of non-leaf nodes by min or max of children AND-nodes (opponent) by minimization OR-nodes (player) by maximization (Special case: whole game-tree covered, winning leafs 1, losing leafs -1, and draws 0)
Minimax Tree Search 1 0 1 2 0 3 1 0 2-1 0 3-2 1 1
Alpha-Beta Pruning Idea behind Alpha-Beta Pruning min(x, max(y, z)) = x if x y (α cuts) max(x, min(y, z)) = x if x y (β cuts) In both cases, z is irrelevant.
Alpha-Beta pruning example MAX 3 MIN 3 3 12 8
Alpha-Beta pruning example MAX 3 MIN 3 2 3 12 8 2 X X
Alpha-Beta pruning example MAX 3 MIN 3 2 14 3 12 8 2 X X 14
Alpha-Beta pruning example MAX 3 MIN 3 2 14 5 3 12 8 2 X X 14 5
Alpha-Beta pruning example MAX 3 3 MIN 3 2 14 5 2 3 12 8 2 X X 14 5 2
Heuristics to support Alpha-Beta Pruning Alpha-Beta prunes more if best actions tried first Determine promising actions through iterative deepening: use score for action/child from previous iterative-deepening round
Issue with depth-bounds: Horizon effect Black bishop is trapped, but its capture could be delayed to search depth d + 1
Transposition tables Depth-first used in games like chess because astronomic state spaces: algorithms that require storing all visited states not feasible. Need to utilize memory for pruning, without exhausting it DFS can reach a state in multiple ways = Multiple copies of the same subtree Transposition tables: Cache states encountered during DFS; retrieve value of already-encountered states, rather than repeating search When table full, delete low-importance states
Endgame databases In games with a limited number of simple (late) states/configurations, compute their value by exhaustive game-tree search and store for later use. Another form of caching, constructed once, before game-playing
Endgame databases All 7 piece states solved in 2012 7-piece DB is 140 TB; 6-piece DB is 1.2 TB Black to check-mate in 545 moves:
Checkers is solved Checkers (5 10 20 states) was shown to be a draw (Schaeffer et al., 2007) The solution consists of: AND-OR tree from initial state ( 10 7 nodes) Leaf nodes evaluated from endgame database with all 10 piece positions: consists of 3.9 10 13 states; computations took 2001-2005
Checkers is solved
Monte Carlo methods DFS not working well for some types of games Too many states Heuristics don t guide search well Information gained during search not utilized Monte Carlo methods Sample randomly full game-plays Focus search according to promising game-plays Works even without heuristics, e.g. for Go Similar methods used also for very large MDPs, POMDPs (e.g. in robotics)
Go (or Baduk or Weiqi) Two-player fully-observable deterministic zero-sum board game Has been a big challenge for computers
Rules of Go Go is played on 19 19 square grid of points, by players called Black and White. Each point on the grid may be colored black, white or empty. A point P, not colored C, is said to reach C, if there is a path of (vertically or horizontally) adjacent points of P s color from P to a point of color C. Clearing a color means emptying all points of that color that don t reach empty. Starting with an empty grid, the players alternate turns, starting with Black. A turn is either a pass; or a move that doesn t repeat an earlier grid coloring. A move consists of coloring an empty point one s own color; then clearing the opponent color, and then clearing one s own color. The game ends after two consecutive passes. A player s score is the number of points of her color, plus the number of empty points that reach only her color. White gets 6.5 points extra. The player with the higher score at the end of the game is the winner.
Example game of 9 9 Go
Example game of 9 9 Go
Example game of 9 9 Go
Why is Go difficult for computers? Go is visual and thus easy for people Could not show 10 Chess moves in one image Branching factor far larger than in Chess Evaluation of board configurations difficult Horizon effect is strong (easy to delay capture)
Paradigm shift in 2006 Computer Go was progressing slowly (weak amateur level) In 2006, Monte-Carlo methods surpassed traditional tree search In 2015 All competitive programs use Monte Carlo 19 19 is strong amateur level 9 9 is professional level 5 6 is solved, solving 6 6 feasible In 2016, board-evaluation by neural networks AlphaGo beats human champions
Monte Carlo Search Try out every possible action Several randomized plays: Choose actions randomly Stop only after game ends Score each gameplay according to who wins Best action is one with most wins Notice: No search tree here, only evaluation of current action alternatives
Monte Carlo Tree Search (MCTS) Extension of simulation/sampling-only Monte Carlo search Generate a search tree, with leafs evaluated by randomized simulation
Example (Single Agent) 0/0 0/0 Show number of wins/trials for each node
Monte Carlo Tree Search 1/1 1/1 win
Monte Carlo Tree Search 1/1 1/1 0/0
Monte Carlo Tree Search 1/2 1/1 0/1 loss
Monte Carlo Tree Search 2/3 1/1 0/1 1/1 win
Monte Carlo Tree Search 3/4 2/2 0/1 1/1 1/1 win
Monte Carlo Tree Search 3/5 2/2 0/1 1/2 1/1 0/1 loss
Monte Carlo Tree Search 4/6 3/3 0/1 1/2 1/1 1/1 0/1 win
Monte Carlo Tree Search Which tree node to choose for next expansion or trial? Incomplete information: results of previous trials Choose one with few trials with high rewards (low confidence) Choose one with many trials with lower rewards (high confidence) (Exploration-exploitation trade-off as in Reinforcement Learning) approach: Multi-Armed Bandits
Multi-Armed Bandits Consider three One-Armed Bandits (slot machines) with different win distributions, and with the following wins so far. 1 0, 1, 0, 0, 1 2 5 3 2, 2, 1 Which arm would you pull next?
Multi-Armed Bandits µ i = (initially unknown) expected pay-off of arm i T i (t) = how many times arm i played in steps 1..t µ = max K i=1 µ i is optimum pay-off Optimal way of choosing the arm minimizes regret (how much below optimum?) after n steps: nµ K µ i E[T i (n)] i=1
Multi-Armed Bandits x i = average reward from arm i in first n steps UCB1 Formula (Auer et al. 2002) First every arm is played once. Optimal arm after n steps: choose i to maximize 2 ln n x i + T i (n)
UCT algorithm Create a root of tree with initial state while within computational budget do leaf Selection(root) terminal Simulation(leaf) Backpropagation(leaf, Utility(terminal)) end return arg max Children(root) N(child)
UCT algorithm function Selection(node) while NonTerminal(State(node)) do action arg max Actions(node) UCB1(node, action) if Child(node, action) then node Child(node, action) else return Expand(node,action) end end return node function UCB1(node,action) child Child(node,action) if child then return SumUtil(child) + 2 ln N(node) N(child) N(child) else return
UCT algorithm function Expand(node, action) child Create a new child to node N(child) 0 SumUtil(child) 0 return child function Backpropagation(node, utility) while node do N(node) N(node) + 1 SumUtil(node) SumUtil(node) + utility node Parent(node) end
Properties of UCT algorithm Best action chosen exponentially more often Grows an asymmetric tree Utility estimates converge to true values Applicable to one or more agents deterministic or stochastic systems