We describe a generalized Q-learning type algorithm for reinforcement learning in competitive multi-agent games. We make the observation that in a competitive setting with adaptive agents an agent's actions will (likely) result in changes in the opponents policies. In addition to accounting for the estimated policies of the opponents, our algorithm also adjusts these future opponent policies by incorporating estimates of how the opponents change their policy as a reaction to ones own actions. We present results showing that agents that learn with this algorithm can successfully achieve high reward in competitive multi-agent games where myopic self-interested behavior conflicts with the long term individual interests of the players. We show that this approach successfully scales for multi-agent games of various sizes, in particular to the social dilemma type problems: from the small iterated Prisoner's Dilemma, to larger settings akin to Harding's Tragedy of the Commons. ...
Pieter Jan't Hoen, Sander M. Bohte, Han La Poutr&e