In any single agent system, exploration is a critical component of learning. It ensures that all possible actions receive some degree of attention, allowing an agent to converge to good policies. The same concept has been adopted by multiagent learning systems. However, there is a fundamentally different dynamic in multiagent learning: each agent operates in a non-stationary environment, as a direct result of the evolving policies of other agents in the system. As such, exploratory actions taken by agents bias the policies of other agents, forcing them to perform optimally in the presence of agent exploration. CLEAN rewards address this issue by privatizing exploration (agents take their best action, but internally compute rewards for counterfactual actions). However, CLEAN rewards require each agent to know the mathematical form of the system evaluation function, which is typically unavailable to agents. In this paper, we present an algorithm to approximate CLEAN rewards, eliminatin...
Mitchell K. Colby, Sepideh Kharaghani, Chris Holme