The paper explores a very simple agent design method called Q-decomposition, wherein a complex agent is built from simpler subagents. Each subagent has its own reward function and runs its own reinforcement learning process. It supplies to a central arbitrator the Q-values (according to its own reward function) for each possible action. The arbitrator selects an action maximizing the sum of Q-values from all the subagents. This approach has advantages over designs in which subagents recommend actions. It also has the property that if each subagent runs the Sarsa reinforcement learning algorithm to learn its local Q-function, then a globally optimal policy is achieved. (On the other hand, local Q-learning leads to globally suboptimal behavior.) In some cases, this form of agent decomposition allows the local Q-functions to be expressed by muchreduced state and action spaces. These results are illustrated in two domains that require effective coordination of behaviors.
Stuart J. Russell, Andrew Zimdars