A new training algorithm is presented for delayed reinforcement learning problems that does not assume the existence of a critic model and employs the polytope optimization algorit...
We introduce the first algorithm for off-policy temporal-difference learning that is stable with linear function approximation. Off-policy learning is of interest because it forms...