— This paper addresses learning based adaptive resource allocation for wireless MIMO channels with Markovian fading. The problem is posed as Constrained Markov Decision Process with the goal of minimizing the average transmission cost (such as the transmission power) with the constraint on the average holding cost (such as the transmitter delay). Standard Q-learning algorithm is employed to adaptively find the optimal policy for unknown channel/traffic statistics, its convergence properties discussed and shown that it can relatively quickly compute the optimal policy even for rather large state spaces. In order to further improve the convergence rate of the standard Qlearning, we establish several structural results on the optimal policies. We show that the optimal transmission policy is monotonic in the buffer occupancy. This permits us to utilize the supermodularity of the Q-factors and form a structured Q-learning algorithm that increases the convergence rate with respect to the...
Dejan V. Djonin, Vikram Krishnamurthy