This paper presents a scalable and self-optimizing architecture for Quality-of-Service (QoS) provisioning in the Differentiated Services (DiffServ) framework. The proposed architecture includes adaptive components that model the network as a Semi-Markov Decision Process (SMDP). Specifically, an ingress node adaptively performs connection admission and flow classification, while each core router performs joint bandwidth allocation and buffer management for the network classes. The main objective is to maximize average long term network revenue, and at the same time, effectively minimize average long term QoS violations. We use a model-free Reinforcement Learning (RL) technique to find the optimal policy for each DiffServ component. Simulation results show that our proposed solution not only performs well in terms of average long term reward, but is able to adapt, self-optimize, and self-heal to network changes.