— Three-stage non-blocking switching fabrics are the next step in scaling current crossbar switches to many hundreds or few thousands of ports. Congestion (output contention) management is the central open problem –without it, performance suffers heavily under real-world traffic patterns. Centralized schedulers for bufferless crossbars manage output contention but are not scalable to high valencies and to multi-stage fabrics. Distributed scheduling, as in buffered crossbars, is scalable but has never been scaled beyond crossbars. We combine ideas from centralized and from distributed schedulers, from request-grant protocols, and from credit-based flow control, to propose a novel, practical architecture for scheduling in non-blocking buffered switching fabrics. The new architecture relies on multiple, independent, single-resource schedulers, operating in a pipeline. It: (i) does not need internal speedup; (ii) directly operates on variable-size packets or multi-packet segments; (i...