This paper describes the architecture and implementation of the AlphaServer GS320, a cache-coherent non-uniform memory access multiprocessor developed at Compaq. The AlphaServer GS320 architecture is specifically targeted at medium-scale multiprocessing with 32 to 64 processors. Each node in the design consists of four Alpha 21264 processors, up to 32GB of coherent memory, and an aggressive IO subsystem. The current implementation supports up to 8 such nodes for a total of 32 processors. While snoopy-based designs have been stretched to medium-scale multiprocessors by some vendors, providing sufficient snoop bandwidth remains a major challenge especially in systems with aggressive processors. At the same time, directory protocols targeted at larger scale designs lead to a number of inherent inefficiencies relative to snoopy designs. A key goal of the AlphaServer GS320 architecture has been to achieve the best-of-both-worlds, partly by exploiting the bounded scale of the target syst...