The emergence of global-scale online services has galvanized scale-out software, characterized by splitting vast datasets and massive computation across many independent servers. Datacenters housing thousands of servers are designed to support scale-out workloads, with per-server throughput dictating the overall datacenter capacity and cost. However, today’s processors do not use the die area efficiently, limiting the per-server throughput. We find that existing processors over-provision cache capacity, leading to designs with sub-optimal performance density (performance per unit area). Furthermore, as these designs are scaled up with technology, the increasing number of cores leads to further performance density reduction due to increased on-chip latencies. We use a suite of real-world scale-out workloads to investigate performance density and formulate a methodology to design optimally-efficient processors for scale-out workloads. Our proposed architecture is based on the notion o...