Abstract— In this paper, we describe the design and implementation of the primary memory system of the TRIPS processor. To match the aggressive execution bandwidth and support high levels of memory parallelism, the primary memory system is completely partitioned into four banks, can support up to 256 in-flight memory instructions, aggressive reordering of in-flight loads and stores, up to four loads and stores every cycle and up to 64 outstanding cache misses to sixteen different cache lines. The design was implemented using IBM 130nm ASIC technology and occupies 21% of the processor area. We describe in detail the microarchitecture of the memory system, detailed design of two of the most complex and interesting components – the LSQ and the MHU – and discuss the rationale behind some of the design decisions. Our design experience suggests that the complexity of the partitioned memory system is comparable to less aggressive centralized implementations.
Simha Sethumadhavan, Robert G. McDonald, Rajagopal