This paper examines how the performance of a shared-memory multiprocessor can be improved by including hardware support for block transfers. A system similar to the Hector multiprocessor developed at the University of Toronto is used as a base architecture. It is shown that such hardware support can improve the performance of initialization code by as much as 50, but that the amount of improvement depends on the memory access behavior of the program and the way in which the operating system issues block transfer requests.
Steven J. E. Wilton, Zvonko G. Vranesic