Existing low-latency protocols make unrealistically strong assumptions about reliability. This allows them to achieve impressive performance, but also prevents this performance being exploited by applications, which must then deal with reliability issues in the application code. We present results from a new protocol that provides error recovery, and whose performance is close to that of existing low-latency protocols. We achieve a CPU overhead of 1:5s for packet download and 3:6s for upload. Our results show that a executing a protocol in the kernel is not incompatible with high performance, and b complete control over the protocol stack enables 1 simple forms of ow control to be adopted, 2 proper bracketing of the unreliable portions of the interconnect thus minimising bu ers held up for possible recovery, and 3 the sharing of bu er pools. The result is a protocol which performs well in the context of parallel computation and the loose coupling of processes in the workstations of a c...
Stephen R. Donaldson, Jonathan M. D. Hill, David B