Abstract—As programmers are asked to manage more complicated parallel machines, it is likely that they will become increasingly dependent on tools such as multi-threaded data race detectors, memory bounds checkers, dynamic dataflow trackers, and various performance profilers to understand and maintain their software. As these tools continue to grow in importance, it is worth exploring the potential for special purpose accelerators for these tasks, especially since commodity multi-cores can only provide limited speedups. Rather than performing all the instrumentation and analysis on the main processor, we explore the idea of using the increasingly highthroughput board level interconnect available on many systems to offload analysis to a parallel off-chip accelerator. There are many non-trivial technical issues in taking such an approach that may not appear in simulation, and to flush them out we have developed a prototype system that maps a DMA based analysis engine, sitting on a ...