Recently, a dedicated hardware accelerator was proposed that works in conjunction with caches found next to modern-day microprocessors, to speedup the commonly utilized memcpy operation. The main assumption of the proposal was that the to-be-memcpy-ed data has to reside inside the cache, which is not always valid. In this paper, we present a dedicated load/store unit and its implementation which cooperates with the previously proposed memcpy hardware accelerator and cache to ensure that data becomes available in the cache. Experimental results, using synthetic benchmarks, show that the load/store unit in conjunction with the memcpy hardware accelerator is capable of reducing the memcpy latencies by 85% (when the data is not present in the cache) compared to a highly optimized, hand-coded in assembly software solution.