In this paper, we present a hardware solution to perform non cache-line aligned memory copies allowing the commonly used memcpy function to cope with word copies. The main purpose is to reduce the latency in executing memory copies aligned on word boundaries. The proposed solution exploits the presence of a cache and assumes that the to-becopied words are already in the cache. We extend an earlier proposed solution that exploited the cache-line alignment of memcpy function when ‘moving’ large amounts of data. We present the concept and implementation details of the proposed hardware module and the system used to experiment both our hardware and an optimized software implementation of the memcpy function. Experimental results show that the proposed hardware solution is at least 66% faster than an optimized hand-coded software solution.