We propose a technique for reducing the energy spent in the memory-processor interface of an embedded system during the execution of firmware code. The method is based on the idea of compressing the most commonly executed instructions so as to reduce the energy dissipated during memory access. Instruction decompression is performed on-the-fly by a hardware block located between processor and memory: No changes to the processor architecture are required. Hence, our technique is well suited for systems employing IP cores whose internal architecture cannot be modified. We describe a number of decompression schemes and architectures that effectively trade off hardware complexity and static code size increase for memory energy and bandwidth reduction, as proved by the experimental data we have collected by executing several test programs on different design templates.