Embedded systems commonly execute one program for their lifetime. Designing embedded system architectures with configurable components, such that those components can be tuned to that one program based on a program pre-analysis, can yield significant power and performance benefits. We illustrate such benefits by designing a loop cache specifically with tuning in mind. Our results show a 70% reduction in instruction memory access, for MIPS and 8051 processors