Level one cache normally resides on a processor’s critical path, which determines the clock frequency. Directmapped caches exhibit fast access time but poor hit rates compared with same sized set-associative caches due to nonuniform accesses to the cache sets, which generate more conflict misses in some sets while other sets are underutilized. We propose a technique to reduce the miss rate of direct mapped caches through balancing the accesses to cache sets. We increase the decoder length and thus reduce the accesses to heavily used sets without dynamically detecting the cache set usage information. We introduce a replacement policy to direct-mapped cache design and increase the access to the underutilized cache sets with the help of programmable decoders. On average, the proposed balanced cache, or BCache, achieves 64.5% and 37.8% miss rate reductions on all 26 SPEC2K benchmarks for the instruction and data caches, respectively. This translates into an average IPC improvement of 5....