Recently, several loop buffer designs have been proposed to reduce instruction fetch energy due to size and location advantage of loop buffer. Nevertheless, on design complexity dictates most loop buffer designs to store only innermost loops without forward branch or instructions within innermost loops before a forward branch. While program modeling shows that typical programs can best be represented with a simple loop model, many of then contain forward branches in their innermost loops. For example, MiBench spends 71% of execution time on innermost loops, and 27% of these innermost loops consist of forward branch(es). Hence, existing designs lead to limitation in reduction of instruction fetch energy. We propose a simple and effective way to cope with this complexity: since using BTB is a norm in most designs, if we add an extra bit in BTB, indicating if the loop buffer stores the fall-through or target trace after a within-the-innermost-loop forward branch, then much of the complexi...