This paper describes an improved version of the Tenca-Koç unified scalable radix-2 Montgomery multiplier with half the latency for small and moderate precision operands and half the queue memory requirement. Like the Tenca-Koç multiplier, this design is reconfigurable to accept any input precision in either GF(p) or GF(2n ) up to the size of the on-chip memory. An FPGA implementation can perform 1024-bit modular exponentiation in 16 ms using 5598 4-input lookup tables, making it the fastest unified scalable design yet reported.