In this paper, an efficient Montgomery multiplier is introduced for the modular exponentiation operation, which is fundamental to numerous public-key cryptosystems. Four aspects are considered: performance, power, reliability, and scalability. To increase performance, the architecture is based on the radix-4 Carry-Save Adder (CSA). To lower power consumption, we devised several effective techniques for reducing the spurious transitions and the Expected Switching Activity (ESA) of high fan-out signals. To achieve scalability, we implement a 4-fold nested loop for the whole data processing flow. It is compatible with the multiple-precision digit-serial arithmetic as well as the data transfer to/from an external memory. Lastly, to make sure that the arithmetic operation runs correctly without inducing data overflow error, we find out the optimum numbers of bits for all vectors appearing in the operation through a mathematical analysis and a logic simulation. In the evaluation of hardware...