The computational fundament of most public-key cryptosystems is the modular multiplication. Improving the efficiency of the modular multiplication is directly associated with the efficiency of the whole cryptosystem. This paper presents an implementation and comparison of three recently proposed, highly efficient architectures for modular multiplication on FPGAs: interleaved modular multiplication and two variants of the Montgomery modular multiplication. This (first) hardware implementation of these designs shows their relative performance regarding area and speed. One of the main findings is that the interleaved multiplication has the least area time product of all investigated architectures. As a typical cryptographic application, we show that a 1024-bit RSA exponentiation can be performed in less than 6.1ms at a clock rate of 69MHz on a Xilinx Virtex FPGA.