We detail and analyse the critical techniques which may be combined in the design of fast hardware for RSA cryptography: chinese remainders, star chains, Hensel's odd division (a.k.a. Montgomery modular reduction), carry-save representation, quotient pipelining and asynchronous carry completion adders. A PAM1 implementation of RSA which combines all of the techniques presented here is fully operational at PRL: it delivers an RSA secret decryption rate over 600Kb/s for 512b keys, and 165Kb/s for 1Kb keys. This is an order of magnitude faster than any previously reported running implementation. While our implementation makes full use of the PAM's recongurability, we can nevertheless derive from our (multiple PAM designs) implementation a (single) gate-array specication whose size is estimated under 100K gates, and speed over 1Mb/s for RSA 512b keys. Each speed-up in the hardware performance of RSA involves a matching gain in software performance which we also analyse. In add...