The efficiency of the Public Key encryption systems like RSA and ECC can be improved with the adoption of a faster multiplication scheme. In this paper, Modified Montgomery multiplications and circuit architectures are presented. The first modified Montgomery multiplier uses 4:2 compressor and carry save adders (CSA) to perform large word length additions. The total delay for a single modular multiplication using the proposed approach is 7XOR+1 AND gate compared to 8XOR+1AND gate of the recently proposed fastest algorithm. The second modified Montgomery multiplier uses a novel proposed hardware unit that outputs carry save representation of the 4-input operands in 3XOR delays. The total delay for a single modular multiplication using the novel hardware unit is 5XOR+1 AND gate compared to 6XOR+1AND gate of the recently proposed algorithm. The optimal transistor implementations of the proposed approaches have also been presented. The proposed transistor implementations are highly optimi...