In this paper, we investigate the efficient software implementations of the Montgomery modular multiplication algorithm on a multi-core system. A HW/SW co-design technique is used to find the efficient system architecture and the instruction scheduling method. We first implement the Montgomery modular multiplication on a multi-core system with general purpose cores. We then speed up it by adopting the Multiply-Accumulate (MAC) operation in each core. As a result, the performance