Next-generation high-end Network Processors (NP) must address demands from both diversified applications and ever-increasing traffic pressure. One major challenge is to design an extraordinary scalable architecture. In this paper, it is argued that such an objective can only be sufficed by introducing highly paralleled structure, namely the Paralleled Processing-engine Cluster (PPC). We demonstrate this point from the trade-off among aspects such as performance, programmability and flexibility. However, PPC natively suffers from several critical issues on load-balancing, intra-flow packet ordering and memory contention. After investigating several existing approaches, we present novel solutions for each issue according to the balance between performance and coast. Through intensive analysis and comprehensive simulations, it is shown that the Shortest Queue First scheduling with Class-based prediction (SQF-C) performs nearly optimally, while the hardware based per-flow ordering mechanis...