Given a hardware/software partitioned specification and an allocation (number and type) of processors, we present an algorithm to (1) map each of the software behaviors (or tasks) to processors, (2) pipeline the system specification, and (3) schedule the behaviors in each pipe stage, amongst selected hardware components and processors, so as to satisfy a throughput constraint at minimal hardware cost. Thus, to achieve high performance, not only are critical tasks implemented as pipelined hardware architectures, but the system is also divided into concurrently executing stages. Furthermore, to offset the cost of this increased concurrency, non-critical sections are implemented on processors or as cheaper hardware blocks. Our experiments demonstrate the feasibility of our approach and the necessity of system pipelining in high performance design.