Using traditional software profiling to optimize embedded software in an MPSoC design is not reliable. With multiple processors running concurrently and programs interacting, traditional profiling on individual processors cannot capture useful execution information to assist software optimization. A new method to model parallel executions of interacting programs is needed. In this paper, we consider the software optimization problem for throughput-constrained MPSoC designs. We define the "longest delay path" as a sequence of steps leading to a throughput constraint violation and propose an algorithm to build up the path dynamically during simulation. Using an industrial-strength MPEG-2 decoder design in our case study and custom instructions for software optimization, we show that we can optimize the software efficiently in MPSoC designs using frequently executed statement information from the longest delay path. Categories and Subject Descriptors D.2.2 [Design Tools and Tec...