High performance microprocessors are designed with generalpurpose applications in mind. When it comes to embedded applications, these architectures typically perform controlintensive tasks in a System-on-Chip (SoC) design. But they are significantly inefficient for data-intensive tasks such as video encoding/decoding. Although configurable processors fill this gap by complementing the existing functional units with instruction extensions, their performance lags behind the needs of real-time embedded tasks. In this paper, we evaluate the performance potential of a dataflow processor for H.264 video decoding. We first profile the H.264 application to capture the amount of data traffic among modules. We use this information to guide the placement of H.264 modules in the WaveScalar dataflow architecture. A simulated annealing based placement algorithm produces the final placement aiming to optimize the communication costs between the modules in the dataflow architecture. In addition to ou...