The key to increasing performance without a commensurate increase in power consumption in modern processors lies in increasing both parallelism and core specialization. Core specialization has been employed in the embedded space and is likely to play an important role in future heterogeneous multi-core architectures as well. In this paper, the face recognition application domain is employed as a case study to showcase an architectural design methodology which generates a specialized core with high performance and very low power characteristics. Specifically, we create 'ASIC-like' execution flows to sustain the high memory parallelism generated within the core. The price of this benefit is a significant increase in compilation complexity. The crux of the problem is the need to co-schedule the often conflicting constraints of data access, data movement, and computation. A modular compiler approach that employs integer linear programming (ILP) based 'interconnect-aware...