We present a cross-layer customization methodology for latency and bandwidth efficient inter-core communication in embedded multiprocessors. The methodology integrates compiler, operating system, and hardware support to achieve a bandwidth efficient, snoopfree, and coherence cache miss-free shared memory communication between synchronized producer and consumers cores. A compilerdriven code transformation is introduced that utilizes a simple ISA support in the form of a special write-through store instruction. It ensures that producer writes are propagated to the consumers with a single bus transaction per cache block when the producer performs the last write to that cache line before exiting its synchronization region. Information regarding the shared buffers involved in the communications is captured by the OS and provided to the cores with the purpose of filtering bus traffic and performing remote updates when necessary. The end result of the proposed methodology is a single bus tra...