With current FPGAs, designers can now instantiate several embedded processors, memory units, and a wide variety of IP blocks to build a single-chip, high-performance multiprocessor embedded system. Furthermore, Multi-FPGA systems can be built to provide massive parallelism given an efficient programming model. In this paper, we present a lightweight subset implementation of the standard messagepassing interface, MPI, that is suitable for embedded processors. It does not require an operating system and uses a small memory footprint. With our MPI implementation (TMDMPI), we provide a programming model capable of using multiple-FPGAs that hides hardware complexities from the programmer, facilitates the development of parallel code and promotes code portability. To enable intra-FPGA and interFPGA communications, a simple Network-on-Chip is also developed using a low overhead network packet protocol. Together, TMD-MPI and the network provide a homogeneous view of a cluster of embedded proc...