This paper explores the challenges associated with distributed application management in large-scale computing environments. In particular, we investigate several techniques for extending Plush, an existing distributed application management framework, to provide improved scalability and fault tolerance without sacrificing performance. One of the main limitations of Plush is the structure of the underlying communication fabric. We explain how we incorporated the use of an overlay tree provided by Mace, a toolkit that simplifies the implementation of overlay networks, in place of the existing communication subsystem in Plush to improve robustness and scalability.
Nikolay Topilski, Jeannie R. Albrecht, Amin Vahdat