A Communication Framework for Fault-Tolerant Parallel Execution

14 years 7 months ago

Download www2.cs.uh.edu

PC grids represent massive computation capacity at a low cost, but are challenging to employ for parallel computing because of variable and unpredictable performance and availability. A communicating parallel program must employ checkpoint-restart and/or process redundancy to make continuous forward progress in such an unreliable environment. A communication model based on one-sided Put/Get calls, pioneered by the Linda system, is a good match as processes can execute their communication operations independently and asynchronously. However, Linda and its many variants are not designed for communicating processes that are replicated or independently restarted from checkpoints. The key problem is that a single logical operation that impacts the global program state may be executed by diﬀerent instances of the same process at diﬀerent times leading to semantic inconsistency. This paper presents the design, execution model, implementation, and validation of a communication layer for ro...

Nagarajan Kanna, Jaspal Subhlok, Edgar Gabriel, Es

Real-time Traffic

Continuous Forward Progress | LCPC 2009 | Massive Computation Capacity | Parallel Computing | System Software |

claim paper

Post Info
More Details (n/a)

Added	26 Jul 2010
Updated	26 Jul 2010
Type	Conference
Year	2009
Where	LCPC
Authors	Nagarajan Kanna, Jaspal Subhlok, Edgar Gabriel, Eshwar Rohit, David Anderson

Comments (0)

Sciweavers

A Communication Framework for Fault-Tolerant Parallel Execution

Continuous Forward Progress | LCPC 2009 | Massive Computation Capacity | Parallel Computing | System Software |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers