Sciweavers

1113 search results - page 18 / 223
» Performance under Failures of DAG-based Parallel Computing
Sort
View
IPPS
2002
IEEE
14 years 9 days ago
Failure Behavior Analysis for Reliable Distributed Embedded Systems
Failure behavior analysis is a very important phase in developing large distributed embedded systems with weak safety requirements which do graceful degradation in case of failure...
Mario Trapp, Bernd Schürmann, Torsten Tettero...
CONCUR
2005
Springer
14 years 28 days ago
A Theory of System Behaviour in the Presence of Node and Link Failures
d Abstract) Adrian Francalanza and Matthew Hennessy University of Sussex, Falmer Brighton BN1 9RH, England Abstract. We develop a behavioural theory of distributed programs in the ...
Adrian Francalanza, Matthew Hennessy
HPDC
2011
IEEE
12 years 11 months ago
Algorithm-based recovery for iterative methods without checkpointing
In today’s high performance computing practice, fail-stop failures are often tolerated by checkpointing. While checkpointing is a very general technique and can often be applied...
Zizhong Chen
IPPS
2007
IEEE
14 years 1 months ago
An Adaptive Semantic Filter for Blue Gene/L Failure Log Analysis
— Frequent failure occurrences are becoming a serious concern to the community of high-end computing, especially when the applications and the underlying systems rapidly grow in ...
Yinglung Liang, Yanyong Zhang, Hui Xiong, Ramendra...
CCGRID
2006
IEEE
14 years 1 months ago
A Failure-Aware Scheduling Strategy in Large-Scale Cluster System
As the scale is expanding, node failure becomes a commonplace feature of large-scale cluster systems. As an important part of cluster operating system software, job scheduling tak...
Linping Wu, Dan Meng, Jianfeng Zhan, Wang Lei, Bib...