Sciweavers

482 search results - page 25 / 97
» A large-scale study of failures in high-performance computin...
Sort
View
HPDC
2010
IEEE
15 years 4 months ago
ROARS: a scalable repository for data intensive scientific computing
As scientific research becomes more data intensive, there is an increasing need for scalable, reliable, and high performance storage systems. Such data repositories must provide b...
Hoang Bui, Peter Bui, Patrick J. Flynn, Douglas Th...
125
Voted
CCGRID
2009
IEEE
15 years 10 months ago
Performance under Failures of DAG-based Parallel Computing
— As the scale and complexity of parallel systems continue to grow, failures become more and more an inevitable fact for solving large-scale applications. In this research, we pr...
Hui Jin, Xian-He Sun, Ziming Zheng, Zhiling Lan, B...
ICPP
2009
IEEE
15 years 10 months ago
Accelerating Checkpoint Operation by Node-Level Write Aggregation on Multicore Systems
—Clusters and applications continue to grow in size while their mean time between failure (MTBF) is getting smaller. Checkpoint/Restart is becoming increasingly important for lar...
Xiangyong Ouyang, Karthik Gopalakrishnan, Dhabales...
PDP
2002
IEEE
15 years 8 months ago
On the Impossibility of Implementing Perpetual Failure Detectors in Partially Synchronous Systems
In this paper we study the implementability of different classes of failure detectors in several models of partial synchrony. We show that no failure detector with perpetual accur...
Mikel Larrea, Antonio Fernández, Sergio Ar&...
CCGRID
2009
IEEE
15 years 7 months ago
Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing
In large-scale clusters and computational grids, component failures become norms instead of exceptions. Failure occurrence as well as its impact on system performance and operatio...
Song Fu