Sciweavers

148 search results - page 16 / 30
» Recovery From Software Failures Caused by Mandelbugs
Sort
View
118
Voted
ICDCS
2000
IEEE
15 years 8 months ago
Coherence-based Coordinated Checkpointing for Software Distributed Shared Memory Systems
Fault-tolerant techniques that can cope with system failures in software distributed shared memory (SDSM) are essential for creating productive and highly available parallel compu...
Angkul Kongmunvattana, Santipong Tanchatchawal, Ni...
140
Voted
ANSS
2007
IEEE
15 years 7 months ago
An Accurate and Efficient Time-Division Parallelization of Cycle Accurate Architectural Simulators
This paper proposes a parallel cycle-accurate microarchitectural simulator which efficiently executes its workload by splitting the simulation process along time-axis into many in...
Masahiro Yano, Toru Takasaki, Takashi Nakada, Hiro...
204
Voted
CBSE
2011
Springer
14 years 3 months ago
Rectifying orphan components using group-failover in distributed real-time and embedded systems
Orphan requests are a significant problem for multi-tier distributed systems since they adversely impact system correctness by violating the exactly-once semantics of application...
Sumant Tambe, Aniruddha S. Gokhale
115
Voted
LCPC
2007
Springer
15 years 9 months ago
Compiler-Enhanced Incremental Checkpointing
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and complexity. This makes them increasingly vulnerable to failures from a variety o...
Greg Bronevetsky, Daniel Marques, Keshav Pingali, ...
145
Voted
MOBISYS
2007
ACM
16 years 3 months ago
NodeMD: diagnosing node-level faults in remote wireless sensor systems
Software failures in wireless sensor systems are notoriously difficult to debug. Resource constraints in wireless deployments substantially restrict visibility into the root cause...
Veljko Krunic, Eric Trumpler, Richard Han