Sciweavers

2400 search results - page 262 / 480
» Systems Failures
Sort
View
123
Voted
SIGSOFT
2010
ACM
15 years 15 days ago
Embracing ambiguity
Software helps people fulfill their goals, but development tools lack understanding of those goals. But if development tools did understand how software artifacts relate to higher...
Kenneth C. Arnold, Henry Lieberman
108
Voted
ICSE
2011
IEEE-ACM
14 years 6 months ago
ReAssert: a tool for repairing broken unit tests
Successful software systems continuously change their requirements and thus code. When this happens, some existing tests get broken because they no longer reflect the intended be...
Brett Daniel, Danny Dig, Tihomir Gvero, Vilas Jaga...
129
Voted
SIGMETRICS
2009
ACM
134views Hardware» more  SIGMETRICS 2009»
15 years 9 months ago
DRAM errors in the wild: a large-scale field study
Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern compute clusters. Failures are costly both in terms of hardware replacement costs and...
Bianca Schroeder, Eduardo Pinheiro, Wolf-Dietrich ...
125
Voted
TSMC
2008
113views more  TSMC 2008»
15 years 2 months ago
Distributed Diagnosis Under Bounded-Delay Communication of Immediately Forwarded Local Observations
In this paper, we study distributed failure diagnosis under k-bounded communication delay, where each local site transmits its observations to other sites immediately after each o...
Wenbin Qiu, Ratnesh Kumar
171
Voted
ICDCS
2012
IEEE
13 years 5 months ago
Combining Partial Redundancy and Checkpointing for HPC
Today’s largest High Performance Computing (HPC) systems exceed one Petaflops (1015 floating point operations per second) and exascale systems are projected within seven years...
James Elliott, Kishor Kharbas, David Fiala, Frank ...