DRAM errors in the wild: a large-scale field study

16 years 1 months ago

Download www.cs.toronto.edu

Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern compute clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on DRAM in laboratory conditions, little has been reported on real DRAM failures in large production clusters. In this paper, we analyze measurements of memory errors in a large ﬂeet of commodity servers over a period of 2.5 years. The collected data covers multiple vendors, DRAM capacities and technologies, and comprises many millions of DIMM days. The goal of this paper is to answer questions such as the following: How common are memory errors in practice? What are their statistical properties? How are they aﬀected by external factors, such as temperature and utilization, and by chip-speciﬁc factors, such as chip density, memory technology and DIMM age? We ﬁnd that DRAM error behavior in the ﬁeld diﬀers in many key aspects from commonly held...

Bianca Schroeder, Eduardo Pinheiro, Wolf-Dietrich

Real-time Traffic

DRAM | DRAM Error | Hardware | Memory Errors | SIGMETRICS 2009 |

claim paper

Added	28 May 2010
Updated	28 May 2010
Type	Conference
Year	2009
Where	SIGMETRICS
Authors	Bianca Schroeder, Eduardo Pinheiro, Wolf-Dietrich Weber

Sciweavers

DRAM errors in the wild: a large-scale field study

DRAM | DRAM Error | Hardware | Memory Errors | SIGMETRICS 2009 |

Explore & Download

Productivity Tools

Sciweavers