Operator mistakes are a significant source of unavailability in modern Internet services. In this paper, we first characterize these mistakes by performing an extensive set of exp...
A significant fraction of software failures in large-scale Internet systems are cured by rebooting, even when the exact failure causes are unknown. However, rebooting can be expen...
George Candea, Shinichi Kawamoto, Yuichi Fujiki, G...
FFPF is a network monitoring framework designed for three things: speed (handling high link rates), scalability (ability to handle multiple applications) and flexibility. Multiple...
Herbert Bos, Willem de Bruijn, Mihai-Lucian Criste...
Tools to understand complex system behaviour are essential for many performance analysis and debugging tasks, yet there are many open research problems in their development. Magpi...
Paul Barham, Austin Donnelly, Rebecca Isaacs, Rich...
Abstractions as the Foundation for Storage Infrastructure John MacCormick, Nick Murphy, Marc Najork, Chandramohan A. Thekkath, and Lidong Zhou Microsoft Research Silicon Valley Wr...
John MacCormick, Nick Murphy, Marc Najork, Chandra...