Sciweavers

FAST
2016

Uncovering Bugs in Distributed Storage Systems during Testing (Not in Production!)

8 years 8 months ago
Uncovering Bugs in Distributed Storage Systems during Testing (Not in Production!)
Testing distributed systems is challenging due to multiple sources of nondeterminism. Conventional testing techniques, such as unit, integration and stress testing, are ineffective in preventing serious but subtle bugs from reaching production. Formal techniques, such as TLA+, can only verify high-level specifications of systems at the level of logic-based models, and fall short of checking the actual executable code. In this paper, we present a new methodology for testing distributed systems. Our approach applies advanced systematic testing techniques to thoroughly check that the executable code adheres to its high-level specifications, which significantly improves coverage of important system behaviors. Our methodology has been applied to three distributed storage systems in the Microsoft Azure cloud computing platform. In the process, numerous bugs were identified, reproduced, confirmed and fixed. These bugs required a subtle combination of concurrency and failures, making th...
Pantazis Deligiannis, Matt McCutchen, Paul Thomson
Added 03 Apr 2016
Updated 03 Apr 2016
Type Journal
Year 2016
Where FAST
Authors Pantazis Deligiannis, Matt McCutchen, Paul Thomson, Shuo Chen, Alastair F. Donaldson, John Erickson, Cheng Huang, Akash Lal, Rashmi Mudduluru, Shaz Qadeer, Wolfram Schulte
Comments (0)