This paper describes and evaluates two algorithms for performing on-line failure recovery (data reconstruction) in redundant disk arrays. It presents an implementation of disk-oriented reconstruction, a data recovery algorithm that allows the reconstruction process to absorb essentially all the disk bandwidth not consumed by the user processes, and then compares this algorithm to a previousproposed parallel stripe-oriented approach. The disk-oriented approach yields better overall failure-recovery performance. The paper evaluates performance via detailed simulation on two different disk array architectures: the RAID level 5 organization, and the declustered parity organization. The benefits of the disk-oriented algorithm can be achieved using controller or host buffer memory no larger than the size of three disk tracks per disk in the array. This paper also investigates the tradeoffs involved in selecting the size of the disk accesses used by the failure recovery process.
Mark Holland, Garth A. Gibson, Daniel P. Siewiorek