We have developed new methods for log-based recovery for middleware servers which involve thread pooling, private inmemory states for clients, shared in-memory state and message interactions among middleware servers. Due to the observed rareness of crashes, relatively small size of shared state and infrequency of shared state read/write accesses, we are able to reduce the overhead of message logging and shared state logging while maintaining recovery independence. Checkpointing has a very small impact on ongoing activities while still reducing recovery time. Our recovery mechanism enables client private states to be recovered in parallel after a crash. On a commercial middleware server platform, we have implemented a recovery infrastructure prototype, which demonstrates the manageability of system complexity and shows promising performance results. Categories & Subject Descriptors: D.4.5 Reliability, D.2.4 Software/Program Verification General Terms: Reliability, Performance
Rui Wang 0002, Betty Salzberg, David B. Lomet