The cost and complexity of administration of large systems has come to dominate their total cost of ownership. Stateless and soft-state components, such as Web servers or network routers, are relatively easy to manage: capacity can be scaled incrementally by adding more nodes, rebalancing of load after failover is easy, and reactive or proactive ("rolling") reboots can be used to handle transient failures. We show that it is possible to achieve the same ease of management for the state-storage subsystem by subdividing persistent state according to the specific guarantees needed by each type. While other systems [22, 20] have addressed persistent-until-deleted state, we describe SSM, an implemented store for a previously unaddressed class of state
Benjamin C. Ling, Emre Kiciman, Armando Fox