Maximizing Service Reliability in Distributed Computing Systems with Random Node Failures: Theory and Implementation