Divide-and-conquer is a well-suited programming paradigm for parallel Grid applications. Our Satin system efficiently schedules the finegrained tasks of a divide-and-conquer application across multiple clusters in a grid. To accomodate long-running applications, we present a fault-tolerance mechanism for Satin that has negligible overhead during normal execution, while minimizing the amount of redundant work done after a crash of one or more nodes. We study the impact of our fault-tolerance mechanism on application efficiency, both on the Dutch DAS-2 system and using the European testbed of the EC-funded project GridLab.