Enabling Applications for Grid Computing with Globus

3.6. Checkpoint and restart capability

A job within a grid application may be designed to be launched, perform its tasks, and report back to the user or grid portal regarding its success or failure. In the latter case the same job may be launched for a second time, if it has not changed any persistent data prior to its error state. This process can be then repeated until final successful completion. However, it may make sense that failures be handled by the grid server to allow a more sophisticated way to get to job completion.

By building checkpoint and restart capabilities into the job and making its state available to other services within the grid, the job could be restarted where it failed, even on a different node.

Get Enabling Applications for Grid Computing with Globus now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Enabling Applications for Grid Computing with Globus by Bart Jacob, Luis Ferreira, Norbert Bieberstein, Candice Gilzean, Jean-Yves Girard, Roman Strachowski, Seong Yu

3.6. Checkpoint and restart capability

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly