On-the-fly recovery of job input data in supercomputers

Chao Wang*, Zhe Zhang, Sudharshan S. Vazhkudai, Xiaosong Ma, Frank Mueller

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Storage system failure is a serious concern as we approach Petascale computing. Even at today's sub-Petascale levels, I/O failure is the leading cause of downtimes and job failures. We contribute a novel, on-the-fly recovery frame-work for job input data into supercomputer parallel file systems. The framework exploits key traits of the HPC I/O workload to reconstruct lost input data during job execution from remote, immutable copies. Each reconstructed data stripe is made immediately accessible in the client request order due to the delayed metadata update and fine-granular locking while unrelated access to the same file remains unaffected. We have implemented the recovery component within the Lustre parallel file system, thus building a novel application-transparent online recovery solution. Our solution is integrated into Lustre 's two-level locking scheme using a two-phase blocking protocol. Combining parametric and simulation studies, our experiments demonstrate a significant improvement in HPC center serviceability and user job turnaround time.

Original languageEnglish
Title of host publicationProceedings - 37th International Conference on Parallel Processing, ICPP 2008
Pages620-627
Number of pages8
DOIs
Publication statusPublished - 2008
Externally publishedYes
Event37th International Conference on Parallel Processing, ICPP 2008 - Portland, OR, United States
Duration: 9 Sept 200812 Sept 2008

Publication series

NameProceedings of the International Conference on Parallel Processing
ISSN (Print)0190-3918

Conference

Conference37th International Conference on Parallel Processing, ICPP 2008
Country/TerritoryUnited States
CityPortland, OR
Period9/09/0812/09/08

Fingerprint

Dive into the research topics of 'On-the-fly recovery of job input data in supercomputers'. Together they form a unique fingerprint.

Cite this