TY - GEN
T1 - On-the-fly recovery of job input data in supercomputers
AU - Wang, Chao
AU - Zhang, Zhe
AU - Vazhkudai, Sudharshan S.
AU - Ma, Xiaosong
AU - Mueller, Frank
PY - 2008
Y1 - 2008
N2 - Storage system failure is a serious concern as we approach Petascale computing. Even at today's sub-Petascale levels, I/O failure is the leading cause of downtimes and job failures. We contribute a novel, on-the-fly recovery frame-work for job input data into supercomputer parallel file systems. The framework exploits key traits of the HPC I/O workload to reconstruct lost input data during job execution from remote, immutable copies. Each reconstructed data stripe is made immediately accessible in the client request order due to the delayed metadata update and fine-granular locking while unrelated access to the same file remains unaffected. We have implemented the recovery component within the Lustre parallel file system, thus building a novel application-transparent online recovery solution. Our solution is integrated into Lustre 's two-level locking scheme using a two-phase blocking protocol. Combining parametric and simulation studies, our experiments demonstrate a significant improvement in HPC center serviceability and user job turnaround time.
AB - Storage system failure is a serious concern as we approach Petascale computing. Even at today's sub-Petascale levels, I/O failure is the leading cause of downtimes and job failures. We contribute a novel, on-the-fly recovery frame-work for job input data into supercomputer parallel file systems. The framework exploits key traits of the HPC I/O workload to reconstruct lost input data during job execution from remote, immutable copies. Each reconstructed data stripe is made immediately accessible in the client request order due to the delayed metadata update and fine-granular locking while unrelated access to the same file remains unaffected. We have implemented the recovery component within the Lustre parallel file system, thus building a novel application-transparent online recovery solution. Our solution is integrated into Lustre 's two-level locking scheme using a two-phase blocking protocol. Combining parametric and simulation studies, our experiments demonstrate a significant improvement in HPC center serviceability and user job turnaround time.
UR - https://www.scopus.com/pages/publications/55849114447
U2 - 10.1109/ICPP.2008.28
DO - 10.1109/ICPP.2008.28
M3 - Conference contribution
AN - SCOPUS:55849114447
SN - 9780769533742
T3 - Proceedings of the International Conference on Parallel Processing
SP - 620
EP - 627
BT - Proceedings - 37th International Conference on Parallel Processing, ICPP 2008
T2 - 37th International Conference on Parallel Processing, ICPP 2008
Y2 - 9 September 2008 through 12 September 2008
ER -