TY - GEN
T1 - Optimizing center performance through coordinated data staging, scheduling and recovery
AU - Zhang, Zhe
AU - Wang, Chao
AU - Vazhkudai, Sudharshan S.
AU - Ma, Xiaosong
AU - Pike, Gregory G.
AU - Cobb, John W.
AU - Mueller, Frank
PY - 2007
Y1 - 2007
N2 - Procurement and the optimized utilization of Petascale supercomputers and centers is a renewed national priority. Sustained performance and availability of such large centers is a key technical challenge significantly impacting their usability. Storage systems are known to be the primary fault source leading to data unavailability and job resubmissions. This results in reduced center performance, partially due to the lack of coordination between I/O activities and job scheduling. In this work, we propose the coordination of job scheduling with data staging/offloading and on-demand staged data reconstruction to address the availability of job input data and to improve centerwide performance. Fundamental to both mechanisms is the efficient management of transient data: in the way it is scheduled and recovered. Collectively, from a center's standpoint, these techniques optimize resource usage and increase its data/service availability. From a user's standpoint, they reduce the job turnaround time and optimize the allocated time usage. (c) 2007 ACM.
AB - Procurement and the optimized utilization of Petascale supercomputers and centers is a renewed national priority. Sustained performance and availability of such large centers is a key technical challenge significantly impacting their usability. Storage systems are known to be the primary fault source leading to data unavailability and job resubmissions. This results in reduced center performance, partially due to the lack of coordination between I/O activities and job scheduling. In this work, we propose the coordination of job scheduling with data staging/offloading and on-demand staged data reconstruction to address the availability of job input data and to improve centerwide performance. Fundamental to both mechanisms is the efficient management of transient data: in the way it is scheduled and recovered. Collectively, from a center's standpoint, these techniques optimize resource usage and increase its data/service availability. From a user's standpoint, they reduce the job turnaround time and optimize the allocated time usage. (c) 2007 ACM.
KW - Coordinated scheduling
KW - Data scheduling
KW - Data staging
KW - HPC center performance optimization
KW - Transient data recovery
UR - https://www.scopus.com/pages/publications/56749179540
U2 - 10.1145/1362622.1362696
DO - 10.1145/1362622.1362696
M3 - Conference contribution
AN - SCOPUS:56749179540
SN - 9781595937643
T3 - Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC'07
BT - Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC'07
T2 - 2007 ACM/IEEE Conference on Supercomputing, SC'07
Y2 - 10 November 2007 through 16 November 2007
ER -