TY - JOUR
T1 - End-to-end I/O Monitoring on Leading Supercomputers
AU - Yang, Bin
AU - Xue, Wei
AU - Zhang, Tianyu
AU - Liu, Shichao
AU - Ma, Xiaosong
AU - Wang, Xiyang
AU - Liu, Weiguo
N1 - Publisher Copyright:
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2023/1/11
Y1 - 2023/1/11
N2 - This paper offers a solution to overcome the complexities of production system I/O performance monitoring. We present Beacon, an end-to-end I/O resource monitoring and diagnosis system for the 40960-node Sunway TaihuLight supercomputer, currently the fourth-ranked supercomputer in the world. Beacon simultaneously collects and correlates I/O tracing/profiling data from all the compute nodes, forwarding nodes, storage nodes, and metadata servers. With mechanisms such as aggressive online and offline trace compression and distributed caching/storage, it delivers scalable, low-overhead, and sustainable I/O diagnosis under production use. With Beacon's deployment on TaihuLight for more than three years, we demonstrate Beacon's effectiveness with real-world use cases for I/O performance issue identification and diagnosis. It has already successfully helped center administrators identify obscure design or configuration flaws, system anomaly occurrences, I/O performance interference, and resource under- or over-provisioning problems. Several of the exposed problems have already been fixed, with others being currently addressed. Encouraged by Beacon's success in I/O monitoring, we extend it to monitor interconnection networks, which is another contention point on supercomputers. In addition, we demonstrate Beacon's generality by extending it to other supercomputers. Both Beacon codes and part of collected monitoring data are released.1
AB - This paper offers a solution to overcome the complexities of production system I/O performance monitoring. We present Beacon, an end-to-end I/O resource monitoring and diagnosis system for the 40960-node Sunway TaihuLight supercomputer, currently the fourth-ranked supercomputer in the world. Beacon simultaneously collects and correlates I/O tracing/profiling data from all the compute nodes, forwarding nodes, storage nodes, and metadata servers. With mechanisms such as aggressive online and offline trace compression and distributed caching/storage, it delivers scalable, low-overhead, and sustainable I/O diagnosis under production use. With Beacon's deployment on TaihuLight for more than three years, we demonstrate Beacon's effectiveness with real-world use cases for I/O performance issue identification and diagnosis. It has already successfully helped center administrators identify obscure design or configuration flaws, system anomaly occurrences, I/O performance interference, and resource under- or over-provisioning problems. Several of the exposed problems have already been fixed, with others being currently addressed. Encouraged by Beacon's success in I/O monitoring, we extend it to monitor interconnection networks, which is another contention point on supercomputers. In addition, we demonstrate Beacon's generality by extending it to other supercomputers. Both Beacon codes and part of collected monitoring data are released.1
KW - Anomaly detection
KW - Bottleneck optimization
KW - I/O diagnosis
KW - I/O monitoring
UR - https://www.scopus.com/pages/publications/85149435496
U2 - 10.1145/3568425
DO - 10.1145/3568425
M3 - Article
AN - SCOPUS:85149435496
SN - 1553-3077
VL - 19
JO - ACM Transactions on Storage
JF - ACM Transactions on Storage
IS - 1
M1 - 3
ER -