End-to-end I/O Monitoring on Leading Supercomputers

  • Bin Yang
  • , Wei Xue*
  • , Tianyu Zhang
  • , Shichao Liu
  • , Xiaosong Ma
  • , Xiyang Wang
  • , Weiguo Liu
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

18 Citations (Scopus)

Abstract

This paper offers a solution to overcome the complexities of production system I/O performance monitoring. We present Beacon, an end-to-end I/O resource monitoring and diagnosis system for the 40960-node Sunway TaihuLight supercomputer, currently the fourth-ranked supercomputer in the world. Beacon simultaneously collects and correlates I/O tracing/profiling data from all the compute nodes, forwarding nodes, storage nodes, and metadata servers. With mechanisms such as aggressive online and offline trace compression and distributed caching/storage, it delivers scalable, low-overhead, and sustainable I/O diagnosis under production use. With Beacon's deployment on TaihuLight for more than three years, we demonstrate Beacon's effectiveness with real-world use cases for I/O performance issue identification and diagnosis. It has already successfully helped center administrators identify obscure design or configuration flaws, system anomaly occurrences, I/O performance interference, and resource under- or over-provisioning problems. Several of the exposed problems have already been fixed, with others being currently addressed. Encouraged by Beacon's success in I/O monitoring, we extend it to monitor interconnection networks, which is another contention point on supercomputers. In addition, we demonstrate Beacon's generality by extending it to other supercomputers. Both Beacon codes and part of collected monitoring data are released.1

Original languageEnglish
Article number3
JournalACM Transactions on Storage
Volume19
Issue number1
DOIs
Publication statusPublished - 11 Jan 2023

Keywords

  • Anomaly detection
  • Bottleneck optimization
  • I/O diagnosis
  • I/O monitoring

Fingerprint

Dive into the research topics of 'End-to-end I/O Monitoring on Leading Supercomputers'. Together they form a unique fingerprint.

Cite this