HPCPerf Stats

Purpose

HPCPerf Stats is an infrastructure for the low-overhead collection of system-wide performance data that integrates information from a variety of sources. HPCPerf Stats provides a web-based interface for exploring jobs and system-level reports about this data as well as automated analysis and flagging of jobs that need human attention.

Histograms generated from WRF runs on Stampede. Subplots show run times, job size in cores, average cycles per instruction (CPI), and average floating point computation rate.

Overview

The HPCPerf Stats monitor runs periodically during the execution of each job to collect a large number of system statistics and hardware performance counter data from a variety of sources including: CPU usage, socket-level memory usage, swapping and paging statistics, system load and process statistics, system and block device counters, interprocess communications, filesystems usage (NFS, Lustre, Panasas), interconnect fabric traffic, and CPU counters and Uncore counters (e.g. counters from the Memory Controller, Cache and NUMA Coherence Agents, Power Control Unit).

Nightly analyses are available to flag underperforming and misconfigured jobs for later attention by HPC consultants. Jobs are flagged when they leave nodes idle, use the wrong network, experience a drastic drop in performance, or show evidence of low efficiency.

HPCPerf Stats' associated web interface allows for browsing all jobs associated with a cluster, identifying flagged jobs, and plotting basic job characteristics.

Impact

HPCPerf Stats is deployed on all TACC systems, SDSC Comet and Gordon, and LSU SuperMIC.

Publications

R. T. Evans, J. C. Browne and W. L. Barth, "Understanding Application and System Performance Through System-Wide Monitoring," 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Chicago, IL, USA, 2016, pp. 1702-1710, doi: 10.1109/IPDPSW.2016.145.

Evans, T.; Barth, W.L.; Browne, J.C.; DeLeon, R.L.; Furlani, T.R.; Gallo, S.M.; Jones, M.D.; Patra, A.K., "Comprehensive Resource Use Monitoring for HPC Systems with TACC Stats," HPC User Support Tools (HUST), 2014 First International Workshop on , vol., no., pp.13,21, 21-21 Nov. 2014 doi: 10.1109/HUST.2014.7 [pdf]

Contributors

Amit Ruhela
Manager, HPC Tools

Stephen Lien Harrell
HPC Engineering Scientist, HPC Performance & Architectures

Sangamithra Goutham
Research Engineering Scientist Associate, User Services

Chris Ramos
Engineering Scientist Associate, HPC Performance & Architectures