Memory Error Impact on Deep Learning Training

Purpose

This project quantifies the impact of silent data corruption on deep learning training.

Overview

Supercomputers have shown an unparalleled capacity to speed up deep learning (DL) training. In the coming era of exascale computing, a high error rate is expected to be problematic for most HPC applications. However, the impact on emerging DL applications remains unclear given their stochastic nature. In this project, we focus on understanding the training phase of such applications in the presence of silent data corruption. We design and perform a quantification study with three representative applications by manually injecting silent data corruption errors (SDCs) across the design space and compare training results with the error-free baseline. The results show only 0.61–1.76% of SDCs cause training failures, and taking into account the SDC rate in modern hardware, the actual chance of an error is one in thousands to millions. Over 75% of the SDCs that cause catastrophic errors have a training loss in the next iteration that can be easily detected. With our method and results, supercomputer designers can make rational selection between error correction code (ECC) enabled hardware and ECC-free hardware with or without error-aware DL frameworks based on their acceptable training failure expectation.

Impact

  • This work reveals insights of the expected training failure rate in the presence of a silent data corruption in memory, which helps DL researchers who work on ECC-free hardware to understand the risk.
  • This work can help computing centers and hardware designers to make appropriate acquisition or design decision with quantitative evidence and their acceptable training failure rate.

Contributors

Zhao Zhang
Research Associate

Lei Huang
Research Associate

Ruizhu Huang
Research Associate

Weijia Xu
Manager, Scalable Computational Intelligence

Funding Source

Base funding