This project enables scalable and efficient I/O for distributed deep learning training in computer clusters with existing hardware/software stack.
Emerging Deep Learning (DL) applications introduce heavy I/O workloads on computer clusters. The inherent long lasting, repeated, high volume, and highly concurrent random file access pattern can easily saturate the metadata and data service and negatively impact other users. In this project, we try to design a transient runtime file system that optimizes DL I/O on existing hardware/software stacks. With a comprehensive I/O profile study on real world DL applications, we implemented FanStore. FanStore distributes datasets to the local storage of compute nodes, and maintains a global namespace. With the techniques of function interception, distributed metadata management, and generic data compression, FanStore provides a POSIX-compliant interface with native hardware throughput in an efficient and scalable manner. Users do not have to make intrusive code changes to use FanStore and take advantage of the optimized I/O. Our experiments with benchmarks and real applications show that FanStore can scale DL training to 512 compute nodes with over 90% scaling efficiency.
Zhao Zhang Research Associate
Lei Huang Research Associate
John Cazes Deputy Director Of High Performance Computing
Niall Gaffney Director of Data Intensive Computing
Zhang, Zhao, Lei Huang, Uri Manor, Linjing Fang, Gabriele Merlo, Craig Michoski, John Cazes, and Niall Gaffney. "FanStore: Enabling Efficient and Scalable I/O for Distributed Deep Learning." arXiv preprint arXiv:1809.10799 (2018).
Base funding