Abstract
Large-scale scientific applications can create enormous complexity for neural networks in terms of both size of data sample and data set. We will present LBANN’s unique capabilities that leverage scalable distributed memory training algorithms as well as large-scale platforms such as the Sierra supercomputer at LLNL for better strong scaling. Specifically, we will describe how our distributed convolution algorithms coupled with GPU-centric communication techniques realize both improved compute performance as well as more-capable models by exploiting fine-grained parallelism in large-scale convolutional neural networks. We demonstrate this capability by scaling up the size of the 3D data cube used for training neural network that predicts cosmological constants. Additionally, we will showcase the challenges of I/O at scale and detail some of the techniques that we use to overcome them.
Brian Van Essen is the informatics group leader and a computer scientist at the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory (LLNL). He is pursuing research in large-scale deep learning for scientific domains and training deep neural networks using high-performance computing systems. He is the project leader for the Livermore Big Artificial Neural Network open-source deep learning toolkit, and the LLNL lead for the ECP ExaLearn and CANDLE projects. Additionally, he co-leads an effort to map scientific machine learning applications to neural network accelerator co-processors as well as neuromorphic architectures. He joined LLNL in 2010 after earning his Ph.D. and M.S. in computer science and engineering at the University of Washington. He also has an M.S and B.S. in electrical and computer engineering from Carnegie Mellon University. |