PaperSwipe

Synkhronos: a Multi-GPU Theano Extension for Data Parallelism

Published 8 years agoVersion 1arXiv:1710.04162

Authors

Adam Stooke, Pieter Abbeel

Categories

cs.DCcs.AIcs.LG

Abstract

We present Synkhronos, an extension to Theano for multi-GPU computations leveraging data parallelism. Our framework provides automated execution and synchronization across devices, allowing users to continue to write serial programs without risk of race conditions. The NVIDIA Collective Communication Library is used for high-bandwidth inter-GPU communication. Further enhancements to the Theano function interface include input slicing (with aggregation) and input indexing, which perform common data-parallel computation patterns efficiently. One example use case is synchronous SGD, which has recently been shown to scale well for a growing set of deep learning problems. When training ResNet-50, we achieve a near-linear speedup of 7.5x on an NVIDIA DGX-1 using 8 GPUs, relative to Theano-only code running a single GPU in isolation. Yet Synkhronos remains general to any data-parallel computation programmable in Theano. By implementing parallelism at the level of individual Theano functions, our framework uniquely addresses a niche between manual multi-device programming and prescribed multi-GPU training routines.

Synkhronos: a Multi-GPU Theano Extension for Data Parallelism

8 years ago
v1
2 authors

Categories

cs.DCcs.AIcs.LG

Abstract

We present Synkhronos, an extension to Theano for multi-GPU computations leveraging data parallelism. Our framework provides automated execution and synchronization across devices, allowing users to continue to write serial programs without risk of race conditions. The NVIDIA Collective Communication Library is used for high-bandwidth inter-GPU communication. Further enhancements to the Theano function interface include input slicing (with aggregation) and input indexing, which perform common data-parallel computation patterns efficiently. One example use case is synchronous SGD, which has recently been shown to scale well for a growing set of deep learning problems. When training ResNet-50, we achieve a near-linear speedup of 7.5x on an NVIDIA DGX-1 using 8 GPUs, relative to Theano-only code running a single GPU in isolation. Yet Synkhronos remains general to any data-parallel computation programmable in Theano. By implementing parallelism at the level of individual Theano functions, our framework uniquely addresses a niche between manual multi-device programming and prescribed multi-GPU training routines.

Authors

Adam Stooke, Pieter Abbeel

arXiv ID: 1710.04162
Published Oct 11, 2017

Click to preview the PDF directly in your browser