Distributed computing

DL models have to be trained on a large amount of data to improve their performance. However, training a deep network with millions of parameters may take days, or even weeks. In Large Scale Distributed Deep Networks, Dean et al. proposed two paradigms, namely model parallelism and data parallelism, which allow us to train and serve a network model on multiple physical machines. In the following section, we introduce these paradigms with a focus on distributed TensorFlow capabilities.

Model parallelism

Model parallelism gives every processor the same data but applies a different model to it. If the network model is too big to fit into one machine's memory, different parts of the model can be assigned to different machines. ...

Get Deep Learning with TensorFlow - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.