Chapter 8. Queues, Threads, and Reading Data

In this chapter we introduce the use of queues and threads in TensorFlow, with the main motivation of streamlining the process of reading input data. We show how to write and read TFRecords, the efficient TensorFlow file format. We then demonstrate queues, threads, and related functionalities, and connect all the dots in a full working example of a multithreaded input pipeline for image data that includes pre-processing, batching, and training.

The Input Pipeline

When dealing with small datasets that can be stored in memory, such as MNIST images, it is reasonable to simply load all data into memory, then use feeding to push data into a TensorFlow graph. For larger datasets, however, this can become unwieldy. A natural paradigm for handling such cases is to keep the data on disk and load chunks of it as needed (such as mini-batches for training), such that the only limit is the size of your hard drive.

In addition, in many cases in practice, a typical data pipeline often includes steps such as reading input files with different formats, changing the shape or structure of input, normalizing or doing other forms of pre-processing, and shuffling the input, all before training has even started.

Much of this process can trivially be decoupled and broken into modular components. Pre-processing, for example, does not involve training, and thus naively inputs can be preprocessed all at once and then fed to training. Since our training works ...

Get Learning TensorFlow now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.