Because we're using a built-in dataset, Keras takes care of a great deal of the mundane work we'd need to do around tokenizing, stemming, stop words, and converting our word tokens into numeric tokens. keras.datasets.imbd will give us a list of lists, each list containing a variable length sequence of integers representing the words in the review. We will define our data using the following code:
def load_data(vocab_size): data = dict() data["vocab_size"] = vocab_size (data["X_train"], data["y_train"]), (data["X_test"], data["y_test"]) = imdb.load_data(num_words=vocab_size) return data
We can load our data by calling load_data and choosing a maximum size for our vocabulary. For this example, I'll use 20,000 words as the ...