The Bag-of-Words model works in two phases. In the training phase, one collects local image descriptors for training images (img0 and img1, in our case) and clusters them into vocabulary. In the second phase, local descriptors found in the input image are compared with all vocabulary words, alongside a list of how often each word appeared (for example, was selected as the closest one) within the image, for example, the frequencies vector, which forms the global image descriptor.
The following is the expected output: