Initially a method was tried with the conditional random field (CRF) constructing the sentence with the objects and attributes detected in the image. The steps involved in this process are shown as follows:
CRF has limited ability to come up with sentences in a coherent manner. The quality of generated sentences is not great, as shown in the following screenshot:
The sentences shown here are too structured despite getting ...