For detailed learning, refer: Xu et al., in the paper, https://arxiv.org/pdf/1502.03044.pdf, proposed a method for image captioning using an attention mechanism.
Attention-based captioning has become popular recently as it provides better accuracy:
This method trains an attention model in the sequence of the caption, thereby producing better results:
Here is a diagram of LSTM with attention-generating captions: