Deep learning model that generates image descriptions using ResNet50 features + NLP. Achieves BLEU-4 score of 0.331 on Flickr8k dataset. Dual-input architecture processes images & text sequences to predict captions word-by-word.
Flickr8k - 8,000 images with 5 captions each (40,000 total captions)
Download the Flickr8k dataset from [Kaggle]([https://www.kaggle.com/datasets/adityajn105/flickr8k] and upload to Google Drive.
Expected folder structure: Data Split:
- Train: 735 images
- Test: 137 images
- Dev: 113 images
- Total filtered images: 1,000
- Clean captions (lowercase, remove punctuation/digits)
- Add
startseqandendseqtokens - Filter words with frequency < 5
- Pad sequences to max length of 33 tokens
- Extract ResNet50 features → 2048-dim vectors per image
- Image Encoder: ResNet50 (pretrained on ImageNet) → 2048-dim feature vector
- Text Processor: Tokenization + padding (max length: 33 tokens)
- Vocabulary Size: 3,266 words
- Total Model Parameters: 2,791,106 (10.65 MB)
- Model Type: Dual-input neural network (image + text branches)
| Epoch | Loss |
|---|---|
| 1 | 5.1046 |
| 2 | 3.7326 |
| 3 | 3.2901 |
| 4 | 2.9483 |
| 5 | 2.6832 |
| 6 | 2.4816 |
| 7 | 2.2829 |
| 8 | 2.1052 |
| 9 | 1.9720 |
| 10 | 1.8263 |
| Metric | Score |
|---|---|
| BLEU-1 | 0.7293 |
| BLEU-2 | 0.5714 |
| BLEU-3 | 0.4451 |
| BLEU-4 | 0.3310 |
| # | Image ID | Generated Caption |
|---|---|---|
| 1 | 3544793763_b38546a5e8.jpg | "a boxer is smiling in front of a boxer" |
| 2 | 509123893_07b8ea82a9.jpg | "two girls enjoy corn on a purple purple and purple..." |
| 3 | 1082379191_ec1e53f996.jpg | "a man sits on a dock on a dock" |
| 4 | 3223224391_be50bf4f43.jpg | "a dog is running through the deep water" |
| 5 | 3216926094_bc975e84b9.jpg | "a brown dog runs through the grass with a toy in its mouth" |
- ✅ "a brown dog is running through water"
- ✅ "a boy is jumping over a trampoline"
- ✅ "a hockey player is guarding the goal"
- ✅ "a woman is walking on a pebble path"
- ✅ "a young boy is sitting on a bed with a map in his shoulder"
⚠️ "a group of dogs pulling a group of dogs pulling a group of dogs" (repetition issue)
# Install required libraries
!pip install torch torchvision pillow numpy matplotlib tqdm nltk
# For NLTK (BLEU score)
import nltk
nltk.download('punkt')- Word/phrase repetition in some captions (e.g., " a group of dogs", "on a dock on a dock")
- Greedy decoding without beam search
- No attention mechanism implemented
- Add attention mechanism
- Implement beam search to reduce repetition
- Upgrade to Transformer-based architecture
- Train on larger dataset (Flickr30k/MSCOCO)
- Add repetition penalty during inference