Skip to content

Tayyabah-Rehman/Image-Captioning-using-ResNet50-Flickr8k

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Image-Captioning-using-ResNet50-Flickr8k

Deep learning model that generates image descriptions using ResNet50 features + NLP. Achieves BLEU-4 score of 0.331 on Flickr8k dataset. Dual-input architecture processes images & text sequences to predict captions word-by-word.

📁 Dataset

Flickr8k - 8,000 images with 5 captions each (40,000 total captions)

📋 Dataset Setup

Download the Flickr8k dataset from [Kaggle]([https://www.kaggle.com/datasets/adityajn105/flickr8k] and upload to Google Drive.

Expected folder structure: Data Split:

  • Train: 735 images
  • Test: 137 images
  • Dev: 113 images
  • Total filtered images: 1,000

🛠️ Preprocessing Steps

  1. Clean captions (lowercase, remove punctuation/digits)
  2. Add startseq and endseq tokens
  3. Filter words with frequency < 5
  4. Pad sequences to max length of 33 tokens
  5. Extract ResNet50 features → 2048-dim vectors per image

🏗️ Model Architecture

  • Image Encoder: ResNet50 (pretrained on ImageNet) → 2048-dim feature vector
  • Text Processor: Tokenization + padding (max length: 33 tokens)
  • Vocabulary Size: 3,266 words
  • Total Model Parameters: 2,791,106 (10.65 MB)
  • Model Type: Dual-input neural network (image + text branches)

📈 Model Training Log (10 Epochs)

Epoch Loss
1 5.1046
2 3.7326
3 3.2901
4 2.9483
5 2.6832
6 2.4816
7 2.2829
8 2.1052
9 1.9720
10 1.8263

📊 BLEU Scores (on Test Set)

Metric Score
BLEU-1 0.7293
BLEU-2 0.5714
BLEU-3 0.4451
BLEU-4 0.3310

🖼️ Sample Test Image Predictions

# Image ID Generated Caption
1 3544793763_b38546a5e8.jpg "a boxer is smiling in front of a boxer"
2 509123893_07b8ea82a9.jpg "two girls enjoy corn on a purple purple and purple..."
3 1082379191_ec1e53f996.jpg "a man sits on a dock on a dock"
4 3223224391_be50bf4f43.jpg "a dog is running through the deep water"
5 3216926094_bc975e84b9.jpg "a brown dog runs through the grass with a toy in its mouth"

🖼️ Sample Generated Captions

  • ✅ "a brown dog is running through water"
  • ✅ "a boy is jumping over a trampoline"
  • ✅ "a hockey player is guarding the goal"
  • ✅ "a woman is walking on a pebble path"
  • ✅ "a young boy is sitting on a bed with a map in his shoulder"
  • ⚠️ "a group of dogs pulling a group of dogs pulling a group of dogs" (repetition issue)

🚀 Getting Started

Prerequisites

# Install required libraries
!pip install torch torchvision pillow numpy matplotlib tqdm nltk

# For NLTK (BLEU score)
import nltk
nltk.download('punkt')

⚠️ Known Limitations

  • Word/phrase repetition in some captions (e.g., " a group of dogs", "on a dock on a dock")
  • Greedy decoding without beam search
  • No attention mechanism implemented

🔮 Future Improvements

  • Add attention mechanism
  • Implement beam search to reduce repetition
  • Upgrade to Transformer-based architecture
  • Train on larger dataset (Flickr30k/MSCOCO)
  • Add repetition penalty during inference

About

Deep learning model that generates image descriptions using ResNet50 features + NLP. Achieves BLEU-4 score of 0.331 on Flickr8k dataset. Dual-input architecture processes images & text sequences to predict captions word-by-word.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors