Image-Captioning-using-ResNet50-Flickr8k

Deep learning model that generates image descriptions using ResNet50 features + NLP. Achieves BLEU-4 score of 0.331 on Flickr8k dataset. Dual-input architecture processes images & text sequences to predict captions word-by-word.

📁 Dataset

Flickr8k - 8,000 images with 5 captions each (40,000 total captions)

📋 Dataset Setup

Download the Flickr8k dataset from [Kaggle]([https://www.kaggle.com/datasets/adityajn105/flickr8k] and upload to Google Drive.

Expected folder structure: Data Split:

Train: 735 images
Test: 137 images
Dev: 113 images
Total filtered images: 1,000

🛠️ Preprocessing Steps

Clean captions (lowercase, remove punctuation/digits)
Add startseq and endseq tokens
Filter words with frequency < 5
Pad sequences to max length of 33 tokens
Extract ResNet50 features → 2048-dim vectors per image

🏗️ Model Architecture

Image Encoder: ResNet50 (pretrained on ImageNet) → 2048-dim feature vector
Text Processor: Tokenization + padding (max length: 33 tokens)
Vocabulary Size: 3,266 words
Total Model Parameters: 2,791,106 (10.65 MB)
Model Type: Dual-input neural network (image + text branches)

📈 Model Training Log (10 Epochs)

Epoch	Loss
1	5.1046
2	3.7326
3	3.2901
4	2.9483
5	2.6832
6	2.4816
7	2.2829
8	2.1052
9	1.9720
10	1.8263

📊 BLEU Scores (on Test Set)

Metric	Score
BLEU-1	0.7293
BLEU-2	0.5714
BLEU-3	0.4451
BLEU-4	0.3310

🖼️ Sample Test Image Predictions

#	Image ID	Generated Caption
1	3544793763_b38546a5e8.jpg	"a boxer is smiling in front of a boxer"
2	509123893_07b8ea82a9.jpg	"two girls enjoy corn on a purple purple and purple..."
3	1082379191_ec1e53f996.jpg	"a man sits on a dock on a dock"
4	3223224391_be50bf4f43.jpg	"a dog is running through the deep water"
5	3216926094_bc975e84b9.jpg	"a brown dog runs through the grass with a toy in its mouth"

🖼️ Sample Generated Captions

✅ "a brown dog is running through water"
✅ "a boy is jumping over a trampoline"
✅ "a hockey player is guarding the goal"
✅ "a woman is walking on a pebble path"
✅ "a young boy is sitting on a bed with a map in his shoulder"
⚠️ "a group of dogs pulling a group of dogs pulling a group of dogs" (repetition issue)

🚀 Getting Started

Prerequisites

# Install required libraries
!pip install torch torchvision pillow numpy matplotlib tqdm nltk

# For NLTK (BLEU score)
import nltk
nltk.download('punkt')

⚠️ Known Limitations

Word/phrase repetition in some captions (e.g., " a group of dogs", "on a dock on a dock")
Greedy decoding without beam search
No attention mechanism implemented

🔮 Future Improvements

Add attention mechanism
Implement beam search to reduce repetition
Upgrade to Transformer-based architecture
Train on larger dataset (Flickr30k/MSCOCO)
Add repetition penalty during inference

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Image Captioning Report.pdf		Image Captioning Report.pdf
Image Captioning.ipynb		Image Captioning.ipynb
README.md		README.md
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Image-Captioning-using-ResNet50-Flickr8k

📁 Dataset

📋 Dataset Setup

🛠️ Preprocessing Steps

🏗️ Model Architecture

📈 Model Training Log (10 Epochs)

📊 BLEU Scores (on Test Set)

🖼️ Sample Test Image Predictions

🖼️ Sample Generated Captions

🚀 Getting Started

Prerequisites

⚠️ Known Limitations

🔮 Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Image-Captioning-using-ResNet50-Flickr8k

📁 Dataset

📋 Dataset Setup

🛠️ Preprocessing Steps

🏗️ Model Architecture

📈 Model Training Log (10 Epochs)

📊 BLEU Scores (on Test Set)

🖼️ Sample Test Image Predictions

🖼️ Sample Generated Captions

🚀 Getting Started

Prerequisites

⚠️ Known Limitations

🔮 Future Improvements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages