Customer Segmentation on Online Retail (K-Means, Agglomerative, DBSCAN/HDBSCAN) + Cluster-Based Recommendations

Abstract

This project presents a comprehensive machine-learning framework for enhancing online retail analytics through data-driven customer segmentation and targeted product recommendation. Leveraging the publicly available 2010–2011 transactional dataset from a UK-based retailer, the study implements rigorous data cleaning, feature engineering, and unsupervised learning techniques.

After imputing anomalies and transforming raw transactions into a customer-centric matrix of normalised RFM metrics and behavioural attributes, a multi-model clustering strategy is applied. K-Means supplies an interpretable baseline; Hierarchical Agglomerative Clustering offers a nuanced multilevel perspective on customer behavior; and DBSCAN/HDBSCAN detect low-density outliers to preserve segment purity.

Cluster evaluation is guided holistically by Silhouette Score, Calinski-Harabasz Index, and Davies-Bouldin Index. Optimal model selection, based on the peak silhouette value of 0.53, yields three behaviourally distinct clusters. These segments are profiled to reveal purchasing propensities and lifetime-value potential, and are subsequently integrated into a cluster-based recommendation engine that promotes top-selling yet unpurchased products within each cluster.

The proposed framework demonstrates a replicable, multi-model insight generation framework for online retailers to refine marketing campaigns, elevate personalisation, and stimulate revenue growth through RFM analysis as well as data-driven customer insights.

Keywords

RFM Analysis • Unsupervised Learning • K-Means Clustering • Hierarchical Agglomerative Clustering • DBSCAN • HDBSCAN • Silhouette Score • Customer Segmentation • Recommendation System • Cluster-Based Filtering • Online Retail Analytics

TL;DR.

We cluster ~4k retail customers using a PCA-compressed feature set and compare three methods.
Final take: K-Means (k=3) gives the clearest three actionable segments; Agglomerative corroborates and zooms in on the K-Means clusters structure; HDBSCAN is best kept for anomaly detection.

Beyond clustering, we built a cluster-driven recommendation logic:

Top-3 not-yet-bought product suggestions tailored for each new customer based on its assigned cluster's profile.

Notebooks

K-Means — 01_kmeans.ipynb
Agglomerative — 02_agglomerative.ipynb
DBSCAN/HDBSCAN — 03_dbscan_hdbscan.ipynb

Data & Features

4,067 customers × 16 engineered features -> z-scored + outlier freed, then PCA to 6 components, ~81% variance retained. The PCA matrix feeds all three clustering methods.

Results (on the 4,067×6 PCA matrix)

Model	#Clusters (noise)	Silhouette	Calinski–Harabasz	Davies–Bouldin	Take
K-Means (k=3)	3 (0%)	0.236	1257.17	1.37	Tight, distinct three-segment split.
Agglomerative (complete, k=3)	3 (0%)	0.367	507.64	1.27	Highest cohesion; supports k=3 decision.
HDBSCAN (min_cluster_size=50)	2 (+ 7.8% noise)	0.260	852.42	1.15	Great for outlier flagging; under-segments for marketing.

PCA removes redundancy, speeds clustering, keeps structure; 6 PCs captured ~81% while remaining interpretable via loadings.

Reproduce

pip install -r requirements.txt
jupyter lab
# open notebooks/ and run top to bottom

Final Deliverables:

Clean, runnable notebooks for three clustering approaches.
Cluster-based recommendation engine logic
Visualizations and metrics exported to results/.
Report.pdf (academic-style writeup with methods + findings; also the final academic submission towards the fulfillment of our diploma).

Authors

Khushi Singh (ML Engineer, yours truly)
Shalini Mitra (Business Intelligence Analyst, Project Architect)

This project was completed as part of our PG Diploma in Big Data Analytics at CDAC-Noida (India).

💡 Why this project?
We chose customer segmentation because it merges our backgrounds: Shalini brought business intuition and logic, while I had experience tailoring conservation giving programs for diverse client profiles.
Understanding clusters + customer behaviour and nuances wasn’t just academic, it’s exactly the kind of insight that drives smarter marketing, personalized outreach, meaningful client relationships (with the bonus of business-relevant talking points in an interview!)

AI DISCLOSURE: Some documentation, formatting, and report structuring support was assisted by generative AI tools (Perplexity.ai). All analysis, coding, and interpretation are our own.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Customer Segmentation on Online Retail (K-Means, Agglomerative, DBSCAN/HDBSCAN) + Cluster-Based Recommendations

Abstract

Keywords

TL;DR.

Notebooks

Data & Features

Results (on the 4,067×6 PCA matrix)

Reproduce

Final Deliverables:

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Customer Segmentation on Online Retail (K-Means, Agglomerative, DBSCAN/HDBSCAN) + Cluster-Based Recommendations

Abstract

Keywords

TL;DR.

Notebooks

Data & Features

Results (on the 4,067×6 PCA matrix)

Reproduce

Final Deliverables:

Authors

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages