Skip to content

plasticweather/customer-segmentation-retail

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 

Repository files navigation

Customer Segmentation on Online Retail (K-Means, Agglomerative, DBSCAN/HDBSCAN) + Cluster-Based Recommendations

Abstract

This project presents a comprehensive machine-learning framework for enhancing online retail analytics through data-driven customer segmentation and targeted product recommendation. Leveraging the publicly available 2010–2011 transactional dataset from a UK-based retailer, the study implements rigorous data cleaning, feature engineering, and unsupervised learning techniques.

After imputing anomalies and transforming raw transactions into a customer-centric matrix of normalised RFM metrics and behavioural attributes, a multi-model clustering strategy is applied. K-Means supplies an interpretable baseline; Hierarchical Agglomerative Clustering offers a nuanced multilevel perspective on customer behavior; and DBSCAN/HDBSCAN detect low-density outliers to preserve segment purity.

Cluster evaluation is guided holistically by Silhouette Score, Calinski-Harabasz Index, and Davies-Bouldin Index. Optimal model selection, based on the peak silhouette value of 0.53, yields three behaviourally distinct clusters. These segments are profiled to reveal purchasing propensities and lifetime-value potential, and are subsequently integrated into a cluster-based recommendation engine that promotes top-selling yet unpurchased products within each cluster.

The proposed framework demonstrates a replicable, multi-model insight generation framework for online retailers to refine marketing campaigns, elevate personalisation, and stimulate revenue growth through RFM analysis as well as data-driven customer insights.

Keywords

RFM Analysis • Unsupervised Learning • K-Means Clustering • Hierarchical Agglomerative Clustering • DBSCAN • HDBSCAN • Silhouette Score • Customer Segmentation • Recommendation System • Cluster-Based Filtering • Online Retail Analytics


TL;DR.

We cluster ~4k retail customers using a PCA-compressed feature set and compare three methods.
Final take: K-Means (k=3) gives the clearest three actionable segments; Agglomerative corroborates and zooms in on the K-Means clusters structure; HDBSCAN is best kept for anomaly detection.

Beyond clustering, we built a cluster-driven recommendation logic:

  • Top-3 not-yet-bought product suggestions tailored for each new customer based on its assigned cluster's profile.

Notebooks

Data & Features

  • 4,067 customers × 16 engineered features -> z-scored + outlier freed, then PCA to 6 components, ~81% variance retained. The PCA matrix feeds all three clustering methods.

Results (on the 4,067×6 PCA matrix)

Model #Clusters (noise) Silhouette Calinski–Harabasz Davies–Bouldin Take
K-Means (k=3) 3 (0%) 0.236 1257.17 1.37 Tight, distinct three-segment split.
Agglomerative (complete, k=3) 3 (0%) 0.367 507.64 1.27 Highest cohesion; supports k=3 decision.
HDBSCAN (min_cluster_size=50) 2 (+ 7.8% noise) 0.260 852.42 1.15 Great for outlier flagging; under-segments for marketing.

PCA removes redundancy, speeds clustering, keeps structure; 6 PCs captured ~81% while remaining interpretable via loadings.

Reproduce

pip install -r requirements.txt
jupyter lab
# open notebooks/ and run top to bottom

Final Deliverables:

  • Clean, runnable notebooks for three clustering approaches.
  • Cluster-based recommendation engine logic
  • Visualizations and metrics exported to results/.
  • Report.pdf (academic-style writeup with methods + findings; also the final academic submission towards the fulfillment of our diploma).

Authors

  • Khushi Singh (ML Engineer, yours truly)

  • Shalini Mitra (Business Intelligence Analyst, Project Architect)

This project was completed as part of our PG Diploma in Big Data Analytics at CDAC-Noida (India).

💡 Why this project?
We chose customer segmentation because it merges our backgrounds: Shalini brought business intuition and logic, while I had experience tailoring conservation giving programs for diverse client profiles.
Understanding clusters + customer behaviour and nuances wasn’t just academic, it’s exactly the kind of insight that drives smarter marketing, personalized outreach, meaningful client relationships (with the bonus of business-relevant talking points in an interview!)

AI DISCLOSURE: Some documentation, formatting, and report structuring support was assisted by generative AI tools (Perplexity.ai). All analysis, coding, and interpretation are our own.

About

Customer segmentation of retail data using PCA + K-Means, Agglomerative, and DBSCAN/HDBSCAN, with cluster profiling via radar charts and histograms, and cluster-based top-3 product recommendations.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors