Customer Segmentation on Online Retail (K-Means, Agglomerative, DBSCAN/HDBSCAN) + Cluster-Based Recommendations
This project presents a comprehensive machine-learning framework for enhancing online retail analytics through data-driven customer segmentation and targeted product recommendation. Leveraging the publicly available 2010–2011 transactional dataset from a UK-based retailer, the study implements rigorous data cleaning, feature engineering, and unsupervised learning techniques.
After imputing anomalies and transforming raw transactions into a customer-centric matrix of normalised RFM metrics and behavioural attributes, a multi-model clustering strategy is applied. K-Means supplies an interpretable baseline; Hierarchical Agglomerative Clustering offers a nuanced multilevel perspective on customer behavior; and DBSCAN/HDBSCAN detect low-density outliers to preserve segment purity.
Cluster evaluation is guided holistically by Silhouette Score, Calinski-Harabasz Index, and Davies-Bouldin Index. Optimal model selection, based on the peak silhouette value of 0.53, yields three behaviourally distinct clusters. These segments are profiled to reveal purchasing propensities and lifetime-value potential, and are subsequently integrated into a cluster-based recommendation engine that promotes top-selling yet unpurchased products within each cluster.
The proposed framework demonstrates a replicable, multi-model insight generation framework for online retailers to refine marketing campaigns, elevate personalisation, and stimulate revenue growth through RFM analysis as well as data-driven customer insights.
RFM Analysis • Unsupervised Learning • K-Means Clustering • Hierarchical Agglomerative Clustering • DBSCAN • HDBSCAN • Silhouette Score • Customer Segmentation • Recommendation System • Cluster-Based Filtering • Online Retail Analytics
We cluster ~4k retail customers using a PCA-compressed feature set and compare three methods.
Final take: K-Means (k=3) gives the clearest three actionable segments; Agglomerative corroborates and zooms in on the K-Means clusters structure; HDBSCAN is best kept for anomaly detection.
Beyond clustering, we built a cluster-driven recommendation logic:
- Top-3 not-yet-bought product suggestions tailored for each new customer based on its assigned cluster's profile.
- K-Means —
01_kmeans.ipynb - Agglomerative —
02_agglomerative.ipynb - DBSCAN/HDBSCAN —
03_dbscan_hdbscan.ipynb
- 4,067 customers × 16 engineered features -> z-scored + outlier freed, then PCA to 6 components, ~81% variance retained. The PCA matrix feeds all three clustering methods.
| Model | #Clusters (noise) | Silhouette | Calinski–Harabasz | Davies–Bouldin | Take |
|---|---|---|---|---|---|
| K-Means (k=3) | 3 (0%) | 0.236 | 1257.17 | 1.37 | Tight, distinct three-segment split. |
| Agglomerative (complete, k=3) | 3 (0%) | 0.367 | 507.64 | 1.27 | Highest cohesion; supports k=3 decision. |
| HDBSCAN (min_cluster_size=50) | 2 (+ 7.8% noise) | 0.260 | 852.42 | 1.15 | Great for outlier flagging; under-segments for marketing. |
PCA removes redundancy, speeds clustering, keeps structure; 6 PCs captured ~81% while remaining interpretable via loadings.
pip install -r requirements.txt
jupyter lab
# open notebooks/ and run top to bottom- Clean, runnable notebooks for three clustering approaches.
- Cluster-based recommendation engine logic
- Visualizations and metrics exported to
results/. - Report.pdf (academic-style writeup with methods + findings; also the final academic submission towards the fulfillment of our diploma).
-
Khushi Singh (ML Engineer, yours truly)
-
Shalini Mitra (Business Intelligence Analyst, Project Architect)
This project was completed as part of our PG Diploma in Big Data Analytics at CDAC-Noida (India).
💡 Why this project?
We chose customer segmentation because it merges our backgrounds: Shalini brought business intuition and logic, while I had experience tailoring conservation giving programs for diverse client profiles.
Understanding clusters + customer behaviour and nuances wasn’t just academic, it’s exactly the kind of insight that drives smarter marketing, personalized outreach, meaningful client relationships (with the bonus of business-relevant talking points in an interview!)
AI DISCLOSURE: Some documentation, formatting, and report structuring support was assisted by generative AI tools (Perplexity.ai). All analysis, coding, and interpretation are our own.