Global Coffee & Health: Stress Prediction & Behavioral Clustering

Course: Machine Learning and Data Mining (MLDM)
Academic Year: 2024/2025
Authors: Davide Leone & Filippo Camossi

Project Overview

This project analyzes the Global Coffee Health Dataset (10,000 samples) using a dual-track machine learning architecture to achieve two main objectives:

Supervised Modeling (Classification): Build and optimize predictive models to classify an individual's stress level (Low, Medium, High) based on health metrics, demographics, and coffee consumption.
Unsupervised Analysis (Clustering): Identify natural behavioral profiles (archetypes) within the population based purely on their habits and characteristics.

The complete end-to-end Data Science pipeline—from Exploratory Data Analysis (EDA) and feature engineering to model evaluation and cluster interpretation—is documented in the main Jupyter Notebook.

Executive Summary & Key Insights

Our analysis uncovered four critical insights regarding the relationship between coffee, health, and stress:

High Predictive Accuracy: Stress levels can be predicted with high confidence. Our best-performing model, a Bagging Ensemble, achieved a 90.35% accuracy on the test set.
Prediction Drivers (Health > Coffee): Feature importance analysis for predicting stress revealed that pre-existing health metrics are the dominant factors:
- Health_Issues: 66% - 91% importance.
- Health_Risk_Score: ~16% importance.
- Caffeine/Coffee features: < 5% importance.
Clustering Drivers (Coffee Defines Groups): Unsupervised analysis (K-Means, k=3) revealed 3 distinct population profiles. The feature importance for creating these groups showed the opposite picture:
- Caffeine_per_Serving: 53.2% importance.
- Caffeine_mg & Coffee_Intake: ~28% combined importance.
The Hidden Relationship: Caffeine consumption is not a good direct predictor of stress. However, it defines behavioral profiles (Cluster 0: High Consumers, Cluster 1: Moderate, Cluster 2: Non-Consumers) which inherently have significantly different stress rates. Health status predicts stress, but the caffeine consumption profile acts as a proxy for the risk group one belongs to.

Methodology & Pipeline

The project follows a rigorous, structured Machine Learning pipeline:

Exploratory Data Analysis (EDA): Target distribution analysis (handling class imbalance), correlation mapping (Heatmaps), and statistical testing (ANOVA F-test, Chi-Square) for initial feature selection.
Targeted Feature Engineering: Creation of high-impact derived variables, such as Caffeine_per_Serving (handling zero-division cases) and Health_Risk_Score (a composite score of 4 risk factors).
Robust Preprocessing & Splitting:
- Categorical encoding (LabelEncoding, One-Hot).
- Feature scaling (StandardScaler).
- Stratified 60-20-20 Split (Train, Validation, Test) to ensure robust evaluation across imbalanced classes.
Supervised Modeling: Comparison of 8+ algorithms including Decision Trees, Random Forests, AdaBoost, Gradient Boosting, XGBoost, Bagging, SVM (Linear & RBF), and Multi-Layer Perceptrons (MLP). Evaluated via Accuracy, F1-Score, and Confusion Matrices.
Advanced Hyperparameter Tuning: Systematic tuning using Grid Search (for SVM) and Bayesian Optimization via Optuna (for Tree-based models and Neural Networks).
Unsupervised Clustering: * Segmented the population using K-Means.
- Optimal k=3 selected via the Elbow Method (WCSS) and Silhouette Analysis.
- Cluster visualization using 2D PCA and interpretative profiling via scaled deviation heatmaps.

Tech Stack

Language: Python
Environment: Jupyter Notebook / JupyterLab
Data Manipulation & Math: pandas, numpy, scipy
Machine Learning: scikit-learn, xgboost, optuna (Bayesian Optimization)
Visualization: matplotlib, seaborn
Model Serialization: joblib

🚀 Getting Started

Follow these steps to run the analysis locally.

Prerequisites

Python 3.8+
Ensure the dataset CSV file is placed in the root directory of the project.

Installation & Execution

Clone the repository:

git clone https://github.com/SickCiQuattro/MLDM-Coffee-Health.git
cd coffee-health-ml-analysis

Create and activate a virtual environment:

# On Windows
python -m venv venv
venv\Scripts\activate

# On macOS/Linux
python3 -m venv venv
source venv/bin/activate

Install the required dependencies:
```
pip install -r requirements.txt
```
Launch the Jupyter Notebook:
```
jupyter notebook
```
Open MLDM_Project.ipynb and run the cells sequentially to reproduce the analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
MLDM_Presentazione.pdf		MLDM_Presentazione.pdf
MLDM_Project.ipynb		MLDM_Project.ipynb
MLDM_Project_Improved.ipynb		MLDM_Project_Improved.ipynb
README.md		README.md
global_coffee_health_dataset.csv		global_coffee_health_dataset.csv
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Global Coffee & Health: Stress Prediction & Behavioral Clustering

Project Overview

Executive Summary & Key Insights

Methodology & Pipeline

Tech Stack

🚀 Getting Started

Prerequisites

Installation & Execution

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Global Coffee & Health: Stress Prediction & Behavioral Clustering

Project Overview

Executive Summary & Key Insights

Methodology & Pipeline

Tech Stack

🚀 Getting Started

Prerequisites

Installation & Execution

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages