Skip to content

SickCiQuattro/MLDM-Coffee-Health

Repository files navigation

Global Coffee & Health: Stress Prediction & Behavioral Clustering

Python Jupyter Scikit-Learn XGBoost Pandas

Course: Machine Learning and Data Mining (MLDM)
Academic Year: 2024/2025
Authors: Davide Leone & Filippo Camossi

Project Overview

This project analyzes the Global Coffee Health Dataset (10,000 samples) using a dual-track machine learning architecture to achieve two main objectives:

  1. Supervised Modeling (Classification): Build and optimize predictive models to classify an individual's stress level (Low, Medium, High) based on health metrics, demographics, and coffee consumption.
  2. Unsupervised Analysis (Clustering): Identify natural behavioral profiles (archetypes) within the population based purely on their habits and characteristics.

The complete end-to-end Data Science pipeline—from Exploratory Data Analysis (EDA) and feature engineering to model evaluation and cluster interpretation—is documented in the main Jupyter Notebook.

Executive Summary & Key Insights

Our analysis uncovered four critical insights regarding the relationship between coffee, health, and stress:

  • High Predictive Accuracy: Stress levels can be predicted with high confidence. Our best-performing model, a Bagging Ensemble, achieved a 90.35% accuracy on the test set.
  • Prediction Drivers (Health > Coffee): Feature importance analysis for predicting stress revealed that pre-existing health metrics are the dominant factors:
    • Health_Issues: 66% - 91% importance.
    • Health_Risk_Score: ~16% importance.
    • Caffeine/Coffee features: < 5% importance.
  • Clustering Drivers (Coffee Defines Groups): Unsupervised analysis (K-Means, k=3) revealed 3 distinct population profiles. The feature importance for creating these groups showed the opposite picture:
    • Caffeine_per_Serving: 53.2% importance.
    • Caffeine_mg & Coffee_Intake: ~28% combined importance.
  • The Hidden Relationship: Caffeine consumption is not a good direct predictor of stress. However, it defines behavioral profiles (Cluster 0: High Consumers, Cluster 1: Moderate, Cluster 2: Non-Consumers) which inherently have significantly different stress rates. Health status predicts stress, but the caffeine consumption profile acts as a proxy for the risk group one belongs to.

Methodology & Pipeline

The project follows a rigorous, structured Machine Learning pipeline:

  1. Exploratory Data Analysis (EDA): Target distribution analysis (handling class imbalance), correlation mapping (Heatmaps), and statistical testing (ANOVA F-test, Chi-Square) for initial feature selection.
  2. Targeted Feature Engineering: Creation of high-impact derived variables, such as Caffeine_per_Serving (handling zero-division cases) and Health_Risk_Score (a composite score of 4 risk factors).
  3. Robust Preprocessing & Splitting:
    • Categorical encoding (LabelEncoding, One-Hot).
    • Feature scaling (StandardScaler).
    • Stratified 60-20-20 Split (Train, Validation, Test) to ensure robust evaluation across imbalanced classes.
  4. Supervised Modeling: Comparison of 8+ algorithms including Decision Trees, Random Forests, AdaBoost, Gradient Boosting, XGBoost, Bagging, SVM (Linear & RBF), and Multi-Layer Perceptrons (MLP). Evaluated via Accuracy, F1-Score, and Confusion Matrices.
  5. Advanced Hyperparameter Tuning: Systematic tuning using Grid Search (for SVM) and Bayesian Optimization via Optuna (for Tree-based models and Neural Networks).
  6. Unsupervised Clustering: * Segmented the population using K-Means.
    • Optimal k=3 selected via the Elbow Method (WCSS) and Silhouette Analysis.
    • Cluster visualization using 2D PCA and interpretative profiling via scaled deviation heatmaps.

Tech Stack

  • Language: Python
  • Environment: Jupyter Notebook / JupyterLab
  • Data Manipulation & Math: pandas, numpy, scipy
  • Machine Learning: scikit-learn, xgboost, optuna (Bayesian Optimization)
  • Visualization: matplotlib, seaborn
  • Model Serialization: joblib

🚀 Getting Started

Follow these steps to run the analysis locally.

Prerequisites

  • Python 3.8+
  • Ensure the dataset CSV file is placed in the root directory of the project.

Installation & Execution

  1. Clone the repository:

    git clone https://github.com/SickCiQuattro/MLDM-Coffee-Health.git
    cd coffee-health-ml-analysis
  2. Create and activate a virtual environment:

    # On Windows
    python -m venv venv
    venv\Scripts\activate
    # On macOS/Linux
    python3 -m venv venv
    source venv/bin/activate
  3. Install the required dependencies:

    pip install -r requirements.txt
  4. Launch the Jupyter Notebook:

    jupyter notebook
  5. Open MLDM_Project.ipynb and run the cells sequentially to reproduce the analysis.

About

Data science project analyzing the relationship between coffee habits, health risks, and stress levels through predictive modeling and clustering.

Topics

Resources

Stars

Watchers

Forks

Contributors