Course: Machine Learning and Data Mining (MLDM)
Academic Year: 2024/2025
Authors: Davide Leone & Filippo Camossi
This project analyzes the Global Coffee Health Dataset (10,000 samples) using a dual-track machine learning architecture to achieve two main objectives:
- Supervised Modeling (Classification): Build and optimize predictive models to classify an individual's stress level (Low, Medium, High) based on health metrics, demographics, and coffee consumption.
- Unsupervised Analysis (Clustering): Identify natural behavioral profiles (archetypes) within the population based purely on their habits and characteristics.
The complete end-to-end Data Science pipeline—from Exploratory Data Analysis (EDA) and feature engineering to model evaluation and cluster interpretation—is documented in the main Jupyter Notebook.
Our analysis uncovered four critical insights regarding the relationship between coffee, health, and stress:
- High Predictive Accuracy: Stress levels can be predicted with high confidence. Our best-performing model, a Bagging Ensemble, achieved a 90.35% accuracy on the test set.
- Prediction Drivers (Health > Coffee): Feature importance analysis for predicting stress revealed that pre-existing health metrics are the dominant factors:
Health_Issues: 66% - 91% importance.Health_Risk_Score: ~16% importance.- Caffeine/Coffee features: < 5% importance.
- Clustering Drivers (Coffee Defines Groups): Unsupervised analysis (K-Means, k=3) revealed 3 distinct population profiles. The feature importance for creating these groups showed the opposite picture:
Caffeine_per_Serving: 53.2% importance.Caffeine_mg&Coffee_Intake: ~28% combined importance.
- The Hidden Relationship: Caffeine consumption is not a good direct predictor of stress. However, it defines behavioral profiles (Cluster 0: High Consumers, Cluster 1: Moderate, Cluster 2: Non-Consumers) which inherently have significantly different stress rates. Health status predicts stress, but the caffeine consumption profile acts as a proxy for the risk group one belongs to.
The project follows a rigorous, structured Machine Learning pipeline:
- Exploratory Data Analysis (EDA): Target distribution analysis (handling class imbalance), correlation mapping (Heatmaps), and statistical testing (ANOVA F-test, Chi-Square) for initial feature selection.
- Targeted Feature Engineering: Creation of high-impact derived variables, such as
Caffeine_per_Serving(handling zero-division cases) andHealth_Risk_Score(a composite score of 4 risk factors). - Robust Preprocessing & Splitting:
- Categorical encoding (LabelEncoding, One-Hot).
- Feature scaling (StandardScaler).
- Stratified 60-20-20 Split (Train, Validation, Test) to ensure robust evaluation across imbalanced classes.
- Supervised Modeling: Comparison of 8+ algorithms including Decision Trees, Random Forests, AdaBoost, Gradient Boosting, XGBoost, Bagging, SVM (Linear & RBF), and Multi-Layer Perceptrons (MLP). Evaluated via Accuracy, F1-Score, and Confusion Matrices.
- Advanced Hyperparameter Tuning: Systematic tuning using Grid Search (for SVM) and Bayesian Optimization via Optuna (for Tree-based models and Neural Networks).
- Unsupervised Clustering: * Segmented the population using K-Means.
- Optimal
k=3selected via the Elbow Method (WCSS) and Silhouette Analysis. - Cluster visualization using 2D PCA and interpretative profiling via scaled deviation heatmaps.
- Optimal
- Language: Python
- Environment: Jupyter Notebook / JupyterLab
- Data Manipulation & Math:
pandas,numpy,scipy - Machine Learning:
scikit-learn,xgboost,optuna(Bayesian Optimization) - Visualization:
matplotlib,seaborn - Model Serialization:
joblib
Follow these steps to run the analysis locally.
- Python 3.8+
- Ensure the dataset CSV file is placed in the root directory of the project.
-
Clone the repository:
git clone https://github.com/SickCiQuattro/MLDM-Coffee-Health.git cd coffee-health-ml-analysis -
Create and activate a virtual environment:
# On Windows python -m venv venv venv\Scripts\activate
# On macOS/Linux python3 -m venv venv source venv/bin/activate
-
Install the required dependencies:
pip install -r requirements.txt
-
Launch the Jupyter Notebook:
jupyter notebook
-
Open
MLDM_Project.ipynband run the cells sequentially to reproduce the analysis.