Skip to content

Predictions from machine learning models may not properly account for uncertainty #71

Description

@tammandres

Hi all,

I was just reflecting on miceforest's imputation strategy, and I wonder if it may underestimate uncertainty in missing values.

In classical multiple imputation methods that use a Gaussian linear model, a prediction for the missing value is generated by (1) drawing a value beta_dot from the posterior distribution of regression coefficients, and a value sigma_dot from the posterior distribution of the observation noise, and then (2) drawing a prediction from N(X*beta_dot, sigma_dot). So the predictions that are generated account for noise in the data, and also account for uncertainty in the model (the uncertainty in estimated regression coefficients and in estimated noise), assuming of course that such linear model fits good enough. The idea is not to generate the 'best' prediction for a missing value, but to draw predictions from a distribution that reflects the uncertainty in missing values. These predictions can subsequently be used for predictive mean matching. This is based on https://stefvanbuuren.name/fimd/how-to-generate-multiple-imputations.html#sec:meth3

When one uses random forest to make predictions, for example, the output is a single value, which is the average of the outputs of B trees, each fitted to a bootstrap sample and each using a different subset of predictors. If the trees are fitted to new bootstrap samples at each iteration, the predictions they make should account for 'uncertainty in model parameters', as different samples would lead to different trees. However, as the random forest ultimately returns a single value, it does not provide an estimate for how uncertain the missing value may be. A naive solution could be to draw a random sample of K trees from all fitted trees, and use the average value of these trees as the prediction. But I do not know if this would account for uncertainty in a 'proper' way, as when K=1 there is probably too much uncertainty because individual trees can strongly overfit, and when K=B there is too little uncertainty. Looks like there's some work done on this in the literature (e.g. https://arxiv.org/abs/1404.6473), but having a very quick look I didn't spot something that is straightforward to implement.

So overall, when the imputation model is a machine learning model that returns a single 'best' prediction, rather than an interval estimate for where the predicted value may lie, then the imputed values that are generated may vary too little, even if predictive mean matching is used subsequently. I do not know atm how big consequence this may have for downstream analyses. I wonder if you have considered this potential issue?

Thanks,
Andres

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions