Predictions from machine learning models may not properly account for uncertainty

Hi all,

I was just reflecting on miceforest's imputation strategy, and I wonder if it may underestimate uncertainty in missing values.

In classical multiple imputation methods that use a Gaussian linear model, a prediction for the missing value is generated by (1) drawing a value beta_dot from the posterior distribution of regression coefficients, and a value sigma_dot from the posterior distribution of the observation noise, and then (2) drawing a prediction from N(X*beta_dot, sigma_dot). So the predictions that are generated account for noise in the data, and also account for uncertainty in the model (the uncertainty in estimated regression coefficients and in estimated noise), assuming of course that such linear model fits good enough. The idea is not to generate the 'best' prediction for a missing value, but to draw predictions from a distribution that reflects the uncertainty in missing values. These predictions can subsequently be used for predictive mean matching. This is based on https://stefvanbuuren.name/fimd/how-to-generate-multiple-imputations.html#sec:meth3

When one uses random forest to make predictions, for example, the output is a single value, which is the average of the outputs of B trees, each fitted to a bootstrap sample and each using a different subset of predictors. If the trees are fitted to new bootstrap samples at each iteration, the predictions they make should account for 'uncertainty in model parameters', as different samples would lead to different trees. However, as the random forest ultimately returns a single value, it does not provide an estimate for how uncertain the missing value may be. A naive solution could be to draw a random sample of K trees from all fitted trees, and use the average value of these trees as the prediction. But I do not know if this would account for uncertainty in a 'proper' way, as when K=1 there is probably too much uncertainty because individual trees can strongly overfit, and when K=B there is too little uncertainty. Looks like there's some work done on this in the literature (e.g. https://arxiv.org/abs/1404.6473), but having a very quick look I didn't spot something that is straightforward to implement.

So overall, when the imputation model is a machine learning model that returns a single 'best' prediction, rather than an interval estimate for where the predicted value may lie, then the imputed values that are generated may vary too little, even if predictive mean matching is used subsequently. I do not know atm how big consequence this may have for downstream analyses. I wonder if you have considered this potential issue?

Thanks,
Andres







Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Predictions from machine learning models may not properly account for uncertainty #71

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Predictions from machine learning models may not properly account for uncertainty #71

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions