feat: take pipeline's estimator into account to decide tabular vectorizer options in TabularPipeline#2152
Conversation
|
Can you add the link to this issue that will be closed by this PR please? |
correcting doc Co-authored-by: Marie Sacksick <79304610+MarieSacksick@users.noreply.github.com>
Co-authored-by: Marie Sacksick <79304610+MarieSacksick@users.noreply.github.com>
Co-authored-by: Marie Sacksick <79304610+MarieSacksick@users.noreply.github.com>
MarieSacksick
left a comment
There was a problem hiding this comment.
Good for me, thanks :)!
and congratulations!!
waiting for someone else review.
jeromedockes
left a comment
There was a problem hiding this comment.
thank you very much @khaoulariad ! just a few comments :)
| step; | ||
| - a scikit-learn estimator: the provided estimator is used as the final step. | ||
| - a scikit-learn pipeline : the whole pipeline is kept and usual pre-processing by the TableReport | ||
| is added on top, depending on the estimator in the last step of the pipeline. |
There was a problem hiding this comment.
| is added on top, depending on the estimator in the last step of the pipeline. | |
| is added before, depending on the estimator in the last step of the pipeline. |
| """ # noqa: E501 | ||
| vectorizer = TableVectorizer(n_jobs=n_jobs) | ||
| cat_feat_kwargs = {"categorical_features": "from_dtype"} | ||
| if isinstance(estimator, Pipeline): |
There was a problem hiding this comment.
I think this should come after the checks below -- it is unlikely someone will put the string "regressor" or a class at the end of a pipeline before passing it to tabular_pipeline
also please avoid local variable names that differ by a single character like estimator and estimator_. here we can have something like
if isinstance(estimator, Pipeline):
*user_transformers, estimator = estimator.steps
else:
user_transformers = ()
...
make_pipeline(TableVectorizer(), *user_transformers, estimator)| if not isinstance(estimator_, _TREE_ENSEMBLE_CLASSES): | ||
| steps.append(SquashingScaler(max_absolute_value=5)) | ||
| steps.append(estimator) | ||
| if isinstance(estimator, Pipeline): |
There was a problem hiding this comment.
with the suggestion above we can do all the handling of pipelines in one place
|
|
||
| @pytest.mark.parametrize( | ||
| "learner_kind", ["regressor", "regression", "classifier", "classification"] | ||
| "learner_kind", |
There was a problem hiding this comment.
not a big deal at all but in general please try to avoid changes that are unrelated to your Pull Request to make it easier to review and avoid cluttering the git history
| assert isinstance(p.named_steps["tablevectorizer"].low_cardinality, ToCategorical) | ||
|
|
||
|
|
||
| def test_skpipeline_learner(): |
There was a problem hiding this comment.
| def test_skpipeline_learner(): | |
| def test_estimator_is_a_pipeline(): |
| original_learner = LogisticRegression() | ||
| sk_pipeline = Pipeline([("pca", PCA()), ("clf", original_learner)]) | ||
| tab_pipeline = tabular_pipeline(sk_pipeline) | ||
| assert len([element for _, element in tab_pipeline.steps]) == 5 |
There was a problem hiding this comment.
| assert len([element for _, element in tab_pipeline.steps]) == 5 | |
| assert len(tab_pipeline.steps) == 5 |
| sk_pipeline = Pipeline([("pca", PCA()), ("clf", original_learner)]) | ||
| tab_pipeline = tabular_pipeline(sk_pipeline) | ||
| assert len([element for _, element in tab_pipeline.steps]) == 5 | ||
| tv, imputer, scaler, pca, learner = (element for _, element in tab_pipeline.steps) |
There was a problem hiding this comment.
maybe
| tv, imputer, scaler, pca, learner = (element for _, element in tab_pipeline.steps) | |
| tv, imputer, scaler, pca, learner = tab_pipeline.named_steps.values() |
?
There was a problem hiding this comment.
also you can group the steps you don't use like *_, pca, learner = ...
Bug Fix Pull Request
Description
Addresses #1967
Checklist
How Has This Been Tested?
AI Disclosure