feat: take pipeline's estimator into account to decide tabular vectorizer options in TabularPipeline by khaoulariad · Pull Request #2152 · skrub-data/skrub

khaoulariad · 2026-06-10T09:32:47Z

Bug Fix Pull Request

Description

Addresses #1967

Checklist

I have read the contributing guidelines
I have added tests that verify the bug fix
I have added an entry to CHANGES.rst describing the fix
My code follows the code style of this project
I have checked my code and corrected any misspellings

How Has This Been Tested?

AI Disclosure

This PR contains AI-generated code
- I have tested the code generated in my PR
- I have read and understood every line that has been generated by the AI agent
- I can explain what the AI-generated code does

MarieSacksick · 2026-06-10T12:44:01Z

Can you add the link to this issue that will be closed by this PR please?

correcting doc Co-authored-by: Marie Sacksick <79304610+MarieSacksick@users.noreply.github.com>

Co-authored-by: Marie Sacksick <79304610+MarieSacksick@users.noreply.github.com>

MarieSacksick

Good for me, thanks :)!
and congratulations!!

waiting for someone else review.

jeromedockes

thank you very much @khaoulariad ! just a few comments :)

jeromedockes · 2026-06-11T07:26:25Z

          step;
        - a scikit-learn estimator: the provided estimator is used as the final step.
+        - a scikit-learn pipeline : the whole pipeline is kept and usual pre-processing by the TableReport
+          is added on top, depending on the estimator in the last step of the pipeline.


Suggested change

is added on top, depending on the estimator in the last step of the pipeline.

is added before, depending on the estimator in the last step of the pipeline.

jeromedockes · 2026-06-11T07:31:57Z

    """  # noqa: E501
    vectorizer = TableVectorizer(n_jobs=n_jobs)
    cat_feat_kwargs = {"categorical_features": "from_dtype"}
+    if isinstance(estimator, Pipeline):


I think this should come after the checks below -- it is unlikely someone will put the string "regressor" or a class at the end of a pipeline before passing it to tabular_pipeline

also please avoid local variable names that differ by a single character like estimator and estimator_. here we can have something like

if isinstance(estimator, Pipeline): *user_transformers, estimator = estimator.steps else: user_transformers = () ... make_pipeline(TableVectorizer(), *user_transformers, estimator)

jeromedockes · 2026-06-11T07:35:08Z

+    if not isinstance(estimator_, _TREE_ENSEMBLE_CLASSES):
        steps.append(SquashingScaler(max_absolute_value=5))
-    steps.append(estimator)
+    if isinstance(estimator, Pipeline):


with the suggestion above we can do all the handling of pipelines in one place

jeromedockes · 2026-06-11T07:36:39Z


 @pytest.mark.parametrize(
-    "learner_kind", ["regressor", "regression", "classifier", "classification"]
+    "learner_kind",


not a big deal at all but in general please try to avoid changes that are unrelated to your Pull Request to make it easier to review and avoid cluttering the git history

jeromedockes · 2026-06-11T07:37:12Z

    assert isinstance(p.named_steps["tablevectorizer"].low_cardinality, ToCategorical)
+
+
+def test_skpipeline_learner():


Suggested change

def test_skpipeline_learner():

def test_estimator_is_a_pipeline():

jeromedockes · 2026-06-11T07:37:55Z

+    original_learner = LogisticRegression()
+    sk_pipeline = Pipeline([("pca", PCA()), ("clf", original_learner)])
+    tab_pipeline = tabular_pipeline(sk_pipeline)
+    assert len([element for _, element in tab_pipeline.steps]) == 5


Suggested change

assert len([element for _, element in tab_pipeline.steps]) == 5

assert len(tab_pipeline.steps) == 5

jeromedockes · 2026-06-11T07:51:39Z

+    sk_pipeline = Pipeline([("pca", PCA()), ("clf", original_learner)])
+    tab_pipeline = tabular_pipeline(sk_pipeline)
+    assert len([element for _, element in tab_pipeline.steps]) == 5
+    tv, imputer, scaler, pca, learner = (element for _, element in tab_pipeline.steps)


maybe

Suggested change

tv, imputer, scaler, pca, learner = (element for _, element in tab_pipeline.steps)

tv, imputer, scaler, pca, learner = tab_pipeline.named_steps.values()

?

also you can group the steps you don't use like *_, pca, learner = ...

add skpipeline to tabular pipeline

de8bd3e

GaelVaroquaux reviewed Jun 10, 2026

View reviewed changes

Comment thread skrub/tests/test_tabular_pipeline.py Outdated

GaelVaroquaux requested changes Jun 10, 2026

View reviewed changes

Comment thread skrub/_tabular_pipeline.py Outdated

MarieSacksick added the CFM sprint June 2026 For PRs opened during the CFM sprint in June 2026 label Jun 10, 2026

Khaoula Riad added 2 commits June 10, 2026 14:14

feat : add usual parameter if skpipeline is passed

2961a8a

add pull request number to changes.rst

e5406db

MarieSacksick requested changes Jun 10, 2026

View reviewed changes

Comment thread CHANGES.rst Outdated

Comment thread skrub/_tabular_pipeline.py Outdated

Comment thread skrub/tests/test_tabular_pipeline.py Outdated

Comment thread skrub/tests/test_tabular_pipeline.py Outdated

Comment thread skrub/tests/test_tabular_pipeline.py Outdated

correct doc

37fc386

Khaoula Riad added 2 commits June 10, 2026 14:59

proper PR number and comments taken into account

26ad5e6

correct accessing piepline

30a9019

MarieSacksick reviewed Jun 10, 2026

View reviewed changes

Comment thread skrub/_tabular_pipeline.py Outdated

Update skrub/_tabular_pipeline.py

0a78b9f

correcting doc Co-authored-by: Marie Sacksick <79304610+MarieSacksick@users.noreply.github.com>

MarieSacksick reviewed Jun 10, 2026

View reviewed changes

Comment thread CHANGES.rst Outdated

Update CHANGES.rst

3f24e2b

Co-authored-by: Marie Sacksick <79304610+MarieSacksick@users.noreply.github.com>

MarieSacksick reviewed Jun 10, 2026

View reviewed changes

Comment thread skrub/tests/test_tabular_pipeline.py Outdated

MarieSacksick changed the title ~~add skpipeline to tabular pipeline~~ feat: take estimator in pipeline when given to TabularPipeline into account to decide tabular vectorizer options Jun 10, 2026

MarieSacksick changed the title ~~feat: take estimator in pipeline when given to TabularPipeline into account to decide tabular vectorizer options~~ feat: take pipeline's estimator into account to decide tabular vectorizer options in TabularPipeline Jun 10, 2026

Update skrub/tests/test_tabular_pipeline.py

5dc4fa5

Co-authored-by: Marie Sacksick <79304610+MarieSacksick@users.noreply.github.com>

MarieSacksick requested a review from GaelVaroquaux June 10, 2026 19:55

MarieSacksick approved these changes Jun 10, 2026

View reviewed changes

jeromedockes reviewed Jun 11, 2026

View reviewed changes

	is added on top, depending on the estimator in the last step of the pipeline.
	is added before, depending on the estimator in the last step of the pipeline.

		assert isinstance(p.named_steps["tablevectorizer"].low_cardinality, ToCategorical)


		def test_skpipeline_learner():

	def test_skpipeline_learner():
	def test_estimator_is_a_pipeline():

	assert len([element for _, element in tab_pipeline.steps]) == 5
	assert len(tab_pipeline.steps) == 5

	tv, imputer, scaler, pca, learner = (element for _, element in tab_pipeline.steps)
	tv, imputer, scaler, pca, learner = tab_pipeline.named_steps.values()

Conversation

khaoulariad commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bug Fix Pull Request

Description

Checklist

How Has This Been Tested?

AI Disclosure

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MarieSacksick commented Jun 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MarieSacksick left a comment

Choose a reason for hiding this comment

Uh oh!

jeromedockes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

khaoulariad commented Jun 10, 2026 •

edited

Loading