Skip to content

feat: take pipeline's estimator into account to decide tabular vectorizer options in TabularPipeline#2152

Open
khaoulariad wants to merge 9 commits into
skrub-data:mainfrom
khaoulariad:feature_add_skpipeline_to_tabular_pipeline
Open

feat: take pipeline's estimator into account to decide tabular vectorizer options in TabularPipeline#2152
khaoulariad wants to merge 9 commits into
skrub-data:mainfrom
khaoulariad:feature_add_skpipeline_to_tabular_pipeline

Conversation

@khaoulariad

@khaoulariad khaoulariad commented Jun 10, 2026

Copy link
Copy Markdown

Bug Fix Pull Request

Description

Addresses #1967

Checklist

  • I have read the contributing guidelines
  • I have added tests that verify the bug fix
  • I have added an entry to CHANGES.rst describing the fix
  • My code follows the code style of this project
  • I have checked my code and corrected any misspellings

How Has This Been Tested?

AI Disclosure

  • This PR contains AI-generated code
    • I have tested the code generated in my PR
    • I have read and understood every line that has been generated by the AI agent
    • I can explain what the AI-generated code does

Comment thread skrub/tests/test_tabular_pipeline.py Outdated
Comment thread skrub/_tabular_pipeline.py Outdated
@MarieSacksick MarieSacksick added the CFM sprint June 2026 For PRs opened during the CFM sprint in June 2026 label Jun 10, 2026
Comment thread CHANGES.rst Outdated
Comment thread skrub/_tabular_pipeline.py Outdated
Comment thread skrub/tests/test_tabular_pipeline.py Outdated
Comment thread skrub/tests/test_tabular_pipeline.py Outdated
Comment thread skrub/tests/test_tabular_pipeline.py Outdated
@MarieSacksick

Copy link
Copy Markdown
Contributor

Can you add the link to this issue that will be closed by this PR please?

Comment thread skrub/_tabular_pipeline.py Outdated
correcting doc

Co-authored-by: Marie Sacksick <79304610+MarieSacksick@users.noreply.github.com>
Comment thread CHANGES.rst Outdated
Co-authored-by: Marie Sacksick <79304610+MarieSacksick@users.noreply.github.com>
Comment thread skrub/tests/test_tabular_pipeline.py Outdated
@MarieSacksick MarieSacksick changed the title add skpipeline to tabular pipeline feat: take estimator in pipeline when given to TabularPipeline into account to decide tabular vectorizer options Jun 10, 2026
@MarieSacksick MarieSacksick changed the title feat: take estimator in pipeline when given to TabularPipeline into account to decide tabular vectorizer options feat: take pipeline's estimator into account to decide tabular vectorizer options in TabularPipeline Jun 10, 2026
Co-authored-by: Marie Sacksick <79304610+MarieSacksick@users.noreply.github.com>

@MarieSacksick MarieSacksick left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good for me, thanks :)!
and congratulations!!

waiting for someone else review.

@jeromedockes jeromedockes left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you very much @khaoulariad ! just a few comments :)

step;
- a scikit-learn estimator: the provided estimator is used as the final step.
- a scikit-learn pipeline : the whole pipeline is kept and usual pre-processing by the TableReport
is added on top, depending on the estimator in the last step of the pipeline.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
is added on top, depending on the estimator in the last step of the pipeline.
is added before, depending on the estimator in the last step of the pipeline.

""" # noqa: E501
vectorizer = TableVectorizer(n_jobs=n_jobs)
cat_feat_kwargs = {"categorical_features": "from_dtype"}
if isinstance(estimator, Pipeline):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should come after the checks below -- it is unlikely someone will put the string "regressor" or a class at the end of a pipeline before passing it to tabular_pipeline

also please avoid local variable names that differ by a single character like estimator and estimator_. here we can have something like

if isinstance(estimator, Pipeline):
    *user_transformers, estimator = estimator.steps
else:
    user_transformers = ()
...
make_pipeline(TableVectorizer(), *user_transformers, estimator)

if not isinstance(estimator_, _TREE_ENSEMBLE_CLASSES):
steps.append(SquashingScaler(max_absolute_value=5))
steps.append(estimator)
if isinstance(estimator, Pipeline):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the suggestion above we can do all the handling of pipelines in one place


@pytest.mark.parametrize(
"learner_kind", ["regressor", "regression", "classifier", "classification"]
"learner_kind",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a big deal at all but in general please try to avoid changes that are unrelated to your Pull Request to make it easier to review and avoid cluttering the git history

assert isinstance(p.named_steps["tablevectorizer"].low_cardinality, ToCategorical)


def test_skpipeline_learner():

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_skpipeline_learner():
def test_estimator_is_a_pipeline():

original_learner = LogisticRegression()
sk_pipeline = Pipeline([("pca", PCA()), ("clf", original_learner)])
tab_pipeline = tabular_pipeline(sk_pipeline)
assert len([element for _, element in tab_pipeline.steps]) == 5

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert len([element for _, element in tab_pipeline.steps]) == 5
assert len(tab_pipeline.steps) == 5

sk_pipeline = Pipeline([("pca", PCA()), ("clf", original_learner)])
tab_pipeline = tabular_pipeline(sk_pipeline)
assert len([element for _, element in tab_pipeline.steps]) == 5
tv, imputer, scaler, pca, learner = (element for _, element in tab_pipeline.steps)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe

Suggested change
tv, imputer, scaler, pca, learner = (element for _, element in tab_pipeline.steps)
tv, imputer, scaler, pca, learner = tab_pipeline.named_steps.values()

?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also you can group the steps you don't use like *_, pca, learner = ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CFM sprint June 2026 For PRs opened during the CFM sprint in June 2026

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants