Adding the KFoldQuantileSplitter by florisvdf · Pull Request #474 · ProteinGym/proteingym-base

florisvdf · 2026-06-09T09:57:45Z

Changes

Adding K-fold cross validation support for quantile splitting.

Newly introduced convention for Subsets structure

Since quantile splitting creates test folds that contain samples that never appear in the training data, the structure of the resulting Subsets object is slightly different from the Subsets object created by the KFoldSplitter. The former creates a dictionary with a key train_folds storing all the training sets for each fold, and a key test_folds storing all the test folds.

Existing proteingym-benchmark models do not yet comply with this structure.

Checklist

I broke the PR down so that it contains a reasonable amount of changes for an effective review
I performed a self-review of my code. Amongst other things, I have commented my code in hard-to-understand areas.
I made corresponding changes to the documentation
I added tests that prove my fix is effective or that my feature works
I accounted for dependent changes to be merged and published in downstream modules

codecov · 2026-06-09T10:00:40Z

Codecov Report

❌ Patch coverage is 98.13084% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/proteingym/base/splits.py	98.13%	1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

hredestig

Let's think of the api a bit - and also, can you please add some content to the splits notebook as well to demonstrate how kfold-quantile is used

…rray

florisvdf · 2026-06-15T12:11:40Z

Couple of big changes. First of all I addressed the issue of conforming with split creation api and added a demo to the notebook demonstrating the usage of the quantile splitters. Next, I realized that there was a significant problem with split assignment in both quantile splitters, namely that they reasoned over sequences at the individual assay level, rather than at the aggregated dataset level. This caused a number of downstream inconsistent and breaking behaviors. I described my fix below:

Instead of reasoning on the level of per-assay records, both the QuantileSplitter and the KFoldQuantileSplitter reason about variants at the aggregated level (data that results from .to_df(target_names=[target]). This fixes two important issues with the splitting behavior:

Reasoning over variants at the individual record level in the dataset yields quantile thresholds and values of top_k that may no longer be valid once the data is aggregated for a given target. This is now fixed by reasoning at the aggregated dataset (result of .to_df) level.
With split assignment at the individual record level, one may encounter the scenario where a sequence is assigned to the upper quantile when its property value exceeds the threshold in assay_i, but also assigned to the lower quantile when it is under the threshold in assay_j, creating conflicting AssaySlice objects. Split assignment now also happens after sequences are first aggregated.

The splitters now also raise a ValueError when one attempts to use them to split datasets with a target that combines assays with varying assay variables, since property value thresholds can not be compared across different assay conditions.

One concern worth raising is that the aggregation method used in downstream training jobs must match the default aggregation of to_df , otherwise top_k is no longer valid. Not sure how to validate this, though seems like an very rare edge case.

karel-w

We now allow for empty splits, which will fail downstream when we cannot extract data from the split. I would suggest to fail early and raise errors when we obtain an empty split instead.

This also removes some of the edge cases in testing.

…assays

florisvdf · 2026-06-17T09:35:04Z

Thanks a lot for the extensive review. I addressed most of the comments, most importantly no longer allowing for empty slices to be created when the passed target is not present in the assays.

karel-w

Lets raise an issue for the validation of QuantileFolds to discuss this point, but in the meanwhile we can move on.

Line 325 is now redundant since we would fail earlier when the slice is empty? Would maybe be a nice test to make sure _count_high_property_variants return correct counts.

… test slices

florisvdf · 2026-06-19T13:07:16Z

Thanks, made the final changes. Will raise issue for Quantile validation splits

florisvdf requested review from ethane4, hredestig and karel-w June 9, 2026 09:57

hredestig reviewed Jun 11, 2026

View reviewed changes

Comment thread src/proteingym/base/splits.py Outdated

Floris vanderFlier added 7 commits June 13, 2026 10:32

implemented KFoldQuantileSplitter

561af43

added tests

b5dcfab

linting

750d521

small change to QuantileSplitter passing mask as list instead of np a…

23715fd

…rray

linting

602896b

added test checking quantile value is a fraction

f8cf3fc

quantile splitters now reason at aggregated level, added tests

8e48004

florisvdf force-pushed the KFoldQuantileSplitter branch from 5b0799d to 8e48004 Compare June 15, 2026 10:24

upated notebook demoing quantile splitters

53d7edb

karel-w requested changes Jun 16, 2026

View reviewed changes

Floris vanderFlier added 2 commits June 16, 2026 17:18

changes requested by karel

ba6e1cf

quantile splitters now raise value errors when target not present in …

8fdd147

…assays

extra test for numeric check

0f9c8a9

florisvdf requested a review from karel-w June 19, 2026 09:50

karel-w approved these changes Jun 19, 2026

View reviewed changes

Comment thread src/proteingym/base/splits.py

quantile now within open interval and not testing for potential empty…

68c29ad

… test slices

florisvdf requested a review from karel-w June 19, 2026 13:07

karel-w approved these changes Jun 19, 2026

View reviewed changes

florisvdf merged commit 95d8e2d into main Jun 19, 2026
5 checks passed

florisvdf deleted the KFoldQuantileSplitter branch June 19, 2026 14:29

Conversation

florisvdf commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Newly introduced convention for Subsets structure

Checklist

Uh oh!

codecov Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hredestig left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

florisvdf commented Jun 15, 2026

Uh oh!

karel-w left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

florisvdf commented Jun 17, 2026

Uh oh!

karel-w left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

florisvdf commented Jun 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

florisvdf commented Jun 9, 2026 •

edited

Loading

codecov Bot commented Jun 9, 2026 •

edited

Loading