Skip to content

Adding the KFoldQuantileSplitter#474

Merged
florisvdf merged 12 commits into
mainfrom
KFoldQuantileSplitter
Jun 19, 2026
Merged

Adding the KFoldQuantileSplitter#474
florisvdf merged 12 commits into
mainfrom
KFoldQuantileSplitter

Conversation

@florisvdf

@florisvdf florisvdf commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Changes

Adding K-fold cross validation support for quantile splitting.

Newly introduced convention for Subsets structure

Since quantile splitting creates test folds that contain samples that never appear in the training data, the structure of the resulting Subsets object is slightly different from the Subsets object created by the KFoldSplitter. The former creates a dictionary with a key train_folds storing all the training sets for each fold, and a key test_folds storing all the test folds.

Existing proteingym-benchmark models do not yet comply with this structure.

Checklist

  • I broke the PR down so that it contains a reasonable amount of changes for an effective review
  • I performed a self-review of my code. Amongst other things, I have commented my code in hard-to-understand areas.
  • I made corresponding changes to the documentation
  • I added tests that prove my fix is effective or that my feature works
  • I accounted for dependent changes to be merged and published in downstream modules

@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 98.13084% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/proteingym/base/splits.py 98.13% 1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@hredestig hredestig left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's think of the api a bit - and also, can you please add some content to the splits notebook as well to demonstrate how kfold-quantile is used

Comment thread src/proteingym/base/splits.py Outdated
@florisvdf florisvdf force-pushed the KFoldQuantileSplitter branch from 5b0799d to 8e48004 Compare June 15, 2026 10:24
@florisvdf

Copy link
Copy Markdown
Contributor Author

Couple of big changes. First of all I addressed the issue of conforming with split creation api and added a demo to the notebook demonstrating the usage of the quantile splitters. Next, I realized that there was a significant problem with split assignment in both quantile splitters, namely that they reasoned over sequences at the individual assay level, rather than at the aggregated dataset level. This caused a number of downstream inconsistent and breaking behaviors. I described my fix below:

Instead of reasoning on the level of per-assay records, both the QuantileSplitter and the KFoldQuantileSplitter reason about variants at the aggregated level (data that results from .to_df(target_names=[target]). This fixes two important issues with the splitting behavior:

  1. Reasoning over variants at the individual record level in the dataset yields quantile thresholds and values of top_k that may no longer be valid once the data is aggregated for a given target. This is now fixed by reasoning at the aggregated dataset (result of .to_df) level.
  2. With split assignment at the individual record level, one may encounter the scenario where a sequence is assigned to the upper quantile when its property value exceeds the threshold in assay_i, but also assigned to the lower quantile when it is under the threshold in assay_j, creating conflicting AssaySlice objects. Split assignment now also happens after sequences are first aggregated.

The splitters now also raise a ValueError when one attempts to use them to split datasets with a target that combines assays with varying assay variables, since property value thresholds can not be compared across different assay conditions.

One concern worth raising is that the aggregation method used in downstream training jobs must match the default aggregation of to_df , otherwise top_k is no longer valid. Not sure how to validate this, though seems like an very rare edge case.

@karel-w karel-w left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now allow for empty splits, which will fail downstream when we cannot extract data from the split. I would suggest to fail early and raise errors when we obtain an empty split instead.

This also removes some of the edge cases in testing.

Comment thread docs/notebooks/05_Creating_Dataset_Splits.ipynb Outdated
Comment thread docs/notebooks/05_Creating_Dataset_Splits.ipynb Outdated
Comment thread docs/notebooks/NEIME_2019.splits.pgdata Outdated
Comment thread src/proteingym/base/splits.py
Comment thread src/proteingym/base/splits.py
Comment thread tests/test_splits.py Outdated
Comment thread tests/test_splits.py Outdated
Comment thread tests/test_splits.py Outdated
Comment thread tests/test_splits.py
Comment thread tests/test_splits.py
@florisvdf

Copy link
Copy Markdown
Contributor Author

Thanks a lot for the extensive review. I addressed most of the comments, most importantly no longer allowing for empty slices to be created when the passed target is not present in the assays.

@florisvdf florisvdf requested a review from karel-w June 19, 2026 09:50

@karel-w karel-w left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets raise an issue for the validation of QuantileFolds to discuss this point, but in the meanwhile we can move on.

Line 325 is now redundant since we would fail earlier when the slice is empty? Would maybe be a nice test to make sure _count_high_property_variants return correct counts.

Comment thread src/proteingym/base/splits.py
@florisvdf

Copy link
Copy Markdown
Contributor Author

Thanks, made the final changes. Will raise issue for Quantile validation splits

@florisvdf florisvdf requested a review from karel-w June 19, 2026 13:07
@florisvdf florisvdf merged commit 95d8e2d into main Jun 19, 2026
5 checks passed
@florisvdf florisvdf deleted the KFoldQuantileSplitter branch June 19, 2026 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants