Adding the KFoldQuantileSplitter#474
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
hredestig
left a comment
There was a problem hiding this comment.
Let's think of the api a bit - and also, can you please add some content to the splits notebook as well to demonstrate how kfold-quantile is used
5b0799d to
8e48004
Compare
|
Couple of big changes. First of all I addressed the issue of conforming with split creation api and added a demo to the notebook demonstrating the usage of the quantile splitters. Next, I realized that there was a significant problem with split assignment in both quantile splitters, namely that they reasoned over sequences at the individual assay level, rather than at the aggregated dataset level. This caused a number of downstream inconsistent and breaking behaviors. I described my fix below: Instead of reasoning on the level of per-assay records, both the
The splitters now also raise a One concern worth raising is that the aggregation method used in downstream training jobs must match the default aggregation of |
karel-w
left a comment
There was a problem hiding this comment.
We now allow for empty splits, which will fail downstream when we cannot extract data from the split. I would suggest to fail early and raise errors when we obtain an empty split instead.
This also removes some of the edge cases in testing.
|
Thanks a lot for the extensive review. I addressed most of the comments, most importantly no longer allowing for empty slices to be created when the passed target is not present in the assays. |
karel-w
left a comment
There was a problem hiding this comment.
Lets raise an issue for the validation of QuantileFolds to discuss this point, but in the meanwhile we can move on.
Line 325 is now redundant since we would fail earlier when the slice is empty? Would maybe be a nice test to make sure _count_high_property_variants return correct counts.
|
Thanks, made the final changes. Will raise issue for Quantile validation splits |
Changes
Adding K-fold cross validation support for quantile splitting.
Newly introduced convention for Subsets structure
Since quantile splitting creates test folds that contain samples that never appear in the training data, the structure of the resulting
Subsetsobject is slightly different from theSubsetsobject created by theKFoldSplitter. The former creates a dictionary with a keytrain_foldsstoring all the training sets for each fold, and a keytest_foldsstoring all the test folds.Existing
proteingym-benchmarkmodels do not yet comply with this structure.Checklist