What you would like to be added?
Add a TrainerStatus metrics collector to enable real-time visibility into HPO trial convergence by watching TrainJob.status.trainerStatus from Kubeflow Trainer's progress tracking feature (KEP-2779).
Context: This is part of a coordinated effort with the Kubeflow Trainer team. The controller-side implementation is available in kubeflow/trainer#3227 which adds a status server that receives progress updates from training pods and exposes them via the TrainJob CR status. The HuggingFace integration is tracked in huggingface/transformers#44487. This Katib collector would aggregate progress from all trials, enabling:
- Real-time progress percentage, ETA, and loss for all trials simultaneously
- Convergence graphs showing loss curves across trials
- Enhanced Hyperband early stopping based on real-time convergence (not just final metrics)
- Zero code changes for users running HuggingFace Trainer on Kubeflow
kubectl get experiment hpo-llm-finetune -o jsonpath='{.status.trialsProgress}' | jq
# [
# {"trialName": "trial-1", "progressPercentage": 45, "currentLoss": "0.23", "status": "Running"},
# {"trialName": "trial-2", "progressPercentage": 60, "currentLoss": "0.15", "status": "Running"},
# {"trialName": "trial-3", "progressPercentage": 35, "currentLoss": "0.41", "status": "EarlyStopped", "earlyStopReason": "loss > 1.5x best"}
# ]
Why is this needed?
Problem: When running HPO experiments with Katib + TrainJob, practitioners have no real-time visibility into trial convergence. They must either:
- Wait for trials to complete before seeing any metrics
- Manually exec into pods to check logs
- Set up external tracking (MLflow/W&B) per trial, then correlate manually
- Run Hyperband without convergence-aware early stopping (wastes compute)
Why this matters:
- HPO experiments can run for hours/days with many parallel trials
- Without real-time visibility, users can't identify winning hyperparameters early
- Current early stopping in Katib relies on completed trial metrics, not convergence trends
- Wasted GPU hours on trials that are clearly diverging but haven't hit a checkpoint yet
User experience improvement:
Before (no visibility):
kubectl get experiment my-hpo
# NAME STATUS TRIALS SUCCEEDED RUNNING
# my-hpo Running 5 0 5
# (Which trials are converging? No idea until they complete.)
After (with TrainerStatus collector):
kubectl get experiment my-hpo -o jsonpath='{.status.trialsProgress}'
# trial-1: 45% done, loss=0.23 (converging)
# trial-2: 60% done, loss=0.15 (best so far)
# trial-3: 35% done, loss=0.41 → STOPPED (diverging, loss > 1.5x best)
Convergence-aware Hyperband:
Current Hyperband can only make early stopping decisions at fixed resource checkpoints. With real-time trainerStatus, Hyperband can:
algorithmSettings:
- name: "earlyStoppingEnabled"
value: "true"
- name: "earlyStoppingThreshold"
value: "1.5" # Stop if loss > 1.5x best completed trial
- name: "earlyStoppingMinProgress"
value: "20" # Only evaluate after 20% training progress
This saves significant compute by terminating non-converging trials early with clear visibility into why they were stopped.
Zero friction for users:
No code changes required. When using TrainJob as trial template with HuggingFace Trainer:
metricsCollectorSpec:
collector:
kind: TrainerStatus # watches TrainJob.status.trainerStatus
trialTemplate:
trialSpec:
apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
spec:
trainer:
image: huggingface/transformers:latest
# Progress automatically reported via KubeflowCallback
Synergy with existing collectors:
This complements (not replaces) existing collectors (StdOut, File, PrometheusMetric, TrainerStatus)
Love this feature?
Give it a 👍 We prioritize the features with most 👍
What you would like to be added?
Add a TrainerStatus metrics collector to enable real-time visibility into HPO trial convergence by watching TrainJob.status.trainerStatus from Kubeflow Trainer's progress tracking feature (KEP-2779).
Context: This is part of a coordinated effort with the Kubeflow Trainer team. The controller-side implementation is available in kubeflow/trainer#3227 which adds a status server that receives progress updates from training pods and exposes them via the TrainJob CR status. The HuggingFace integration is tracked in huggingface/transformers#44487. This Katib collector would aggregate progress from all trials, enabling:
Why is this needed?
Problem: When running HPO experiments with Katib + TrainJob, practitioners have no real-time visibility into trial convergence. They must either:
Why this matters:
User experience improvement:
Before (no visibility):
After (with TrainerStatus collector):
Convergence-aware Hyperband:
Current Hyperband can only make early stopping decisions at fixed resource checkpoints. With real-time trainerStatus, Hyperband can:
This saves significant compute by terminating non-converging trials early with clear visibility into why they were stopped.
Zero friction for users:
No code changes required. When using TrainJob as trial template with HuggingFace Trainer:
Synergy with existing collectors:
This complements (not replaces) existing collectors (StdOut, File, PrometheusMetric, TrainerStatus)
Love this feature?
Give it a 👍 We prioritize the features with most 👍