Skip to content

TrainerStatus Collector: Real-time HPO trial convergence tracking via TrainJob trainerStatus #2637

@abhijeet-dhumal

Description

@abhijeet-dhumal

What you would like to be added?

Add a TrainerStatus metrics collector to enable real-time visibility into HPO trial convergence by watching TrainJob.status.trainerStatus from Kubeflow Trainer's progress tracking feature (KEP-2779).
Context: This is part of a coordinated effort with the Kubeflow Trainer team. The controller-side implementation is available in kubeflow/trainer#3227 which adds a status server that receives progress updates from training pods and exposes them via the TrainJob CR status. The HuggingFace integration is tracked in huggingface/transformers#44487. This Katib collector would aggregate progress from all trials, enabling:

  • Real-time progress percentage, ETA, and loss for all trials simultaneously
  • Convergence graphs showing loss curves across trials
  • Enhanced Hyperband early stopping based on real-time convergence (not just final metrics)
  • Zero code changes for users running HuggingFace Trainer on Kubeflow
kubectl get experiment hpo-llm-finetune -o jsonpath='{.status.trialsProgress}' | jq
# [
#   {"trialName": "trial-1", "progressPercentage": 45, "currentLoss": "0.23", "status": "Running"},
#   {"trialName": "trial-2", "progressPercentage": 60, "currentLoss": "0.15", "status": "Running"},
#   {"trialName": "trial-3", "progressPercentage": 35, "currentLoss": "0.41", "status": "EarlyStopped", "earlyStopReason": "loss > 1.5x best"}
# ]

Why is this needed?

Problem: When running HPO experiments with Katib + TrainJob, practitioners have no real-time visibility into trial convergence. They must either:

  • Wait for trials to complete before seeing any metrics
  • Manually exec into pods to check logs
  • Set up external tracking (MLflow/W&B) per trial, then correlate manually
  • Run Hyperband without convergence-aware early stopping (wastes compute)

Why this matters:

  • HPO experiments can run for hours/days with many parallel trials
  • Without real-time visibility, users can't identify winning hyperparameters early
  • Current early stopping in Katib relies on completed trial metrics, not convergence trends
  • Wasted GPU hours on trials that are clearly diverging but haven't hit a checkpoint yet

User experience improvement:
Before (no visibility):

kubectl get experiment my-hpo
# NAME     STATUS    TRIALS   SUCCEEDED   RUNNING
# my-hpo   Running   5             0                        5
# (Which trials are converging? No idea until they complete.)

After (with TrainerStatus collector):

kubectl get experiment my-hpo -o jsonpath='{.status.trialsProgress}'
# trial-1: 45% done, loss=0.23 (converging)
# trial-2: 60% done, loss=0.15 (best so far)
# trial-3: 35% done, loss=0.41 → STOPPED (diverging, loss > 1.5x best)

Convergence-aware Hyperband:
Current Hyperband can only make early stopping decisions at fixed resource checkpoints. With real-time trainerStatus, Hyperband can:

algorithmSettings:
  - name: "earlyStoppingEnabled"
    value: "true"
  - name: "earlyStoppingThreshold"
    value: "1.5"  # Stop if loss > 1.5x best completed trial
  - name: "earlyStoppingMinProgress"
    value: "20"   # Only evaluate after 20% training progress

This saves significant compute by terminating non-converging trials early with clear visibility into why they were stopped.

Zero friction for users:
No code changes required. When using TrainJob as trial template with HuggingFace Trainer:

metricsCollectorSpec:
  collector:
    kind: TrainerStatus  # watches TrainJob.status.trainerStatus

trialTemplate:
  trialSpec:
    apiVersion: trainer.kubeflow.org/v1alpha1
    kind: TrainJob
    spec:
      trainer:
        image: huggingface/transformers:latest
        # Progress automatically reported via KubeflowCallback

Synergy with existing collectors:
This complements (not replaces) existing collectors (StdOut, File, PrometheusMetric, TrainerStatus)

Love this feature?

Give it a 👍 We prioritize the features with most 👍

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions