TrainerStatus Collector: Real-time HPO trial convergence tracking via TrainJob trainerStatus

### What you would like to be added?

Add a TrainerStatus metrics collector to enable real-time visibility into HPO trial convergence by watching TrainJob.status.trainerStatus from Kubeflow Trainer's progress tracking feature (KEP-2779).
Context: This is part of a coordinated effort with the Kubeflow Trainer team. The controller-side implementation is available in kubeflow/trainer#3227 which adds a status server that receives progress updates from training pods and exposes them via the TrainJob CR status. The HuggingFace integration is tracked in huggingface/transformers#44487. This Katib collector would aggregate progress from all trials, enabling:
- Real-time progress percentage, ETA, and loss for all trials simultaneously
- Convergence graphs showing loss curves across trials
- Enhanced Hyperband early stopping based on real-time convergence (not just final metrics)
- Zero code changes for users running HuggingFace Trainer on Kubeflow
```
kubectl get experiment hpo-llm-finetune -o jsonpath='{.status.trialsProgress}' | jq
# [
#   {"trialName": "trial-1", "progressPercentage": 45, "currentLoss": "0.23", "status": "Running"},
#   {"trialName": "trial-2", "progressPercentage": 60, "currentLoss": "0.15", "status": "Running"},
#   {"trialName": "trial-3", "progressPercentage": 35, "currentLoss": "0.41", "status": "EarlyStopped", "earlyStopReason": "loss > 1.5x best"}
# ]
```


### Why is this needed?

Problem: When running HPO experiments with Katib + TrainJob, practitioners have no real-time visibility into trial convergence. They must either:
- Wait for trials to complete before seeing any metrics
- Manually exec into pods to check logs
- Set up external tracking (MLflow/W&B) per trial, then correlate manually
- Run Hyperband without convergence-aware early stopping (wastes compute)

**Why this matters**:
- HPO experiments can run for hours/days with many parallel trials
- Without real-time visibility, users can't identify winning hyperparameters early
- Current early stopping in Katib relies on completed trial metrics, not convergence trends
- Wasted GPU hours on trials that are clearly diverging but haven't hit a checkpoint yet

**User experience improvement:**
Before (no visibility):
```
kubectl get experiment my-hpo
# NAME     STATUS    TRIALS   SUCCEEDED   RUNNING
# my-hpo   Running   5             0                        5
# (Which trials are converging? No idea until they complete.)
```

After (with TrainerStatus collector):
```
kubectl get experiment my-hpo -o jsonpath='{.status.trialsProgress}'
# trial-1: 45% done, loss=0.23 (converging)
# trial-2: 60% done, loss=0.15 (best so far)
# trial-3: 35% done, loss=0.41 → STOPPED (diverging, loss > 1.5x best)
```

**Convergence-aware Hyperband:**
Current Hyperband can only make early stopping decisions at fixed resource checkpoints. With real-time trainerStatus, Hyperband can:
```
algorithmSettings:
  - name: "earlyStoppingEnabled"
    value: "true"
  - name: "earlyStoppingThreshold"
    value: "1.5"  # Stop if loss > 1.5x best completed trial
  - name: "earlyStoppingMinProgress"
    value: "20"   # Only evaluate after 20% training progress
```
This saves significant compute by terminating non-converging trials early with clear visibility into why they were stopped.

**Zero friction for users:**
No code changes required. When using TrainJob as trial template with HuggingFace Trainer:
```
metricsCollectorSpec:
  collector:
    kind: TrainerStatus  # watches TrainJob.status.trainerStatus

trialTemplate:
  trialSpec:
    apiVersion: trainer.kubeflow.org/v1alpha1
    kind: TrainJob
    spec:
      trainer:
        image: huggingface/transformers:latest
        # Progress automatically reported via KubeflowCallback
```

**Synergy with existing collectors:**
This complements (not replaces) existing collectors (StdOut, File, PrometheusMetric, TrainerStatus)


### Love this feature?

Give it a 👍 We prioritize the features with most 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TrainerStatus Collector: Real-time HPO trial convergence tracking via TrainJob trainerStatus #2637

What you would like to be added?

Why is this needed?

Love this feature?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TrainerStatus Collector: Real-time HPO trial convergence tracking via TrainJob trainerStatus #2637

Description

What you would like to be added?

Why is this needed?

Love this feature?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions