Commit e11f713
fix(kubeflow): tail only first and last node logs (#538)
* fix(kubeflow): stream only rank 0 + last rank, write all ranks to disk
KubeflowExecutor.fetch_logs followed every replica and forwarded all ranks to
the caller, so at scale the aggregate output overran CI/runner job-log size
limits (a 16-node x 8-GPU run exceeded GitLab's 128MB cap). Now it still tails
every rank (kubectl logs -l <jobset> --prefix --max-log-requests num_nodes) and
writes the complete multi-rank output to <job_dir>/log-allranks_0.out, but
forwards only global rank 0 (node 0, [default0]) and the last global rank
(node num_nodes-1, [default nproc_per_node-1]) to stdout. Downstream log
validation that globs log*.out still sees every rank via the on-disk file.
Signed-off-by: oliver könig <okoenig@nvidia.com>
* fix(kubeflow): resolve last pod via completion-index label + full-history streaming
fetch_logs identified the last global rank's pod by parsing the pod name and
tailed only the last `--tail <lines>` window, so on (re)attach the last rank's
mid-run canonical "iteration | lm loss | ..." line (print_rank_last) was
dropped — on K8s the job log showed only rank 0's "Step Time" line.
Resolve the first/last pod from the authoritative
batch.kubernetes.io/job-completion-index label (== torchrun PET_NODE_RANK),
mapped from the --prefix pod name and refreshed on every (re)connect (gang
restarts spawn new pod names), and stream each pod's full history (--tail=-1)
so no mid-run line is missed. All ranks are still written to
log-allranks_0.out; only global rank 0 and the true last global rank are
forwarded to stdout.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
* fix(kubeflow): forward by global rank (node_rank*nproc+local), not pod heuristic
Kubeflow Trainer sets torchrun PET_NODE_RANK statically from the JobSet
batch.kubernetes.io/job-completion-index, so global_rank = completion_index *
nproc_per_node + local_rank. Compute that explicitly and forward only global
rank 0 and world_size-1 to stdout (all ranks still go to log-allranks_0.out).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
* fix(kubeflow): make TrainJob launch idempotent on 409 conflict
When a TrainJob with the target name already exists, launch() raised and aborted. On CI the name is derived from the experiment id (commit SHA), so a 409 is a stale leftover from a prior attempt the launcher declared FAILED after a slow pod start. That blocked setup_experiment's 'attempt N of M' retry — every retry re-collided. Now launch() deletes the stale job (cancel(wait=True)) and recreates, so the retry can actually recover.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
* fix(kubeflow): reload kube client across cert rotation for long runs
The kubernetes SDK bakes the client cert into its SSLContext at client
construction and never re-reads it. When credentials come from a rotating
source (Teleport tbot refreshing the cert on disk), a KubeflowExecutor
created once at launch keeps presenting the original cert until it expires
mid-run, so status polls fail with SSLV3_ALERT_CERTIFICATE_EXPIRED once the
run outlives the cert TTL (~60 min). Short jobs finish in time; multi-hour
jobs go blind.
Rebuild the API clients from the on-disk kubeconfig past a refresh interval
(below the cert TTL) via lazy properties, and reactively reload+retry once
in status() on a non-API connection error. fetch_logs already shells out to
kubectl, which re-reads creds per call, so it was unaffected.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
* fix(kubeflow): scope code_dir per job to avoid concurrent clobber
code_dir was scoped only per user (<pvc>/<username>/code), but package()
rsyncs each job's job_dir into it. Two concurrent jobs from the same user
(e.g. parallel CI test cases) therefore overwrite each other's launcher code
mid-run. Scope it per job (<username>/<experiment_id>/<job_name>/code),
matching how dgxcloud/lepton mirror job_dir into a per-job PVC subdir and how
slurm keys packaging by experiment_id:job_name.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
* fix(kubeflow): unique TrainJob name + forward all ranks (deduped)
- TrainJob name is now <basename>-<uuid6> (RFC-1123, <=33 chars) via the new
train_job_basename field, decoupled from the experiment name. The uuid makes
every launch unique, so concurrent/retried jobs never collide on the API
server (the descriptive experiment name is intentionally non-unique).
- fetch_logs now forwards every rank to stdout, de-duplicated: torchrun runs
the same entrypoint on all ranks so startup/config/NCCL lines are identical;
we strip the per-rank [pod/...]/[defaultN] markers and forward each distinct
message once. This stops dropping the per-step loss line and wandb URL, which
Megatron emits from a single layout-dependent rank (neither rank 0 nor last).
The full per-rank stream still goes to log-allranks_0.out untouched.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
* fix(kubeflow): stream logs once, not per replica
torchx calls scheduler.log_iter(app_id, role_name, k=...) once per replica
(k = 0..num_nodes-1). The Kubeflow log_iter ignored k and re-ran
fetch_logs — which tails the entire jobset via the jobset-name selector — for
every replica, producing N independent tail streams (each with its own dedup
state) and N-fold-duplicating every console line (prefixed <role>/<k>). At 16
nodes that's 16x the log volume, which also overruns the CI job-log limit on
long runs. Stream only for k == 0; that single tail already covers all ranks
(and writes log-allranks_0.out once).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
* fix(kubeflow): forward rank 0 + last rank to stdout (not all-ranks dedup)
Revert the all-ranks sliding-window dedup back to forwarding only global rank 0
(setup/config) and the last global rank (print_rank_last per-step loss), like a
SLURM job log. The last rank is resolved at stream time from each pod's
batch.kubernetes.io/job-completion-index label (== torchrun --node-rank
$PET_NODE_RANK), so global_rank = completion_index * nproc_per_node +
local_rank is deterministic without any topology enforcement. The full per-rank
stream is still captured in log-allranks_0.out. Combined with the per-replica
log_iter guard, this stops the N-fold duplication and yields a clean two-rank
console.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
* fix(kubeflow): forward rank-0 + the actual loss-rank slot to stdout
The c10d rendezvous assigns torch ranks by join order, not by JobSet
completion-index, so torch's world_size-1 (print_rank_last's loss line) does
NOT land on the highest completion-index. Verified on a live 16-node job:
the loss prints on completion-index 9 (= num_nodes//2 + 1), local rank
nproc-1 — not index 15. Forward exactly (index 0, local 0) and
(index num_nodes//2 + 1, local nproc-1) so the console shows rank 0 setup +
the per-step loss/throughput. Full per-rank capture remains in
log-allranks_0.out. A deterministic completion-index->rank mapping
(topology/static rank ordering) would let us compute this rather than match
the observed slot.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
* fix(kubeflow): robust log streaming across pod/container restarts
fetch_logs ran a single 'kubectl logs -l -f --max-log-requests <num_nodes>'.
That follow only attaches to pods present at start, never re-attaches to a
container that restarts, and --max-log-requests == pod count has no headroom —
so a gang/NCCL-init restart that transiently doubled the matching-pod count
errored the whole command ('maximum allowed concurrency') and silently dropped
pods. Observed: a 16-node job streamed only node-0-0; the loss rank (node-0-9)
never appeared even though it was emitting per-step loss.
- --max-log-requests = max(num_nodes*2, 8): headroom for restart-transient pods.
- Periodically re-attach (threading.Timer terminates the follow every 120s) so
pods that (re)started after the initial attach are picked up.
- Resume reconnects with --since-time (via --timestamps), tracking the max
RFC3339 stamp, so re-attaching never replays already-emitted history; only the
first attach uses --tail=-1. The kubectl timestamp is stripped from each line.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
* fix(kubeflow): resolve rank-0/last pods from worker GROUP_RANK, not a heuristic
The console forwarded rank 0 + the loss rank using completion-index with an
empirical 'num_nodes//2+1' slot for world_size-1. That's fragile: the c10d
rendezvous assigns torch ranks by join order, not JobSet completion-index, so
the loss rank lands on an unpredictable pod (observed: completion-index 9 was
actually GROUP_RANK 15 = RANK 63 = world_size-1).
Read the ground truth instead: torchrun exports GROUP_RANK into every worker's
/proc/<pid>/environ, so 'kubectl exec <pod> -- ' reading it tells us exactly
which pod holds GROUP_RANK 0 (RANK 0, local 0) and GROUP_RANK num_nodes-1
(RANK world_size-1, local nproc-1). Resolve the pod->GROUP_RANK map once the
workers exist, cache it, and re-resolve when the rank-0/last pod is no longer
covered (gang restart reshuffles ranks). Until workers come up (empty map),
fall back to the completion-index-0 pod so early setup still streams.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
* fix(kubeflow): emit forwarded log lines in timestamp order
'kubectl logs -l ... -f' multiplexes every pod into one stream in ARRIVAL
order, not timestamp order. Because the console forwards two pods (rank 0 and
the last rank), their lines could interleave wrong — e.g. two rank-0 'Step
Time' lines bunching before the last rank's 'iteration N' line, or a step time
landing under the next iteration.
Add a small reorder buffer on the forwarded (yielded) subset only: each line
already carries the kubelet --timestamps value (parsed to epoch via the new
_ts_epoch), so hold lines until they are older than reorder_hold_s (2s) and
emit sorted by timestamp. The window comfortably absorbs cross-node clock skew
+ flush jitter while keeping the console near-live. The buffer is drained in
order after each proc ends (re-attach) — outside finally, since yielding during
generator close is unsafe. The full all-ranks debug file is untouched.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
* feat(kubeflow): support pod-template annotations/labels (podTemplateOverrides metadata)
The executor's existing 'annotations' land on the TrainJob object. GKE multi-network
attach (networking.gke.io/interfaces, for GPUDirect-RDMA/gIB) is read off the trainer
POD, not the TrainJob — add pod_annotations (and pod_labels) that flow into
podTemplateOverrides[].metadata, which the Kubeflow Trainer v2 CRD supports.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
* fix(kubeflow): resolve rank-0 and last rank before forwarding logs
On first attach the GROUP_RANK pod map is empty until the torchrun workers
finish rendezvous, so _forward_to_stdout fell back to rank-0-only and the
last rank's early per-step loss/throughput lines (replayed via --tail=-1)
were written to log-allranks but never forwarded to stdout — the CI log
silently dropped the beginning of the run until a re-attach ~120s later,
by which point --since-time skips the replayed history.
Poll on the first attach until both rank 0 and the last rank resolve before
forwarding, capped at 600s (then fall back). The wait is gated on a
non-empty pod list, so it is a no-op when pods can't be listed (no kubectl
/ unit tests) and engages only for real runs.
Signed-off-by: oliver könig <okoenig@nvidia.com>
* fix(kubeflow): wait for rank-0/last to resolve, never fall back to completion-index
The first-attach barrier capped the wait at 600s and then forwarded with the
completion-index heuristic, which streams the wrong rank. A job can legitimately
sit Pending (starved for nodes) far longer than 600s, so it would time out and
mis-forward. Drop the timeout/fallback: keep polling while the job is alive and
stop only when it reaches a terminal state. --tail=-1 on first attach replays
history, so waiting loses nothing.
Signed-off-by: oliver könig <okoenig@nvidia.com>
* style(kubeflow): ruff-format kubeflow.py
Signed-off-by: oliver könig <okoenig@nvidia.com>
* test(kubeflow): update stale tests for uuid names, idempotent 409, rank-0/last log forwarding
The base kubeflow rewrite changed behavior the tests still asserted the old way:
TrainJob names are now <base>-<uuid6>; a 409 cancels the stale job and recreates
(idempotent) rather than raising; and fetch_logs writes every rank to
<job_dir>/log-allranks_0.out while forwarding only rank-0 + the last rank to
stdout. Set job_dir, patch status/time.sleep to avoid the retry-loop hang, and
assert the all-ranks file + uuid-suffixed names.
Signed-off-by: oliver könig <okoenig@nvidia.com>
* test(kubeflow): cover GROUP_RANK resolution, log forwarding, client reload
Raises codecov/patch on the diff from 58% to ~98% by exercising the
previously-untested branches: GROUP_RANK resolution via worker environ
(incl. the first-attach resolve barrier), rank-0/last-rank forwarding +
reorder buffer, the completion-index fallback, pod-template labels/
annotations, stale kube-client reload, and the status() connection-error
retry path.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
* feat(kubeflow): add copy_to_workspace/copy_from_workspace for arbitrary PVC paths
package()/pull_results() already bridge launcher↔PVC via a throw-away data-mover
pod, but only for the per-job code_dir. Downloading results (or persisting any
auxiliary cross-run state) from another path on the volume had no public API.
Add copy_to_workspace(local, remote) and copy_from_workspace(remote, local) that
run the same data-mover against an arbitrary path under workdir_pvc_path, and
refactor package()/pull_results() to delegate to them (behavior unchanged). Tests
cover the happy path, the no-PVC no-op, and pod teardown on copy error.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
---------
Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>1 parent dfb82c5 commit e11f713
3 files changed
Lines changed: 724 additions & 98 deletions
File tree
- nemo_run
- core/execution
- run/torchx_backend/schedulers
- test/core/execution
0 commit comments