Skip to content

Commit e11f713

Browse files
ko3n1gclaude
andauthored
fix(kubeflow): tail only first and last node logs (#538)
* fix(kubeflow): stream only rank 0 + last rank, write all ranks to disk KubeflowExecutor.fetch_logs followed every replica and forwarded all ranks to the caller, so at scale the aggregate output overran CI/runner job-log size limits (a 16-node x 8-GPU run exceeded GitLab's 128MB cap). Now it still tails every rank (kubectl logs -l <jobset> --prefix --max-log-requests num_nodes) and writes the complete multi-rank output to <job_dir>/log-allranks_0.out, but forwards only global rank 0 (node 0, [default0]) and the last global rank (node num_nodes-1, [default nproc_per_node-1]) to stdout. Downstream log validation that globs log*.out still sees every rank via the on-disk file. Signed-off-by: oliver könig <okoenig@nvidia.com> * fix(kubeflow): resolve last pod via completion-index label + full-history streaming fetch_logs identified the last global rank's pod by parsing the pod name and tailed only the last `--tail <lines>` window, so on (re)attach the last rank's mid-run canonical "iteration | lm loss | ..." line (print_rank_last) was dropped — on K8s the job log showed only rank 0's "Step Time" line. Resolve the first/last pod from the authoritative batch.kubernetes.io/job-completion-index label (== torchrun PET_NODE_RANK), mapped from the --prefix pod name and refreshed on every (re)connect (gang restarts spawn new pod names), and stream each pod's full history (--tail=-1) so no mid-run line is missed. All ranks are still written to log-allranks_0.out; only global rank 0 and the true last global rank are forwarded to stdout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com> * fix(kubeflow): forward by global rank (node_rank*nproc+local), not pod heuristic Kubeflow Trainer sets torchrun PET_NODE_RANK statically from the JobSet batch.kubernetes.io/job-completion-index, so global_rank = completion_index * nproc_per_node + local_rank. Compute that explicitly and forward only global rank 0 and world_size-1 to stdout (all ranks still go to log-allranks_0.out). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com> * fix(kubeflow): make TrainJob launch idempotent on 409 conflict When a TrainJob with the target name already exists, launch() raised and aborted. On CI the name is derived from the experiment id (commit SHA), so a 409 is a stale leftover from a prior attempt the launcher declared FAILED after a slow pod start. That blocked setup_experiment's 'attempt N of M' retry — every retry re-collided. Now launch() deletes the stale job (cancel(wait=True)) and recreates, so the retry can actually recover. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com> * fix(kubeflow): reload kube client across cert rotation for long runs The kubernetes SDK bakes the client cert into its SSLContext at client construction and never re-reads it. When credentials come from a rotating source (Teleport tbot refreshing the cert on disk), a KubeflowExecutor created once at launch keeps presenting the original cert until it expires mid-run, so status polls fail with SSLV3_ALERT_CERTIFICATE_EXPIRED once the run outlives the cert TTL (~60 min). Short jobs finish in time; multi-hour jobs go blind. Rebuild the API clients from the on-disk kubeconfig past a refresh interval (below the cert TTL) via lazy properties, and reactively reload+retry once in status() on a non-API connection error. fetch_logs already shells out to kubectl, which re-reads creds per call, so it was unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com> * fix(kubeflow): scope code_dir per job to avoid concurrent clobber code_dir was scoped only per user (<pvc>/<username>/code), but package() rsyncs each job's job_dir into it. Two concurrent jobs from the same user (e.g. parallel CI test cases) therefore overwrite each other's launcher code mid-run. Scope it per job (<username>/<experiment_id>/<job_name>/code), matching how dgxcloud/lepton mirror job_dir into a per-job PVC subdir and how slurm keys packaging by experiment_id:job_name. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com> * fix(kubeflow): unique TrainJob name + forward all ranks (deduped) - TrainJob name is now <basename>-<uuid6> (RFC-1123, <=33 chars) via the new train_job_basename field, decoupled from the experiment name. The uuid makes every launch unique, so concurrent/retried jobs never collide on the API server (the descriptive experiment name is intentionally non-unique). - fetch_logs now forwards every rank to stdout, de-duplicated: torchrun runs the same entrypoint on all ranks so startup/config/NCCL lines are identical; we strip the per-rank [pod/...]/[defaultN] markers and forward each distinct message once. This stops dropping the per-step loss line and wandb URL, which Megatron emits from a single layout-dependent rank (neither rank 0 nor last). The full per-rank stream still goes to log-allranks_0.out untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com> * fix(kubeflow): stream logs once, not per replica torchx calls scheduler.log_iter(app_id, role_name, k=...) once per replica (k = 0..num_nodes-1). The Kubeflow log_iter ignored k and re-ran fetch_logs — which tails the entire jobset via the jobset-name selector — for every replica, producing N independent tail streams (each with its own dedup state) and N-fold-duplicating every console line (prefixed <role>/<k>). At 16 nodes that's 16x the log volume, which also overruns the CI job-log limit on long runs. Stream only for k == 0; that single tail already covers all ranks (and writes log-allranks_0.out once). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com> * fix(kubeflow): forward rank 0 + last rank to stdout (not all-ranks dedup) Revert the all-ranks sliding-window dedup back to forwarding only global rank 0 (setup/config) and the last global rank (print_rank_last per-step loss), like a SLURM job log. The last rank is resolved at stream time from each pod's batch.kubernetes.io/job-completion-index label (== torchrun --node-rank $PET_NODE_RANK), so global_rank = completion_index * nproc_per_node + local_rank is deterministic without any topology enforcement. The full per-rank stream is still captured in log-allranks_0.out. Combined with the per-replica log_iter guard, this stops the N-fold duplication and yields a clean two-rank console. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com> * fix(kubeflow): forward rank-0 + the actual loss-rank slot to stdout The c10d rendezvous assigns torch ranks by join order, not by JobSet completion-index, so torch's world_size-1 (print_rank_last's loss line) does NOT land on the highest completion-index. Verified on a live 16-node job: the loss prints on completion-index 9 (= num_nodes//2 + 1), local rank nproc-1 — not index 15. Forward exactly (index 0, local 0) and (index num_nodes//2 + 1, local nproc-1) so the console shows rank 0 setup + the per-step loss/throughput. Full per-rank capture remains in log-allranks_0.out. A deterministic completion-index->rank mapping (topology/static rank ordering) would let us compute this rather than match the observed slot. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com> * fix(kubeflow): robust log streaming across pod/container restarts fetch_logs ran a single 'kubectl logs -l -f --max-log-requests <num_nodes>'. That follow only attaches to pods present at start, never re-attaches to a container that restarts, and --max-log-requests == pod count has no headroom — so a gang/NCCL-init restart that transiently doubled the matching-pod count errored the whole command ('maximum allowed concurrency') and silently dropped pods. Observed: a 16-node job streamed only node-0-0; the loss rank (node-0-9) never appeared even though it was emitting per-step loss. - --max-log-requests = max(num_nodes*2, 8): headroom for restart-transient pods. - Periodically re-attach (threading.Timer terminates the follow every 120s) so pods that (re)started after the initial attach are picked up. - Resume reconnects with --since-time (via --timestamps), tracking the max RFC3339 stamp, so re-attaching never replays already-emitted history; only the first attach uses --tail=-1. The kubectl timestamp is stripped from each line. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com> * fix(kubeflow): resolve rank-0/last pods from worker GROUP_RANK, not a heuristic The console forwarded rank 0 + the loss rank using completion-index with an empirical 'num_nodes//2+1' slot for world_size-1. That's fragile: the c10d rendezvous assigns torch ranks by join order, not JobSet completion-index, so the loss rank lands on an unpredictable pod (observed: completion-index 9 was actually GROUP_RANK 15 = RANK 63 = world_size-1). Read the ground truth instead: torchrun exports GROUP_RANK into every worker's /proc/<pid>/environ, so 'kubectl exec <pod> -- ' reading it tells us exactly which pod holds GROUP_RANK 0 (RANK 0, local 0) and GROUP_RANK num_nodes-1 (RANK world_size-1, local nproc-1). Resolve the pod->GROUP_RANK map once the workers exist, cache it, and re-resolve when the rank-0/last pod is no longer covered (gang restart reshuffles ranks). Until workers come up (empty map), fall back to the completion-index-0 pod so early setup still streams. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com> * fix(kubeflow): emit forwarded log lines in timestamp order 'kubectl logs -l ... -f' multiplexes every pod into one stream in ARRIVAL order, not timestamp order. Because the console forwards two pods (rank 0 and the last rank), their lines could interleave wrong — e.g. two rank-0 'Step Time' lines bunching before the last rank's 'iteration N' line, or a step time landing under the next iteration. Add a small reorder buffer on the forwarded (yielded) subset only: each line already carries the kubelet --timestamps value (parsed to epoch via the new _ts_epoch), so hold lines until they are older than reorder_hold_s (2s) and emit sorted by timestamp. The window comfortably absorbs cross-node clock skew + flush jitter while keeping the console near-live. The buffer is drained in order after each proc ends (re-attach) — outside finally, since yielding during generator close is unsafe. The full all-ranks debug file is untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com> * feat(kubeflow): support pod-template annotations/labels (podTemplateOverrides metadata) The executor's existing 'annotations' land on the TrainJob object. GKE multi-network attach (networking.gke.io/interfaces, for GPUDirect-RDMA/gIB) is read off the trainer POD, not the TrainJob — add pod_annotations (and pod_labels) that flow into podTemplateOverrides[].metadata, which the Kubeflow Trainer v2 CRD supports. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com> * fix(kubeflow): resolve rank-0 and last rank before forwarding logs On first attach the GROUP_RANK pod map is empty until the torchrun workers finish rendezvous, so _forward_to_stdout fell back to rank-0-only and the last rank's early per-step loss/throughput lines (replayed via --tail=-1) were written to log-allranks but never forwarded to stdout — the CI log silently dropped the beginning of the run until a re-attach ~120s later, by which point --since-time skips the replayed history. Poll on the first attach until both rank 0 and the last rank resolve before forwarding, capped at 600s (then fall back). The wait is gated on a non-empty pod list, so it is a no-op when pods can't be listed (no kubectl / unit tests) and engages only for real runs. Signed-off-by: oliver könig <okoenig@nvidia.com> * fix(kubeflow): wait for rank-0/last to resolve, never fall back to completion-index The first-attach barrier capped the wait at 600s and then forwarded with the completion-index heuristic, which streams the wrong rank. A job can legitimately sit Pending (starved for nodes) far longer than 600s, so it would time out and mis-forward. Drop the timeout/fallback: keep polling while the job is alive and stop only when it reaches a terminal state. --tail=-1 on first attach replays history, so waiting loses nothing. Signed-off-by: oliver könig <okoenig@nvidia.com> * style(kubeflow): ruff-format kubeflow.py Signed-off-by: oliver könig <okoenig@nvidia.com> * test(kubeflow): update stale tests for uuid names, idempotent 409, rank-0/last log forwarding The base kubeflow rewrite changed behavior the tests still asserted the old way: TrainJob names are now <base>-<uuid6>; a 409 cancels the stale job and recreates (idempotent) rather than raising; and fetch_logs writes every rank to <job_dir>/log-allranks_0.out while forwarding only rank-0 + the last rank to stdout. Set job_dir, patch status/time.sleep to avoid the retry-loop hang, and assert the all-ranks file + uuid-suffixed names. Signed-off-by: oliver könig <okoenig@nvidia.com> * test(kubeflow): cover GROUP_RANK resolution, log forwarding, client reload Raises codecov/patch on the diff from 58% to ~98% by exercising the previously-untested branches: GROUP_RANK resolution via worker environ (incl. the first-attach resolve barrier), rank-0/last-rank forwarding + reorder buffer, the completion-index fallback, pod-template labels/ annotations, stale kube-client reload, and the status() connection-error retry path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com> * feat(kubeflow): add copy_to_workspace/copy_from_workspace for arbitrary PVC paths package()/pull_results() already bridge launcher↔PVC via a throw-away data-mover pod, but only for the per-job code_dir. Downloading results (or persisting any auxiliary cross-run state) from another path on the volume had no public API. Add copy_to_workspace(local, remote) and copy_from_workspace(remote, local) that run the same data-mover against an arbitrary path under workdir_pvc_path, and refactor package()/pull_results() to delegate to them (behavior unchanged). Tests cover the happy path, the no-PVC no-op, and pod teardown on copy error. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent dfb82c5 commit e11f713

3 files changed

Lines changed: 724 additions & 98 deletions

File tree

0 commit comments

Comments
 (0)