Skip to content

Proactive Capacity-Aware ARC Autoscaling #499

@jeanschmidt

Description

@jeanschmidt

Proactive Capacity-Aware ARC Autoscaling

Problem Statement

ARC in kubernetes and kubernetes-novolume container modes has a fundamental architectural flaw: the workflow pod is created by the runner pod. At the time Kubernetes schedules the runner pod, neither the scheduler nor Karpenter has any awareness of the resources required for the workflow pod.

This creates two failure modes:

  1. Delayed execution: Runner pod starts, picks up a job, creates the workflow pod — but no node has capacity for it. Karpenter must provision a new node reactively, adding minutes of delay while the job is "running" but doing nothing.
  2. Hard failure (InsufficientCapacity): Runner pod starts and claims a job from GitHub, but the workflow pod can never be scheduled — e.g., AWS has no capacity for the required instance type. The job is claimed but cannot run. It doesn't return to the GitHub queue; it's stuck until the 24-hour timeout.

The second failure mode is the critical one. A job that stays queued on GitHub can be picked up by another cluster. A job that's claimed but unrunnable is dead weight.

Root Cause

ARC's autoscaling is count-based and capacity-unaware:

  1. The listener receives TotalAssignedJobs from GitHub's Actions Service
  2. It patches EphemeralRunnerSet.spec.replicas to match
  3. The EphemeralRunnerSet controller creates EphemeralRunner CRs
  4. The EphemeralRunner controller registers with GitHub and creates runner pods
  5. Runner pods pick up jobs and create workflow pods

At no point does any component check whether the cluster can actually fit the runner + workflow pod pair. The Kubernetes scheduler and Karpenter only react once pods exist — there is no forward-looking capacity assessment.

Solution: Capacity-Aware Listener with Proactive Provisioning

Key Insight: X-ScaleSetMaxCapacity

The GitHub Actions Service protocol already has a capacity signaling mechanism. On every long-poll GetMessage() request, the listener sends an X-ScaleSetMaxCapacity header:

// github.com/actions/scaleset/session_client.go
req.Header.Set("X-ScaleSetMaxCapacity", strconv.Itoa(maxCapacity))

GitHub only assigns jobs to a scale set up to its reported capacity. Today, maxRunners is static — set once from Helm values. The fix is to make it dynamic: a capacity monitor adjusts maxRunners in real time based on actual cluster capacity.

The listener exposes a thread-safe setter for this:

// github.com/actions/scaleset/listener/listener.go
func (l *Listener) SetMaxRunners(max uint32) {
    l.maxRunners.Store(max)  // atomic, safe to call from any goroutine
}

The next GetMessage() poll automatically uses the updated value. No protocol changes, no controller changes, no CRD changes.

Strategy: Optimistic with Placeholder Pods (Strategy B)

Each cluster proactively provisions capacity ahead of demand using low-priority placeholder pods. It reports its full provisionable capacity to GitHub, greedily claiming as many jobs as it can fit. If another cluster claims the jobs first, the spare capacity sits idle until Karpenter consolidates it.

Multi-Cluster Behavior

Cluster A capacity monitor:
  "I have 20 available runner+workflow slots"
  → SetMaxRunners(20)
  → GitHub assigns up to 20 jobs to Cluster A

Cluster B capacity monitor:
  "I have 15 available runner+workflow slots"
  → SetMaxRunners(15)
  → GitHub assigns up to 15 jobs to Cluster B

Cluster A hits InsufficientCapacity (EC2 capacity exhausted):
  "I now have 0 provisionable slots, 18 running"
  → SetMaxRunners(18)
  → New jobs flow to Cluster B (or any other cluster with capacity)

Each cluster is greedy — it advertises its full capacity. GitHub distributes jobs across clusters based on their reported maximums. When a cluster can't provision more nodes, it lowers its maximum, and jobs naturally flow to clusters that still have capacity.

Architecture

Components

1. Forked ghalistener Binary (the only fork required)

The ghalistener binary is the listener process that runs as a pod per AutoscalingRunnerSet. It lives at cmd/ghalistener/main.go in the ARC repo — roughly 150 lines. Today it wires up:

  • A scaleset.Client (GitHub Actions Service REST client)
  • A listener.Listener (long-poll event loop, from github.com/actions/scaleset/listener)
  • A scaler.Scaler (patches EphemeralRunnerSet replicas via K8s API)

The fork adds one new component: a CapacityMonitor goroutine that runs alongside the listener in the same errgroup. It queries cluster state and dynamically calls listener.SetMaxRunners().

Nothing else is forked. The ARC controllers (AutoscalingRunnerSet, EphemeralRunnerSet, EphemeralRunner) run stock. The CRDs are unchanged. The Helm charts need only a container image override for the listener.

2. Capacity Monitor

A goroutine inside the forked ghalistener binary. Responsibilities:

  • Watch Karpenter NodePools: query limits (CPU, memory, GPU budgets) and current usage to decide when to create new placeholder pods
  • Watch EphemeralRunner CRs: track current runner count and their states (pending, running)
  • Watch placeholder pod status: detect when placeholder pairs (runner + workflow) both reach Running (confirmed reservation) or remain Pending past timeout (capacity unavailable)
  • Calculate available slots: available = ready_pair_count (only pairs where both placeholders are Running count — the scheduler is the sole arbiter of resource availability)
  • Update maxRunners: call listener.SetMaxRunners(current_runners + available) whenever capacity changes
  • Manage placeholder pair lifecycle: create placeholder pairs to claim potential headroom, delete timed-out pairs (if either placeholder stays Pending), and replenish pairs as runners preempt them

3. Placeholder Pods (Split Runner + Workflow)

Each capacity slot is reserved by two placeholder pods — one for the runner and one for the workflow. This split solves a critical race condition: without it, after the runner pod preempts a single combined placeholder, the freed workflow resources are unprotected until the workflow pod is created (seconds to tens of seconds later — GitHub registration, job pickup, hooks init). Any other pod in the cluster could claim those resources, leaving the workflow pod Pending and the job stuck.

With split placeholders, the workflow placeholder (priority 10) survives runner pod creation (priority 0) and continues to protect the workflow resources until the actual workflow pod (priority 20) preempts it.

Per slot, the capacity monitor creates:

Pod Resource requests PriorityClass Priority Preempted by
Placeholder-Runner runner.requests placeholder-runner -10 Runner pod (0)
Placeholder-Workflow workflow.requests placeholder-workflow 10 Workflow pod (20)

Placeholder pods are landing-place agnostic — they do NOT require same-node affinity. Each placeholder is scheduled independently by the Kubernetes scheduler. The purpose is to reserve cluster-level capacity (total CPU, memory, GPU across the cluster), not to guarantee specific node co-location. Karpenter provisions nodes based on pending pods regardless of where placeholders land.

Shared properties (both placeholder types):

  • Node affinity: Matches the same nodeSelector and tolerations as the runner pods, ensuring placeholders trigger provisioning of the correct instance types.
  • Labels: Clearly labeled with the scale set name, slot ID, role (placeholder-runner or placeholder-workflow), and a TTL annotation for cleanup.
  • Lightweight image: public.ecr.aws/docker/library/alpine:3.21 with command: ["sleep", "900"] — same Alpine used by other OSDC DaemonSets (already cached on nodes). The 15-minute sleep acts as a safety timeout: if nothing preempts or deletes the placeholder, it self-terminates to prevent resource leaks.
  • terminationGracePeriodSeconds: 0: Ensures preemption frees resources immediately — the default 30s grace period would delay scheduling.
  • preemptionPolicy: Never: Placeholders never preempt other pods.
  • Owner reference: Owned by the listener pod, so they're cleaned up automatically if the listener is deleted.

Priority ladder:

Priority 20: Workflow pod      — preempts Placeholder-Workflow (10)
Priority 10: Placeholder-Workflow — survives Runner pod creation, protects workflow resources
Priority  0: Runner pod        — preempts Placeholder-Runner (-10), does NOT preempt Placeholder-Workflow (10)
Priority -10: Placeholder-Runner  — lowest priority, preempted first

Preemption sequence during job execution:

  1. Runner pod (priority 0) is created → preempts Placeholder-Runner (priority -10) → runner starts
  2. Placeholder-Workflow (priority 10) remains Running — its resources are protected
  3. Runner registers with GitHub, picks up job, calls runner-container-hooks
  4. Workflow pod (priority 20) is created → preempts Placeholder-Workflow (priority 10) → workflow starts
  5. Both placeholders are gone, runner + workflow are running on the reserved capacity

4. Capacity Calculation

The capacity monitor must answer: "How many runner+workflow pairs can this cluster guarantee right now?"

Placeholder pod pairs are the sole source of truth for available capacity. The monitor does NOT calculate node headroom directly — instead, it creates placeholder pairs (runner + workflow) and waits for the Kubernetes scheduler to confirm the reservation by transitioning both to Running. This eliminates double-counting across scale sets: the scheduler is the arbiter of resource contention, and a Running pair is proof that the resources are committed.

capacity = ready_pair_count - pending_runners_without_pairs

Where:

  • ready_pair_count = number of placeholder pairs where BOTH the runner and workflow placeholder are in Running state
  • pending_runners_without_pairs = runners that have been created but don't yet have a corresponding placeholder pair to preempt

The flow for detecting and reporting new capacity:

  1. Monitor detects potential headroom (node added, job completed, etc.)
  2. Monitor creates placeholder pair(s) — one Placeholder-Runner + one Placeholder-Workflow (scheduled independently, no same-node requirement)
  3. Both placeholders stay Pending until the scheduler places them — if multiple scale sets compete for the same headroom, the scheduler picks one and the others remain Pending
  4. Once both placeholders in a pair reach Running, monitor counts the pair as one available slot and updates maxRunners
  5. If either placeholder in a pair does not reach Running within placeholderReadyTimeout, the monitor deletes the entire pair and does NOT report the capacity — it will retry on the next recalculation cycle

This design means capacity is never reported speculatively. Every slot reported to GitHub via X-ScaleSetMaxCapacity is backed by a confirmed resource reservation for both the runner and the workflow pod.

The monitor recalculates on every relevant event (node added/removed, pod scheduled/deleted, NodePool status change) and on a periodic fallback interval (e.g., 30 seconds).

Detailed Flow

Steady State (No Queued Jobs)

  1. Capacity monitor creates placeholder pod pairs (runner + workflow) up to the configured proactiveCapacity limit
  2. Karpenter sees pending placeholder pods, provisions nodes if needed (e.g., previous nodes were consolidated)
  3. Scheduler places both placeholders (independently, no co-location required) — both transition to Running
  4. If either placeholder in a pair does not reach Running within placeholderReadyTimeout, monitor deletes the entire pair (capacity unavailable) and retries on next cycle
  5. Monitor counts only complete pairs where both placeholders are Running and calls SetMaxRunners(current_runners + ready_pairs)
  6. Listener polls GitHub with the updated X-ScaleSetMaxCapacity
  7. GitHub sees this scale set can handle N jobs — ready for the next burst

Job Burst Arrives

  1. GitHub assigns M jobs to this scale set (M ≤ maxRunners)
  2. Listener receives TotalAssignedJobs = M in statistics
  3. Scaler patches EphemeralRunnerSet.spec.replicas = M
  4. EphemeralRunnerSet controller creates M EphemeralRunner CRs
  5. EphemeralRunner controller creates M runner pods (priority 0)
  6. Runner pod (priority 0) preempts a Placeholder-Runner (priority -10) on the cluster. Because placeholders use terminationGracePeriodSeconds: 0, resources are freed near-instantly.
  7. Placeholder-Workflow (priority 10) survives — runner pod priority (0) is too low to preempt it. Workflow resources remain protected in the cluster (the placeholder-workflow may be on a different node).
  8. Runner pod starts, registers with GitHub, picks up job
  9. Runner creates workflow pod (priority 20) via runner-container-hooks
  10. Workflow pod (priority 20) preempts Placeholder-Workflow (priority 10) — workflow resources are freed and immediately claimed by the workflow pod
  11. Both placeholders are gone. Runner + workflow are running on the capacity that was reserved for them (possibly on different nodes).
  12. Capacity monitor detects the consumed pair:
    • Decreases available count
    • Calls SetMaxRunners(current_runners + remaining_ready_pairs)
    • Creates new placeholder pairs to replenish proactive capacity
    • Karpenter provisions new nodes for the new placeholders
  13. Next poll to GitHub reflects the updated capacity

Why split placeholders solve the resource protection gap: With a single combined placeholder, the runner pod preempts it entirely, freeing both runner and workflow resources at once. The runner consumes its portion, but the workflow resources sit unprotected in the cluster for seconds (GitHub registration, job pickup, hooks init) — any other pod can claim them. With split placeholders, the Placeholder-Workflow remains Running and holds the workflow resources until the actual workflow pod (higher priority) arrives to preempt it. The resources are never unprotected. Because placeholders are landing-place agnostic, the runner and workflow pods may end up on different nodes — what matters is that cluster-level capacity was reserved for both.

Why terminationGracePeriodSeconds: 0 is mandatory on both placeholder types: Without it, Kubernetes gives the placeholder 30 seconds to shut down before forcefully killing it. During those 30 seconds the resources are still held by the dying placeholder. With terminationGracePeriodSeconds: 0, the pause container is killed immediately and resources are freed for the preempting pod.

InsufficientCapacity (EC2 Exhaustion)

  1. Capacity monitor creates placeholder pods to trigger Karpenter provisioning
  2. Karpenter attempts to provision nodes but hits EC2 InsufficientInstanceCapacity
  3. Placeholder pods stay Pending indefinitely
  4. Capacity monitor detects: placeholder pods are not becoming Running within a timeout (e.g., 5 minutes)
  5. Monitor does NOT count pending placeholders as available capacity
  6. Calls SetMaxRunners(current_runners + ready_placeholders_only) — only capacity that's actually confirmed
  7. Next poll to GitHub reports reduced capacity
  8. GitHub stops assigning new jobs to this scale set
  9. New jobs flow to other clusters that still have capacity
  10. When EC2 capacity becomes available again, placeholders become Running, monitor increases maxRunners, jobs flow back

Job Claimed by Another Cluster

  1. Cluster A reports capacity, GitHub assigns jobs
  2. Cluster B also has capacity for the same runner labels, GitHub assigns some jobs there instead
  3. Cluster A's TotalAssignedJobs is lower than expected
  4. Scaler creates fewer runners than maxRunners
  5. Placeholder pods remain running (spare capacity)
  6. Karpenter's consolidation policy eventually reclaims underutilized nodes (placeholder pods are low-priority, easily evicted during consolidation)
  7. Monitor adjusts as nodes are consolidated

Scale to Zero

  1. All jobs complete, runners exit, EphemeralRunners are cleaned up
  2. Capacity monitor detects no active runners
  3. If proactiveCapacity > 0: maintains some placeholder pods to keep warm capacity for the next burst
  4. If proactiveCapacity == 0: deletes all placeholders, Karpenter consolidates nodes, SetMaxRunners(0)
  5. Cluster is fully scaled down but can report capacity again within one Karpenter provisioning cycle

Configuration

Runner Definition (modules/arc-runners/defs/*.yaml)

Each runner definition MUST set maxRunners — the absolute ceiling for that scale set. The capacity monitor will NEVER exceed this value, regardless of placeholder count or queued jobs.

runner:
  name: l-x86iavx512-8-16
  instance_type: c7a.48xlarge
  vcpu: 8
  memory: 16Gi
  gpu: 0
  disk_size: 150
  maxRunners: 100  # absolute ceiling — capacity monitor never exceeds this

The maxRunners value flows through the template into the Helm maxRunners field, which the ARC controller writes into the listener config. The capacity monitor reads it from config.MaxRunners and uses it as the ceiling for X-ScaleSetMaxCapacity.

Capacity-Aware Listener Config

New values added to the runner scale set Helm values (or listener config):

capacityAware:
  enabled: true
  # How many runner+workflow slots to proactively provision ahead of demand.
  # Default is 0 (disabled) — opt-in per scale set. Consider enabling for
  # runner types with frequent bursts where cold-start latency matters.
  proactiveCapacity: 0
  # How often to recalculate capacity (fallback; event-driven is primary)
  recalculateInterval: 30s
  # How long to wait for a placeholder to become Ready before considering
  # the capacity unavailable (InsufficientCapacity detection)
  placeholderReadyTimeout: 5m
  # Resource requirements for the workflow pod (runner resources come from
  # the pod template). This is the key input for placeholder sizing.
  # MUST be set to the MAXIMUM expected workflow resource requirements for
  # this runner type — not the average. A workflow that exceeds these
  # limits will fail to schedule even after preempting the placeholder.
  workflowResources:
    requests:
      cpu: "4"
      memory: "16Gi"
    # Optional: GPU requirements
    # nvidia.com/gpu: "1"
  # PriorityClasses for placeholder pods (created automatically by the capacity monitor)
  # placeholder-runner: priority -10 (preempted by runner pods)
  # placeholder-workflow: priority 10 (survives runner creation, preempted by workflow pods)

Each scale set (runner type) has its own workflowResources because different runner types run different workloads. A CPU runner's workflow pod needs 4 CPU + 16Gi. A GPU runner's workflow pod needs 8 CPU + 64Gi + 1 GPU. The placeholder pods are sized accordingly.

HUD API Integration (Queued Jobs)

The capacity monitor queries the PyTorch HUD API to discover how many jobs are currently queued for this runner's labels, and pre-provisions placeholder pairs for them in addition to the static proactiveCapacity.

API endpoint: https://hud.pytorch.org/api/clickhouse/queued_jobs_aggregate?parameters=%5B%5D
Auth header: x-hud-internal-bot: <secret>
Secret: Stored as a Kubernetes Secret in the arc-systems namespace:

apiVersion: v1
kind: Secret
metadata:
  name: pytorch-hud-token
  namespace: arc-systems
type: Opaque
stringData:
  token: "<hud-internal-bot-secret>"

The listener pod mounts this secret as an environment variable HUD_API_TOKEN.

Response format:

interface QueuedJobsForRunner {
  runner_label: string;   // e.g., "mt-l-x86iavx512-8-16"
  org: string;
  repo: string;
  num_queued_jobs: number;
  min_queue_time_minutes: number;
  max_queue_time_minutes: number;
}

Label discovery: At startup, the capacity monitor calls client.GetRunnerScaleSet(ctx, scaleSetID) to retrieve the RunnerScaleSet.Labels array — these are the labels GitHub matches against runs-on:. The monitor then filters the HUD response to entries where runner_label matches any of the scale set's configured labels, and sums num_queued_jobs across all matching entries.

Capacity Formulas

Desired placeholder pairs (how many pairs the monitor tries to maintain):

desired_placeholder_pairs = proactiveCapacity + nbr_queued_jobs_runner

Where nbr_queued_jobs_runner is the sum of num_queued_jobs from the HUD API for all labels matching this scale set.

X-ScaleSetMaxCapacity (reported to GitHub on every poll):

X-ScaleSetMaxCapacity = min(total_running_jobs + running_placeholder_pairs, maxRunners)

Where:

  • total_running_jobs = runners currently executing jobs
  • running_placeholder_pairs = placeholder pairs where BOTH the runner and workflow placeholder pods are in Running state
  • maxRunners = the absolute ceiling from the Helm values (runner def YAML)

The maxRunners ceiling is NEVER exceeded. Even if the HUD API reports 500 queued jobs, if maxRunners is 100, the capacity monitor caps at 100. The desired placeholder pairs are also capped: desired_placeholder_pairs = min(desired_placeholder_pairs, maxRunners - total_running_jobs).

Resource Sizing for Placeholders

Each capacity slot creates two placeholder pods with separate resource requests:

placeholder-runner.requests   = runner.requests
placeholder-workflow.requests = workflow.requests

Where:

  • runner.requests comes from the existing pod template in the AutoscalingRunnerSet (e.g., 750m CPU, 512Mi memory for a standard runner)
  • workflow.requests comes from the capacityAware.workflowResources config — this MUST be set to the maximum expected workflow resource requirements, not the average. Different GitHub Actions workflows on the same runner type may request different resources. If the placeholder is sized for the average and a heavy workflow arrives, the workflow pod won't fit on the node even after preempting the placeholder.

Example for a standard CPU runner:

  • Placeholder-Runner: 750m CPU, 512Mi memory
  • Placeholder-Workflow: 4 CPU, 16Gi memory
  • Total per slot: 4750m CPU, 16.5Gi memory

Placeholder pods are scheduled independently (no same-node requirement). They reserve cluster-level capacity, not per-node capacity.

PriorityClass Setup

Four priority classes form the preemption ladder. The values are chosen so that each level only preempts the level(s) below it:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: placeholder-runner
value: -10
globalDefault: false
description: "Runner placeholder — reserves runner resources, preempted by runner pods"
preemptionPolicy: Never

---

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: arc-runner
value: 0
globalDefault: false
description: "Runner pods — preempt runner placeholders, NOT workflow placeholders"

---

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: placeholder-workflow
value: 10
globalDefault: false
description: "Workflow placeholder — reserves workflow resources, survives runner creation, preempted by workflow pods"
preemptionPolicy: Never

---

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: arc-workflow
value: 20
globalDefault: false
description: "Workflow pods — preempt workflow placeholders"

Preemption guarantees:

  • Runner pod (0) preempts Placeholder-Runner (-10) but NOT Placeholder-Workflow (10) — workflow resources stay protected
  • Workflow pod (20) preempts Placeholder-Workflow (10) — workflow resources are released only when the real workflow pod needs them
  • Neither placeholder type preempts anything (preemptionPolicy: Never)

What Gets Forked

Component Action Reason
cmd/ghalistener/main.go Fork Add CapacityMonitor goroutine to the errgroup
cmd/ghalistener/capacity/ New package CapacityMonitor implementation, placeholder pod management, Karpenter NodePool queries
github.com/actions/scaleset No change Used as-is; SetMaxRunners() is the only integration point
controllers/actions.github.com/* No change All controllers run stock
ARC CRDs No change No schema changes
gha-runner-scale-set-controller chart Minimal change Override the listener container image to use the forked ghalistener binary. Chart published from https://github.com/jeanschmidt/actions-runner-controller.git master branch.
gha-runner-scale-set chart No change Runner pod templates stay the same

Maintenance Burden

The fork surface is minimal — one binary entry point (~150 lines today) plus a new capacity/ package. On ARC upgrades:

  1. Check if cmd/ghalistener/main.go changed (the entry point wiring)
  2. Check if the listener.Scaler interface changed (unlikely — it's been stable)
  3. Check if listener.SetMaxRunners() still exists (it's the public API)
  4. Rebase the fork

The capacity/ package is entirely ours — no upstream merge conflicts possible.

Implementation Plan

Phase 1: Proof of Concept (validate the protocol) — COMPLETED

Goal: Answer the fundamental question — does GitHub respect dynamic X-ScaleSetMaxCapacity changes mid-session?

Result: YES — validated. GitHub re-reads X-ScaleSetMaxCapacity on every poll. Setting maxRunners=0 stops job assignment. Reducing capacity mid-burst redirects queued jobs to other clusters. Latency is one poll cycle (~5-10 seconds).

POC implementation (ConfigMap-based manual knob) has been removed. Phase 2 replaces it with the full automated system.

Phase 2: Production Placeholder System + HUD Integration

Goal: Implement the full capacity-aware listener with placeholder pods, HUD API integration for demand-driven scaling, and proper deployment infrastructure.

2.1 OSDC Infrastructure (osdc/ repo)

POC Cleanup:

  • Remove scripts/python/capacity_setter.py (POC ConfigMap-based tool)
  • Remove DYNAMIC_CAPACITY_CONFIGMAP env var from modules/arc-runners/templates/runner.yaml.tpl
  • Remove capacity recipe from justfile
  • Remove any tests for capacity_setter

Runner Definitions:

  • Add maxRunners field to runner template (runner.yaml.tpl) — maps to the Helm maxRunners value
  • Add maxRunners support to generate_runners.py
  • Add maxRunners to all runner defs in modules/arc-runners/defs/*.yaml

Deploy Infrastructure:

  • Add Harbor osdc project creation to modules/arc/deploy.sh (same pattern as harbor-cache-recovery)
  • Update modules/arc/deploy.sh to pull the controller Helm chart from the published chart on https://github.com/jeanschmidt/actions-runner-controller.git master branch (not a local path)

PriorityClasses:

  • Create Kubernetes manifests for the four priority classes (placeholder-runner, arc-runner, placeholder-workflow, arc-workflow)
  • Deploy as part of the ARC module (applied before runner scale sets)

HUD API Secret:

  • Document the K8s Secret creation for the user to run manually:
    kubectl create secret generic pytorch-hud-token \
      --namespace arc-systems \
      --from-literal=token='<hud-internal-bot-secret>'
  • Add HUD_API_TOKEN env var to listener pod template (from the secret)

2.2 ARC Fork (actions-runner-controller repo)

POC Cleanup:

  • Remove any ConfigMap watcher code from the forked listener

Capacity Monitor Package (cmd/ghalistener/capacity/):

The core implementation. A single goroutine that runs in the listener's errgroup:

  1. Label Discovery (labels.go):

    • At startup, call client.GetRunnerScaleSet(ctx, scaleSetID) to get RunnerScaleSet.Labels
    • Cache the labels for HUD API matching
    • The scale set's labels are already fetched in main.go — pass them to the monitor
  2. HUD API Client (hud_client.go):

    • HTTP GET to https://hud.pytorch.org/api/clickhouse/queued_jobs_aggregate?parameters=%5B%5D
    • Header: x-hud-internal-bot: <token> (from HUD_API_TOKEN env var)
    • Parse response as []QueuedJobsForRunner
    • Filter by matching runner_label against the scale set's labels
    • Sum num_queued_jobs for matching entries → nbr_queued_jobs_runner
    • Poll interval: same as recalculateInterval (default 30s)
  3. Placeholder Manager (placeholder.go):

    • Create placeholder pairs (Placeholder-Runner + Placeholder-Workflow) as Kubernetes pods
    • Image: public.ecr.aws/docker/library/alpine:3.21, command: ["sleep", "900"]
    • Owner reference: listener pod (auto-cleanup on listener death/restart)
    • terminationGracePeriodSeconds: 0
    • preemptionPolicy: Never on both placeholders
    • No pod affinity — placeholders are landing-place agnostic (cluster-level capacity reservation)
    • Node selector + tolerations: match the runner pod template
    • Track pair state: both Running = ready slot, either Pending past timeout = delete pair
  4. Capacity Calculator (monitor.go):

    • Main reconciliation loop (event-driven + periodic fallback at recalculateInterval):
      nbr_queued = query_hud_api()
      desired_pairs = proactiveCapacity + nbr_queued
      desired_pairs = min(desired_pairs, maxRunners - total_running_jobs)
      desired_pairs = max(desired_pairs, 0)
      
      // Create or delete placeholder pairs to match desired count
      adjust_placeholders(desired_pairs)
      
      // Report capacity to GitHub
      running_pairs = count_running_placeholder_pairs()
      capacity = min(total_running_jobs + running_pairs, maxRunners)
      listener.SetMaxRunners(capacity)
      
    • Watch for pod events (placeholder state changes) to trigger immediate recalculation
    • Watch for EphemeralRunner changes (job started/completed) to adjust
  5. Integration (main.go changes):

    • Create CapacityMonitor in the errgroup alongside the listener
    • Pass: scaleset.Client, listener.SetMaxRunners, config.MaxRunners, scale set labels, K8s client
    • If capacityAware.enabled is false, skip monitor creation entirely (backward-compatible)

Unit Tests:

  • Test capacity formula with various inputs (queued jobs, running jobs, maxRunners ceiling)
  • Test HUD API response parsing and label matching
  • Test placeholder pair lifecycle (creation, timeout, deletion)
  • Mock K8s client for placeholder pod operations

2.3 Staging Validation

Deploy to staging and validate:

  1. Placeholder pods trigger Karpenter provisioning (no same-node requirement)
  2. Runner pods (priority 0) preempt Placeholder-Runner (-10) but NOT Placeholder-Workflow (10)
  3. Workflow pods (priority 20) preempt Placeholder-Workflow (10)
  4. Workflow resources remain protected between runner start and workflow pod creation
  5. maxRunners ceiling from Helm is never exceeded
  6. HUD API integration correctly discovers queued jobs per runner label
  7. Placeholder pairs scale up when jobs queue, scale down when jobs are assigned
  8. Placeholder timeout (15 min) works — pods self-terminate if not preempted
  9. Listener restart cleans up all placeholder pods (owner reference)
  10. InsufficientCapacity: placeholder stuck Pending past placeholderReadyTimeout → deleted, capacity not reported

Phase 3: Production Rollout

  1. Add Prometheus metrics for capacity monitoring (available slots, placeholder status, maxRunners changes, HUD API latency)
  2. Add Grafana dashboards
  3. Roll out to production clusters one at a time
  4. Tune proactiveCapacity per runner type based on observed burst patterns and HUD data

Phase 4: Multi-Cluster Optimization

  1. Validate multi-cluster behavior with dynamic maxRunners
  2. Consider adding cross-cluster capacity metrics (each cluster publishes its available capacity to a shared metric store)
  3. Tune consolidation delays to balance spare capacity cost vs. burst responsiveness

Risks and Mitigations

Risk: GitHub doesn't respect X-ScaleSetMaxCapacity dynamically

Likelihood: Low. The header is sent on every poll, and the protocol is designed for this.
Mitigation: Phase 1 PoC validates this before any significant investment.

Risk: Placeholder pod preemption race conditions

Scenario: Runner pod is created, preempts its placeholder, but the freed workflow resources are claimed by another pod before the workflow pod is created.
Mitigation: Solved by the split placeholder design. Each slot has two placeholders: Placeholder-Runner (priority -10) and Placeholder-Workflow (priority 10). When the runner pod (priority 0) is created, it preempts only Placeholder-Runner — Placeholder-Workflow survives because its priority (10) is higher than the runner's (0). The workflow resources remain protected until the actual workflow pod (priority 20) preempts Placeholder-Workflow. Both placeholder types MUST set terminationGracePeriodSeconds: 0 to ensure resources are freed instantly during preemption.

Risk: Multi-scale-set resource contention (double-counting headroom)

Scenario: Multiple runner types share the same NodePools. Two capacity monitors both see node headroom and report it as available.
Mitigation: This is solved by design. Capacity is never reported from headroom calculations — only from Running placeholder pods. When two monitors both detect headroom and create placeholders, the Kubernetes scheduler places one and the other stays Pending. The Pending placeholder times out, is deleted, and the monitor retries on the next cycle. The scheduler is the sole arbiter of resource contention; no cross-monitor coordination is needed.

Risk: Karpenter consolidation conflicts with placeholder pods

Scenario: Karpenter tries to consolidate nodes with placeholder pods while the capacity monitor is trying to maintain proactive capacity.
Mitigation: Placeholder pods do NOT use karpenter.sh/do-not-disrupt — that annotation blocks consolidation entirely with no TTL, which would cause idle nodes to accumulate indefinitely. Instead, placeholders are low-priority pods with preemptionPolicy: Never, which means Karpenter treats them as reschedulable during consolidation. When Karpenter evicts a placeholder, the capacity monitor detects the pod leaving Running state, decreases maxRunners, and creates a new placeholder pair on the next cycle (which may land on a different, more efficient node). This is the desired behavior — Karpenter optimizes node utilization, the capacity monitor reacts and re-provisions.

Risk: Cost of spare capacity

Scenario: Clusters maintain placeholder pods (and thus nodes) that never get used.
Mitigation: proactiveCapacity defaults to 0 (disabled) and is opt-in per scale set. Enable it only for runner types with frequent bursts where cold-start latency matters. Tune the value based on observed burst patterns for each runner type. Karpenter consolidation eventually reclaims unused capacity if no jobs arrive.

Risk: Fork maintenance burden

Likelihood: Low. The fork is ~200 lines of new code in a single binary. The scaleset package interface is stable (three methods, unchanged since v0.2.0).
Mitigation: The forked binary is a thin wrapper. ARC controller upgrades don't affect it. Only changes to the scaleset package's Scaler interface or SetMaxRunners() API would require fork updates.

Risk: HUD API unavailability

Scenario: The PyTorch HUD API at hud.pytorch.org is down, slow, or returns errors. The capacity monitor cannot determine queued job counts.
Mitigation: The HUD API is a best-effort enhancement. If the API is unavailable, the capacity monitor falls back to proactiveCapacity only (ignoring the nbr_queued_jobs_runner component). The monitor logs the failure but does not reduce capacity or stop functioning. Placeholder pairs based on static proactiveCapacity continue to work regardless of HUD API status.

Risk: JobAssigned messages are ignored

Context: The listener currently ignores JobAssigned messages (parses them but doesn't pass to the scaler). These messages contain per-job metadata (RequestLabels, RepositoryName, JobID, etc.) that could be useful for smarter capacity decisions.
Opportunity: A future enhancement could process JobAssigned messages to make per-job capacity decisions — e.g., "this job needs a GPU node, do I have GPU capacity?"

Protocol Reference

Key Types (from github.com/actions/scaleset)

// Statistics sent with every message
type RunnerScaleSetStatistic struct {
    TotalAvailableJobs     int
    TotalAcquiredJobs      int
    TotalAssignedJobs      int  // <-- this is what drives scaling
    TotalRunningJobs       int
    TotalRegisteredRunners int
    TotalBusyRunners       int
    TotalIdleRunners       int
}

// Message types
const (
    MessageTypeJobAssigned  = "JobAssigned"   // ignored by current listener
    MessageTypeJobStarted   = "JobStarted"    // runner picked up job
    MessageTypeJobCompleted = "JobCompleted"  // runner finished job
)

// Per-job metadata (available in all message types)
type JobMessageBase struct {
    RunnerRequestID    int64
    RepositoryName     string
    OwnerName          string
    JobID              int64
    JobWorkflowRef     string
    JobDisplayName     string
    WorkflowRunID      int64
    EventName          string
    RequestLabels      []string
    QueueTime          time.Time
    ScaleSetAssignTime time.Time
    RunnerAssignTime   time.Time
    FinishTime         time.Time
}

Listener Event Loop (from github.com/actions/scaleset/listener)

loop:
  1. GetMessage(ctx, lastMessageID, maxRunners)
     → HTTP GET to messageQueueURL
     → Header: X-ScaleSetMaxCapacity = maxRunners
     → Header: Authorization = Bearer <messageQueueAccessToken>

  2a. HTTP 202 (no messages):
     → call scaler.HandleDesiredRunnerCount(ctx, latestStatistics.TotalAssignedJobs)
     → loop

  2b. HTTP 200 (message batch):
     → parse message (Statistics + JobAssigned/Started/Completed arrays)
     → DELETE messageQueueURL/{messageID}  (ACK — all-or-nothing)
     → call scaler.HandleJobStarted() for each started job
     → call scaler.HandleJobCompleted() for each completed job
     → call scaler.HandleDesiredRunnerCount(ctx, msg.Statistics.TotalAssignedJobs)
     → loop

  2c. HTTP 401 (token expired):
     → PATCH sessions/{sessionId} to refresh token
     → retry

ARC Controller Chain (unchanged)

AutoscalingRunnerSet (Helm-created CR)
  ↓ controller creates
AutoscalingListener (CR + listener pod in arc-systems)
  ↓ listener patches
EphemeralRunnerSet.spec.replicas = N
  ↓ controller creates N
EphemeralRunner (one per runner)
  ↓ controller creates
Pod (runner pod, from template in AutoscalingRunnerSet)
  ↓ runner creates (via runner-container-hooks)
Pod (workflow pod, in arc-runners namespace)

Source Code References

File What
actions/scaleset/listener/listener.go Listener struct, Scaler interface, Run() loop, SetMaxRunners()
actions/scaleset/session_client.go GetMessage (long-poll with X-ScaleSetMaxCapacity), DeleteMessage (ACK)
actions/scaleset/client.go REST API client, authentication, JIT config generation
actions/scaleset/types.go Protocol types — RunnerScaleSetStatistic, message types, RunnerScaleSet
actions/scaleset/examples/dockerscaleset/scaler.go Reference Scaler implementation showing capacity capping
actions-runner-controller/cmd/ghalistener/main.go Listener binary entry point (the file we fork)
actions-runner-controller/cmd/ghalistener/scaler/scaler.go Current Scaler implementation (patches EphemeralRunnerSet)
actions-runner-controller/controllers/actions.github.com/ephemeralrunnerset_controller.go EphemeralRunnerSet reconciler (creates EphemeralRunner CRs)
actions-runner-controller/controllers/actions.github.com/ephemeralrunner_controller.go EphemeralRunner reconciler (creates runner pods)
actions-runner-controller/controllers/actions.github.com/autoscalingrunnerset_controller.go AutoscalingRunnerSet reconciler (manages listener + EphemeralRunnerSet)
actions-runner-controller/controllers/actions.github.com/resourcebuilder.go Pod spec construction for listener and runner pods

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions