Proactive Capacity-Aware ARC Autoscaling
Problem Statement
ARC in kubernetes and kubernetes-novolume container modes has a fundamental architectural flaw: the workflow pod is created by the runner pod. At the time Kubernetes schedules the runner pod, neither the scheduler nor Karpenter has any awareness of the resources required for the workflow pod.
This creates two failure modes:
- Delayed execution: Runner pod starts, picks up a job, creates the workflow pod — but no node has capacity for it. Karpenter must provision a new node reactively, adding minutes of delay while the job is "running" but doing nothing.
- Hard failure (InsufficientCapacity): Runner pod starts and claims a job from GitHub, but the workflow pod can never be scheduled — e.g., AWS has no capacity for the required instance type. The job is claimed but cannot run. It doesn't return to the GitHub queue; it's stuck until the 24-hour timeout.
The second failure mode is the critical one. A job that stays queued on GitHub can be picked up by another cluster. A job that's claimed but unrunnable is dead weight.
Root Cause
ARC's autoscaling is count-based and capacity-unaware:
- The listener receives
TotalAssignedJobs from GitHub's Actions Service
- It patches
EphemeralRunnerSet.spec.replicas to match
- The EphemeralRunnerSet controller creates
EphemeralRunner CRs
- The EphemeralRunner controller registers with GitHub and creates runner pods
- Runner pods pick up jobs and create workflow pods
At no point does any component check whether the cluster can actually fit the runner + workflow pod pair. The Kubernetes scheduler and Karpenter only react once pods exist — there is no forward-looking capacity assessment.
Solution: Capacity-Aware Listener with Proactive Provisioning
Key Insight: X-ScaleSetMaxCapacity
The GitHub Actions Service protocol already has a capacity signaling mechanism. On every long-poll GetMessage() request, the listener sends an X-ScaleSetMaxCapacity header:
// github.com/actions/scaleset/session_client.go
req.Header.Set("X-ScaleSetMaxCapacity", strconv.Itoa(maxCapacity))
GitHub only assigns jobs to a scale set up to its reported capacity. Today, maxRunners is static — set once from Helm values. The fix is to make it dynamic: a capacity monitor adjusts maxRunners in real time based on actual cluster capacity.
The listener exposes a thread-safe setter for this:
// github.com/actions/scaleset/listener/listener.go
func (l *Listener) SetMaxRunners(max uint32) {
l.maxRunners.Store(max) // atomic, safe to call from any goroutine
}
The next GetMessage() poll automatically uses the updated value. No protocol changes, no controller changes, no CRD changes.
Strategy: Optimistic with Placeholder Pods (Strategy B)
Each cluster proactively provisions capacity ahead of demand using low-priority placeholder pods. It reports its full provisionable capacity to GitHub, greedily claiming as many jobs as it can fit. If another cluster claims the jobs first, the spare capacity sits idle until Karpenter consolidates it.
Multi-Cluster Behavior
Cluster A capacity monitor:
"I have 20 available runner+workflow slots"
→ SetMaxRunners(20)
→ GitHub assigns up to 20 jobs to Cluster A
Cluster B capacity monitor:
"I have 15 available runner+workflow slots"
→ SetMaxRunners(15)
→ GitHub assigns up to 15 jobs to Cluster B
Cluster A hits InsufficientCapacity (EC2 capacity exhausted):
"I now have 0 provisionable slots, 18 running"
→ SetMaxRunners(18)
→ New jobs flow to Cluster B (or any other cluster with capacity)
Each cluster is greedy — it advertises its full capacity. GitHub distributes jobs across clusters based on their reported maximums. When a cluster can't provision more nodes, it lowers its maximum, and jobs naturally flow to clusters that still have capacity.
Architecture
Components
1. Forked ghalistener Binary (the only fork required)
The ghalistener binary is the listener process that runs as a pod per AutoscalingRunnerSet. It lives at cmd/ghalistener/main.go in the ARC repo — roughly 150 lines. Today it wires up:
- A
scaleset.Client (GitHub Actions Service REST client)
- A
listener.Listener (long-poll event loop, from github.com/actions/scaleset/listener)
- A
scaler.Scaler (patches EphemeralRunnerSet replicas via K8s API)
The fork adds one new component: a CapacityMonitor goroutine that runs alongside the listener in the same errgroup. It queries cluster state and dynamically calls listener.SetMaxRunners().
Nothing else is forked. The ARC controllers (AutoscalingRunnerSet, EphemeralRunnerSet, EphemeralRunner) run stock. The CRDs are unchanged. The Helm charts need only a container image override for the listener.
2. Capacity Monitor
A goroutine inside the forked ghalistener binary. Responsibilities:
- Watch Karpenter NodePools: query limits (CPU, memory, GPU budgets) and current usage to decide when to create new placeholder pods
- Watch EphemeralRunner CRs: track current runner count and their states (pending, running)
- Watch placeholder pod status: detect when placeholder pairs (runner + workflow) both reach
Running (confirmed reservation) or remain Pending past timeout (capacity unavailable)
- Calculate available slots:
available = ready_pair_count (only pairs where both placeholders are Running count — the scheduler is the sole arbiter of resource availability)
- Update maxRunners: call
listener.SetMaxRunners(current_runners + available) whenever capacity changes
- Manage placeholder pair lifecycle: create placeholder pairs to claim potential headroom, delete timed-out pairs (if either placeholder stays
Pending), and replenish pairs as runners preempt them
3. Placeholder Pods (Split Runner + Workflow)
Each capacity slot is reserved by two placeholder pods — one for the runner and one for the workflow. This split solves a critical race condition: without it, after the runner pod preempts a single combined placeholder, the freed workflow resources are unprotected until the workflow pod is created (seconds to tens of seconds later — GitHub registration, job pickup, hooks init). Any other pod in the cluster could claim those resources, leaving the workflow pod Pending and the job stuck.
With split placeholders, the workflow placeholder (priority 10) survives runner pod creation (priority 0) and continues to protect the workflow resources until the actual workflow pod (priority 20) preempts it.
Per slot, the capacity monitor creates:
| Pod |
Resource requests |
PriorityClass |
Priority |
Preempted by |
| Placeholder-Runner |
runner.requests |
placeholder-runner |
-10 |
Runner pod (0) |
| Placeholder-Workflow |
workflow.requests |
placeholder-workflow |
10 |
Workflow pod (20) |
Placeholder pods are landing-place agnostic — they do NOT require same-node affinity. Each placeholder is scheduled independently by the Kubernetes scheduler. The purpose is to reserve cluster-level capacity (total CPU, memory, GPU across the cluster), not to guarantee specific node co-location. Karpenter provisions nodes based on pending pods regardless of where placeholders land.
Shared properties (both placeholder types):
- Node affinity: Matches the same
nodeSelector and tolerations as the runner pods, ensuring placeholders trigger provisioning of the correct instance types.
- Labels: Clearly labeled with the scale set name, slot ID, role (
placeholder-runner or placeholder-workflow), and a TTL annotation for cleanup.
- Lightweight image:
public.ecr.aws/docker/library/alpine:3.21 with command: ["sleep", "900"] — same Alpine used by other OSDC DaemonSets (already cached on nodes). The 15-minute sleep acts as a safety timeout: if nothing preempts or deletes the placeholder, it self-terminates to prevent resource leaks.
terminationGracePeriodSeconds: 0: Ensures preemption frees resources immediately — the default 30s grace period would delay scheduling.
preemptionPolicy: Never: Placeholders never preempt other pods.
- Owner reference: Owned by the listener pod, so they're cleaned up automatically if the listener is deleted.
Priority ladder:
Priority 20: Workflow pod — preempts Placeholder-Workflow (10)
Priority 10: Placeholder-Workflow — survives Runner pod creation, protects workflow resources
Priority 0: Runner pod — preempts Placeholder-Runner (-10), does NOT preempt Placeholder-Workflow (10)
Priority -10: Placeholder-Runner — lowest priority, preempted first
Preemption sequence during job execution:
- Runner pod (priority 0) is created → preempts Placeholder-Runner (priority -10) → runner starts
- Placeholder-Workflow (priority 10) remains
Running — its resources are protected
- Runner registers with GitHub, picks up job, calls runner-container-hooks
- Workflow pod (priority 20) is created → preempts Placeholder-Workflow (priority 10) → workflow starts
- Both placeholders are gone, runner + workflow are running on the reserved capacity
4. Capacity Calculation
The capacity monitor must answer: "How many runner+workflow pairs can this cluster guarantee right now?"
Placeholder pod pairs are the sole source of truth for available capacity. The monitor does NOT calculate node headroom directly — instead, it creates placeholder pairs (runner + workflow) and waits for the Kubernetes scheduler to confirm the reservation by transitioning both to Running. This eliminates double-counting across scale sets: the scheduler is the arbiter of resource contention, and a Running pair is proof that the resources are committed.
capacity = ready_pair_count - pending_runners_without_pairs
Where:
ready_pair_count = number of placeholder pairs where BOTH the runner and workflow placeholder are in Running state
pending_runners_without_pairs = runners that have been created but don't yet have a corresponding placeholder pair to preempt
The flow for detecting and reporting new capacity:
- Monitor detects potential headroom (node added, job completed, etc.)
- Monitor creates placeholder pair(s) — one Placeholder-Runner + one Placeholder-Workflow (scheduled independently, no same-node requirement)
- Both placeholders stay
Pending until the scheduler places them — if multiple scale sets compete for the same headroom, the scheduler picks one and the others remain Pending
- Once both placeholders in a pair reach
Running, monitor counts the pair as one available slot and updates maxRunners
- If either placeholder in a pair does not reach
Running within placeholderReadyTimeout, the monitor deletes the entire pair and does NOT report the capacity — it will retry on the next recalculation cycle
This design means capacity is never reported speculatively. Every slot reported to GitHub via X-ScaleSetMaxCapacity is backed by a confirmed resource reservation for both the runner and the workflow pod.
The monitor recalculates on every relevant event (node added/removed, pod scheduled/deleted, NodePool status change) and on a periodic fallback interval (e.g., 30 seconds).
Detailed Flow
Steady State (No Queued Jobs)
- Capacity monitor creates placeholder pod pairs (runner + workflow) up to the configured
proactiveCapacity limit
- Karpenter sees pending placeholder pods, provisions nodes if needed (e.g., previous nodes were consolidated)
- Scheduler places both placeholders (independently, no co-location required) — both transition to
Running
- If either placeholder in a pair does not reach
Running within placeholderReadyTimeout, monitor deletes the entire pair (capacity unavailable) and retries on next cycle
- Monitor counts only complete pairs where both placeholders are
Running and calls SetMaxRunners(current_runners + ready_pairs)
- Listener polls GitHub with the updated
X-ScaleSetMaxCapacity
- GitHub sees this scale set can handle N jobs — ready for the next burst
Job Burst Arrives
- GitHub assigns M jobs to this scale set (M ≤ maxRunners)
- Listener receives
TotalAssignedJobs = M in statistics
- Scaler patches
EphemeralRunnerSet.spec.replicas = M
- EphemeralRunnerSet controller creates M
EphemeralRunner CRs
- EphemeralRunner controller creates M runner pods (priority 0)
- Runner pod (priority 0) preempts a Placeholder-Runner (priority -10) on the cluster. Because placeholders use
terminationGracePeriodSeconds: 0, resources are freed near-instantly.
- Placeholder-Workflow (priority 10) survives — runner pod priority (0) is too low to preempt it. Workflow resources remain protected in the cluster (the placeholder-workflow may be on a different node).
- Runner pod starts, registers with GitHub, picks up job
- Runner creates workflow pod (priority 20) via runner-container-hooks
- Workflow pod (priority 20) preempts Placeholder-Workflow (priority 10) — workflow resources are freed and immediately claimed by the workflow pod
- Both placeholders are gone. Runner + workflow are running on the capacity that was reserved for them (possibly on different nodes).
- Capacity monitor detects the consumed pair:
- Decreases available count
- Calls
SetMaxRunners(current_runners + remaining_ready_pairs)
- Creates new placeholder pairs to replenish proactive capacity
- Karpenter provisions new nodes for the new placeholders
- Next poll to GitHub reflects the updated capacity
Why split placeholders solve the resource protection gap: With a single combined placeholder, the runner pod preempts it entirely, freeing both runner and workflow resources at once. The runner consumes its portion, but the workflow resources sit unprotected in the cluster for seconds (GitHub registration, job pickup, hooks init) — any other pod can claim them. With split placeholders, the Placeholder-Workflow remains Running and holds the workflow resources until the actual workflow pod (higher priority) arrives to preempt it. The resources are never unprotected. Because placeholders are landing-place agnostic, the runner and workflow pods may end up on different nodes — what matters is that cluster-level capacity was reserved for both.
Why terminationGracePeriodSeconds: 0 is mandatory on both placeholder types: Without it, Kubernetes gives the placeholder 30 seconds to shut down before forcefully killing it. During those 30 seconds the resources are still held by the dying placeholder. With terminationGracePeriodSeconds: 0, the pause container is killed immediately and resources are freed for the preempting pod.
InsufficientCapacity (EC2 Exhaustion)
- Capacity monitor creates placeholder pods to trigger Karpenter provisioning
- Karpenter attempts to provision nodes but hits EC2
InsufficientInstanceCapacity
- Placeholder pods stay
Pending indefinitely
- Capacity monitor detects: placeholder pods are not becoming
Running within a timeout (e.g., 5 minutes)
- Monitor does NOT count pending placeholders as available capacity
- Calls
SetMaxRunners(current_runners + ready_placeholders_only) — only capacity that's actually confirmed
- Next poll to GitHub reports reduced capacity
- GitHub stops assigning new jobs to this scale set
- New jobs flow to other clusters that still have capacity
- When EC2 capacity becomes available again, placeholders become
Running, monitor increases maxRunners, jobs flow back
Job Claimed by Another Cluster
- Cluster A reports capacity, GitHub assigns jobs
- Cluster B also has capacity for the same runner labels, GitHub assigns some jobs there instead
- Cluster A's
TotalAssignedJobs is lower than expected
- Scaler creates fewer runners than maxRunners
- Placeholder pods remain running (spare capacity)
- Karpenter's consolidation policy eventually reclaims underutilized nodes (placeholder pods are low-priority, easily evicted during consolidation)
- Monitor adjusts as nodes are consolidated
Scale to Zero
- All jobs complete, runners exit, EphemeralRunners are cleaned up
- Capacity monitor detects no active runners
- If
proactiveCapacity > 0: maintains some placeholder pods to keep warm capacity for the next burst
- If
proactiveCapacity == 0: deletes all placeholders, Karpenter consolidates nodes, SetMaxRunners(0)
- Cluster is fully scaled down but can report capacity again within one Karpenter provisioning cycle
Configuration
Runner Definition (modules/arc-runners/defs/*.yaml)
Each runner definition MUST set maxRunners — the absolute ceiling for that scale set. The capacity monitor will NEVER exceed this value, regardless of placeholder count or queued jobs.
runner:
name: l-x86iavx512-8-16
instance_type: c7a.48xlarge
vcpu: 8
memory: 16Gi
gpu: 0
disk_size: 150
maxRunners: 100 # absolute ceiling — capacity monitor never exceeds this
The maxRunners value flows through the template into the Helm maxRunners field, which the ARC controller writes into the listener config. The capacity monitor reads it from config.MaxRunners and uses it as the ceiling for X-ScaleSetMaxCapacity.
Capacity-Aware Listener Config
New values added to the runner scale set Helm values (or listener config):
capacityAware:
enabled: true
# How many runner+workflow slots to proactively provision ahead of demand.
# Default is 0 (disabled) — opt-in per scale set. Consider enabling for
# runner types with frequent bursts where cold-start latency matters.
proactiveCapacity: 0
# How often to recalculate capacity (fallback; event-driven is primary)
recalculateInterval: 30s
# How long to wait for a placeholder to become Ready before considering
# the capacity unavailable (InsufficientCapacity detection)
placeholderReadyTimeout: 5m
# Resource requirements for the workflow pod (runner resources come from
# the pod template). This is the key input for placeholder sizing.
# MUST be set to the MAXIMUM expected workflow resource requirements for
# this runner type — not the average. A workflow that exceeds these
# limits will fail to schedule even after preempting the placeholder.
workflowResources:
requests:
cpu: "4"
memory: "16Gi"
# Optional: GPU requirements
# nvidia.com/gpu: "1"
# PriorityClasses for placeholder pods (created automatically by the capacity monitor)
# placeholder-runner: priority -10 (preempted by runner pods)
# placeholder-workflow: priority 10 (survives runner creation, preempted by workflow pods)
Each scale set (runner type) has its own workflowResources because different runner types run different workloads. A CPU runner's workflow pod needs 4 CPU + 16Gi. A GPU runner's workflow pod needs 8 CPU + 64Gi + 1 GPU. The placeholder pods are sized accordingly.
HUD API Integration (Queued Jobs)
The capacity monitor queries the PyTorch HUD API to discover how many jobs are currently queued for this runner's labels, and pre-provisions placeholder pairs for them in addition to the static proactiveCapacity.
API endpoint: https://hud.pytorch.org/api/clickhouse/queued_jobs_aggregate?parameters=%5B%5D
Auth header: x-hud-internal-bot: <secret>
Secret: Stored as a Kubernetes Secret in the arc-systems namespace:
apiVersion: v1
kind: Secret
metadata:
name: pytorch-hud-token
namespace: arc-systems
type: Opaque
stringData:
token: "<hud-internal-bot-secret>"
The listener pod mounts this secret as an environment variable HUD_API_TOKEN.
Response format:
interface QueuedJobsForRunner {
runner_label: string; // e.g., "mt-l-x86iavx512-8-16"
org: string;
repo: string;
num_queued_jobs: number;
min_queue_time_minutes: number;
max_queue_time_minutes: number;
}
Label discovery: At startup, the capacity monitor calls client.GetRunnerScaleSet(ctx, scaleSetID) to retrieve the RunnerScaleSet.Labels array — these are the labels GitHub matches against runs-on:. The monitor then filters the HUD response to entries where runner_label matches any of the scale set's configured labels, and sums num_queued_jobs across all matching entries.
Capacity Formulas
Desired placeholder pairs (how many pairs the monitor tries to maintain):
desired_placeholder_pairs = proactiveCapacity + nbr_queued_jobs_runner
Where nbr_queued_jobs_runner is the sum of num_queued_jobs from the HUD API for all labels matching this scale set.
X-ScaleSetMaxCapacity (reported to GitHub on every poll):
X-ScaleSetMaxCapacity = min(total_running_jobs + running_placeholder_pairs, maxRunners)
Where:
total_running_jobs = runners currently executing jobs
running_placeholder_pairs = placeholder pairs where BOTH the runner and workflow placeholder pods are in Running state
maxRunners = the absolute ceiling from the Helm values (runner def YAML)
The maxRunners ceiling is NEVER exceeded. Even if the HUD API reports 500 queued jobs, if maxRunners is 100, the capacity monitor caps at 100. The desired placeholder pairs are also capped: desired_placeholder_pairs = min(desired_placeholder_pairs, maxRunners - total_running_jobs).
Resource Sizing for Placeholders
Each capacity slot creates two placeholder pods with separate resource requests:
placeholder-runner.requests = runner.requests
placeholder-workflow.requests = workflow.requests
Where:
runner.requests comes from the existing pod template in the AutoscalingRunnerSet (e.g., 750m CPU, 512Mi memory for a standard runner)
workflow.requests comes from the capacityAware.workflowResources config — this MUST be set to the maximum expected workflow resource requirements, not the average. Different GitHub Actions workflows on the same runner type may request different resources. If the placeholder is sized for the average and a heavy workflow arrives, the workflow pod won't fit on the node even after preempting the placeholder.
Example for a standard CPU runner:
- Placeholder-Runner:
750m CPU, 512Mi memory
- Placeholder-Workflow:
4 CPU, 16Gi memory
- Total per slot:
4750m CPU, 16.5Gi memory
Placeholder pods are scheduled independently (no same-node requirement). They reserve cluster-level capacity, not per-node capacity.
PriorityClass Setup
Four priority classes form the preemption ladder. The values are chosen so that each level only preempts the level(s) below it:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: placeholder-runner
value: -10
globalDefault: false
description: "Runner placeholder — reserves runner resources, preempted by runner pods"
preemptionPolicy: Never
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: arc-runner
value: 0
globalDefault: false
description: "Runner pods — preempt runner placeholders, NOT workflow placeholders"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: placeholder-workflow
value: 10
globalDefault: false
description: "Workflow placeholder — reserves workflow resources, survives runner creation, preempted by workflow pods"
preemptionPolicy: Never
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: arc-workflow
value: 20
globalDefault: false
description: "Workflow pods — preempt workflow placeholders"
Preemption guarantees:
- Runner pod (0) preempts Placeholder-Runner (-10) but NOT Placeholder-Workflow (10) — workflow resources stay protected
- Workflow pod (20) preempts Placeholder-Workflow (10) — workflow resources are released only when the real workflow pod needs them
- Neither placeholder type preempts anything (
preemptionPolicy: Never)
What Gets Forked
| Component |
Action |
Reason |
cmd/ghalistener/main.go |
Fork |
Add CapacityMonitor goroutine to the errgroup |
cmd/ghalistener/capacity/ |
New package |
CapacityMonitor implementation, placeholder pod management, Karpenter NodePool queries |
github.com/actions/scaleset |
No change |
Used as-is; SetMaxRunners() is the only integration point |
controllers/actions.github.com/* |
No change |
All controllers run stock |
| ARC CRDs |
No change |
No schema changes |
gha-runner-scale-set-controller chart |
Minimal change |
Override the listener container image to use the forked ghalistener binary. Chart published from https://github.com/jeanschmidt/actions-runner-controller.git master branch. |
gha-runner-scale-set chart |
No change |
Runner pod templates stay the same |
Maintenance Burden
The fork surface is minimal — one binary entry point (~150 lines today) plus a new capacity/ package. On ARC upgrades:
- Check if
cmd/ghalistener/main.go changed (the entry point wiring)
- Check if the
listener.Scaler interface changed (unlikely — it's been stable)
- Check if
listener.SetMaxRunners() still exists (it's the public API)
- Rebase the fork
The capacity/ package is entirely ours — no upstream merge conflicts possible.
Implementation Plan
Phase 1: Proof of Concept (validate the protocol) — COMPLETED
Goal: Answer the fundamental question — does GitHub respect dynamic X-ScaleSetMaxCapacity changes mid-session?
Result: YES — validated. GitHub re-reads X-ScaleSetMaxCapacity on every poll. Setting maxRunners=0 stops job assignment. Reducing capacity mid-burst redirects queued jobs to other clusters. Latency is one poll cycle (~5-10 seconds).
POC implementation (ConfigMap-based manual knob) has been removed. Phase 2 replaces it with the full automated system.
Phase 2: Production Placeholder System + HUD Integration
Goal: Implement the full capacity-aware listener with placeholder pods, HUD API integration for demand-driven scaling, and proper deployment infrastructure.
2.1 OSDC Infrastructure (osdc/ repo)
POC Cleanup:
- Remove
scripts/python/capacity_setter.py (POC ConfigMap-based tool)
- Remove
DYNAMIC_CAPACITY_CONFIGMAP env var from modules/arc-runners/templates/runner.yaml.tpl
- Remove
capacity recipe from justfile
- Remove any tests for capacity_setter
Runner Definitions:
- Add
maxRunners field to runner template (runner.yaml.tpl) — maps to the Helm maxRunners value
- Add
maxRunners support to generate_runners.py
- Add
maxRunners to all runner defs in modules/arc-runners/defs/*.yaml
Deploy Infrastructure:
- Add Harbor
osdc project creation to modules/arc/deploy.sh (same pattern as harbor-cache-recovery)
- Update
modules/arc/deploy.sh to pull the controller Helm chart from the published chart on https://github.com/jeanschmidt/actions-runner-controller.git master branch (not a local path)
PriorityClasses:
- Create Kubernetes manifests for the four priority classes (
placeholder-runner, arc-runner, placeholder-workflow, arc-workflow)
- Deploy as part of the ARC module (applied before runner scale sets)
HUD API Secret:
- Document the K8s Secret creation for the user to run manually:
kubectl create secret generic pytorch-hud-token \
--namespace arc-systems \
--from-literal=token='<hud-internal-bot-secret>'
- Add
HUD_API_TOKEN env var to listener pod template (from the secret)
2.2 ARC Fork (actions-runner-controller repo)
POC Cleanup:
- Remove any ConfigMap watcher code from the forked listener
Capacity Monitor Package (cmd/ghalistener/capacity/):
The core implementation. A single goroutine that runs in the listener's errgroup:
-
Label Discovery (labels.go):
- At startup, call
client.GetRunnerScaleSet(ctx, scaleSetID) to get RunnerScaleSet.Labels
- Cache the labels for HUD API matching
- The scale set's labels are already fetched in
main.go — pass them to the monitor
-
HUD API Client (hud_client.go):
- HTTP GET to
https://hud.pytorch.org/api/clickhouse/queued_jobs_aggregate?parameters=%5B%5D
- Header:
x-hud-internal-bot: <token> (from HUD_API_TOKEN env var)
- Parse response as
[]QueuedJobsForRunner
- Filter by matching
runner_label against the scale set's labels
- Sum
num_queued_jobs for matching entries → nbr_queued_jobs_runner
- Poll interval: same as
recalculateInterval (default 30s)
-
Placeholder Manager (placeholder.go):
- Create placeholder pairs (Placeholder-Runner + Placeholder-Workflow) as Kubernetes pods
- Image:
public.ecr.aws/docker/library/alpine:3.21, command: ["sleep", "900"]
- Owner reference: listener pod (auto-cleanup on listener death/restart)
terminationGracePeriodSeconds: 0
preemptionPolicy: Never on both placeholders
- No pod affinity — placeholders are landing-place agnostic (cluster-level capacity reservation)
- Node selector + tolerations: match the runner pod template
- Track pair state: both
Running = ready slot, either Pending past timeout = delete pair
-
Capacity Calculator (monitor.go):
- Main reconciliation loop (event-driven + periodic fallback at
recalculateInterval):
nbr_queued = query_hud_api()
desired_pairs = proactiveCapacity + nbr_queued
desired_pairs = min(desired_pairs, maxRunners - total_running_jobs)
desired_pairs = max(desired_pairs, 0)
// Create or delete placeholder pairs to match desired count
adjust_placeholders(desired_pairs)
// Report capacity to GitHub
running_pairs = count_running_placeholder_pairs()
capacity = min(total_running_jobs + running_pairs, maxRunners)
listener.SetMaxRunners(capacity)
- Watch for pod events (placeholder state changes) to trigger immediate recalculation
- Watch for EphemeralRunner changes (job started/completed) to adjust
-
Integration (main.go changes):
- Create
CapacityMonitor in the errgroup alongside the listener
- Pass:
scaleset.Client, listener.SetMaxRunners, config.MaxRunners, scale set labels, K8s client
- If
capacityAware.enabled is false, skip monitor creation entirely (backward-compatible)
Unit Tests:
- Test capacity formula with various inputs (queued jobs, running jobs, maxRunners ceiling)
- Test HUD API response parsing and label matching
- Test placeholder pair lifecycle (creation, timeout, deletion)
- Mock K8s client for placeholder pod operations
2.3 Staging Validation
Deploy to staging and validate:
- Placeholder pods trigger Karpenter provisioning (no same-node requirement)
- Runner pods (priority 0) preempt Placeholder-Runner (-10) but NOT Placeholder-Workflow (10)
- Workflow pods (priority 20) preempt Placeholder-Workflow (10)
- Workflow resources remain protected between runner start and workflow pod creation
maxRunners ceiling from Helm is never exceeded
- HUD API integration correctly discovers queued jobs per runner label
- Placeholder pairs scale up when jobs queue, scale down when jobs are assigned
- Placeholder timeout (15 min) works — pods self-terminate if not preempted
- Listener restart cleans up all placeholder pods (owner reference)
- InsufficientCapacity: placeholder stuck
Pending past placeholderReadyTimeout → deleted, capacity not reported
Phase 3: Production Rollout
- Add Prometheus metrics for capacity monitoring (available slots, placeholder status, maxRunners changes, HUD API latency)
- Add Grafana dashboards
- Roll out to production clusters one at a time
- Tune
proactiveCapacity per runner type based on observed burst patterns and HUD data
Phase 4: Multi-Cluster Optimization
- Validate multi-cluster behavior with dynamic
maxRunners
- Consider adding cross-cluster capacity metrics (each cluster publishes its available capacity to a shared metric store)
- Tune consolidation delays to balance spare capacity cost vs. burst responsiveness
Risks and Mitigations
Risk: GitHub doesn't respect X-ScaleSetMaxCapacity dynamically
Likelihood: Low. The header is sent on every poll, and the protocol is designed for this.
Mitigation: Phase 1 PoC validates this before any significant investment.
Risk: Placeholder pod preemption race conditions
Scenario: Runner pod is created, preempts its placeholder, but the freed workflow resources are claimed by another pod before the workflow pod is created.
Mitigation: Solved by the split placeholder design. Each slot has two placeholders: Placeholder-Runner (priority -10) and Placeholder-Workflow (priority 10). When the runner pod (priority 0) is created, it preempts only Placeholder-Runner — Placeholder-Workflow survives because its priority (10) is higher than the runner's (0). The workflow resources remain protected until the actual workflow pod (priority 20) preempts Placeholder-Workflow. Both placeholder types MUST set terminationGracePeriodSeconds: 0 to ensure resources are freed instantly during preemption.
Risk: Multi-scale-set resource contention (double-counting headroom)
Scenario: Multiple runner types share the same NodePools. Two capacity monitors both see node headroom and report it as available.
Mitigation: This is solved by design. Capacity is never reported from headroom calculations — only from Running placeholder pods. When two monitors both detect headroom and create placeholders, the Kubernetes scheduler places one and the other stays Pending. The Pending placeholder times out, is deleted, and the monitor retries on the next cycle. The scheduler is the sole arbiter of resource contention; no cross-monitor coordination is needed.
Risk: Karpenter consolidation conflicts with placeholder pods
Scenario: Karpenter tries to consolidate nodes with placeholder pods while the capacity monitor is trying to maintain proactive capacity.
Mitigation: Placeholder pods do NOT use karpenter.sh/do-not-disrupt — that annotation blocks consolidation entirely with no TTL, which would cause idle nodes to accumulate indefinitely. Instead, placeholders are low-priority pods with preemptionPolicy: Never, which means Karpenter treats them as reschedulable during consolidation. When Karpenter evicts a placeholder, the capacity monitor detects the pod leaving Running state, decreases maxRunners, and creates a new placeholder pair on the next cycle (which may land on a different, more efficient node). This is the desired behavior — Karpenter optimizes node utilization, the capacity monitor reacts and re-provisions.
Risk: Cost of spare capacity
Scenario: Clusters maintain placeholder pods (and thus nodes) that never get used.
Mitigation: proactiveCapacity defaults to 0 (disabled) and is opt-in per scale set. Enable it only for runner types with frequent bursts where cold-start latency matters. Tune the value based on observed burst patterns for each runner type. Karpenter consolidation eventually reclaims unused capacity if no jobs arrive.
Risk: Fork maintenance burden
Likelihood: Low. The fork is ~200 lines of new code in a single binary. The scaleset package interface is stable (three methods, unchanged since v0.2.0).
Mitigation: The forked binary is a thin wrapper. ARC controller upgrades don't affect it. Only changes to the scaleset package's Scaler interface or SetMaxRunners() API would require fork updates.
Risk: HUD API unavailability
Scenario: The PyTorch HUD API at hud.pytorch.org is down, slow, or returns errors. The capacity monitor cannot determine queued job counts.
Mitigation: The HUD API is a best-effort enhancement. If the API is unavailable, the capacity monitor falls back to proactiveCapacity only (ignoring the nbr_queued_jobs_runner component). The monitor logs the failure but does not reduce capacity or stop functioning. Placeholder pairs based on static proactiveCapacity continue to work regardless of HUD API status.
Risk: JobAssigned messages are ignored
Context: The listener currently ignores JobAssigned messages (parses them but doesn't pass to the scaler). These messages contain per-job metadata (RequestLabels, RepositoryName, JobID, etc.) that could be useful for smarter capacity decisions.
Opportunity: A future enhancement could process JobAssigned messages to make per-job capacity decisions — e.g., "this job needs a GPU node, do I have GPU capacity?"
Protocol Reference
Key Types (from github.com/actions/scaleset)
// Statistics sent with every message
type RunnerScaleSetStatistic struct {
TotalAvailableJobs int
TotalAcquiredJobs int
TotalAssignedJobs int // <-- this is what drives scaling
TotalRunningJobs int
TotalRegisteredRunners int
TotalBusyRunners int
TotalIdleRunners int
}
// Message types
const (
MessageTypeJobAssigned = "JobAssigned" // ignored by current listener
MessageTypeJobStarted = "JobStarted" // runner picked up job
MessageTypeJobCompleted = "JobCompleted" // runner finished job
)
// Per-job metadata (available in all message types)
type JobMessageBase struct {
RunnerRequestID int64
RepositoryName string
OwnerName string
JobID int64
JobWorkflowRef string
JobDisplayName string
WorkflowRunID int64
EventName string
RequestLabels []string
QueueTime time.Time
ScaleSetAssignTime time.Time
RunnerAssignTime time.Time
FinishTime time.Time
}
Listener Event Loop (from github.com/actions/scaleset/listener)
loop:
1. GetMessage(ctx, lastMessageID, maxRunners)
→ HTTP GET to messageQueueURL
→ Header: X-ScaleSetMaxCapacity = maxRunners
→ Header: Authorization = Bearer <messageQueueAccessToken>
2a. HTTP 202 (no messages):
→ call scaler.HandleDesiredRunnerCount(ctx, latestStatistics.TotalAssignedJobs)
→ loop
2b. HTTP 200 (message batch):
→ parse message (Statistics + JobAssigned/Started/Completed arrays)
→ DELETE messageQueueURL/{messageID} (ACK — all-or-nothing)
→ call scaler.HandleJobStarted() for each started job
→ call scaler.HandleJobCompleted() for each completed job
→ call scaler.HandleDesiredRunnerCount(ctx, msg.Statistics.TotalAssignedJobs)
→ loop
2c. HTTP 401 (token expired):
→ PATCH sessions/{sessionId} to refresh token
→ retry
ARC Controller Chain (unchanged)
AutoscalingRunnerSet (Helm-created CR)
↓ controller creates
AutoscalingListener (CR + listener pod in arc-systems)
↓ listener patches
EphemeralRunnerSet.spec.replicas = N
↓ controller creates N
EphemeralRunner (one per runner)
↓ controller creates
Pod (runner pod, from template in AutoscalingRunnerSet)
↓ runner creates (via runner-container-hooks)
Pod (workflow pod, in arc-runners namespace)
Source Code References
| File |
What |
actions/scaleset/listener/listener.go |
Listener struct, Scaler interface, Run() loop, SetMaxRunners() |
actions/scaleset/session_client.go |
GetMessage (long-poll with X-ScaleSetMaxCapacity), DeleteMessage (ACK) |
actions/scaleset/client.go |
REST API client, authentication, JIT config generation |
actions/scaleset/types.go |
Protocol types — RunnerScaleSetStatistic, message types, RunnerScaleSet |
actions/scaleset/examples/dockerscaleset/scaler.go |
Reference Scaler implementation showing capacity capping |
actions-runner-controller/cmd/ghalistener/main.go |
Listener binary entry point (the file we fork) |
actions-runner-controller/cmd/ghalistener/scaler/scaler.go |
Current Scaler implementation (patches EphemeralRunnerSet) |
actions-runner-controller/controllers/actions.github.com/ephemeralrunnerset_controller.go |
EphemeralRunnerSet reconciler (creates EphemeralRunner CRs) |
actions-runner-controller/controllers/actions.github.com/ephemeralrunner_controller.go |
EphemeralRunner reconciler (creates runner pods) |
actions-runner-controller/controllers/actions.github.com/autoscalingrunnerset_controller.go |
AutoscalingRunnerSet reconciler (manages listener + EphemeralRunnerSet) |
actions-runner-controller/controllers/actions.github.com/resourcebuilder.go |
Pod spec construction for listener and runner pods |
Proactive Capacity-Aware ARC Autoscaling
Problem Statement
ARC in
kubernetesandkubernetes-novolumecontainer modes has a fundamental architectural flaw: the workflow pod is created by the runner pod. At the time Kubernetes schedules the runner pod, neither the scheduler nor Karpenter has any awareness of the resources required for the workflow pod.This creates two failure modes:
The second failure mode is the critical one. A job that stays queued on GitHub can be picked up by another cluster. A job that's claimed but unrunnable is dead weight.
Root Cause
ARC's autoscaling is count-based and capacity-unaware:
TotalAssignedJobsfrom GitHub's Actions ServiceEphemeralRunnerSet.spec.replicasto matchEphemeralRunnerCRsAt no point does any component check whether the cluster can actually fit the runner + workflow pod pair. The Kubernetes scheduler and Karpenter only react once pods exist — there is no forward-looking capacity assessment.
Solution: Capacity-Aware Listener with Proactive Provisioning
Key Insight:
X-ScaleSetMaxCapacityThe GitHub Actions Service protocol already has a capacity signaling mechanism. On every long-poll
GetMessage()request, the listener sends anX-ScaleSetMaxCapacityheader:GitHub only assigns jobs to a scale set up to its reported capacity. Today,
maxRunnersis static — set once from Helm values. The fix is to make it dynamic: a capacity monitor adjustsmaxRunnersin real time based on actual cluster capacity.The listener exposes a thread-safe setter for this:
The next
GetMessage()poll automatically uses the updated value. No protocol changes, no controller changes, no CRD changes.Strategy: Optimistic with Placeholder Pods (Strategy B)
Each cluster proactively provisions capacity ahead of demand using low-priority placeholder pods. It reports its full provisionable capacity to GitHub, greedily claiming as many jobs as it can fit. If another cluster claims the jobs first, the spare capacity sits idle until Karpenter consolidates it.
Multi-Cluster Behavior
Each cluster is greedy — it advertises its full capacity. GitHub distributes jobs across clusters based on their reported maximums. When a cluster can't provision more nodes, it lowers its maximum, and jobs naturally flow to clusters that still have capacity.
Architecture
Components
1. Forked
ghalistenerBinary (the only fork required)The
ghalistenerbinary is the listener process that runs as a pod perAutoscalingRunnerSet. It lives atcmd/ghalistener/main.goin the ARC repo — roughly 150 lines. Today it wires up:scaleset.Client(GitHub Actions Service REST client)listener.Listener(long-poll event loop, fromgithub.com/actions/scaleset/listener)scaler.Scaler(patchesEphemeralRunnerSetreplicas via K8s API)The fork adds one new component: a CapacityMonitor goroutine that runs alongside the listener in the same
errgroup. It queries cluster state and dynamically callslistener.SetMaxRunners().Nothing else is forked. The ARC controllers (AutoscalingRunnerSet, EphemeralRunnerSet, EphemeralRunner) run stock. The CRDs are unchanged. The Helm charts need only a container image override for the listener.
2. Capacity Monitor
A goroutine inside the forked
ghalistenerbinary. Responsibilities:Running(confirmed reservation) or remainPendingpast timeout (capacity unavailable)available = ready_pair_count(only pairs where both placeholders areRunningcount — the scheduler is the sole arbiter of resource availability)listener.SetMaxRunners(current_runners + available)whenever capacity changesPending), and replenish pairs as runners preempt them3. Placeholder Pods (Split Runner + Workflow)
Each capacity slot is reserved by two placeholder pods — one for the runner and one for the workflow. This split solves a critical race condition: without it, after the runner pod preempts a single combined placeholder, the freed workflow resources are unprotected until the workflow pod is created (seconds to tens of seconds later — GitHub registration, job pickup, hooks init). Any other pod in the cluster could claim those resources, leaving the workflow pod
Pendingand the job stuck.With split placeholders, the workflow placeholder (priority 10) survives runner pod creation (priority 0) and continues to protect the workflow resources until the actual workflow pod (priority 20) preempts it.
Per slot, the capacity monitor creates:
runner.requestsplaceholder-runnerworkflow.requestsplaceholder-workflowPlaceholder pods are landing-place agnostic — they do NOT require same-node affinity. Each placeholder is scheduled independently by the Kubernetes scheduler. The purpose is to reserve cluster-level capacity (total CPU, memory, GPU across the cluster), not to guarantee specific node co-location. Karpenter provisions nodes based on pending pods regardless of where placeholders land.
Shared properties (both placeholder types):
nodeSelectorandtolerationsas the runner pods, ensuring placeholders trigger provisioning of the correct instance types.placeholder-runnerorplaceholder-workflow), and a TTL annotation for cleanup.public.ecr.aws/docker/library/alpine:3.21withcommand: ["sleep", "900"]— same Alpine used by other OSDC DaemonSets (already cached on nodes). The 15-minute sleep acts as a safety timeout: if nothing preempts or deletes the placeholder, it self-terminates to prevent resource leaks.terminationGracePeriodSeconds: 0: Ensures preemption frees resources immediately — the default 30s grace period would delay scheduling.preemptionPolicy: Never: Placeholders never preempt other pods.Priority ladder:
Preemption sequence during job execution:
Running— its resources are protected4. Capacity Calculation
The capacity monitor must answer: "How many runner+workflow pairs can this cluster guarantee right now?"
Placeholder pod pairs are the sole source of truth for available capacity. The monitor does NOT calculate node headroom directly — instead, it creates placeholder pairs (runner + workflow) and waits for the Kubernetes scheduler to confirm the reservation by transitioning both to
Running. This eliminates double-counting across scale sets: the scheduler is the arbiter of resource contention, and aRunningpair is proof that the resources are committed.Where:
ready_pair_count= number of placeholder pairs where BOTH the runner and workflow placeholder are inRunningstatepending_runners_without_pairs= runners that have been created but don't yet have a corresponding placeholder pair to preemptThe flow for detecting and reporting new capacity:
Pendinguntil the scheduler places them — if multiple scale sets compete for the same headroom, the scheduler picks one and the others remainPendingRunning, monitor counts the pair as one available slot and updatesmaxRunnersRunningwithinplaceholderReadyTimeout, the monitor deletes the entire pair and does NOT report the capacity — it will retry on the next recalculation cycleThis design means capacity is never reported speculatively. Every slot reported to GitHub via
X-ScaleSetMaxCapacityis backed by a confirmed resource reservation for both the runner and the workflow pod.The monitor recalculates on every relevant event (node added/removed, pod scheduled/deleted, NodePool status change) and on a periodic fallback interval (e.g., 30 seconds).
Detailed Flow
Steady State (No Queued Jobs)
proactiveCapacitylimitRunningRunningwithinplaceholderReadyTimeout, monitor deletes the entire pair (capacity unavailable) and retries on next cycleRunningand callsSetMaxRunners(current_runners + ready_pairs)X-ScaleSetMaxCapacityJob Burst Arrives
TotalAssignedJobs = Min statisticsEphemeralRunnerSet.spec.replicas = MEphemeralRunnerCRsterminationGracePeriodSeconds: 0, resources are freed near-instantly.SetMaxRunners(current_runners + remaining_ready_pairs)Why split placeholders solve the resource protection gap: With a single combined placeholder, the runner pod preempts it entirely, freeing both runner and workflow resources at once. The runner consumes its portion, but the workflow resources sit unprotected in the cluster for seconds (GitHub registration, job pickup, hooks init) — any other pod can claim them. With split placeholders, the Placeholder-Workflow remains
Runningand holds the workflow resources until the actual workflow pod (higher priority) arrives to preempt it. The resources are never unprotected. Because placeholders are landing-place agnostic, the runner and workflow pods may end up on different nodes — what matters is that cluster-level capacity was reserved for both.Why
terminationGracePeriodSeconds: 0is mandatory on both placeholder types: Without it, Kubernetes gives the placeholder 30 seconds to shut down before forcefully killing it. During those 30 seconds the resources are still held by the dying placeholder. WithterminationGracePeriodSeconds: 0, thepausecontainer is killed immediately and resources are freed for the preempting pod.InsufficientCapacity (EC2 Exhaustion)
InsufficientInstanceCapacityPendingindefinitelyRunningwithin a timeout (e.g., 5 minutes)SetMaxRunners(current_runners + ready_placeholders_only)— only capacity that's actually confirmedRunning, monitor increasesmaxRunners, jobs flow backJob Claimed by Another Cluster
TotalAssignedJobsis lower than expectedScale to Zero
proactiveCapacity > 0: maintains some placeholder pods to keep warm capacity for the next burstproactiveCapacity == 0: deletes all placeholders, Karpenter consolidates nodes,SetMaxRunners(0)Configuration
Runner Definition (
modules/arc-runners/defs/*.yaml)Each runner definition MUST set
maxRunners— the absolute ceiling for that scale set. The capacity monitor will NEVER exceed this value, regardless of placeholder count or queued jobs.The
maxRunnersvalue flows through the template into the HelmmaxRunnersfield, which the ARC controller writes into the listener config. The capacity monitor reads it fromconfig.MaxRunnersand uses it as the ceiling forX-ScaleSetMaxCapacity.Capacity-Aware Listener Config
New values added to the runner scale set Helm values (or listener config):
Each scale set (runner type) has its own
workflowResourcesbecause different runner types run different workloads. A CPU runner's workflow pod needs 4 CPU + 16Gi. A GPU runner's workflow pod needs 8 CPU + 64Gi + 1 GPU. The placeholder pods are sized accordingly.HUD API Integration (Queued Jobs)
The capacity monitor queries the PyTorch HUD API to discover how many jobs are currently queued for this runner's labels, and pre-provisions placeholder pairs for them in addition to the static
proactiveCapacity.API endpoint:
https://hud.pytorch.org/api/clickhouse/queued_jobs_aggregate?parameters=%5B%5DAuth header:
x-hud-internal-bot: <secret>Secret: Stored as a Kubernetes Secret in the
arc-systemsnamespace:The listener pod mounts this secret as an environment variable
HUD_API_TOKEN.Response format:
Label discovery: At startup, the capacity monitor calls
client.GetRunnerScaleSet(ctx, scaleSetID)to retrieve theRunnerScaleSet.Labelsarray — these are the labels GitHub matches againstruns-on:. The monitor then filters the HUD response to entries whererunner_labelmatches any of the scale set's configured labels, and sumsnum_queued_jobsacross all matching entries.Capacity Formulas
Desired placeholder pairs (how many pairs the monitor tries to maintain):
Where
nbr_queued_jobs_runneris the sum ofnum_queued_jobsfrom the HUD API for all labels matching this scale set.X-ScaleSetMaxCapacity (reported to GitHub on every poll):
Where:
total_running_jobs= runners currently executing jobsrunning_placeholder_pairs= placeholder pairs where BOTH the runner and workflow placeholder pods are inRunningstatemaxRunners= the absolute ceiling from the Helm values (runner def YAML)The
maxRunnersceiling is NEVER exceeded. Even if the HUD API reports 500 queued jobs, ifmaxRunnersis 100, the capacity monitor caps at 100. The desired placeholder pairs are also capped:desired_placeholder_pairs = min(desired_placeholder_pairs, maxRunners - total_running_jobs).Resource Sizing for Placeholders
Each capacity slot creates two placeholder pods with separate resource requests:
Where:
runner.requestscomes from the existing pod template in the AutoscalingRunnerSet (e.g.,750m CPU, 512Mi memoryfor a standard runner)workflow.requestscomes from thecapacityAware.workflowResourcesconfig — this MUST be set to the maximum expected workflow resource requirements, not the average. Different GitHub Actions workflows on the same runner type may request different resources. If the placeholder is sized for the average and a heavy workflow arrives, the workflow pod won't fit on the node even after preempting the placeholder.Example for a standard CPU runner:
750m CPU, 512Mi memory4 CPU, 16Gi memory4750m CPU, 16.5Gi memoryPlaceholder pods are scheduled independently (no same-node requirement). They reserve cluster-level capacity, not per-node capacity.
PriorityClass Setup
Four priority classes form the preemption ladder. The values are chosen so that each level only preempts the level(s) below it:
Preemption guarantees:
preemptionPolicy: Never)What Gets Forked
cmd/ghalistener/main.gocmd/ghalistener/capacity/github.com/actions/scalesetSetMaxRunners()is the only integration pointcontrollers/actions.github.com/*gha-runner-scale-set-controllerchartghalistenerbinary. Chart published fromhttps://github.com/jeanschmidt/actions-runner-controller.gitmaster branch.gha-runner-scale-setchartMaintenance Burden
The fork surface is minimal — one binary entry point (~150 lines today) plus a new
capacity/package. On ARC upgrades:cmd/ghalistener/main.gochanged (the entry point wiring)listener.Scalerinterface changed (unlikely — it's been stable)listener.SetMaxRunners()still exists (it's the public API)The
capacity/package is entirely ours — no upstream merge conflicts possible.Implementation Plan
Phase 1: Proof of Concept (validate the protocol) — COMPLETED
Goal: Answer the fundamental question — does GitHub respect dynamic
X-ScaleSetMaxCapacitychanges mid-session?Result: YES — validated. GitHub re-reads
X-ScaleSetMaxCapacityon every poll. SettingmaxRunners=0stops job assignment. Reducing capacity mid-burst redirects queued jobs to other clusters. Latency is one poll cycle (~5-10 seconds).POC implementation (ConfigMap-based manual knob) has been removed. Phase 2 replaces it with the full automated system.
Phase 2: Production Placeholder System + HUD Integration
Goal: Implement the full capacity-aware listener with placeholder pods, HUD API integration for demand-driven scaling, and proper deployment infrastructure.
2.1 OSDC Infrastructure (osdc/ repo)
POC Cleanup:
scripts/python/capacity_setter.py(POC ConfigMap-based tool)DYNAMIC_CAPACITY_CONFIGMAPenv var frommodules/arc-runners/templates/runner.yaml.tplcapacityrecipe fromjustfileRunner Definitions:
maxRunnersfield to runner template (runner.yaml.tpl) — maps to the HelmmaxRunnersvaluemaxRunnerssupport togenerate_runners.pymaxRunnersto all runner defs inmodules/arc-runners/defs/*.yamlDeploy Infrastructure:
osdcproject creation tomodules/arc/deploy.sh(same pattern as harbor-cache-recovery)modules/arc/deploy.shto pull the controller Helm chart from the published chart onhttps://github.com/jeanschmidt/actions-runner-controller.gitmaster branch (not a local path)PriorityClasses:
placeholder-runner,arc-runner,placeholder-workflow,arc-workflow)HUD API Secret:
kubectl create secret generic pytorch-hud-token \ --namespace arc-systems \ --from-literal=token='<hud-internal-bot-secret>'HUD_API_TOKENenv var to listener pod template (from the secret)2.2 ARC Fork (actions-runner-controller repo)
POC Cleanup:
Capacity Monitor Package (
cmd/ghalistener/capacity/):The core implementation. A single goroutine that runs in the listener's
errgroup:Label Discovery (
labels.go):client.GetRunnerScaleSet(ctx, scaleSetID)to getRunnerScaleSet.Labelsmain.go— pass them to the monitorHUD API Client (
hud_client.go):https://hud.pytorch.org/api/clickhouse/queued_jobs_aggregate?parameters=%5B%5Dx-hud-internal-bot: <token>(fromHUD_API_TOKENenv var)[]QueuedJobsForRunnerrunner_labelagainst the scale set's labelsnum_queued_jobsfor matching entries →nbr_queued_jobs_runnerrecalculateInterval(default 30s)Placeholder Manager (
placeholder.go):public.ecr.aws/docker/library/alpine:3.21, command:["sleep", "900"]terminationGracePeriodSeconds: 0preemptionPolicy: Neveron both placeholdersRunning= ready slot, eitherPendingpast timeout = delete pairCapacity Calculator (
monitor.go):recalculateInterval):Integration (
main.gochanges):CapacityMonitorin the errgroup alongside the listenerscaleset.Client,listener.SetMaxRunners,config.MaxRunners, scale set labels, K8s clientcapacityAware.enabledis false, skip monitor creation entirely (backward-compatible)Unit Tests:
2.3 Staging Validation
Deploy to staging and validate:
maxRunnersceiling from Helm is never exceededPendingpastplaceholderReadyTimeout→ deleted, capacity not reportedPhase 3: Production Rollout
proactiveCapacityper runner type based on observed burst patterns and HUD dataPhase 4: Multi-Cluster Optimization
maxRunnersRisks and Mitigations
Risk: GitHub doesn't respect
X-ScaleSetMaxCapacitydynamicallyLikelihood: Low. The header is sent on every poll, and the protocol is designed for this.
Mitigation: Phase 1 PoC validates this before any significant investment.
Risk: Placeholder pod preemption race conditions
Scenario: Runner pod is created, preempts its placeholder, but the freed workflow resources are claimed by another pod before the workflow pod is created.
Mitigation: Solved by the split placeholder design. Each slot has two placeholders: Placeholder-Runner (priority -10) and Placeholder-Workflow (priority 10). When the runner pod (priority 0) is created, it preempts only Placeholder-Runner — Placeholder-Workflow survives because its priority (10) is higher than the runner's (0). The workflow resources remain protected until the actual workflow pod (priority 20) preempts Placeholder-Workflow. Both placeholder types MUST set
terminationGracePeriodSeconds: 0to ensure resources are freed instantly during preemption.Risk: Multi-scale-set resource contention (double-counting headroom)
Scenario: Multiple runner types share the same NodePools. Two capacity monitors both see node headroom and report it as available.
Mitigation: This is solved by design. Capacity is never reported from headroom calculations — only from
Runningplaceholder pods. When two monitors both detect headroom and create placeholders, the Kubernetes scheduler places one and the other staysPending. ThePendingplaceholder times out, is deleted, and the monitor retries on the next cycle. The scheduler is the sole arbiter of resource contention; no cross-monitor coordination is needed.Risk: Karpenter consolidation conflicts with placeholder pods
Scenario: Karpenter tries to consolidate nodes with placeholder pods while the capacity monitor is trying to maintain proactive capacity.
Mitigation: Placeholder pods do NOT use
karpenter.sh/do-not-disrupt— that annotation blocks consolidation entirely with no TTL, which would cause idle nodes to accumulate indefinitely. Instead, placeholders are low-priority pods withpreemptionPolicy: Never, which means Karpenter treats them as reschedulable during consolidation. When Karpenter evicts a placeholder, the capacity monitor detects the pod leavingRunningstate, decreasesmaxRunners, and creates a new placeholder pair on the next cycle (which may land on a different, more efficient node). This is the desired behavior — Karpenter optimizes node utilization, the capacity monitor reacts and re-provisions.Risk: Cost of spare capacity
Scenario: Clusters maintain placeholder pods (and thus nodes) that never get used.
Mitigation:
proactiveCapacitydefaults to0(disabled) and is opt-in per scale set. Enable it only for runner types with frequent bursts where cold-start latency matters. Tune the value based on observed burst patterns for each runner type. Karpenter consolidation eventually reclaims unused capacity if no jobs arrive.Risk: Fork maintenance burden
Likelihood: Low. The fork is ~200 lines of new code in a single binary. The
scalesetpackage interface is stable (three methods, unchanged since v0.2.0).Mitigation: The forked binary is a thin wrapper. ARC controller upgrades don't affect it. Only changes to the
scalesetpackage'sScalerinterface orSetMaxRunners()API would require fork updates.Risk: HUD API unavailability
Scenario: The PyTorch HUD API at
hud.pytorch.orgis down, slow, or returns errors. The capacity monitor cannot determine queued job counts.Mitigation: The HUD API is a best-effort enhancement. If the API is unavailable, the capacity monitor falls back to
proactiveCapacityonly (ignoring thenbr_queued_jobs_runnercomponent). The monitor logs the failure but does not reduce capacity or stop functioning. Placeholder pairs based on staticproactiveCapacitycontinue to work regardless of HUD API status.Risk:
JobAssignedmessages are ignoredContext: The listener currently ignores
JobAssignedmessages (parses them but doesn't pass to the scaler). These messages contain per-job metadata (RequestLabels,RepositoryName,JobID, etc.) that could be useful for smarter capacity decisions.Opportunity: A future enhancement could process
JobAssignedmessages to make per-job capacity decisions — e.g., "this job needs a GPU node, do I have GPU capacity?"Protocol Reference
Key Types (from
github.com/actions/scaleset)Listener Event Loop (from
github.com/actions/scaleset/listener)ARC Controller Chain (unchanged)
Source Code References
actions/scaleset/listener/listener.goactions/scaleset/session_client.goactions/scaleset/client.goactions/scaleset/types.goactions/scaleset/examples/dockerscaleset/scaler.goactions-runner-controller/cmd/ghalistener/main.goactions-runner-controller/cmd/ghalistener/scaler/scaler.goactions-runner-controller/controllers/actions.github.com/ephemeralrunnerset_controller.goactions-runner-controller/controllers/actions.github.com/ephemeralrunner_controller.goactions-runner-controller/controllers/actions.github.com/autoscalingrunnerset_controller.goactions-runner-controller/controllers/actions.github.com/resourcebuilder.go