Proactive Capacity-Aware ARC Autoscaling

# Proactive Capacity-Aware ARC Autoscaling

## Problem Statement

ARC in `kubernetes` and `kubernetes-novolume` container modes has a fundamental architectural flaw: the workflow pod is created by the runner pod. At the time Kubernetes schedules the runner pod, neither the scheduler nor Karpenter has any awareness of the resources required for the workflow pod.

This creates two failure modes:

1. **Delayed execution**: Runner pod starts, picks up a job, creates the workflow pod — but no node has capacity for it. Karpenter must provision a new node reactively, adding minutes of delay while the job is "running" but doing nothing.
2. **Hard failure (InsufficientCapacity)**: Runner pod starts and claims a job from GitHub, but the workflow pod can never be scheduled — e.g., AWS has no capacity for the required instance type. The job is claimed but cannot run. It doesn't return to the GitHub queue; it's stuck until the 24-hour timeout.

The second failure mode is the critical one. A job that stays queued on GitHub can be picked up by another cluster. A job that's claimed but unrunnable is dead weight.

## Root Cause

ARC's autoscaling is count-based and capacity-unaware:

1. The **listener** receives `TotalAssignedJobs` from GitHub's Actions Service
2. It patches `EphemeralRunnerSet.spec.replicas` to match
3. The **EphemeralRunnerSet controller** creates `EphemeralRunner` CRs
4. The **EphemeralRunner controller** registers with GitHub and creates runner pods
5. Runner pods pick up jobs and create workflow pods

At no point does any component check whether the cluster can actually fit the runner + workflow pod pair. The Kubernetes scheduler and Karpenter only react once pods exist — there is no forward-looking capacity assessment.

## Solution: Capacity-Aware Listener with Proactive Provisioning

### Key Insight: `X-ScaleSetMaxCapacity`

The GitHub Actions Service protocol already has a capacity signaling mechanism. On every long-poll `GetMessage()` request, the listener sends an `X-ScaleSetMaxCapacity` header:

```go
// github.com/actions/scaleset/session_client.go
req.Header.Set("X-ScaleSetMaxCapacity", strconv.Itoa(maxCapacity))
```

GitHub only assigns jobs to a scale set up to its reported capacity. Today, `maxRunners` is static — set once from Helm values. The fix is to make it dynamic: a capacity monitor adjusts `maxRunners` in real time based on actual cluster capacity.

The listener exposes a thread-safe setter for this:

```go
// github.com/actions/scaleset/listener/listener.go
func (l *Listener) SetMaxRunners(max uint32) {
    l.maxRunners.Store(max)  // atomic, safe to call from any goroutine
}
```

The next `GetMessage()` poll automatically uses the updated value. No protocol changes, no controller changes, no CRD changes.

### Strategy: Optimistic with Placeholder Pods (Strategy B)

Each cluster proactively provisions capacity ahead of demand using low-priority placeholder pods. It reports its full provisionable capacity to GitHub, greedily claiming as many jobs as it can fit. If another cluster claims the jobs first, the spare capacity sits idle until Karpenter consolidates it.

#### Multi-Cluster Behavior

```
Cluster A capacity monitor:
  "I have 20 available runner+workflow slots"
  → SetMaxRunners(20)
  → GitHub assigns up to 20 jobs to Cluster A

Cluster B capacity monitor:
  "I have 15 available runner+workflow slots"
  → SetMaxRunners(15)
  → GitHub assigns up to 15 jobs to Cluster B

Cluster A hits InsufficientCapacity (EC2 capacity exhausted):
  "I now have 0 provisionable slots, 18 running"
  → SetMaxRunners(18)
  → New jobs flow to Cluster B (or any other cluster with capacity)
```

Each cluster is greedy — it advertises its full capacity. GitHub distributes jobs across clusters based on their reported maximums. When a cluster can't provision more nodes, it lowers its maximum, and jobs naturally flow to clusters that still have capacity.

## Architecture

### Components

#### 1. Forked `ghalistener` Binary (the only fork required)

The `ghalistener` binary is the listener process that runs as a pod per `AutoscalingRunnerSet`. It lives at `cmd/ghalistener/main.go` in the ARC repo — roughly 150 lines. Today it wires up:

- A `scaleset.Client` (GitHub Actions Service REST client)
- A `listener.Listener` (long-poll event loop, from `github.com/actions/scaleset/listener`)
- A `scaler.Scaler` (patches `EphemeralRunnerSet` replicas via K8s API)

The fork adds one new component: a **CapacityMonitor** goroutine that runs alongside the listener in the same `errgroup`. It queries cluster state and dynamically calls `listener.SetMaxRunners()`.

Nothing else is forked. The ARC controllers (AutoscalingRunnerSet, EphemeralRunnerSet, EphemeralRunner) run stock. The CRDs are unchanged. The Helm charts need only a container image override for the listener.

#### 2. Capacity Monitor

A goroutine inside the forked `ghalistener` binary. Responsibilities:

- **Watch Karpenter NodePools**: query limits (CPU, memory, GPU budgets) and current usage to decide when to create new placeholder pods
- **Watch EphemeralRunner CRs**: track current runner count and their states (pending, running)
- **Watch placeholder pod status**: detect when placeholder pairs (runner + workflow) both reach `Running` (confirmed reservation) or remain `Pending` past timeout (capacity unavailable)
- **Calculate available slots**: `available = ready_pair_count` (only pairs where both placeholders are `Running` count — the scheduler is the sole arbiter of resource availability)
- **Update maxRunners**: call `listener.SetMaxRunners(current_runners + available)` whenever capacity changes
- **Manage placeholder pair lifecycle**: create placeholder pairs to claim potential headroom, delete timed-out pairs (if either placeholder stays `Pending`), and replenish pairs as runners preempt them

#### 3. Placeholder Pods (Split Runner + Workflow)

Each capacity slot is reserved by **two** placeholder pods — one for the runner and one for the workflow. This split solves a critical race condition: without it, after the runner pod preempts a single combined placeholder, the freed workflow resources are unprotected until the workflow pod is created (seconds to tens of seconds later — GitHub registration, job pickup, hooks init). Any other pod in the cluster could claim those resources, leaving the workflow pod `Pending` and the job stuck.

With split placeholders, the workflow placeholder (priority 10) survives runner pod creation (priority 0) and continues to protect the workflow resources until the actual workflow pod (priority 20) preempts it.

**Per slot, the capacity monitor creates:**

| Pod | Resource requests | PriorityClass | Priority | Preempted by |
|-----|------------------|---------------|----------|-------------|
| Placeholder-Runner | `runner.requests` | `placeholder-runner` | -10 | Runner pod (0) |
| Placeholder-Workflow | `workflow.requests` | `placeholder-workflow` | 10 | Workflow pod (20) |

Placeholder pods are **landing-place agnostic** — they do NOT require same-node affinity. Each placeholder is scheduled independently by the Kubernetes scheduler. The purpose is to reserve **cluster-level capacity** (total CPU, memory, GPU across the cluster), not to guarantee specific node co-location. Karpenter provisions nodes based on pending pods regardless of where placeholders land.

**Shared properties (both placeholder types):**
- **Node affinity**: Matches the same `nodeSelector` and `tolerations` as the runner pods, ensuring placeholders trigger provisioning of the correct instance types.
- **Labels**: Clearly labeled with the scale set name, slot ID, role (`placeholder-runner` or `placeholder-workflow`), and a TTL annotation for cleanup.
- **Lightweight image**: `public.ecr.aws/docker/library/alpine:3.21` with `command: ["sleep", "900"]` — same Alpine used by other OSDC DaemonSets (already cached on nodes). The 15-minute sleep acts as a safety timeout: if nothing preempts or deletes the placeholder, it self-terminates to prevent resource leaks.
- **`terminationGracePeriodSeconds: 0`**: Ensures preemption frees resources immediately — the default 30s grace period would delay scheduling.
- **`preemptionPolicy: Never`**: Placeholders never preempt other pods.
- **Owner reference**: Owned by the listener pod, so they're cleaned up automatically if the listener is deleted.

**Priority ladder:**

```
Priority 20: Workflow pod      — preempts Placeholder-Workflow (10)
Priority 10: Placeholder-Workflow — survives Runner pod creation, protects workflow resources
Priority  0: Runner pod        — preempts Placeholder-Runner (-10), does NOT preempt Placeholder-Workflow (10)
Priority -10: Placeholder-Runner  — lowest priority, preempted first
```

**Preemption sequence during job execution:**
1. Runner pod (priority 0) is created → preempts Placeholder-Runner (priority -10) → runner starts
2. Placeholder-Workflow (priority 10) remains `Running` — its resources are protected
3. Runner registers with GitHub, picks up job, calls runner-container-hooks
4. Workflow pod (priority 20) is created → preempts Placeholder-Workflow (priority 10) → workflow starts
5. Both placeholders are gone, runner + workflow are running on the reserved capacity

#### 4. Capacity Calculation

The capacity monitor must answer: "How many runner+workflow pairs can this cluster guarantee right now?"

**Placeholder pod pairs are the sole source of truth for available capacity.** The monitor does NOT calculate node headroom directly — instead, it creates placeholder pairs (runner + workflow) and waits for the Kubernetes scheduler to confirm the reservation by transitioning both to `Running`. This eliminates double-counting across scale sets: the scheduler is the arbiter of resource contention, and a `Running` pair is proof that the resources are committed.

```
capacity = ready_pair_count - pending_runners_without_pairs
```

Where:
- `ready_pair_count` = number of placeholder pairs where BOTH the runner and workflow placeholder are in `Running` state
- `pending_runners_without_pairs` = runners that have been created but don't yet have a corresponding placeholder pair to preempt

The flow for detecting and reporting new capacity:
1. Monitor detects potential headroom (node added, job completed, etc.)
2. Monitor creates placeholder pair(s) — one Placeholder-Runner + one Placeholder-Workflow (scheduled independently, no same-node requirement)
3. Both placeholders stay `Pending` until the scheduler places them — if multiple scale sets compete for the same headroom, the scheduler picks one and the others remain `Pending`
4. Once both placeholders in a pair reach `Running`, monitor counts the pair as one available slot and updates `maxRunners`
5. If either placeholder in a pair does not reach `Running` within `placeholderReadyTimeout`, the monitor deletes the entire pair and does NOT report the capacity — it will retry on the next recalculation cycle

This design means capacity is never reported speculatively. Every slot reported to GitHub via `X-ScaleSetMaxCapacity` is backed by a confirmed resource reservation for both the runner and the workflow pod.

The monitor recalculates on every relevant event (node added/removed, pod scheduled/deleted, NodePool status change) and on a periodic fallback interval (e.g., 30 seconds).

## Detailed Flow

### Steady State (No Queued Jobs)

1. Capacity monitor creates placeholder pod pairs (runner + workflow) up to the configured `proactiveCapacity` limit
2. Karpenter sees pending placeholder pods, provisions nodes if needed (e.g., previous nodes were consolidated)
3. Scheduler places both placeholders (independently, no co-location required) — both transition to `Running`
4. If either placeholder in a pair does not reach `Running` within `placeholderReadyTimeout`, monitor deletes the entire pair (capacity unavailable) and retries on next cycle
5. Monitor counts only complete pairs where both placeholders are `Running` and calls `SetMaxRunners(current_runners + ready_pairs)`
6. Listener polls GitHub with the updated `X-ScaleSetMaxCapacity`
7. GitHub sees this scale set can handle N jobs — ready for the next burst

### Job Burst Arrives

1. GitHub assigns M jobs to this scale set (M ≤ maxRunners)
2. Listener receives `TotalAssignedJobs = M` in statistics
3. Scaler patches `EphemeralRunnerSet.spec.replicas = M`
4. EphemeralRunnerSet controller creates M `EphemeralRunner` CRs
5. EphemeralRunner controller creates M runner pods (priority 0)
6. Runner pod (priority 0) preempts a Placeholder-Runner (priority -10) on the cluster. Because placeholders use `terminationGracePeriodSeconds: 0`, resources are freed near-instantly.
7. **Placeholder-Workflow (priority 10) survives** — runner pod priority (0) is too low to preempt it. Workflow resources remain protected in the cluster (the placeholder-workflow may be on a different node).
8. Runner pod starts, registers with GitHub, picks up job
9. Runner creates workflow pod (priority 20) via runner-container-hooks
10. Workflow pod (priority 20) preempts Placeholder-Workflow (priority 10) — workflow resources are freed and immediately claimed by the workflow pod
11. Both placeholders are gone. Runner + workflow are running on the capacity that was reserved for them (possibly on different nodes).
12. Capacity monitor detects the consumed pair:
    - Decreases available count
    - Calls `SetMaxRunners(current_runners + remaining_ready_pairs)`
    - Creates new placeholder pairs to replenish proactive capacity
    - Karpenter provisions new nodes for the new placeholders
13. Next poll to GitHub reflects the updated capacity

**Why split placeholders solve the resource protection gap**: With a single combined placeholder, the runner pod preempts it entirely, freeing both runner and workflow resources at once. The runner consumes its portion, but the workflow resources sit unprotected in the cluster for seconds (GitHub registration, job pickup, hooks init) — any other pod can claim them. With split placeholders, the Placeholder-Workflow remains `Running` and holds the workflow resources until the actual workflow pod (higher priority) arrives to preempt it. The resources are never unprotected. Because placeholders are landing-place agnostic, the runner and workflow pods may end up on different nodes — what matters is that cluster-level capacity was reserved for both.

**Why `terminationGracePeriodSeconds: 0` is mandatory on both placeholder types**: Without it, Kubernetes gives the placeholder 30 seconds to shut down before forcefully killing it. During those 30 seconds the resources are still held by the dying placeholder. With `terminationGracePeriodSeconds: 0`, the `pause` container is killed immediately and resources are freed for the preempting pod.

### InsufficientCapacity (EC2 Exhaustion)

1. Capacity monitor creates placeholder pods to trigger Karpenter provisioning
2. Karpenter attempts to provision nodes but hits EC2 `InsufficientInstanceCapacity`
3. Placeholder pods stay `Pending` indefinitely
4. Capacity monitor detects: placeholder pods are not becoming `Running` within a timeout (e.g., 5 minutes)
5. Monitor does NOT count pending placeholders as available capacity
6. Calls `SetMaxRunners(current_runners + ready_placeholders_only)` — only capacity that's actually confirmed
7. Next poll to GitHub reports reduced capacity
8. GitHub stops assigning new jobs to this scale set
9. New jobs flow to other clusters that still have capacity
10. When EC2 capacity becomes available again, placeholders become `Running`, monitor increases `maxRunners`, jobs flow back

### Job Claimed by Another Cluster

1. Cluster A reports capacity, GitHub assigns jobs
2. Cluster B also has capacity for the same runner labels, GitHub assigns some jobs there instead
3. Cluster A's `TotalAssignedJobs` is lower than expected
4. Scaler creates fewer runners than maxRunners
5. Placeholder pods remain running (spare capacity)
6. Karpenter's consolidation policy eventually reclaims underutilized nodes (placeholder pods are low-priority, easily evicted during consolidation)
7. Monitor adjusts as nodes are consolidated

### Scale to Zero

1. All jobs complete, runners exit, EphemeralRunners are cleaned up
2. Capacity monitor detects no active runners
3. If `proactiveCapacity > 0`: maintains some placeholder pods to keep warm capacity for the next burst
4. If `proactiveCapacity == 0`: deletes all placeholders, Karpenter consolidates nodes, `SetMaxRunners(0)`
5. Cluster is fully scaled down but can report capacity again within one Karpenter provisioning cycle

## Configuration

### Runner Definition (`modules/arc-runners/defs/*.yaml`)

Each runner definition MUST set `maxRunners` — the absolute ceiling for that scale set. The capacity monitor will NEVER exceed this value, regardless of placeholder count or queued jobs.

```yaml
runner:
  name: l-x86iavx512-8-16
  instance_type: c7a.48xlarge
  vcpu: 8
  memory: 16Gi
  gpu: 0
  disk_size: 150
  maxRunners: 100  # absolute ceiling — capacity monitor never exceeds this
```

The `maxRunners` value flows through the template into the Helm `maxRunners` field, which the ARC controller writes into the listener config. The capacity monitor reads it from `config.MaxRunners` and uses it as the ceiling for `X-ScaleSetMaxCapacity`.

### Capacity-Aware Listener Config

New values added to the runner scale set Helm values (or listener config):

```yaml
capacityAware:
  enabled: true
  # How many runner+workflow slots to proactively provision ahead of demand.
  # Default is 0 (disabled) — opt-in per scale set. Consider enabling for
  # runner types with frequent bursts where cold-start latency matters.
  proactiveCapacity: 0
  # How often to recalculate capacity (fallback; event-driven is primary)
  recalculateInterval: 30s
  # How long to wait for a placeholder to become Ready before considering
  # the capacity unavailable (InsufficientCapacity detection)
  placeholderReadyTimeout: 5m
  # Resource requirements for the workflow pod (runner resources come from
  # the pod template). This is the key input for placeholder sizing.
  # MUST be set to the MAXIMUM expected workflow resource requirements for
  # this runner type — not the average. A workflow that exceeds these
  # limits will fail to schedule even after preempting the placeholder.
  workflowResources:
    requests:
      cpu: "4"
      memory: "16Gi"
    # Optional: GPU requirements
    # nvidia.com/gpu: "1"
  # PriorityClasses for placeholder pods (created automatically by the capacity monitor)
  # placeholder-runner: priority -10 (preempted by runner pods)
  # placeholder-workflow: priority 10 (survives runner creation, preempted by workflow pods)
```

Each scale set (runner type) has its own `workflowResources` because different runner types run different workloads. A CPU runner's workflow pod needs 4 CPU + 16Gi. A GPU runner's workflow pod needs 8 CPU + 64Gi + 1 GPU. The placeholder pods are sized accordingly.

### HUD API Integration (Queued Jobs)

The capacity monitor queries the PyTorch HUD API to discover how many jobs are currently queued for this runner's labels, and pre-provisions placeholder pairs for them in addition to the static `proactiveCapacity`.

**API endpoint**: `https://hud.pytorch.org/api/clickhouse/queued_jobs_aggregate?parameters=%5B%5D`
**Auth header**: `x-hud-internal-bot: <secret>`
**Secret**: Stored as a Kubernetes Secret in the `arc-systems` namespace:

```yaml
apiVersion: v1
kind: Secret
metadata:
  name: pytorch-hud-token
  namespace: arc-systems
type: Opaque
stringData:
  token: "<hud-internal-bot-secret>"
```

The listener pod mounts this secret as an environment variable `HUD_API_TOKEN`.

**Response format**:

```typescript
interface QueuedJobsForRunner {
  runner_label: string;   // e.g., "mt-l-x86iavx512-8-16"
  org: string;
  repo: string;
  num_queued_jobs: number;
  min_queue_time_minutes: number;
  max_queue_time_minutes: number;
}
```

**Label discovery**: At startup, the capacity monitor calls `client.GetRunnerScaleSet(ctx, scaleSetID)` to retrieve the `RunnerScaleSet.Labels` array — these are the labels GitHub matches against `runs-on:`. The monitor then filters the HUD response to entries where `runner_label` matches any of the scale set's configured labels, and sums `num_queued_jobs` across all matching entries.

### Capacity Formulas

**Desired placeholder pairs** (how many pairs the monitor tries to maintain):

```
desired_placeholder_pairs = proactiveCapacity + nbr_queued_jobs_runner
```

Where `nbr_queued_jobs_runner` is the sum of `num_queued_jobs` from the HUD API for all labels matching this scale set.

**X-ScaleSetMaxCapacity** (reported to GitHub on every poll):

```
X-ScaleSetMaxCapacity = min(total_running_jobs + running_placeholder_pairs, maxRunners)
```

Where:
- `total_running_jobs` = runners currently executing jobs
- `running_placeholder_pairs` = placeholder pairs where BOTH the runner and workflow placeholder pods are in `Running` state
- `maxRunners` = the absolute ceiling from the Helm values (runner def YAML)

**The `maxRunners` ceiling is NEVER exceeded.** Even if the HUD API reports 500 queued jobs, if `maxRunners` is 100, the capacity monitor caps at 100. The desired placeholder pairs are also capped: `desired_placeholder_pairs = min(desired_placeholder_pairs, maxRunners - total_running_jobs)`.

## Resource Sizing for Placeholders

Each capacity slot creates two placeholder pods with separate resource requests:

```
placeholder-runner.requests   = runner.requests
placeholder-workflow.requests = workflow.requests
```

Where:
- `runner.requests` comes from the existing pod template in the AutoscalingRunnerSet (e.g., `750m CPU, 512Mi memory` for a standard runner)
- `workflow.requests` comes from the `capacityAware.workflowResources` config — **this MUST be set to the maximum expected workflow resource requirements**, not the average. Different GitHub Actions workflows on the same runner type may request different resources. If the placeholder is sized for the average and a heavy workflow arrives, the workflow pod won't fit on the node even after preempting the placeholder.

Example for a standard CPU runner:
- Placeholder-Runner: `750m CPU, 512Mi memory`
- Placeholder-Workflow: `4 CPU, 16Gi memory`
- Total per slot: `4750m CPU, 16.5Gi memory`

Placeholder pods are scheduled independently (no same-node requirement). They reserve cluster-level capacity, not per-node capacity.

## PriorityClass Setup

Four priority classes form the preemption ladder. The values are chosen so that each level only preempts the level(s) below it:

```yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: placeholder-runner
value: -10
globalDefault: false
description: "Runner placeholder — reserves runner resources, preempted by runner pods"
preemptionPolicy: Never

---

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: arc-runner
value: 0
globalDefault: false
description: "Runner pods — preempt runner placeholders, NOT workflow placeholders"

---

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: placeholder-workflow
value: 10
globalDefault: false
description: "Workflow placeholder — reserves workflow resources, survives runner creation, preempted by workflow pods"
preemptionPolicy: Never

---

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: arc-workflow
value: 20
globalDefault: false
description: "Workflow pods — preempt workflow placeholders"
```

**Preemption guarantees:**
- Runner pod (0) preempts Placeholder-Runner (-10) but NOT Placeholder-Workflow (10) — workflow resources stay protected
- Workflow pod (20) preempts Placeholder-Workflow (10) — workflow resources are released only when the real workflow pod needs them
- Neither placeholder type preempts anything (`preemptionPolicy: Never`)

## What Gets Forked

| Component | Action | Reason |
|-----------|--------|--------|
| `cmd/ghalistener/main.go` | Fork | Add CapacityMonitor goroutine to the errgroup |
| `cmd/ghalistener/capacity/` | New package | CapacityMonitor implementation, placeholder pod management, Karpenter NodePool queries |
| `github.com/actions/scaleset` | No change | Used as-is; `SetMaxRunners()` is the only integration point |
| `controllers/actions.github.com/*` | No change | All controllers run stock |
| ARC CRDs | No change | No schema changes |
| `gha-runner-scale-set-controller` chart | Minimal change | Override the listener container image to use the forked `ghalistener` binary. Chart published from `https://github.com/jeanschmidt/actions-runner-controller.git` master branch. |
| `gha-runner-scale-set` chart | No change | Runner pod templates stay the same |

### Maintenance Burden

The fork surface is minimal — one binary entry point (~150 lines today) plus a new `capacity/` package. On ARC upgrades:

1. Check if `cmd/ghalistener/main.go` changed (the entry point wiring)
2. Check if the `listener.Scaler` interface changed (unlikely — it's been stable)
3. Check if `listener.SetMaxRunners()` still exists (it's the public API)
4. Rebase the fork

The `capacity/` package is entirely ours — no upstream merge conflicts possible.

## Implementation Plan

### Phase 1: Proof of Concept (validate the protocol) — COMPLETED

**Goal**: Answer the fundamental question — does GitHub respect dynamic `X-ScaleSetMaxCapacity` changes mid-session?

**Result**: YES — validated. GitHub re-reads `X-ScaleSetMaxCapacity` on every poll. Setting `maxRunners=0` stops job assignment. Reducing capacity mid-burst redirects queued jobs to other clusters. Latency is one poll cycle (~5-10 seconds).

**POC implementation** (ConfigMap-based manual knob) has been removed. Phase 2 replaces it with the full automated system.

### Phase 2: Production Placeholder System + HUD Integration

**Goal**: Implement the full capacity-aware listener with placeholder pods, HUD API integration for demand-driven scaling, and proper deployment infrastructure.

#### 2.1 OSDC Infrastructure (osdc/ repo)

**POC Cleanup:**
- Remove `scripts/python/capacity_setter.py` (POC ConfigMap-based tool)
- Remove `DYNAMIC_CAPACITY_CONFIGMAP` env var from `modules/arc-runners/templates/runner.yaml.tpl`
- Remove `capacity` recipe from `justfile`
- Remove any tests for capacity_setter

**Runner Definitions:**
- Add `maxRunners` field to runner template (`runner.yaml.tpl`) — maps to the Helm `maxRunners` value
- Add `maxRunners` support to `generate_runners.py`
- Add `maxRunners` to all runner defs in `modules/arc-runners/defs/*.yaml`

**Deploy Infrastructure:**
- Add Harbor `osdc` project creation to `modules/arc/deploy.sh` (same pattern as harbor-cache-recovery)
- Update `modules/arc/deploy.sh` to pull the controller Helm chart from the published chart on `https://github.com/jeanschmidt/actions-runner-controller.git` master branch (not a local path)

**PriorityClasses:**
- Create Kubernetes manifests for the four priority classes (`placeholder-runner`, `arc-runner`, `placeholder-workflow`, `arc-workflow`)
- Deploy as part of the ARC module (applied before runner scale sets)

**HUD API Secret:**
- Document the K8s Secret creation for the user to run manually:
  ```bash
  kubectl create secret generic pytorch-hud-token \
    --namespace arc-systems \
    --from-literal=token='<hud-internal-bot-secret>'
  ```
- Add `HUD_API_TOKEN` env var to listener pod template (from the secret)

#### 2.2 ARC Fork (actions-runner-controller repo)

**POC Cleanup:**
- Remove any ConfigMap watcher code from the forked listener

**Capacity Monitor Package (`cmd/ghalistener/capacity/`):**

The core implementation. A single goroutine that runs in the listener's `errgroup`:

1. **Label Discovery** (`labels.go`):
   - At startup, call `client.GetRunnerScaleSet(ctx, scaleSetID)` to get `RunnerScaleSet.Labels`
   - Cache the labels for HUD API matching
   - The scale set's labels are already fetched in `main.go` — pass them to the monitor

2. **HUD API Client** (`hud_client.go`):
   - HTTP GET to `https://hud.pytorch.org/api/clickhouse/queued_jobs_aggregate?parameters=%5B%5D`
   - Header: `x-hud-internal-bot: <token>` (from `HUD_API_TOKEN` env var)
   - Parse response as `[]QueuedJobsForRunner`
   - Filter by matching `runner_label` against the scale set's labels
   - Sum `num_queued_jobs` for matching entries → `nbr_queued_jobs_runner`
   - Poll interval: same as `recalculateInterval` (default 30s)

3. **Placeholder Manager** (`placeholder.go`):
   - Create placeholder pairs (Placeholder-Runner + Placeholder-Workflow) as Kubernetes pods
   - Image: `public.ecr.aws/docker/library/alpine:3.21`, command: `["sleep", "900"]`
   - Owner reference: listener pod (auto-cleanup on listener death/restart)
   - `terminationGracePeriodSeconds: 0`
   - `preemptionPolicy: Never` on both placeholders
   - No pod affinity — placeholders are landing-place agnostic (cluster-level capacity reservation)
   - Node selector + tolerations: match the runner pod template
   - Track pair state: both `Running` = ready slot, either `Pending` past timeout = delete pair

4. **Capacity Calculator** (`monitor.go`):
   - Main reconciliation loop (event-driven + periodic fallback at `recalculateInterval`):
     ```
     nbr_queued = query_hud_api()
     desired_pairs = proactiveCapacity + nbr_queued
     desired_pairs = min(desired_pairs, maxRunners - total_running_jobs)
     desired_pairs = max(desired_pairs, 0)

     // Create or delete placeholder pairs to match desired count
     adjust_placeholders(desired_pairs)

     // Report capacity to GitHub
     running_pairs = count_running_placeholder_pairs()
     capacity = min(total_running_jobs + running_pairs, maxRunners)
     listener.SetMaxRunners(capacity)
     ```
   - Watch for pod events (placeholder state changes) to trigger immediate recalculation
   - Watch for EphemeralRunner changes (job started/completed) to adjust

5. **Integration** (`main.go` changes):
   - Create `CapacityMonitor` in the errgroup alongside the listener
   - Pass: `scaleset.Client`, `listener.SetMaxRunners`, `config.MaxRunners`, scale set labels, K8s client
   - If `capacityAware.enabled` is false, skip monitor creation entirely (backward-compatible)

**Unit Tests:**
- Test capacity formula with various inputs (queued jobs, running jobs, maxRunners ceiling)
- Test HUD API response parsing and label matching
- Test placeholder pair lifecycle (creation, timeout, deletion)
- Mock K8s client for placeholder pod operations

#### 2.3 Staging Validation

Deploy to staging and validate:
1. Placeholder pods trigger Karpenter provisioning (no same-node requirement)
2. Runner pods (priority 0) preempt Placeholder-Runner (-10) but NOT Placeholder-Workflow (10)
3. Workflow pods (priority 20) preempt Placeholder-Workflow (10)
4. Workflow resources remain protected between runner start and workflow pod creation
5. `maxRunners` ceiling from Helm is never exceeded
6. HUD API integration correctly discovers queued jobs per runner label
7. Placeholder pairs scale up when jobs queue, scale down when jobs are assigned
8. Placeholder timeout (15 min) works — pods self-terminate if not preempted
9. Listener restart cleans up all placeholder pods (owner reference)
10. InsufficientCapacity: placeholder stuck `Pending` past `placeholderReadyTimeout` → deleted, capacity not reported

### Phase 3: Production Rollout

1. Add Prometheus metrics for capacity monitoring (available slots, placeholder status, maxRunners changes, HUD API latency)
2. Add Grafana dashboards
3. Roll out to production clusters one at a time
4. Tune `proactiveCapacity` per runner type based on observed burst patterns and HUD data

### Phase 4: Multi-Cluster Optimization

1. Validate multi-cluster behavior with dynamic `maxRunners`
2. Consider adding cross-cluster capacity metrics (each cluster publishes its available capacity to a shared metric store)
3. Tune consolidation delays to balance spare capacity cost vs. burst responsiveness

## Risks and Mitigations

### Risk: GitHub doesn't respect `X-ScaleSetMaxCapacity` dynamically

**Likelihood**: Low. The header is sent on every poll, and the protocol is designed for this.
**Mitigation**: Phase 1 PoC validates this before any significant investment.

### Risk: Placeholder pod preemption race conditions

**Scenario**: Runner pod is created, preempts its placeholder, but the freed workflow resources are claimed by another pod before the workflow pod is created.
**Mitigation**: Solved by the split placeholder design. Each slot has two placeholders: Placeholder-Runner (priority -10) and Placeholder-Workflow (priority 10). When the runner pod (priority 0) is created, it preempts only Placeholder-Runner — Placeholder-Workflow survives because its priority (10) is higher than the runner's (0). The workflow resources remain protected until the actual workflow pod (priority 20) preempts Placeholder-Workflow. Both placeholder types MUST set `terminationGracePeriodSeconds: 0` to ensure resources are freed instantly during preemption.

### Risk: Multi-scale-set resource contention (double-counting headroom)

**Scenario**: Multiple runner types share the same NodePools. Two capacity monitors both see node headroom and report it as available.
**Mitigation**: This is solved by design. Capacity is never reported from headroom calculations — only from `Running` placeholder pods. When two monitors both detect headroom and create placeholders, the Kubernetes scheduler places one and the other stays `Pending`. The `Pending` placeholder times out, is deleted, and the monitor retries on the next cycle. The scheduler is the sole arbiter of resource contention; no cross-monitor coordination is needed.

### Risk: Karpenter consolidation conflicts with placeholder pods

**Scenario**: Karpenter tries to consolidate nodes with placeholder pods while the capacity monitor is trying to maintain proactive capacity.
**Mitigation**: Placeholder pods do NOT use `karpenter.sh/do-not-disrupt` — that annotation blocks consolidation entirely with no TTL, which would cause idle nodes to accumulate indefinitely. Instead, placeholders are low-priority pods with `preemptionPolicy: Never`, which means Karpenter treats them as reschedulable during consolidation. When Karpenter evicts a placeholder, the capacity monitor detects the pod leaving `Running` state, decreases `maxRunners`, and creates a new placeholder pair on the next cycle (which may land on a different, more efficient node). This is the desired behavior — Karpenter optimizes node utilization, the capacity monitor reacts and re-provisions.

### Risk: Cost of spare capacity

**Scenario**: Clusters maintain placeholder pods (and thus nodes) that never get used.
**Mitigation**: `proactiveCapacity` defaults to `0` (disabled) and is opt-in per scale set. Enable it only for runner types with frequent bursts where cold-start latency matters. Tune the value based on observed burst patterns for each runner type. Karpenter consolidation eventually reclaims unused capacity if no jobs arrive.

### Risk: Fork maintenance burden

**Likelihood**: Low. The fork is ~200 lines of new code in a single binary. The `scaleset` package interface is stable (three methods, unchanged since v0.2.0).
**Mitigation**: The forked binary is a thin wrapper. ARC controller upgrades don't affect it. Only changes to the `scaleset` package's `Scaler` interface or `SetMaxRunners()` API would require fork updates.

### Risk: HUD API unavailability

**Scenario**: The PyTorch HUD API at `hud.pytorch.org` is down, slow, or returns errors. The capacity monitor cannot determine queued job counts.
**Mitigation**: The HUD API is a best-effort enhancement. If the API is unavailable, the capacity monitor falls back to `proactiveCapacity` only (ignoring the `nbr_queued_jobs_runner` component). The monitor logs the failure but does not reduce capacity or stop functioning. Placeholder pairs based on static `proactiveCapacity` continue to work regardless of HUD API status.

### Risk: `JobAssigned` messages are ignored

**Context**: The listener currently ignores `JobAssigned` messages (parses them but doesn't pass to the scaler). These messages contain per-job metadata (`RequestLabels`, `RepositoryName`, `JobID`, etc.) that could be useful for smarter capacity decisions.
**Opportunity**: A future enhancement could process `JobAssigned` messages to make per-job capacity decisions — e.g., "this job needs a GPU node, do I have GPU capacity?"

## Protocol Reference

### Key Types (from `github.com/actions/scaleset`)

```go
// Statistics sent with every message
type RunnerScaleSetStatistic struct {
    TotalAvailableJobs     int
    TotalAcquiredJobs      int
    TotalAssignedJobs      int  // <-- this is what drives scaling
    TotalRunningJobs       int
    TotalRegisteredRunners int
    TotalBusyRunners       int
    TotalIdleRunners       int
}

// Message types
const (
    MessageTypeJobAssigned  = "JobAssigned"   // ignored by current listener
    MessageTypeJobStarted   = "JobStarted"    // runner picked up job
    MessageTypeJobCompleted = "JobCompleted"  // runner finished job
)

// Per-job metadata (available in all message types)
type JobMessageBase struct {
    RunnerRequestID    int64
    RepositoryName     string
    OwnerName          string
    JobID              int64
    JobWorkflowRef     string
    JobDisplayName     string
    WorkflowRunID      int64
    EventName          string
    RequestLabels      []string
    QueueTime          time.Time
    ScaleSetAssignTime time.Time
    RunnerAssignTime   time.Time
    FinishTime         time.Time
}
```

### Listener Event Loop (from `github.com/actions/scaleset/listener`)

```
loop:
  1. GetMessage(ctx, lastMessageID, maxRunners)
     → HTTP GET to messageQueueURL
     → Header: X-ScaleSetMaxCapacity = maxRunners
     → Header: Authorization = Bearer <messageQueueAccessToken>

  2a. HTTP 202 (no messages):
     → call scaler.HandleDesiredRunnerCount(ctx, latestStatistics.TotalAssignedJobs)
     → loop

  2b. HTTP 200 (message batch):
     → parse message (Statistics + JobAssigned/Started/Completed arrays)
     → DELETE messageQueueURL/{messageID}  (ACK — all-or-nothing)
     → call scaler.HandleJobStarted() for each started job
     → call scaler.HandleJobCompleted() for each completed job
     → call scaler.HandleDesiredRunnerCount(ctx, msg.Statistics.TotalAssignedJobs)
     → loop

  2c. HTTP 401 (token expired):
     → PATCH sessions/{sessionId} to refresh token
     → retry
```

### ARC Controller Chain (unchanged)

```
AutoscalingRunnerSet (Helm-created CR)
  ↓ controller creates
AutoscalingListener (CR + listener pod in arc-systems)
  ↓ listener patches
EphemeralRunnerSet.spec.replicas = N
  ↓ controller creates N
EphemeralRunner (one per runner)
  ↓ controller creates
Pod (runner pod, from template in AutoscalingRunnerSet)
  ↓ runner creates (via runner-container-hooks)
Pod (workflow pod, in arc-runners namespace)
```

## Source Code References

| File | What |
|------|------|
| `actions/scaleset/listener/listener.go` | Listener struct, Scaler interface, Run() loop, SetMaxRunners() |
| `actions/scaleset/session_client.go` | GetMessage (long-poll with X-ScaleSetMaxCapacity), DeleteMessage (ACK) |
| `actions/scaleset/client.go` | REST API client, authentication, JIT config generation |
| `actions/scaleset/types.go` | Protocol types — RunnerScaleSetStatistic, message types, RunnerScaleSet |
| `actions/scaleset/examples/dockerscaleset/scaler.go` | Reference Scaler implementation showing capacity capping |
| `actions-runner-controller/cmd/ghalistener/main.go` | Listener binary entry point (the file we fork) |
| `actions-runner-controller/cmd/ghalistener/scaler/scaler.go` | Current Scaler implementation (patches EphemeralRunnerSet) |
| `actions-runner-controller/controllers/actions.github.com/ephemeralrunnerset_controller.go` | EphemeralRunnerSet reconciler (creates EphemeralRunner CRs) |
| `actions-runner-controller/controllers/actions.github.com/ephemeralrunner_controller.go` | EphemeralRunner reconciler (creates runner pods) |
| `actions-runner-controller/controllers/actions.github.com/autoscalingrunnerset_controller.go` | AutoscalingRunnerSet reconciler (manages listener + EphemeralRunnerSet) |
| `actions-runner-controller/controllers/actions.github.com/resourcebuilder.go` | Pod spec construction for listener and runner pods |

Pod	Resource requests	PriorityClass	Priority	Preempted by
Placeholder-Runner	`runner.requests`	`placeholder-runner`	-10	Runner pod (0)
Placeholder-Workflow	`workflow.requests`	`placeholder-workflow`	10	Workflow pod (20)

Component	Action	Reason
`cmd/ghalistener/main.go`	Fork	Add CapacityMonitor goroutine to the errgroup
`cmd/ghalistener/capacity/`	New package	CapacityMonitor implementation, placeholder pod management, Karpenter NodePool queries
`github.com/actions/scaleset`	No change	Used as-is; `SetMaxRunners()` is the only integration point
`controllers/actions.github.com/*`	No change	All controllers run stock
ARC CRDs	No change	No schema changes
`gha-runner-scale-set-controller` chart	Minimal change	Override the listener container image to use the forked `ghalistener` binary. Chart published from `https://github.com/jeanschmidt/actions-runner-controller.git` master branch.
`gha-runner-scale-set` chart	No change	Runner pod templates stay the same

File	What
`actions/scaleset/listener/listener.go`	Listener struct, Scaler interface, Run() loop, SetMaxRunners()
`actions/scaleset/session_client.go`	GetMessage (long-poll with X-ScaleSetMaxCapacity), DeleteMessage (ACK)
`actions/scaleset/client.go`	REST API client, authentication, JIT config generation
`actions/scaleset/types.go`	Protocol types — RunnerScaleSetStatistic, message types, RunnerScaleSet
`actions/scaleset/examples/dockerscaleset/scaler.go`	Reference Scaler implementation showing capacity capping
`actions-runner-controller/cmd/ghalistener/main.go`	Listener binary entry point (the file we fork)
`actions-runner-controller/cmd/ghalistener/scaler/scaler.go`	Current Scaler implementation (patches EphemeralRunnerSet)
`actions-runner-controller/controllers/actions.github.com/ephemeralrunnerset_controller.go`	EphemeralRunnerSet reconciler (creates EphemeralRunner CRs)
`actions-runner-controller/controllers/actions.github.com/ephemeralrunner_controller.go`	EphemeralRunner reconciler (creates runner pods)
`actions-runner-controller/controllers/actions.github.com/autoscalingrunnerset_controller.go`	AutoscalingRunnerSet reconciler (manages listener + EphemeralRunnerSet)
`actions-runner-controller/controllers/actions.github.com/resourcebuilder.go`	Pod spec construction for listener and runner pods

Proactive Capacity-Aware ARC Autoscaling #499

Description

Proactive Capacity-Aware ARC Autoscaling

Problem Statement

Root Cause

Solution: Capacity-Aware Listener with Proactive Provisioning

Key Insight: X-ScaleSetMaxCapacity

Strategy: Optimistic with Placeholder Pods (Strategy B)

Multi-Cluster Behavior

Architecture

Components

1. Forked ghalistener Binary (the only fork required)

2. Capacity Monitor

3. Placeholder Pods (Split Runner + Workflow)

4. Capacity Calculation

Detailed Flow

Steady State (No Queued Jobs)

Job Burst Arrives

InsufficientCapacity (EC2 Exhaustion)

Job Claimed by Another Cluster

Scale to Zero

Configuration

Runner Definition (modules/arc-runners/defs/*.yaml)

Capacity-Aware Listener Config

HUD API Integration (Queued Jobs)

Capacity Formulas

Resource Sizing for Placeholders

PriorityClass Setup

What Gets Forked

Maintenance Burden

Implementation Plan

Phase 1: Proof of Concept (validate the protocol) — COMPLETED

Phase 2: Production Placeholder System + HUD Integration

2.1 OSDC Infrastructure (osdc/ repo)

2.2 ARC Fork (actions-runner-controller repo)

2.3 Staging Validation

Phase 3: Production Rollout

Phase 4: Multi-Cluster Optimization

Risks and Mitigations

Risk: GitHub doesn't respect X-ScaleSetMaxCapacity dynamically

Risk: Placeholder pod preemption race conditions

Risk: Multi-scale-set resource contention (double-counting headroom)

Risk: Karpenter consolidation conflicts with placeholder pods

Risk: Cost of spare capacity

Risk: Fork maintenance burden

Risk: HUD API unavailability

Risk: JobAssigned messages are ignored

Protocol Reference

Key Types (from github.com/actions/scaleset)

Listener Event Loop (from github.com/actions/scaleset/listener)

ARC Controller Chain (unchanged)

Source Code References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Key Insight: `X-ScaleSetMaxCapacity`

1. Forked `ghalistener` Binary (the only fork required)

Runner Definition (`modules/arc-runners/defs/*.yaml`)

Risk: GitHub doesn't respect `X-ScaleSetMaxCapacity` dynamically

Risk: `JobAssigned` messages are ignored

Key Types (from `github.com/actions/scaleset`)

Listener Event Loop (from `github.com/actions/scaleset/listener`)