Commit 7754e3e
authored
Runtime rule hot-update for MAL and LAL (#13851)
Runtime rule hot-update: REST admin surface for MAL/LAL rule files.
Operators can add, override, inactivate, and delete MAL (`otel-rules`,
`log-mal-rules`, `telegraf-rules`) and LAL rule files at runtime without
restarting OAP. Edits compile and load into the OAP JVM on the fly; every
node in a cluster converges on its next periodic scan (~30 s).
The admin surface is **disabled by default** and listens on **port 17128**
when enabled (`SW_RECEIVER_RUNTIME_RULE=default`). It has **no built-in
authentication** — operators must gateway-protect with IP allow-lists and
never expose it to the public internet.
## REST endpoints
### Write
`POST /runtime/rule/addOrUpdate`
Body: raw rule YAML. Filter-only edits use a fast in-place swap;
structural edits run a cluster pause + DDL + verify + persist + resume
cycle.
Query parameters:
- `catalog` — required; one of `otel-rules`, `log-mal-rules`,
`telegraf-rules`, `lal`. Unknown returns
`400 invalid_catalog`.
- `name` — required; filesystem-style path under the
catalog root, no extension. Pattern:
`[A-Za-z0-9._-]+(/[A-Za-z0-9._-]+)*`.
- `allowStorageChange` — optional, default `false`. Set `true` to
permit shape-breaking edits (drops measure
data on BanyanDB; orphans rows on ES/JDBC).
- `force` — optional, default `false`. Recovery flag;
bypasses the byte-identical no-change
short-circuit so re-pushing known-good content
is treated as a fresh apply.
`POST /runtime/rule/inactivate`
Soft-pause. Stops dispatching for that rule but preserves the backend
measure + history so a later `/addOrUpdate` is lossless. The "off" intent
is durable across restarts.
Query parameters:
- `catalog`, `name` — required, same as above.
`POST /runtime/rule/delete`
Removes an INACTIVE row (ACTIVE rules return
`409 requires_inactivate_first`). Behaviour depends on `mode` and whether
a bundled YAML twin exists on disk:
- default mode + no bundled twin: drops the row, leaves the backend as
inert artefact (matches bundled-rule deletion on disk).
- default mode + bundled twin: refused with
`409 requires_revert_to_bundled` so bundled cannot silently take over
without an explicit operator decision.
- `?mode=revertToBundled` + bundled twin: schema-change pipeline
(install runtime locally, apply bundled through the standard pipeline
so the runtime->bundled delta drops runtime-only metrics, installs
bundled-only metrics) before removing the row.
- `?mode=revertToBundled` + no bundled twin: returns
`400 no_bundled_twin`.
Query parameters:
- `catalog`, `name` — required.
- `mode` — optional, default empty. Set
`revertToBundled` to drive the schema-change
pipeline.
### Read
`GET /runtime/rule`
One rule's YAML body. Default returns the runtime row; falls back to
bundled when the row is absent. Supports `ETag` and `If-None-Match` for
cheap 304s.
Query parameters:
- `catalog`, `name` — required.
- `source` — optional, `runtime` (default) or `bundled`.
`bundled` reads on-disk YAML even when a
runtime override is in place.
- HTTP `Accept` — `application/x-yaml` (default) or
`application/json` for the JSON envelope.
`GET /runtime/rule/bundled`
Bundled rules in one catalog as JSON, with override flag joined from
runtime rows.
Query parameters:
- `catalog` — required.
- `withContent` — optional, default `true`. When `false`,
omits each YAML body (listing only).
`GET /runtime/rule/list`
Single JSON envelope `{generatedAt, loaderStats, rules}` merging stored
rules with this node's local state. Each row carries `loaderKind`,
`loaderName`, `bundled`, and `bundledContentHash` so a UI can render
override badges without a second roundtrip.
Query parameters:
- `catalog` — optional. Narrows the output to one
catalog. Unknown returns
`400 invalid_catalog`.
`GET /runtime/rule/dump[/<catalog>]`
Tar.gz of stored rules + manifest.yaml for backup/DR. Trailing
`/<catalog>` narrows the dump.
### Catalog shortcut routes
Mirror the canonical paths for scripts that drive a single catalog:
- `/runtime/mal/otel/{addOrUpdate,inactivate,delete}` -> `catalog=otel-rules`
- `/runtime/mal/log/{addOrUpdate,inactivate,delete}` -> `catalog=log-mal-rules`
- `/runtime/lal/{addOrUpdate,inactivate,delete}` -> `catalog=lal`
`telegraf-rules` is supported via canonical routes only.
## Lifecycle / status of a DSL rule
Status (DAO row + synthetic):
- `BUNDLED` — synthetic, shipped on disk with no operator override. Healthy steady state, no DAO row.
- `ACTIVE` — DAO row, runtime override is serving.
- `INACTIVE` — DAO row, soft-paused tombstone. Handlers torn down; backend preserved.
- `n/a` — synthetic transient: row was just removed and this node hasn't swept yet.
Local state (per node, transient):
- `RUNNING` — dispatching samples; commit complete.
- `SUSPENDED` — mid-structural-apply. `suspendOrigin` in {SELF, PEER, BOTH}.
- `NOT_LOADED` — after `/inactivate` or never installed; no handlers.
- (null) — boot-seeded bundled entry; gone-keys reconcile leaves it alone.
Loader kind (per-file classloader):
- `RUNTIME` — operator-pushed override active. Loader prefix `runtime-rule:`.
- `BUNDLED` — bundled rule served via fall-over loader. Loader prefix `bundled:`.
- `NONE` — no per-file loader (bundled-only via the OAP shared default loader, or INACTIVE).
State matrix:
| Operator action history | status | loaderKind | bundled | What is serving |
| ---------------------------------------- | ---------- | ---------- | ------- | --------------------------------------------------------------- |
| Bundled rule, never touched | `BUNDLED` | `NONE` | `true` | Bundled YAML, OAP shared default classloader. |
| `/addOrUpdate` overriding bundled | `ACTIVE` | `RUNTIME` | `true` | Runtime override; compare contentHash vs bundledContentHash. |
| `/addOrUpdate` brand-new (no twin) | `ACTIVE` | `RUNTIME` | `false` | Runtime override; no bundled fallback. |
| `/inactivate` of override | `INACTIVE` | `NONE` | `true` | Nothing. Bundled does NOT auto-resurrect. |
| `/inactivate` of bundled-only | `INACTIVE` | `NONE` | `true` | Nothing. Tombstone carries the bundled YAML at inactivate-time. |
| `/inactivate` of brand-new | `INACTIVE` | `NONE` | `false` | Nothing. Rule is off. |
| Post-`revertToBundled` row removed | `n/a` | `BUNDLED` | `true` | Bundled rule freshly compiled into a `bundled:` loader. |
Lifecycle transitions (linear form, friendly to plain-text diff views):
1. Initial state: `BUNDLED` (rule shipped on disk, no DAO row).
2. `/addOrUpdate` against a bundled or absent rule -> `ACTIVE` (loaderKind=RUNTIME).
3. `/addOrUpdate` against an `ACTIVE` rule -> `ACTIVE` (re-applies; filter-only fast path or structural pipeline depending on the diff).
4. `/inactivate` against `ACTIVE` -> `INACTIVE` (handlers torn down, backend preserved).
5. `/inactivate` against `BUNDLED` -> `INACTIVE` (tombstone row carrying the bundled YAML at inactivate time).
6. `/inactivate` against `INACTIVE` -> `INACTIVE` (idempotent, returns `200 already_inactive`).
From `INACTIVE` there are exactly three legal exits:
7a. `/addOrUpdate` with same content -> `ACTIVE` (reactivate; full structural pipeline).
7b. `/addOrUpdate` with new content -> `ACTIVE` (reactivate with edits).
7c. `/delete?mode=revertToBundled` -> row gone, `BUNDLED` loader installed (only if a bundled twin exists on disk).
`/delete` (default mode) on `INACTIVE`:
8a. No bundled twin on disk -> row gone, backend left as inert artefact.
8b. Bundled twin on disk -> `409 requires_revert_to_bundled` (refused; operator must opt in via 7c).
Constraints:
- `/delete` against `ACTIVE` always returns `409 requires_inactivate_first` — destruction goes through the explicit two-step `/inactivate -> /delete` workflow.
- The `INACTIVE` tombstone is durable across OAP restarts; bundled does NOT auto-resurrect when a runtime override is removed via `/inactivate`. Only path 7c brings bundled back.
## Persistence
Hot-updates survive OAP restart: at boot, OAP merges bundled rule files with
persisted runtime rules so the cluster never silently regresses to bundled
defaults.
DAO row shape: `(catalog, name, content, status, updateTime)`. Per-backend
DAO implementations:
- BanyanDB — etcd-backed property writes; cluster fences on `mod_revision` via Schema Barrier.
- Elasticsearch — upsert by row.
- JDBC (H2 / MySQL / PostgreSQL / TiDB / OceanBase) — upsert by row.
## Configuration
Application.yml block (`oap-server/server-starter/src/main/resources/application.yml`):
| Knob | Env var | Default |
| -------------------------- | ------------------------------------------------------- | ---------------- |
| selector | `SW_RECEIVER_RUNTIME_RULE` | empty (disabled) |
| `restHost` | `SW_RECEIVER_RUNTIME_RULE_REST_HOST` | `0.0.0.0` |
| `restPort` | `SW_RECEIVER_RUNTIME_RULE_REST_PORT` | `17128` |
| `restContextPath` | `SW_RECEIVER_RUNTIME_RULE_REST_CONTEXT_PATH` | `/` |
| `restIdleTimeOut` | `SW_RECEIVER_RUNTIME_RULE_REST_IDLE_TIMEOUT` | `30000` |
| `restAcceptQueueSize` | `SW_RECEIVER_RUNTIME_RULE_REST_QUEUE_SIZE` | `0` |
| `httpMaxRequestHeaderSize` | `SW_RECEIVER_RUNTIME_RULE_HTTP_MAX_REQUEST_HEADER_SIZE` | `8192` |
| `reconcilerIntervalSeconds`| `SW_RECEIVER_RUNTIME_RULE_RECONCILER_INTERVAL_SECONDS` | `30` |
| `selfHealThresholdSeconds` | `SW_RECEIVER_RUNTIME_RULE_SELF_HEAL_THRESHOLD_SECONDS` | `60` |
## Security
- Disabled by default; `selector` is empty out of the box.
- The admin port has **no authentication** in this iteration. Operators must
gateway-protect with IP allow-lists + auth and never expose port 17128 to
the public internet.
- Audit every request — rule content compiles into the OAP JVM, equivalent
to shell access on the OAP host.
- Cluster Suspend RPC rides the existing OAP cluster-bus gRPC server
(RemoteService / HealthCheck transport), separate from port 17128.
## Documentation
- `docs/en/setup/backend/backend-runtime-rule-api.md` — full API reference with applyStatus codes and per-backend `/delete` semantics.
- `docs/en/concepts-and-designs/runtime-rule-hot-update.md` — design doc.
- `docs/en/security/README.md` — security notice for the admin surface.
- `docs/en/setup/backend/configuration-vocabulary.md` — env-var reference.1 parent 36a3f9c commit 7754e3e
224 files changed
Lines changed: 23809 additions & 964 deletions
File tree
- .claude/skills/gh-pull-request
- .github/workflows
- apm-protocol/apm-network
- dist-material/release-docs
- docker
- oap
- docs
- en
- changes
- concepts-and-designs
- security
- setup/backend
- oap-server-bom
- oap-server
- ai-pipeline
- analyzer
- agent-analyzer/src/test/java/org/apache/skywalking/oap/server/analyzer/provider/meter/process
- log-analyzer/src/main/java/org/apache/skywalking/oap/log/analyzer/v2
- compiler
- dsl
- spec/extractor
- module
- provider
- log/listener
- meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2
- compiler
- dsl
- prometheus/rule
- exporter
- server-alarm-plugin
- src/main/java/org/apache/skywalking/oap/server/core/alarm/provider
- server-configuration/grpc-configuration-sync
- server-core
- src
- main/java/org/apache/skywalking/oap/server/core
- alarm
- analysis
- meter
- worker
- classloader
- management/runtimerule
- rule/ext
- source
- storage
- annotation
- management
- model
- worker
- test/java/org/apache/skywalking/oap/server/core
- analysis
- meter
- worker
- classloader
- rule/ext
- storage/model
- server-fetcher-plugin/fetcher-proto
- server-library
- library-banyandb-client
- src
- main
- java/org/apache/skywalking/library/banyandb/v1/client
- grpc
- metadata
- test/java/org/apache/skywalking/library/banyandb/v1/client
- library-batch-queue/src/main/java/org/apache/skywalking/oap/server/library/batchqueue
- library-integration-test/src/main/java/org/apache/skywalking/oap/server/library/it
- library-pprof-parser
- server-query-plugin/traceql-plugin
- server-receiver-plugin
- aws-firehose-receiver
- envoy-metrics-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/envoy
- otel-receiver-plugin/src
- main/java/org/apache/skywalking/oap/server/receiver/otel
- otlp
- test/java/org/apache/skywalking/oap/server/receiver/otel/otlp
- receiver-proto
- skywalking-runtime-rule-receiver-plugin
- src
- main
- java/org/apache/skywalking/oap/server/receiver/runtimerule
- apply
- cluster
- engine
- lal
- mal
- extension
- metrics
- module
- reconcile
- rest
- state
- util
- proto
- resources/META-INF/services
- test/java/org/apache/skywalking/oap
- meter/analyzer/v2/dsl
- server/receiver/runtimerule
- apply
- cluster
- rest
- state
- util
- skywalking-telegraf-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/telegraf/provider
- skywalking-zabbix-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/zabbix/provider
- server-starter
- src/main/resources
- server-storage-plugin
- storage-banyandb-plugin/src
- main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb
- bulk
- stream
- test/java/org/apache/skywalking/oap/server/storage/plugin/banyandb
- storage-elasticsearch-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/elasticsearch
- base
- query
- storage-jdbc-hikaricp-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/jdbc/common
- dao
- server-tools/profile-exporter/tool-profile-snapshot-server-mock/src/main/java/org/apache/skywalking/oap/server/tool/profile/core
- mock
- test/e2e-v2
- cases/runtime-rule
- cluster
- expected
- lal
- expected
- log-emitter
- seed-rules
- mal-storage
- banyandb
- expected
- elasticsearch
- expected
- otlp-emitter
- postgresql
- seed-rules
- java-test-service
- e2e-mock-baseline-server
- e2e-protocol
- opentelemetry-proto
- script
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
32 | 32 | | |
33 | 33 | | |
34 | 34 | | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
35 | 75 | | |
36 | 76 | | |
37 | 77 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
294 | 294 | | |
295 | 295 | | |
296 | 296 | | |
297 | | - | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
298 | 300 | | |
299 | 301 | | |
300 | 302 | | |
| |||
394 | 396 | | |
395 | 397 | | |
396 | 398 | | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
397 | 411 | | |
398 | 412 | | |
399 | 413 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
43 | 43 | | |
44 | 44 | | |
45 | 45 | | |
| 46 | + | |
46 | 47 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
134 | 134 | | |
135 | 135 | | |
136 | 136 | | |
137 | | - | |
| 137 | + | |
138 | 138 | | |
139 | 139 | | |
140 | | - | |
| 140 | + | |
141 | 141 | | |
142 | 142 | | |
143 | 143 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
90 | 90 | | |
91 | 91 | | |
92 | 92 | | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
93 | 104 | | |
94 | 105 | | |
95 | 106 | | |
| |||
105 | 116 | | |
106 | 117 | | |
107 | 118 | | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
108 | 126 | | |
109 | 127 | | |
110 | 128 | | |
| |||
257 | 275 | | |
258 | 276 | | |
259 | 277 | | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
260 | 282 | | |
261 | 283 | | |
262 | 284 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
92 | 92 | | |
93 | 93 | | |
94 | 94 | | |
95 | | - | |
| 95 | + | |
96 | 96 | | |
97 | 97 | | |
98 | 98 | | |
99 | | - | |
| 99 | + | |
100 | 100 | | |
101 | 101 | | |
102 | 102 | | |
| |||
0 commit comments