Skip to content

Commit 7754e3e

Browse files
authored
Runtime rule hot-update for MAL and LAL (#13851)
Runtime rule hot-update: REST admin surface for MAL/LAL rule files. Operators can add, override, inactivate, and delete MAL (`otel-rules`, `log-mal-rules`, `telegraf-rules`) and LAL rule files at runtime without restarting OAP. Edits compile and load into the OAP JVM on the fly; every node in a cluster converges on its next periodic scan (~30 s). The admin surface is **disabled by default** and listens on **port 17128** when enabled (`SW_RECEIVER_RUNTIME_RULE=default`). It has **no built-in authentication** — operators must gateway-protect with IP allow-lists and never expose it to the public internet. ## REST endpoints ### Write `POST /runtime/rule/addOrUpdate` Body: raw rule YAML. Filter-only edits use a fast in-place swap; structural edits run a cluster pause + DDL + verify + persist + resume cycle. Query parameters: - `catalog` — required; one of `otel-rules`, `log-mal-rules`, `telegraf-rules`, `lal`. Unknown returns `400 invalid_catalog`. - `name` — required; filesystem-style path under the catalog root, no extension. Pattern: `[A-Za-z0-9._-]+(/[A-Za-z0-9._-]+)*`. - `allowStorageChange` — optional, default `false`. Set `true` to permit shape-breaking edits (drops measure data on BanyanDB; orphans rows on ES/JDBC). - `force` — optional, default `false`. Recovery flag; bypasses the byte-identical no-change short-circuit so re-pushing known-good content is treated as a fresh apply. `POST /runtime/rule/inactivate` Soft-pause. Stops dispatching for that rule but preserves the backend measure + history so a later `/addOrUpdate` is lossless. The "off" intent is durable across restarts. Query parameters: - `catalog`, `name` — required, same as above. `POST /runtime/rule/delete` Removes an INACTIVE row (ACTIVE rules return `409 requires_inactivate_first`). Behaviour depends on `mode` and whether a bundled YAML twin exists on disk: - default mode + no bundled twin: drops the row, leaves the backend as inert artefact (matches bundled-rule deletion on disk). - default mode + bundled twin: refused with `409 requires_revert_to_bundled` so bundled cannot silently take over without an explicit operator decision. - `?mode=revertToBundled` + bundled twin: schema-change pipeline (install runtime locally, apply bundled through the standard pipeline so the runtime->bundled delta drops runtime-only metrics, installs bundled-only metrics) before removing the row. - `?mode=revertToBundled` + no bundled twin: returns `400 no_bundled_twin`. Query parameters: - `catalog`, `name` — required. - `mode` — optional, default empty. Set `revertToBundled` to drive the schema-change pipeline. ### Read `GET /runtime/rule` One rule's YAML body. Default returns the runtime row; falls back to bundled when the row is absent. Supports `ETag` and `If-None-Match` for cheap 304s. Query parameters: - `catalog`, `name` — required. - `source` — optional, `runtime` (default) or `bundled`. `bundled` reads on-disk YAML even when a runtime override is in place. - HTTP `Accept` — `application/x-yaml` (default) or `application/json` for the JSON envelope. `GET /runtime/rule/bundled` Bundled rules in one catalog as JSON, with override flag joined from runtime rows. Query parameters: - `catalog` — required. - `withContent` — optional, default `true`. When `false`, omits each YAML body (listing only). `GET /runtime/rule/list` Single JSON envelope `{generatedAt, loaderStats, rules}` merging stored rules with this node's local state. Each row carries `loaderKind`, `loaderName`, `bundled`, and `bundledContentHash` so a UI can render override badges without a second roundtrip. Query parameters: - `catalog` — optional. Narrows the output to one catalog. Unknown returns `400 invalid_catalog`. `GET /runtime/rule/dump[/<catalog>]` Tar.gz of stored rules + manifest.yaml for backup/DR. Trailing `/<catalog>` narrows the dump. ### Catalog shortcut routes Mirror the canonical paths for scripts that drive a single catalog: - `/runtime/mal/otel/{addOrUpdate,inactivate,delete}` -> `catalog=otel-rules` - `/runtime/mal/log/{addOrUpdate,inactivate,delete}` -> `catalog=log-mal-rules` - `/runtime/lal/{addOrUpdate,inactivate,delete}` -> `catalog=lal` `telegraf-rules` is supported via canonical routes only. ## Lifecycle / status of a DSL rule Status (DAO row + synthetic): - `BUNDLED` — synthetic, shipped on disk with no operator override. Healthy steady state, no DAO row. - `ACTIVE` — DAO row, runtime override is serving. - `INACTIVE` — DAO row, soft-paused tombstone. Handlers torn down; backend preserved. - `n/a` — synthetic transient: row was just removed and this node hasn't swept yet. Local state (per node, transient): - `RUNNING` — dispatching samples; commit complete. - `SUSPENDED` — mid-structural-apply. `suspendOrigin` in {SELF, PEER, BOTH}. - `NOT_LOADED` — after `/inactivate` or never installed; no handlers. - (null) — boot-seeded bundled entry; gone-keys reconcile leaves it alone. Loader kind (per-file classloader): - `RUNTIME` — operator-pushed override active. Loader prefix `runtime-rule:`. - `BUNDLED` — bundled rule served via fall-over loader. Loader prefix `bundled:`. - `NONE` — no per-file loader (bundled-only via the OAP shared default loader, or INACTIVE). State matrix: | Operator action history | status | loaderKind | bundled | What is serving | | ---------------------------------------- | ---------- | ---------- | ------- | --------------------------------------------------------------- | | Bundled rule, never touched | `BUNDLED` | `NONE` | `true` | Bundled YAML, OAP shared default classloader. | | `/addOrUpdate` overriding bundled | `ACTIVE` | `RUNTIME` | `true` | Runtime override; compare contentHash vs bundledContentHash. | | `/addOrUpdate` brand-new (no twin) | `ACTIVE` | `RUNTIME` | `false` | Runtime override; no bundled fallback. | | `/inactivate` of override | `INACTIVE` | `NONE` | `true` | Nothing. Bundled does NOT auto-resurrect. | | `/inactivate` of bundled-only | `INACTIVE` | `NONE` | `true` | Nothing. Tombstone carries the bundled YAML at inactivate-time. | | `/inactivate` of brand-new | `INACTIVE` | `NONE` | `false` | Nothing. Rule is off. | | Post-`revertToBundled` row removed | `n/a` | `BUNDLED` | `true` | Bundled rule freshly compiled into a `bundled:` loader. | Lifecycle transitions (linear form, friendly to plain-text diff views): 1. Initial state: `BUNDLED` (rule shipped on disk, no DAO row). 2. `/addOrUpdate` against a bundled or absent rule -> `ACTIVE` (loaderKind=RUNTIME). 3. `/addOrUpdate` against an `ACTIVE` rule -> `ACTIVE` (re-applies; filter-only fast path or structural pipeline depending on the diff). 4. `/inactivate` against `ACTIVE` -> `INACTIVE` (handlers torn down, backend preserved). 5. `/inactivate` against `BUNDLED` -> `INACTIVE` (tombstone row carrying the bundled YAML at inactivate time). 6. `/inactivate` against `INACTIVE` -> `INACTIVE` (idempotent, returns `200 already_inactive`). From `INACTIVE` there are exactly three legal exits: 7a. `/addOrUpdate` with same content -> `ACTIVE` (reactivate; full structural pipeline). 7b. `/addOrUpdate` with new content -> `ACTIVE` (reactivate with edits). 7c. `/delete?mode=revertToBundled` -> row gone, `BUNDLED` loader installed (only if a bundled twin exists on disk). `/delete` (default mode) on `INACTIVE`: 8a. No bundled twin on disk -> row gone, backend left as inert artefact. 8b. Bundled twin on disk -> `409 requires_revert_to_bundled` (refused; operator must opt in via 7c). Constraints: - `/delete` against `ACTIVE` always returns `409 requires_inactivate_first` — destruction goes through the explicit two-step `/inactivate -> /delete` workflow. - The `INACTIVE` tombstone is durable across OAP restarts; bundled does NOT auto-resurrect when a runtime override is removed via `/inactivate`. Only path 7c brings bundled back. ## Persistence Hot-updates survive OAP restart: at boot, OAP merges bundled rule files with persisted runtime rules so the cluster never silently regresses to bundled defaults. DAO row shape: `(catalog, name, content, status, updateTime)`. Per-backend DAO implementations: - BanyanDB — etcd-backed property writes; cluster fences on `mod_revision` via Schema Barrier. - Elasticsearch — upsert by row. - JDBC (H2 / MySQL / PostgreSQL / TiDB / OceanBase) — upsert by row. ## Configuration Application.yml block (`oap-server/server-starter/src/main/resources/application.yml`): | Knob | Env var | Default | | -------------------------- | ------------------------------------------------------- | ---------------- | | selector | `SW_RECEIVER_RUNTIME_RULE` | empty (disabled) | | `restHost` | `SW_RECEIVER_RUNTIME_RULE_REST_HOST` | `0.0.0.0` | | `restPort` | `SW_RECEIVER_RUNTIME_RULE_REST_PORT` | `17128` | | `restContextPath` | `SW_RECEIVER_RUNTIME_RULE_REST_CONTEXT_PATH` | `/` | | `restIdleTimeOut` | `SW_RECEIVER_RUNTIME_RULE_REST_IDLE_TIMEOUT` | `30000` | | `restAcceptQueueSize` | `SW_RECEIVER_RUNTIME_RULE_REST_QUEUE_SIZE` | `0` | | `httpMaxRequestHeaderSize` | `SW_RECEIVER_RUNTIME_RULE_HTTP_MAX_REQUEST_HEADER_SIZE` | `8192` | | `reconcilerIntervalSeconds`| `SW_RECEIVER_RUNTIME_RULE_RECONCILER_INTERVAL_SECONDS` | `30` | | `selfHealThresholdSeconds` | `SW_RECEIVER_RUNTIME_RULE_SELF_HEAL_THRESHOLD_SECONDS` | `60` | ## Security - Disabled by default; `selector` is empty out of the box. - The admin port has **no authentication** in this iteration. Operators must gateway-protect with IP allow-lists + auth and never expose port 17128 to the public internet. - Audit every request — rule content compiles into the OAP JVM, equivalent to shell access on the OAP host. - Cluster Suspend RPC rides the existing OAP cluster-bus gRPC server (RemoteService / HealthCheck transport), separate from port 17128. ## Documentation - `docs/en/setup/backend/backend-runtime-rule-api.md` — full API reference with applyStatus codes and per-backend `/delete` semantics. - `docs/en/concepts-and-designs/runtime-rule-hot-update.md` — design doc. - `docs/en/security/README.md` — security notice for the admin surface. - `docs/en/setup/backend/configuration-vocabulary.md` — env-var reference.
1 parent 36a3f9c commit 7754e3e

224 files changed

Lines changed: 23809 additions & 964 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.claude/skills/gh-pull-request/SKILL.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,46 @@ license-eye header check
3232

3333
If invalid files are found, fix with `license-eye header fix` and re-check.
3434

35+
### 3. Unnecessary fully-qualified class names
36+
37+
The project checkstyle forbids inline FQCNs — every type reference in code should resolve
38+
through an `import`, not a fully-qualified name. Checkstyle does not always catch this (it
39+
misses cases like inline `java.util.HashMap`, `java.util.concurrent.TimeUnit`, or
40+
`org.apache.skywalking.oap.server.telemetry.api.HistogramMetrics.Timer` used as a local
41+
variable type, generic parameter, or `new` target). Audit the files the branch touched
42+
before pushing:
43+
44+
Use the `Grep` tool (ripgrep) rather than BSD `grep` on macOS — the scan below relies on a
45+
negative lookahead that BSD `grep` doesn't support and GNU `grep -P` does:
46+
47+
```
48+
pattern: ^(?!\s*(import |package |\s*\*)).*\b(java\.util\.|java\.io\.|java\.nio\.|java\.util\.concurrent\.|javassist\.|org\.apache\.skywalking\.)[A-Z][A-Za-z0-9_]*
49+
glob: *.java
50+
output_mode: content
51+
-n: true
52+
```
53+
54+
Scope the scan to files the branch touched, not the whole tree — pre-existing FQDNs on
55+
unrelated files generate noise. Use `git diff --name-only master...HEAD -- '*.java'` to get
56+
the changed list, then run the ripgrep pattern against each.
57+
58+
Acceptable exceptions (same as the `CLAUDE.md` rule):
59+
- Two classes with the same simple name would collide if both imported.
60+
- A Javadoc `{@link}` where the short name would be ambiguous to the reader.
61+
- Inside a string literal (e.g., a class name passed to `Class.forName`).
62+
63+
Fix every other hit — add an `import` and switch to the short name. This includes
64+
`new java.util.HashMap<>()`, `java.util.Set<String>` parameter types, and
65+
`org.apache.skywalking.oap.server.telemetry.api.HistogramMetrics.Timer` as a local
66+
variable type. Field declarations, method signatures, local variables, and generic
67+
type arguments should all use the imported short name.
68+
69+
Re-run checkstyle after the fix — a sloppy `sed`/`replace_all` can corrupt the `import`
70+
line itself (e.g., turning `import java.util.concurrent.locks.ReentrantLock;` into
71+
`import ReentrantLock;`), which causes a cryptic checkstyle `Range [0, -1) out of
72+
bounds for length N` error, not a normal violation line. If you see that error, inspect
73+
the imports block first.
74+
3575
## Commit and push
3676

3777
After checks pass, commit and push:

.github/workflows/skywalking.yaml

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -294,7 +294,9 @@ jobs:
294294
distribution: temurin
295295
- name: Integration test
296296
run: |
297-
# Exclude slow integration tests and run those tests separately below.
297+
# Exclude slow integration tests (run in slow-integration-test). Runtime-rule
298+
# and BanyanDB storage CRUD are verified end-to-end in the dedicated e2e cases
299+
# (see test/e2e-v2/cases/runtime-rule/ and test/e2e-v2/cases/banyandb).
298300
./mvnw -B clean integration-test -Dcheckstyle.skip -DskipUTs=true -DexcludedGroups=slow || \
299301
./mvnw -B clean integration-test -Dcheckstyle.skip -DskipUTs=true -DexcludedGroups=slow
300302
@@ -394,6 +396,18 @@ jobs:
394396
config: test/e2e-v2/cases/storage/es/es-sharding/e2e.yaml
395397
env: ES_VERSION=8.18.8
396398

399+
- name: Runtime Rule MAL Storage BanyanDB
400+
config: test/e2e-v2/cases/runtime-rule/mal-storage/banyandb/e2e.yaml
401+
- name: Runtime Rule MAL Storage PostgreSQL
402+
config: test/e2e-v2/cases/runtime-rule/mal-storage/postgresql/e2e.yaml
403+
- name: Runtime Rule MAL Storage Elasticsearch 8.18.8
404+
config: test/e2e-v2/cases/runtime-rule/mal-storage/elasticsearch/e2e.yaml
405+
env: ES_VERSION=8.18.8
406+
- name: Runtime Rule LAL Hot-Update
407+
config: test/e2e-v2/cases/runtime-rule/lal/e2e.yaml
408+
- name: Runtime Rule Cluster Convergence
409+
config: test/e2e-v2/cases/runtime-rule/cluster/e2e.yaml
410+
397411
- name: Alarm ES
398412
config: test/e2e-v2/cases/alarm/es/e2e.yaml
399413
- name: Alarm ES Sharding

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,4 +43,5 @@ test/script-cases/scripts/**/*.generated-classes/
4343

4444
# Claude Code local settings
4545
.claude/settings.local.json
46+
.claude/scheduled_tasks.lock
4647
*.generated-classes/

.licenserc.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -134,10 +134,10 @@ dependency:
134134
version: 1.12.0
135135
license: Apache-2.0
136136
- name: build.buf.protoc-gen-validate:pgv-java-stub
137-
version: 1.2.1
137+
version: 1.3.0
138138
license: Apache-2.0
139139
- name: build.buf.protoc-gen-validate:protoc-gen-validate
140-
version: 1.2.1
140+
version: 1.3.0
141141
license: Apache-2.0
142142
- name: com.aayushatharva.brotli4j:service
143143
version: 1.20.0

CLAUDE.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,17 @@ public class XxxModuleProvider extends ModuleProvider {
9090
- No star imports (`import xxx.*`)
9191
- No unused or redundant imports
9292
- No empty statements (standalone `;`)
93+
- No fully-qualified class names inline in code — always add an `import` statement and
94+
use the short name. Acceptable exceptions: (a) two classes with the same simple name
95+
would collide if both imported, (b) the class appears exactly once in a Javadoc
96+
`{@link}` where the short name would be ambiguous to the reader. Field declarations,
97+
method signatures, local variables, and generic type arguments should always use the
98+
imported short name — `private RemoteClientManager rcm;`, not `private
99+
org.apache.skywalking.oap.server.core.remote.client.RemoteClientManager rcm;`.
100+
- No one-line delegate methods. A wrapper whose only body is a single forwarding call
101+
to another class (`return Other.foo(a, b);`) adds a hop without value. Inline the
102+
call at the use site, or call the underlying object directly (including via method
103+
reference: `obj::foo` instead of `this::wrapperOfFoo`).
93104

94105
**Required patterns:**
95106
- `@Override` annotation required for overridden methods
@@ -105,6 +116,13 @@ public class XxxModuleProvider extends ModuleProvider {
105116
- Package names: `org.apache.skywalking.*` or `test.apache.skywalking.*`
106117
- Type names: `PascalCase` or `UPPER_CASE_WITH_UNDERSCORES`
107118
- Local variables/parameters/members: `camelCase`
119+
- **Function-oriented naming, not abstract metaphor**: classes and methods are named for
120+
what they do, not for an abstract concept. Prefer concrete verbs (`load`, `apply`,
121+
`unregister`, `compile`, `verify`, `commit`, `rollback`) over metaphorical ones
122+
(`seed`, `hydrate`, `bootstrap`, `prime`). Class names follow the same rule —
123+
`StaticRuleLoader` (loads static rules), not `StaticBundleSeeder`; `DSLSyncTimer` (syncs
124+
DB → state on a timer), not `TickRunner`. If you can't name a method without reaching
125+
for a metaphor, the method is probably doing too much; split it.
108126

109127
**File limits:**
110128
- Max file length: 3000 lines
@@ -257,6 +275,10 @@ Actions owned by `actions/*` (GitHub), `github/*`, and `apache/*` are always all
257275
10. **Relative paths in docs are valid**: Relative file paths (e.g., `../../../oap-server/...`) in documentation work both in the repo and on the documentation website, supported by website build tooling
258276
11. **Module service registration**: When adding a service to `CoreModule.services()`, update ALL `CoreModuleProvider` implementations — not just the main one. Search with `grep -rn "extends CoreModuleProvider" oap-server/ --include="*.java"`. The `MockCoreModuleProvider` in `server-tools/profile-exporter/` also needs it, or the profile exporter e2e test will fail at startup.
259277
12. **Multiple OAP packagings**: The OAP server is not only the main `server-starter`. The `server-tools/` directory contains standalone tools (e.g., profile exporter) that boot with mock module providers and a subset of modules. Changes to core module contracts (services, required modules) must be reflected in these tools too.
278+
13. **`moduleManager.find(X.NAME)` requires `X.NAME` in `requiredModules()`**: every call to `moduleManager.find(SomeModule.NAME)` (direct or through a helper) must have `SomeModule.NAME` in the provider's `requiredModules()` array. Missing declarations cause runtime exceptions the first time the code path fires — not at module boot. Wrapping the call in `try { ... } catch (Throwable)` is NOT a substitute; declare the module and keep the try/catch only for defensive handling of transient provider outages. When auditing a branch, grep for `moduleManager.find(` across the touched module and verify each target name appears in `requiredModules()`. Example modules that frequently catch teams out: `AlarmModule` (used by alarm-kernel reset), `LogAnalyzerModule` (used by LAL factory lookup).
279+
14. **Don't look up `ClusterModule` services directly**: the `ClusterModule` (ZooKeeper / K8s / Nacos coordination) exposes `ClusterRegister` / `ClusterNodesQuery` / `ClusterCoordinator`. Most receiver / analyzer modules don't declare `ClusterModule` in `requiredModules()`, so calling `moduleManager.find(ClusterModule.NAME)` will throw at runtime. Instead, go through `CoreModule`'s `RemoteClientManager` service — it's already populated by the cluster module and exposes the peer list every OAP needs. If a module genuinely needs cluster-coordinator primitives, declare `ClusterModule.NAME` in `requiredModules()` explicitly.
280+
15. **No `ThreadLocal` side-channels to hijack downstream behaviour**: routing a caller's intent through a `ThreadLocal` that downstream code reads (e.g., `if (PeerMode.isActive()) skipSomething()`) is almost always the wrong answer — it creates invisible coupling between far-apart code paths, leaks across async hand-offs (executors, gRPC threads, Armeria event loops), and makes the behaviour impossible to understand from a method signature. The correct fix is almost always to **extend the interface** — add a parameter, a new method, a new mode enum that appears in the signature. Rare exceptions: propagating OpenTelemetry context where the whole industry has standardised on `ThreadLocal`, or security principals enforced by a framework. In all other cases, prefer an explicit API extension, even if it costs more lines.
281+
16. **BanyanDB schema-visibility: fence on `mod_revision`, do NOT poll metadata**: every BanyanDB Create / Update / Delete returns an etcd `mod_revision` (0 on a delete that didn't record a tombstone). After firing DDL, fence on `BanyanDBClient.getSchemaWatcher().awaitRevisionApplied(maxRev, timeout)` before unparking dispatch / firing data writes — this blocks until every data node has caught up, which the registry's read-after-write does not guarantee. For deletes that returned `mod_revision == 0`, fall back to `awaitSchemaDeleted(SchemaKey, timeout)`. The previous "poll `findMeasure` until you can read your own write" idiom existed before the `SchemaBarrierService` proto landed and has been replaced — do not reintroduce it. JDBC and ES are synchronous-DDL on the coordinator so they don't need a fence.
260282

261283
## Analysis and Design Principles
262284

apm-protocol/apm-network/pom.xml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -92,11 +92,11 @@
9292
protobuf-java version that grpc depends on.
9393
-->
9494
<protocArtifact>
95-
com.google.protobuf:protoc:${com.google.protobuf.protoc.version}:exe:${os.detected.classifier}
95+
com.google.protobuf:protoc:${protobuf-java.version}:exe:${os.detected.classifier}
9696
</protocArtifact>
9797
<pluginId>grpc-java</pluginId>
9898
<pluginArtifact>
99-
io.grpc:protoc-gen-grpc-java:${protoc-gen-grpc-java.plugin.version}:exe:${os.detected.classifier}
99+
io.grpc:protoc-gen-grpc-java:${grpc.version}:exe:${os.detected.classifier}
100100
</pluginArtifact>
101101
</configuration>
102102
<executions>

0 commit comments

Comments
 (0)