Skip to content

Commit 53e180e

Browse files
committed
Runtime rule hot-update for MAL and LAL
Add a runtime-rule receiver that exposes a REST surface on port 17128 (off by default) so operators can hot-update OTEL MAL, log MAL, telegraf MAL, and LAL rule files without restarting OAP. State persists in management storage; every node in an OAP cluster converges on a new content within ~30 s. REST endpoints (full reference in docs/en/setup/backend/backend-runtime-rule-api.md): - POST /runtime/rule/addOrUpdate create or replace - POST /runtime/rule/inactivate soft-pause (preserves measure + history) - POST /runtime/rule/delete destructive removal of an INACTIVE rule - GET /runtime/rule fetch one rule (raw or JSON, ETag) - GET /runtime/rule/bundled static-vs-runtime overlay per catalog - GET /runtime/rule/list NDJSON rule state across the cluster - GET /runtime/rule/dump tar.gz of all stored rules Engine, cluster + storage: - New management entity RuntimeRule (catalog,name,content,status,updateTime) with per-backend RuntimeRuleManagementDAO save/find/list/delete. - DSLManager + RuleEngine SPI orchestrate compile / verify / commit / rollback per file; MAL and LAL each ship an engine, with a shared apply pipeline (StructuralCommitCoordinator, SuspendResumeCoordinator, ApplierResolver, PostApplyVerifier, alarm-window reset). - Cluster Suspend/Resume RPC and Forward RPC over the cluster bus so any peer can drive a structural cutover that pauses dispatch on every node, persists the rule, then resumes — with a 60 s self-heal backstop for missed Resumes. - LOCAL_CACHE_VERIFY mode: non-main nodes verify backend shape on boot and refuse to start when their declared model diverges from what the main installed, instead of silently registering against an incompatible schema. - Storage-model remove lifecycle: StorageModels.remove + per-backend dropTable so a runtime-rule delete can drop the backing measure on BanyanDB and the model on every backend without restarting. - BanyanDB schema-watch fence: schema mutations wait (best-effort, bounded 2 s) for every data node to apply the new revision before unparking dispatch, so the typical case gets a clean cutover where samples after 200 OK use the new shape. - BanyanDB also gains shape-mismatch detection: at boot, resources whose on-disk shape diverges from the declared model are skipped with an ERROR diff, instead of silently dropping samples. Bundled vs runtime overlay: - StaticRuleRegistry holds bundled rules; the merge resolver lets a runtime rule override a bundled (catalog,name) without rewriting static files. RuntimeRuleOverrideResolver SPI threads operator-side overrides into the bundle resolution. E2E coverage under test/e2e-v2/cases/runtime-rule/: - mal-storage/{banyandb,elasticsearch,postgresql} — full 10-phase lifecycle (CREATE → FILTER_ONLY → STRUCTURAL → DUMP → 4× ILLEGAL → SHAPE-BREAK → INACTIVATE → ACTIVATE → DELETE → DUMP) with a per-phase `step` label so verification queries attribute data back to the phase that wrote it. - lal — log-mal aggregation rule + LAL hot-swap, swctl asserts that the extracted metric carries the swap-flipped step label. - cluster — 2-OAP convergence over ZooKeeper. Dependency bumps (driven by BanyanDB schema-consistency RPCs whose generated validation code requires protobuf-java 4.x): gRPC 1.70 → 1.80, protobuf-java 3.25.5 → 4.33.1, pgv 1.2.1 → 1.3.0, Netty 4.2.10 → 4.2.12, Netty-tcnative 2.0.75 → 2.0.77. Security: the admin port has no built-in authentication; the module is disabled by default. docs/en/security/README.md spells out the operator duty to gateway-protect with IP allow-lists, audit every request, and keep the port off the cluster-external interface.
1 parent 36a3f9c commit 53e180e

218 files changed

Lines changed: 22508 additions & 915 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.claude/skills/gh-pull-request/SKILL.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,46 @@ license-eye header check
3232

3333
If invalid files are found, fix with `license-eye header fix` and re-check.
3434

35+
### 3. Unnecessary fully-qualified class names
36+
37+
The project checkstyle forbids inline FQCNs — every type reference in code should resolve
38+
through an `import`, not a fully-qualified name. Checkstyle does not always catch this (it
39+
misses cases like inline `java.util.HashMap`, `java.util.concurrent.TimeUnit`, or
40+
`org.apache.skywalking.oap.server.telemetry.api.HistogramMetrics.Timer` used as a local
41+
variable type, generic parameter, or `new` target). Audit the files the branch touched
42+
before pushing:
43+
44+
Use the `Grep` tool (ripgrep) rather than BSD `grep` on macOS — the scan below relies on a
45+
negative lookahead that BSD `grep` doesn't support and GNU `grep -P` does:
46+
47+
```
48+
pattern: ^(?!\s*(import |package |\s*\*)).*\b(java\.util\.|java\.io\.|java\.nio\.|java\.util\.concurrent\.|javassist\.|org\.apache\.skywalking\.)[A-Z][A-Za-z0-9_]*
49+
glob: *.java
50+
output_mode: content
51+
-n: true
52+
```
53+
54+
Scope the scan to files the branch touched, not the whole tree — pre-existing FQDNs on
55+
unrelated files generate noise. Use `git diff --name-only master...HEAD -- '*.java'` to get
56+
the changed list, then run the ripgrep pattern against each.
57+
58+
Acceptable exceptions (same as the `CLAUDE.md` rule):
59+
- Two classes with the same simple name would collide if both imported.
60+
- A Javadoc `{@link}` where the short name would be ambiguous to the reader.
61+
- Inside a string literal (e.g., a class name passed to `Class.forName`).
62+
63+
Fix every other hit — add an `import` and switch to the short name. This includes
64+
`new java.util.HashMap<>()`, `java.util.Set<String>` parameter types, and
65+
`org.apache.skywalking.oap.server.telemetry.api.HistogramMetrics.Timer` as a local
66+
variable type. Field declarations, method signatures, local variables, and generic
67+
type arguments should all use the imported short name.
68+
69+
Re-run checkstyle after the fix — a sloppy `sed`/`replace_all` can corrupt the `import`
70+
line itself (e.g., turning `import java.util.concurrent.locks.ReentrantLock;` into
71+
`import ReentrantLock;`), which causes a cryptic checkstyle `Range [0, -1) out of
72+
bounds for length N` error, not a normal violation line. If you see that error, inspect
73+
the imports block first.
74+
3575
## Commit and push
3676

3777
After checks pass, commit and push:

.github/workflows/skywalking.yaml

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -294,7 +294,9 @@ jobs:
294294
distribution: temurin
295295
- name: Integration test
296296
run: |
297-
# Exclude slow integration tests and run those tests separately below.
297+
# Exclude slow integration tests (run in slow-integration-test). Runtime-rule
298+
# and BanyanDB storage CRUD are verified end-to-end in the dedicated e2e cases
299+
# (see test/e2e-v2/cases/runtime-rule/ and test/e2e-v2/cases/banyandb).
298300
./mvnw -B clean integration-test -Dcheckstyle.skip -DskipUTs=true -DexcludedGroups=slow || \
299301
./mvnw -B clean integration-test -Dcheckstyle.skip -DskipUTs=true -DexcludedGroups=slow
300302
@@ -394,6 +396,18 @@ jobs:
394396
config: test/e2e-v2/cases/storage/es/es-sharding/e2e.yaml
395397
env: ES_VERSION=8.18.8
396398

399+
- name: Runtime Rule MAL Storage BanyanDB
400+
config: test/e2e-v2/cases/runtime-rule/mal-storage/banyandb/e2e.yaml
401+
- name: Runtime Rule MAL Storage PostgreSQL
402+
config: test/e2e-v2/cases/runtime-rule/mal-storage/postgresql/e2e.yaml
403+
- name: Runtime Rule MAL Storage Elasticsearch 8.18.8
404+
config: test/e2e-v2/cases/runtime-rule/mal-storage/elasticsearch/e2e.yaml
405+
env: ES_VERSION=8.18.8
406+
- name: Runtime Rule LAL Hot-Update
407+
config: test/e2e-v2/cases/runtime-rule/lal/e2e.yaml
408+
- name: Runtime Rule Cluster Convergence
409+
config: test/e2e-v2/cases/runtime-rule/cluster/e2e.yaml
410+
397411
- name: Alarm ES
398412
config: test/e2e-v2/cases/alarm/es/e2e.yaml
399413
- name: Alarm ES Sharding

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,4 +43,5 @@ test/script-cases/scripts/**/*.generated-classes/
4343

4444
# Claude Code local settings
4545
.claude/settings.local.json
46+
.claude/scheduled_tasks.lock
4647
*.generated-classes/

CLAUDE.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,17 @@ public class XxxModuleProvider extends ModuleProvider {
9090
- No star imports (`import xxx.*`)
9191
- No unused or redundant imports
9292
- No empty statements (standalone `;`)
93+
- No fully-qualified class names inline in code — always add an `import` statement and
94+
use the short name. Acceptable exceptions: (a) two classes with the same simple name
95+
would collide if both imported, (b) the class appears exactly once in a Javadoc
96+
`{@link}` where the short name would be ambiguous to the reader. Field declarations,
97+
method signatures, local variables, and generic type arguments should always use the
98+
imported short name — `private RemoteClientManager rcm;`, not `private
99+
org.apache.skywalking.oap.server.core.remote.client.RemoteClientManager rcm;`.
100+
- No one-line delegate methods. A wrapper whose only body is a single forwarding call
101+
to another class (`return Other.foo(a, b);`) adds a hop without value. Inline the
102+
call at the use site, or call the underlying object directly (including via method
103+
reference: `obj::foo` instead of `this::wrapperOfFoo`).
93104

94105
**Required patterns:**
95106
- `@Override` annotation required for overridden methods
@@ -105,6 +116,13 @@ public class XxxModuleProvider extends ModuleProvider {
105116
- Package names: `org.apache.skywalking.*` or `test.apache.skywalking.*`
106117
- Type names: `PascalCase` or `UPPER_CASE_WITH_UNDERSCORES`
107118
- Local variables/parameters/members: `camelCase`
119+
- **Function-oriented naming, not abstract metaphor**: classes and methods are named for
120+
what they do, not for an abstract concept. Prefer concrete verbs (`load`, `apply`,
121+
`unregister`, `compile`, `verify`, `commit`, `rollback`) over metaphorical ones
122+
(`seed`, `hydrate`, `bootstrap`, `prime`). Class names follow the same rule —
123+
`StaticRuleLoader` (loads static rules), not `StaticBundleSeeder`; `DSLSyncTimer` (syncs
124+
DB → state on a timer), not `TickRunner`. If you can't name a method without reaching
125+
for a metaphor, the method is probably doing too much; split it.
108126

109127
**File limits:**
110128
- Max file length: 3000 lines
@@ -257,6 +275,10 @@ Actions owned by `actions/*` (GitHub), `github/*`, and `apache/*` are always all
257275
10. **Relative paths in docs are valid**: Relative file paths (e.g., `../../../oap-server/...`) in documentation work both in the repo and on the documentation website, supported by website build tooling
258276
11. **Module service registration**: When adding a service to `CoreModule.services()`, update ALL `CoreModuleProvider` implementations — not just the main one. Search with `grep -rn "extends CoreModuleProvider" oap-server/ --include="*.java"`. The `MockCoreModuleProvider` in `server-tools/profile-exporter/` also needs it, or the profile exporter e2e test will fail at startup.
259277
12. **Multiple OAP packagings**: The OAP server is not only the main `server-starter`. The `server-tools/` directory contains standalone tools (e.g., profile exporter) that boot with mock module providers and a subset of modules. Changes to core module contracts (services, required modules) must be reflected in these tools too.
278+
13. **`moduleManager.find(X.NAME)` requires `X.NAME` in `requiredModules()`**: every call to `moduleManager.find(SomeModule.NAME)` (direct or through a helper) must have `SomeModule.NAME` in the provider's `requiredModules()` array. Missing declarations cause runtime exceptions the first time the code path fires — not at module boot. Wrapping the call in `try { ... } catch (Throwable)` is NOT a substitute; declare the module and keep the try/catch only for defensive handling of transient provider outages. When auditing a branch, grep for `moduleManager.find(` across the touched module and verify each target name appears in `requiredModules()`. Example modules that frequently catch teams out: `AlarmModule` (used by alarm-kernel reset), `LogAnalyzerModule` (used by LAL factory lookup).
279+
14. **Don't look up `ClusterModule` services directly**: the `ClusterModule` (ZooKeeper / K8s / Nacos coordination) exposes `ClusterRegister` / `ClusterNodesQuery` / `ClusterCoordinator`. Most receiver / analyzer modules don't declare `ClusterModule` in `requiredModules()`, so calling `moduleManager.find(ClusterModule.NAME)` will throw at runtime. Instead, go through `CoreModule`'s `RemoteClientManager` service — it's already populated by the cluster module and exposes the peer list every OAP needs. If a module genuinely needs cluster-coordinator primitives, declare `ClusterModule.NAME` in `requiredModules()` explicitly.
280+
15. **No `ThreadLocal` side-channels to hijack downstream behaviour**: routing a caller's intent through a `ThreadLocal` that downstream code reads (e.g., `if (PeerMode.isActive()) skipSomething()`) is almost always the wrong answer — it creates invisible coupling between far-apart code paths, leaks across async hand-offs (executors, gRPC threads, Armeria event loops), and makes the behaviour impossible to understand from a method signature. The correct fix is almost always to **extend the interface** — add a parameter, a new method, a new mode enum that appears in the signature. Rare exceptions: propagating OpenTelemetry context where the whole industry has standardised on `ThreadLocal`, or security principals enforced by a framework. In all other cases, prefer an explicit API extension, even if it costs more lines.
281+
16. **BanyanDB schema-visibility: fence on `mod_revision`, do NOT poll metadata**: every BanyanDB Create / Update / Delete returns an etcd `mod_revision` (0 on a delete that didn't record a tombstone). After firing DDL, fence on `BanyanDBClient.getSchemaWatcher().awaitRevisionApplied(maxRev, timeout)` before unparking dispatch / firing data writes — this blocks until every data node has caught up, which the registry's read-after-write does not guarantee. For deletes that returned `mod_revision == 0`, fall back to `awaitSchemaDeleted(SchemaKey, timeout)`. The previous "poll `findMeasure` until you can read your own write" idiom existed before the `SchemaBarrierService` proto landed and has been replaced — do not reintroduce it. JDBC and ES are synchronous-DDL on the coordinator so they don't need a fence.
260282

261283
## Analysis and Design Principles
262284

apm-protocol/apm-network/pom.xml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -92,11 +92,11 @@
9292
protobuf-java version that grpc depends on.
9393
-->
9494
<protocArtifact>
95-
com.google.protobuf:protoc:${com.google.protobuf.protoc.version}:exe:${os.detected.classifier}
95+
com.google.protobuf:protoc:${protobuf-java.version}:exe:${os.detected.classifier}
9696
</protocArtifact>
9797
<pluginId>grpc-java</pluginId>
9898
<pluginArtifact>
99-
io.grpc:protoc-gen-grpc-java:${protoc-gen-grpc-java.plugin.version}:exe:${os.detected.classifier}
99+
io.grpc:protoc-gen-grpc-java:${grpc.version}:exe:${os.detected.classifier}
100100
</pluginArtifact>
101101
</configuration>
102102
<executions>

docker/.env

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,6 @@
66
# docker compose up
77

88
ELASTICSEARCH_IMAGE=docker.elastic.co/elasticsearch/elasticsearch-oss:7.4.2
9-
BANYANDB_IMAGE=ghcr.io/apache/skywalking-banyandb:7568a326bb7b10b6aa804bf0f4239904c347d9d5
9+
BANYANDB_IMAGE=ghcr.io/apache/skywalking-banyandb:69c8f4d20ebb6532ea4c16a7ed7114dd6ec9770b
1010
OAP_IMAGE=ghcr.io/apache/skywalking/oap:latest
1111
UI_IMAGE=ghcr.io/apache/skywalking/ui:latest

docker/oap/docker-entrypoint.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
12
#!/bin/sh
23
# Licensed to the Apache Software Foundation (ASF) under one
34
# or more contributor license agreements. See the NOTICE file

docs/en/changes/changes.md

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,46 @@
11
## 10.5.0
22

33
#### Project
4+
5+
* **Runtime rule hot-update for MAL and LAL.** Operators can now ship metric (MAL) and log
6+
(LAL) rule changes without restarting OAP. A push to a new admin endpoint persists the rule
7+
to the configured storage backend, and every node in the cluster converges to the new
8+
content within ~30 seconds. Common workflows:
9+
* `addOrUpdate` — create or replace a rule. Body is the raw YAML you would normally ship
10+
with OAP's static rule files. Returns 200 once the rule is applied locally and
11+
persisted; peers pick it up on their next periodic scan (≤ 30 s).
12+
* `inactivate` — soft-pause a rule. The OAP stops emitting metrics for that rule but the
13+
backend measure (and its history) is preserved, so a later `addOrUpdate` to the same
14+
`(catalog, name)` is lossless. This is the safe way to take a rule offline.
15+
* `delete` — destructive removal. Requires the rule to already be `INACTIVE` (otherwise
16+
returns 409); drops the backend measure and removes the row.
17+
* `get` / `bundled` / `list` / `dump` — read-side endpoints for fetching a single rule's
18+
YAML (with `ETag` support), listing the static-vs-runtime overlay per catalog,
19+
inspecting cluster-wide rule state in NDJSON, and exporting all rules as a tar.gz for
20+
backup / DR.
21+
Hot-updates survive OAP restart: at boot OAP merges bundled rule files with persisted
22+
runtime rules, so the cluster never silently regresses to the bundled defaults.
23+
**The endpoint is disabled by default and listens on port `17128` when enabled. It has
24+
no built-in authentication — operators must gateway-protect it with IP allow-lists and
25+
never expose it to the public internet.**
26+
* **BanyanDB schema mismatches are now visible at boot, not silent.** If BanyanDB already
27+
holds a resource whose shape doesn't match what the current rule declares (e.g., a rule
28+
was edited on disk while OAP was offline), OAP now skips that resource, logs an ERROR
29+
with the declared-vs-backend diff, and continues booting — previously the mismatch was
30+
silently accepted and samples for the affected resource were quietly dropped. To
31+
re-shape a mismatched metric, push the desired YAML through
32+
`POST /runtime/rule/addOrUpdate`.
433
* Bump infra-e2e to testcontainers-go v0.42.0 (apache/skywalking-infra-e2e#146), which uses Docker Compose v2 plugin natively and removes docker-compose v1 dependency.
534
* Remove deprecated `version` field from all docker-compose files for Compose v2 compatibility.
35+
* **Best-effort schema-cutover fence for BanyanDB.** After firing a schema install or drop
36+
OAP now waits up to a bounded window (default 2s) for every BanyanDB data node to apply
37+
the change before resuming dispatch — the typical case gets a clean cutover where
38+
samples after `200 OK` use the new shape. On laggard timeout, OAP logs a warning and
39+
proceeds anyway so a single slow node doesn't wedge the apply.
40+
* Bump dependencies: gRPC `1.70.0``1.80.0`, protobuf-java `3.25.5``4.33.1`, Netty
41+
`4.2.10.Final``4.2.12.Final`, Netty-tcnative `2.0.75``2.0.77`, pgv (protoc-gen-validate)
42+
`1.2.1``1.3.0`. Driven by the new BanyanDB schema-consistency RPCs whose generated
43+
validation code requires the `protobuf-java 4.x` runtime.
644

745
#### OAP Server
846
* Add Zipkin Virtual GenAI e2e test. Use `zipkin_json` exporter to avoid protobuf dependency conflict
@@ -54,4 +92,3 @@
5492
* Add WeChat / Alipay Mini Program monitoring setup documentation, plus a client-side-monitoring section in the security guide covering public-internet ingress (OTLP + `/v3/segments`) for mobile / browser / mini-program SDKs.
5593

5694
All issues and pull requests are [here](https://github.com/apache/skywalking/issues?q=milestone:10.5.0)
57-

0 commit comments

Comments
 (0)