Skip to content

Commit d1369a8

Browse files
wu-shengclaude
andcommitted
Runtime rule hot-update for MAL and LAL
Adds /runtime/rule/{addOrUpdate,inactivate,delete,list,bundled,dump,get} on a new admin port (default 17128, disabled by default) for cluster-wide MAL/LAL rule management — push, soft-pause, drop, list, dump, single-rule fetch. The endpoint converges through a single elected main per (catalog, name) with Suspend/Resume RPCs broadcast to peers across the OAP cluster bus. Rules are stored in the management storage layer (BanyanDB / Elasticsearch / JDBC) so hot-updates survive OAP restart — at boot the merged view of bundled YAML and persisted runtime rows takes effect without a regression to defaults. The endpoint has no built-in authentication: operators must gateway-protect it with IP allow-lists and never expose it to the public internet. Engine model ------------ Three layers, with one boundary between each: scheduler (DSLManager + REST handler) — DSL-agnostic. Lock acquisition, cluster Suspend/Resume RPC fan-out, persistence, classloader graveyard, cross-file ownership enforcement, tick scheduling, self-heal. orchestrators (DSLRuntimeApply, DSLRuntimeUnregister, DSLRuntimeDelete) — drive the per-DSL phase pipeline through the engine SPI. engines (MalRuleEngine, LalRuleEngine) — DSL-specific. Implement compile / verify / commit / rollback / unregister / dropBackend / reloadStatic against a per-engine ApplyContext; classify, claimedKeys, storageImpactKeys drive cross-DSL routing and the storage-change guardrail. DSLClassLoaderManager (server-core) ----------------------------------- Process-wide singleton that owns every per-file RuleClassLoader. Boot-time bundled rules continue to load into the OAP main classloader (shared); per- file `static:` loaders only mint after a runtime override is removed and the bundled YAML must serve again. Loader name format `<kind>:<catalog>/<rule>@<MMdd-HHmmss>` is observable in stack traces and the graveyard's INFO/WARN log lines. The manager runs an internal daemon sweeper that observes phantom-reference collection and warns on retired loaders that stay alive past the configured threshold (the leak signal). API surface: newBuilder(catalog, rule, kind, hash) — mints a loader for compile; not yet registered as active. commit(loader) — promotes to active, returns the displaced prior so the caller decides whether to retire. retire(loader) — graveyard a specific loader for GC observability. dropRuntime(catalog, rule) — drops + retires the active loader. active(catalog, rule), activeCount(), pendingCount() — diagnostics. The split between newBuilder and commit means a failed compile leaves the live loader untouched: the failed loader is just garbage-collected, the manager's active map still points at the previously-serving one. /inactivate semantics (Design A) -------------------------------- /inactivate stamps localState=NOT_LOADED. Bundled rules do NOT auto- resurrect on /inactivate even when a bundled twin exists on disk — the operator's "off" intent is preserved across reboots. To bring bundled back, the operator runs /delete (drops the row, gone-keys reconcile reloads bundled) or /addOrUpdate (with bundled YAML or their own). /delete semantics ----------------- Default mode: removes the runtime row. No bundled twin — destructive cascade fires (BanyanDB measure / ES index / JDBC table dropped). Rule fully gone. Bundled twin exists — non-destructive: backend resources runtime claimed that bundled doesn't (or claims at different shape) are dropped; bundled-shared at matching shape is preserved. The runtime row is removed; bundled is reinstalled into a `static:` loader synchronously on the local node. Peers converge via gone-keys reconcile on their next tick. ?mode=revertToBundled is an explicit operator hint that requires a bundled twin (returns 400 no_bundled_twin when none exists) — useful for scripts that want to fail loudly on assumption mismatch. REST surface ------------ * /list returns a single JSON envelope {generatedAt, loaderStats, rules} (was NDJSON). Each row carries status, localState, suspendOrigin, loaderGc, loaderKind (RUNTIME/STATIC/NONE), loaderName, contentHash, bundled, bundledContentHash, updateTime, lastApplyError. status=BUNDLED replaces the prior STATIC for bundled-only rules. * /get accepts ?source=bundled to read bundled YAML even when a runtime row exists — closes the "compare runtime to bundled" gap for editor flows. * All JSON build sites use Gson. Catalog enum at REST boundary ----------------------------- The catalog query parameter parses to org.apache.skywalking.oap.server.core .classloader.Catalog at the REST boundary. RuntimeRuleService's public methods (addOrUpdate / inactivate / delete / get / listBundled / dumpCatalog) take Catalog; unknown wire values return 400 invalid_catalog uniformly. Internal helpers convert via getWireName() at DAO / cluster-RPC edges. DeleteMode enum --------------- ?mode= parses to DeleteMode at the REST boundary. The string never leaves the handler. ForwardTarget interface removed ------------------------------- Single production implementation (RuntimeRuleRestHandler) and zero test fakes. The cluster gRPC service now references RuntimeRuleService directly; the REST handler is left as a pure transport adapter (route bindings + parameter parsing). The Result POJO becomes RuntimeRuleService.ForwardResult. Cluster ------- Suspend / Resume / Forward RPCs over the cluster gRPC bus. Single-main routing (deterministic mainFor(catalog, name)) with REST forward-to-main. Self-heal sweeps SUSPENDED bundles whose main crashed mid-apply (60 s default). Storage ------- RuntimeRuleManagementDAO — per-backend upsert / read / delete on the rule rows. Implementations for BanyanDB / ES / JDBC. /inactivate runs under StorageManipulationOpt.localCacheOnly so the backend measure and history stay; /delete fires the destructive cascade unless a bundled twin makes delta-drop the right path. Per-rule lifecycle docs ----------------------- docs/en/setup/backend/backend-runtime-rule-api.md walks the full operator surface: routes, applyStatus matrix, per-row status decoding (status x loaderKind x bundled), reading bundled-vs-runtime YAML, consistency model. E2E --- test/e2e-v2/cases/runtime-rule/ covers the lifecycle on BanyanDB / ES / JDBC, plus a 2-node cluster scenario (Suspend/Resume + main routing) and the LAL pipeline. Verified end-to-end on BanyanDB locally: CREATE → UPDATE-FILTER → UPDATE-STRUCTURAL → DUMP → 4× illegal → SHAPE-BREAK → INACTIVATE → ACTIVATE → DELETE → DUMP. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 315defe commit d1369a8

232 files changed

Lines changed: 23801 additions & 994 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.claude/skills/gh-pull-request/SKILL.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,46 @@ license-eye header check
3232

3333
If invalid files are found, fix with `license-eye header fix` and re-check.
3434

35+
### 3. Unnecessary fully-qualified class names
36+
37+
The project checkstyle forbids inline FQCNs — every type reference in code should resolve
38+
through an `import`, not a fully-qualified name. Checkstyle does not always catch this (it
39+
misses cases like inline `java.util.HashMap`, `java.util.concurrent.TimeUnit`, or
40+
`org.apache.skywalking.oap.server.telemetry.api.HistogramMetrics.Timer` used as a local
41+
variable type, generic parameter, or `new` target). Audit the files the branch touched
42+
before pushing:
43+
44+
Use the `Grep` tool (ripgrep) rather than BSD `grep` on macOS — the scan below relies on a
45+
negative lookahead that BSD `grep` doesn't support and GNU `grep -P` does:
46+
47+
```
48+
pattern: ^(?!\s*(import |package |\s*\*)).*\b(java\.util\.|java\.io\.|java\.nio\.|java\.util\.concurrent\.|javassist\.|org\.apache\.skywalking\.)[A-Z][A-Za-z0-9_]*
49+
glob: *.java
50+
output_mode: content
51+
-n: true
52+
```
53+
54+
Scope the scan to files the branch touched, not the whole tree — pre-existing FQDNs on
55+
unrelated files generate noise. Use `git diff --name-only master...HEAD -- '*.java'` to get
56+
the changed list, then run the ripgrep pattern against each.
57+
58+
Acceptable exceptions (same as the `CLAUDE.md` rule):
59+
- Two classes with the same simple name would collide if both imported.
60+
- A Javadoc `{@link}` where the short name would be ambiguous to the reader.
61+
- Inside a string literal (e.g., a class name passed to `Class.forName`).
62+
63+
Fix every other hit — add an `import` and switch to the short name. This includes
64+
`new java.util.HashMap<>()`, `java.util.Set<String>` parameter types, and
65+
`org.apache.skywalking.oap.server.telemetry.api.HistogramMetrics.Timer` as a local
66+
variable type. Field declarations, method signatures, local variables, and generic
67+
type arguments should all use the imported short name.
68+
69+
Re-run checkstyle after the fix — a sloppy `sed`/`replace_all` can corrupt the `import`
70+
line itself (e.g., turning `import java.util.concurrent.locks.ReentrantLock;` into
71+
`import ReentrantLock;`), which causes a cryptic checkstyle `Range [0, -1) out of
72+
bounds for length N` error, not a normal violation line. If you see that error, inspect
73+
the imports block first.
74+
3575
## Commit and push
3676

3777
After checks pass, commit and push:

.github/workflows/skywalking.yaml

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -294,7 +294,9 @@ jobs:
294294
distribution: temurin
295295
- name: Integration test
296296
run: |
297-
# Exclude slow integration tests and run those tests separately below.
297+
# Exclude slow integration tests (run in slow-integration-test). Runtime-rule
298+
# and BanyanDB storage CRUD are verified end-to-end in the dedicated e2e cases
299+
# (see test/e2e-v2/cases/runtime-rule/ and test/e2e-v2/cases/banyandb).
298300
./mvnw -B clean integration-test -Dcheckstyle.skip -DskipUTs=true -DexcludedGroups=slow || \
299301
./mvnw -B clean integration-test -Dcheckstyle.skip -DskipUTs=true -DexcludedGroups=slow
300302
@@ -394,6 +396,18 @@ jobs:
394396
config: test/e2e-v2/cases/storage/es/es-sharding/e2e.yaml
395397
env: ES_VERSION=8.18.8
396398

399+
- name: Runtime Rule MAL Storage BanyanDB
400+
config: test/e2e-v2/cases/runtime-rule/mal-storage/banyandb/e2e.yaml
401+
- name: Runtime Rule MAL Storage PostgreSQL
402+
config: test/e2e-v2/cases/runtime-rule/mal-storage/postgresql/e2e.yaml
403+
- name: Runtime Rule MAL Storage Elasticsearch 8.18.8
404+
config: test/e2e-v2/cases/runtime-rule/mal-storage/elasticsearch/e2e.yaml
405+
env: ES_VERSION=8.18.8
406+
- name: Runtime Rule LAL Hot-Update
407+
config: test/e2e-v2/cases/runtime-rule/lal/e2e.yaml
408+
- name: Runtime Rule Cluster Convergence
409+
config: test/e2e-v2/cases/runtime-rule/cluster/e2e.yaml
410+
397411
- name: Alarm ES
398412
config: test/e2e-v2/cases/alarm/es/e2e.yaml
399413
- name: Alarm ES Sharding

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,4 +43,5 @@ test/script-cases/scripts/**/*.generated-classes/
4343

4444
# Claude Code local settings
4545
.claude/settings.local.json
46+
.claude/scheduled_tasks.lock
4647
*.generated-classes/

.licenserc.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -134,10 +134,10 @@ dependency:
134134
version: 1.12.0
135135
license: Apache-2.0
136136
- name: build.buf.protoc-gen-validate:pgv-java-stub
137-
version: 1.2.1
137+
version: 1.3.0
138138
license: Apache-2.0
139139
- name: build.buf.protoc-gen-validate:protoc-gen-validate
140-
version: 1.2.1
140+
version: 1.3.0
141141
license: Apache-2.0
142142
- name: com.aayushatharva.brotli4j:service
143143
version: 1.20.0

CLAUDE.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,17 @@ public class XxxModuleProvider extends ModuleProvider {
9090
- No star imports (`import xxx.*`)
9191
- No unused or redundant imports
9292
- No empty statements (standalone `;`)
93+
- No fully-qualified class names inline in code — always add an `import` statement and
94+
use the short name. Acceptable exceptions: (a) two classes with the same simple name
95+
would collide if both imported, (b) the class appears exactly once in a Javadoc
96+
`{@link}` where the short name would be ambiguous to the reader. Field declarations,
97+
method signatures, local variables, and generic type arguments should always use the
98+
imported short name — `private RemoteClientManager rcm;`, not `private
99+
org.apache.skywalking.oap.server.core.remote.client.RemoteClientManager rcm;`.
100+
- No one-line delegate methods. A wrapper whose only body is a single forwarding call
101+
to another class (`return Other.foo(a, b);`) adds a hop without value. Inline the
102+
call at the use site, or call the underlying object directly (including via method
103+
reference: `obj::foo` instead of `this::wrapperOfFoo`).
93104

94105
**Required patterns:**
95106
- `@Override` annotation required for overridden methods
@@ -105,6 +116,13 @@ public class XxxModuleProvider extends ModuleProvider {
105116
- Package names: `org.apache.skywalking.*` or `test.apache.skywalking.*`
106117
- Type names: `PascalCase` or `UPPER_CASE_WITH_UNDERSCORES`
107118
- Local variables/parameters/members: `camelCase`
119+
- **Function-oriented naming, not abstract metaphor**: classes and methods are named for
120+
what they do, not for an abstract concept. Prefer concrete verbs (`load`, `apply`,
121+
`unregister`, `compile`, `verify`, `commit`, `rollback`) over metaphorical ones
122+
(`seed`, `hydrate`, `bootstrap`, `prime`). Class names follow the same rule —
123+
`StaticRuleLoader` (loads static rules), not `StaticBundleSeeder`; `DSLSyncTimer` (syncs
124+
DB → state on a timer), not `TickRunner`. If you can't name a method without reaching
125+
for a metaphor, the method is probably doing too much; split it.
108126

109127
**File limits:**
110128
- Max file length: 3000 lines
@@ -257,6 +275,10 @@ Actions owned by `actions/*` (GitHub), `github/*`, and `apache/*` are always all
257275
10. **Relative paths in docs are valid**: Relative file paths (e.g., `../../../oap-server/...`) in documentation work both in the repo and on the documentation website, supported by website build tooling
258276
11. **Module service registration**: When adding a service to `CoreModule.services()`, update ALL `CoreModuleProvider` implementations — not just the main one. Search with `grep -rn "extends CoreModuleProvider" oap-server/ --include="*.java"`. The `MockCoreModuleProvider` in `server-tools/profile-exporter/` also needs it, or the profile exporter e2e test will fail at startup.
259277
12. **Multiple OAP packagings**: The OAP server is not only the main `server-starter`. The `server-tools/` directory contains standalone tools (e.g., profile exporter) that boot with mock module providers and a subset of modules. Changes to core module contracts (services, required modules) must be reflected in these tools too.
278+
13. **`moduleManager.find(X.NAME)` requires `X.NAME` in `requiredModules()`**: every call to `moduleManager.find(SomeModule.NAME)` (direct or through a helper) must have `SomeModule.NAME` in the provider's `requiredModules()` array. Missing declarations cause runtime exceptions the first time the code path fires — not at module boot. Wrapping the call in `try { ... } catch (Throwable)` is NOT a substitute; declare the module and keep the try/catch only for defensive handling of transient provider outages. When auditing a branch, grep for `moduleManager.find(` across the touched module and verify each target name appears in `requiredModules()`. Example modules that frequently catch teams out: `AlarmModule` (used by alarm-kernel reset), `LogAnalyzerModule` (used by LAL factory lookup).
279+
14. **Don't look up `ClusterModule` services directly**: the `ClusterModule` (ZooKeeper / K8s / Nacos coordination) exposes `ClusterRegister` / `ClusterNodesQuery` / `ClusterCoordinator`. Most receiver / analyzer modules don't declare `ClusterModule` in `requiredModules()`, so calling `moduleManager.find(ClusterModule.NAME)` will throw at runtime. Instead, go through `CoreModule`'s `RemoteClientManager` service — it's already populated by the cluster module and exposes the peer list every OAP needs. If a module genuinely needs cluster-coordinator primitives, declare `ClusterModule.NAME` in `requiredModules()` explicitly.
280+
15. **No `ThreadLocal` side-channels to hijack downstream behaviour**: routing a caller's intent through a `ThreadLocal` that downstream code reads (e.g., `if (PeerMode.isActive()) skipSomething()`) is almost always the wrong answer — it creates invisible coupling between far-apart code paths, leaks across async hand-offs (executors, gRPC threads, Armeria event loops), and makes the behaviour impossible to understand from a method signature. The correct fix is almost always to **extend the interface** — add a parameter, a new method, a new mode enum that appears in the signature. Rare exceptions: propagating OpenTelemetry context where the whole industry has standardised on `ThreadLocal`, or security principals enforced by a framework. In all other cases, prefer an explicit API extension, even if it costs more lines.
281+
16. **BanyanDB schema-visibility: fence on `mod_revision`, do NOT poll metadata**: every BanyanDB Create / Update / Delete returns an etcd `mod_revision` (0 on a delete that didn't record a tombstone). After firing DDL, fence on `BanyanDBClient.getSchemaWatcher().awaitRevisionApplied(maxRev, timeout)` before unparking dispatch / firing data writes — this blocks until every data node has caught up, which the registry's read-after-write does not guarantee. For deletes that returned `mod_revision == 0`, fall back to `awaitSchemaDeleted(SchemaKey, timeout)`. The previous "poll `findMeasure` until you can read your own write" idiom existed before the `SchemaBarrierService` proto landed and has been replaced — do not reintroduce it. JDBC and ES are synchronous-DDL on the coordinator so they don't need a fence.
260282

261283
## Analysis and Design Principles
262284

apm-protocol/apm-network/pom.xml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -92,11 +92,11 @@
9292
protobuf-java version that grpc depends on.
9393
-->
9494
<protocArtifact>
95-
com.google.protobuf:protoc:${com.google.protobuf.protoc.version}:exe:${os.detected.classifier}
95+
com.google.protobuf:protoc:${protobuf-java.version}:exe:${os.detected.classifier}
9696
</protocArtifact>
9797
<pluginId>grpc-java</pluginId>
9898
<pluginArtifact>
99-
io.grpc:protoc-gen-grpc-java:${protoc-gen-grpc-java.plugin.version}:exe:${os.detected.classifier}
99+
io.grpc:protoc-gen-grpc-java:${grpc.version}:exe:${os.detected.classifier}
100100
</pluginArtifact>
101101
</configuration>
102102
<executions>

0 commit comments

Comments
 (0)